Re: [PATCH] MAINTAINERS: Update filesystem-dax and NVDIMM entries

2019-01-24 Thread Ross Zwisler
On Thu, Jan 24, 2019 at 3:13 PM Keith Busch  wrote:
> On Thu, Jan 24, 2019 at 11:17:59AM -0800, Dan Williams wrote:
> > Ross has moved on to other areas.
> >
> > Matthew and Jan are trusted reviewers for DAX.
> >
> > Dan is now coordinating filesystem-dax pull requests.
> >
> > Ira and Keith are now involved with the NVDIMM and Device-DAX
> > sub-systems.
> >
> > The linux-nvdimm mailing hosts a patchwork instance for both DAX and
> > NVDIMM patches.
> >
> > Cc: Jan Kara 
> > Cc: Ira Weiny 
> > Cc: Ross Zwisler 
> > Cc: Keith Busch 
> > Cc: Matthew Wilcox 
> > Signed-off-by: Dan Williams 
>
> Acked-by: Keith Busch 

Acked-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 2/2] [PATCH] xfs: Close race between direct IO and xfs_break_layouts()

2018-08-10 Thread Ross Zwisler
On Fri, Aug 10, 2018 at 9:23 AM Dave Jiang  wrote:
> On 08/10/2018 11:31 AM, Eric Sandeen wrote:
> > On 8/8/18 12:31 PM, Dave Jiang wrote:
> >> This patch is the duplicate of ross's fix for ext4 for xfs.
> >>
> >> If the refcount of a page is lowered between the time that it is returned
> >> by dax_busy_page() and when the refcount is again checked in
> >> xfs_break_layouts() => ___wait_var_event(), the waiting function
> >> xfs_wait_dax_page() will never be called.  This means that
> >> xfs_break_layouts() will still have 'retry' set to false, so we'll stop
> >> looping and never check the refcount of other pages in this inode.
> >>
> >> Instead, always continue looping as long as dax_layout_busy_page() gives us
> >> a page which it found with an elevated refcount.
> >
> > Hi Dave, does this have a testcase?  Have you seen the issue using Ross's
> > xfstest generic/503 or is there some other test?  Apologies if I missed
> > prior discussion on a testcase or race frequency...
>
> I do not have a testcase. I know Ross replicated it on ext4. And Jan
> asked to create the same fix with XFS when he reviewed Ross's fix for ext4.

In my testing I couldn't get this race to hit with XFS.  I couldn't
even get a failure with generic/503 when testing XFS before Dan's
initial patches went in which added xfs_break_layouts() et al.  I
think that Dan had to manually insert timing delays to get the warning
to hit for XFS when testing his patches.

The race we're fixing happens consistently with ext4 and through code
inspection we can see that the race exists in XFS.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-31 Thread Ross Zwisler
On Tue, Jul 10, 2018 at 01:10:29PM -0600, Ross Zwisler wrote:
> Changes since v3:
>  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
>that the {ext4,xfs}_break_layouts() calls have the same meaning.
>(Dave, Darrick and Jan)
> 
> ---
> 
> This series from Dan:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
> 
> added synchronization between DAX dma and truncate/hole-punch in XFS.
> This short series adds analogous support to ext4.
> 
> I've added calls to ext4_break_layouts() everywhere that ext4 removes
> blocks from an inode's map.
> 
> The timings in XFS are such that it's difficult to hit this race.  Dan
> was able to show the race by manually introducing delays in the direct
> I/O path.
> 
> For ext4, though, its trivial to hit this race, and a hit will result in
> a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
> 
> WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> 
> I've made an xfstest which tests all the paths where we now call
> ext4_break_layouts(). Each of the four paths easily hits this race many
> times in my test setup with the xfstest.  You can find that test here:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
> 
> With these patches applied, I've still seen occasional hits of the above
> WARN_ON_ONCE(), which tells me that we still have some work to do.  I'll
> continue looking at these more rare hits.

One last ping on this - do we want to merge this for v4.19?  I've tracked down
the more rare warnings, and have reported the race I'm seeing here:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/017205.html

AFAICT the race exists equally for XFS and ext4, and I'm not sure how to solve
it easily.  Essentially it seems like we need to synchronize not just page
faults but calls to get_page() with truncate/hole punch operations, else we
can have the reference count for a given DAX page going up and down while we
are in the middle of a truncate.  I'm not sure if this is even feasible.

The equivalent code for this series already exists in XFS, so taking the
series now gets ext4 and XFS on the same footing moving forward, so I guess
I'm in favor of merging it now, though I can see the argument that it's not a
complete solution.

Thoughts?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Help trying to use /dev/pmem for dax debugging?

2018-07-31 Thread Ross Zwisler
On Mon, Jul 30, 2018 at 07:53:12PM -0400, Theodore Y. Ts'o wrote:
> In newer kernels, it looks like you can't use /dev/pmem0 for DAX
> unless it's marked as being DAX capable.  This appears to require
> CONFIG_NVDIMM_PFN.  But when I tried to build a kernel with that
> configured, I get the following BUG:
> 
> [0.00] Linux version 4.18.0-rc4-xfstests-00031-g7c2d77aa7d80 
> (tytso@cwcc) (gcc version 7.3.0 (Debian 7.3.0-27)) #460 SMP Mon Jul 30 
> 19:38:44 EDT 2018
> [0.00] Command line: systemd.show_status=auto systemd.log_level=crit 
> root=/dev/vda console=ttyS0,115200 cmd=maint fstesttz=America/New_York 
> fstesttyp=ext4 fstestapi=1.4 memmap=4G!9G memmap=9G!14G

Hey Ted,

You're using the memmap kernel command line parameter to reserve normal
memory to be treated as normal memory, but you've also got kernel address
randomization turned on in your kernel config:

CONFIG_RANDOMIZE_BASE=y
CONFIG_RANDOMIZE_MEMORY=y

You need to turn these off for the memmap kernel command line parameter, else
the memory we're using could overlap with addresses used for other things.

Once that is off you probably want to double check that the addresses you're
reserving are marked as 'usable' in the e820 table.  Gory details here, sorry
for the huge link:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system

- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[fstests PATCH v3] generic/999: test DAX DMA vs truncate/hole-punch

2018-07-27 Thread Ross Zwisler
From: Ross Zwisler 

This adds a regression test for the following series:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/016842.html

which adds synchronization between DAX DMA in ext4 and truncate/hole-punch.
The intention of the test is to test those specific changes, but it runs
fine both with XFS and without DAX so I've put it in the generic tests
instead of ext4 and not restricted it to only DAX configurations.

When run with v4.18-rc6 + DAX + ext4, this test will hit the following
WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

If you change this to a WARN_ON() instead, you can see that each of the
four paths being exercised in this test hits that condition many times in
the one second that the subtest is being run.

Signed-off-by: Ross Zwisler 
---

Changes since v2:
 - Added detailed description to tests/generic/999 explaining the purpose
   of the test (Eryu).

 - Added _require_xfs_io_command for falloc, fpunch, fcollapse and fzero
   to account for filesystems that don't support these commands (Eryu).

 - Incorporated other feedback from Eryu.  Thanks, Eryu, for the review.

---
 .gitignore |   1 +
 src/Makefile   |   2 +-
 src/t_mmap_collision.c | 235 +
 tests/generic/999  |  64 ++
 tests/generic/999.out  |   2 +
 tests/generic/group|   1 +
 6 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 src/t_mmap_collision.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index efc73a7c..ea1aac8a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -125,6 +125,7 @@
 /src/t_holes
 /src/t_immutable
 /src/t_locks_execve
+/src/t_mmap_collision
 /src/t_mmap_cow_race
 /src/t_mmap_dio
 /src/t_mmap_fallocate
diff --git a/src/Makefile b/src/Makefile
index 9e971bcc..41826585 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -16,7 +16,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
t_mmap_cow_race t_mmap_fallocate fsync-err t_mmap_write_ro \
t_ext4_dax_journal_corruption t_ext4_dax_inline_corruption \
-   t_ofd_locks t_locks_execve
+   t_ofd_locks t_locks_execve t_mmap_collision
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/t_mmap_collision.c b/src/t_mmap_collision.c
new file mode 100644
index ..d547bc05
--- /dev/null
+++ b/src/t_mmap_collision.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * As of kernel version 4.18-rc6 Linux has an issue with ext4+DAX where DMA
+ * and direct I/O operations aren't synchronized with respect to operations
+ * which can change the block mappings of an inode.  This means that we can
+ * schedule an I/O for an inode and have the block mapping for that inode
+ * change before the I/O is actually complete.  So, blocks which were once
+ * allocated to a given inode and then freed could still have I/O operations
+ * happening to them.  If these blocks have also been reallocated to a
+ * different inode, this interaction can lead to data corruption.
+ *
+ * This test exercises four of the paths in ext4 which hit this issue.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PAGE(a) ((a)*0x1000)
+#define FILE_SIZE PAGE(4)
+
+void *dax_data;
+int nodax_fd;
+int dax_fd;
+bool done;
+
+#define err_exit(op)  \
+{ \
+   fprintf(stderr, "%s %s: %s\n", __func__, op, strerror(errno));\
+   exit(1);  \
+}
+
+#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
+void punch_hole_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   do {
+   rc = pread(nodax_fd, dax_data + read, FILE_SIZE - read,
+   read);
+   if (rc > 0)
+   read += rc;
+   } while (rc > 0);
+
+   if (read != FILE_SIZE || rc != 0)
+   err_exit("pread");
+
+   rc = fallocate(dax_fd,
+   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   0, FILE_SIZE);
+   if (rc < 0)
+   err_exit("fallocate");
+
+   usleep(rand() % 1000);
+   }
+}
+#else
+void punch_hole_fn(void *ptr) { }
+#endif
+
+#if defined(FALLOC_FL_ZERO_RANGE) &am

Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-27 Thread Ross Zwisler
+ fsdevel and the xfs list.

On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
 wrote:
> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> > On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > > Changes since v3:
> > >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> > >that the {ext4,xfs}_break_layouts() calls have the same meaning.
> > >(Dave, Darrick and Jan)
> >
> > How about the occasional WARN_ON_ONCE you mention below. Were you able to
> > hunt them down?
>
> The root cause of this issue is that while the ei->i_mmap_sem provides
> synchronization between ext4_break_layouts() and page faults, it doesn't
> provide synchronize us with the direct I/O path.  This exact same issue exists
> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
>
> This allows the direct I/O path to do I/O and raise & lower page->_refcount
> while we're executing a truncate/hole punch.  This leads to us trying to free
> a page with an elevated refcount.
>
> Here's one instance of the race:
>
> CPU 0   CPU 1
> -   -
> ext4_punch_hole()
>   ext4_break_layouts() # all pages have refcount=1
>
> ext4_direct_IO()
>   ... lots of layers ...
>   follow_page_pte()
> get_page() # elevates refcount
>
>   truncate_pagecache_range()
>... a few layers ...
>dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
>
> A similar race occurs when the refcount is being dropped while we're running
> ext4_break_layouts(), and this is the one that my test was actually hitting:
>
> CPU 0   CPU 1
> -   -
> ext4_direct_IO()
>   ... lots of layers ...
>   follow_page_pte()
> get_page()
> # elevates refcount of page X
> ext4_punch_hole()
>   ext4_break_layouts() # two pages, X & Y, have refcount == 2
> __wait_var_event() # called for page X
>
>   __put_devmap_managed_page()
>   # drops refcount of X to 1
>
># __wait_var_events() checks X's refcount in "if (condition)", and breaks.
># We never actually called ext4_wait_dax_page(), so 'retry' in
># ext4_break_layouts() is still false.  Exit do/while loop in
># ext4_break_layouts, never attempting to wait on page Y which still has an
># elevated refcount of 2.
>
>   truncate_pagecache_range()
>... a few layers ...
>dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
>
> This second race can be fixed with the patch at the end of this function,
> which I think should go in, unless there is a benfit to the current retry
> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
> With this patch applied I've been able to run my unit test through
> thousands of iterations, where it used to failed consistently within 10 or
> so.
>
> Even so, I wonder if the real solution is to add synchronization between
> the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
> this should be handled?
>
> --- >8 ---
>
> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
> From: Ross Zwisler 
> Date: Wed, 25 Jul 2018 16:16:05 -0600
> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
>
> If the refcount of a page is lowered between the time that it is returned
> by dax_busy_page() and when the refcount is again checked in
> ext4_break_layouts() => ___wait_var_event(), the waiting function
> ext4_wait_dax_page() will never be called.  This means that
> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
> looping and never check the refcount of other pages in this inode.
>
> Instead, always continue looping as long as dax_layout_busy_page() gives us
> a page which it found with an elevated refcount.
>
> Note that this works around the race exposed by my unit test, but I think
> that there is another race that needs to be addressed, probably with
> additional synchronization added between direct I/O and
> {ext4,xfs}_break_layouts().
>
> Signed-off-by: Ross Zwisler 
> ---
>  fs/ext4/inode.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)

Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-25 Thread Ross Zwisler
On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >(Dave, Darrick and Jan)
> 
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path.  This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch.  This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0   CPU 1
-   -
ext4_punch_hole()
  ext4_break_layouts() # all pages have refcount=1

ext4_direct_IO()
  ... lots of layers ...
  follow_page_pte()
get_page() # elevates refcount
  
  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0   CPU 1
-   -
ext4_direct_IO()
  ... lots of layers ...
  follow_page_pte()
get_page()
# elevates refcount of page X
ext4_punch_hole()
  ext4_break_layouts() # two pages, X & Y, have refcount == 2
__wait_var_event() # called for page X

  __put_devmap_managed_page()
  # drops refcount of X to 1

   # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
   # We never actually called ext4_wait_dax_page(), so 'retry' in
   # ext4_break_layouts() is still false.  Exit do/while loop in
   # ext4_break_layouts, never attempting to wait on page Y which still has an
   # elevated refcount of 2.

  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

This second race can be fixed with the patch at the end of this function,
which I think should go in, unless there is a benfit to the current retry
scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
With this patch applied I've been able to run my unit test through
thousands of iterations, where it used to failed consistently within 10 or
so.

Even so, I wonder if the real solution is to add synchronization between
the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
this should be handled?

--- >8 ---

>From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
From: Ross Zwisler 
Date: Wed, 25 Jul 2018 16:16:05 -0600
Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Note that this works around the race exposed by my unit test, but I think
that there is another race that needs to be addressed, probably with
additional synchronization added between direct I/O and
{ext4,xfs}_break_layouts().

Signed-off-by: Ross Zwisler 
---
 fs/ext4/inode.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27e9643bc13b..fedb88104bbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
return 0;
 }
 
-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
 {
-   *did_unlock = true;
up_write(>i_mmap_sem);

Re: [PATCH v2 0/6] kaddr and pfn can be NULL to ->direct_access()

2018-07-25 Thread Ross Zwisler
On Thu, Jul 26, 2018 at 12:28:43AM +0800, Huaisheng Ye wrote:
> From: Huaisheng Ye 
> 
> Changes since v1 [1]:
> * Involve the previous patches for pfn can be NULL.
> * Reword the patch descriptions according to Christian's comment.
> * According to Ross's suggestion, replace local pointer dummy_addr
>   with NULL within md/dm-writecache for direct_access.
> 
> [1]: https://lkml.org/lkml/2018/7/24/199
> 
> Some functions within fs/dax, dax/super and md/dm-writecache don't
> need to get local pointer kaddr or variable pfn from direct_access.
> Assigning NULL to kaddr or pfn to ->direct_access() is more
> straightforward and simple than offering a useless local pointer or
> variable.
> 
> So all ->direct_access() need to check the validity of pointer kaddr
> and pfn for NULL assignment. If either of them is equal to NULL, that
> is to say callers may have no need for kaddr or pfn, so this series of
> patch are prepared for allowing them to pass in NULL instead of having
> to pass in a local pointer or variable that they then just throw away.

Looks good.  For the series:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/5] Do not request a pointer kaddr when not required

2018-07-24 Thread Ross Zwisler
On Tue, Jul 24, 2018 at 04:45:05PM +0800, Huaisheng Ye wrote:
> From: Huaisheng Ye 
> 
> Some functions within fs/dax and dax/super don't need to get kaddr from
> direct_access. Assigning NULL to kaddr to ->direct_access() is more
> straightforward and simple than offering a useless local pointer.
> 
> So all direct_access() need to check the validity of second rank pointer
> kaddr for NULL assignment. If kaddr equals to NULL, it doesn't need to
> calculate its value.
> 
> * This series are supplement to [PATCH v2 00/14]mm: Asynchronous +
>   multithreaded memmap init for ZONE_DEVICE. [1]
> 
> [1]: https://lkml.org/lkml/2018/7/16/828

This whole series looks good to me.  Just a few comments:

1) Does this series actually depend on the "Asynchronous multithreaded mmap
init for ZONE_DEVICE" series from Dan?  It seems totally independent to me?
I reviewed yours by applying to linux/master, which worked fine.  I ask
because Dan's series has been delayed to after v4.19, and if yours isn't
actually dependent it could possibly go in sooner.

2) I agree with Christian's comment that the changelogs could be improved
slightly.  Remember that the goal of the changelog isn't to describe *what*
the code is doing, but *why*.  We can read that the code now checks if 'kaddr'
is NULL, and if so we don't calculate it.  It's useful to say that callers may
have no need for 'kaddr', so this patch is prep for allowing them to pass in
NULL instead of having to pass in a pointer that they then just throw away.

3) I think you should make one more change to kill the unused 'dummy_addr'
variable in persistent_memory_claim().  That was the one last case of a dummy
'kaddr' type variable that I could find.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/5] radix tree test suite: fix mapshift build target

2018-07-21 Thread Ross Zwisler
On Sun, Jul 22, 2018 at 09:45:50AM +1000, Dave Chinner wrote:
> On Mon, Jul 16, 2018 at 07:41:32PM -0700, Matthew Wilcox wrote:
> > On Mon, Jul 16, 2018 at 03:08:20PM -0600, Ross Zwisler wrote:
> > > On Mon, Jul 16, 2018 at 12:52:49PM -0700, Matthew Wilcox wrote:
> > > > On Mon, Jul 16, 2018 at 10:07:10AM -0600, Ross Zwisler wrote:
> > > <>
> > > > OK ... what version of make are you using?  Because this works fine for 
> > > > me:
> > > > 
> > > > $ git clone linux clean
> > > > $ cd clean
> > > > $ git checkout v4.17
> > > > $ cd tools/testing/radix-tree/ 
> > > > $ git revert 8d9fa88edd5e360b71765feeadb915d4066c9684
> > > > $ make
> > > > 
> > > > $ make --version
> > > > GNU Make 4.1
> > > > Built for x86_64-pc-linux-gnu
> > > > 
> > > > It's Debian's Version: 4.1-9.1
> > > 
> > > $ make --version
> > > GNU Make 4.2.1
> > > Built for x86_64-redhat-linux-gnu
> > > 
> > > The one from Fedora 27.
> > 
> > Huh.  I just tried 4.2.1-1.1 from Debian unstable and that doesn't
> > produce the problem either.  I'm not sure how to proceed at this point.
> > I'm really not a makefile expert.
> 
> This smells like a problem we just hit with make 4.2.1 in fedora 28
> in fstests - the regex expanstion has been screwed up such that
> things like [a-z] will match [A-Z] and other things as well. Debian
> is unaffected, apparently fedora has a backport of stuff from the
> as-yet-unreleased next version of make/glibc. See this thread:
> 
> https://www.spinics.net/lists/fstests/msg10200.html
> 
> Try setting LANG=C and seeing if the problem goes away

Hey Dave, we root caused this difference to to be the fact that I had 'make'
aliased to 'make -j32' in my .bashrc.  Matthew was running a singled threaded
build, while I was running a multi-threaded one.  When Matthew ran a
multi-threaded build, he was able to reproduce the issue.

So, essentially I think that the makefile just needs to be enhanced so that
all the dependencies are explicit to allow multi-threaded builds to work
properly.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] Documentation, create-namespace: clarify fsdax wording

2018-07-19 Thread Ross Zwisler
On Thu, Jul 19, 2018 at 6:17 PM Vishal Verma  wrote:
>
> Reword the description of the fsdax mode in the create-namespace
> documentation to better distinguish when DAX can be used vs. how the raw
> block device will use the page cache.
>
> Link: https://github.com/pmem/issues/issues/915
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  Documentation/ndctl/ndctl-create-namespace.txt | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ndctl/ndctl-create-namespace.txt 
> b/Documentation/ndctl/ndctl-create-namespace.txt
> index 4b8b0d1..343733d 100644
> --- a/Documentation/ndctl/ndctl-create-namespace.txt
> +++ b/Documentation/ndctl/ndctl-create-namespace.txt
> @@ -68,10 +68,11 @@ OPTIONS
>   comes at the cost of allocating per-page metadata. The
>   capacity can be allocated from "System RAM", or from a
>   reserved portion of "Persistent Memory" (see the --map=
> - option).  Note that a filesystem is required for dax
> - operation, the resulting raw block device (/dev/pmemX) will
> - use the page cache. See "devdax" mode for raw device access
> - that supports dax.
> + option).  NOTE: A filesystem that supports DAX is required
> + for dax operation. If the raw block device (/dev/pmemX) is
> + used directly without a filesystem, it will use the page
> + cache. See "devdax" mode for raw device access that supports
> + dax.

Looks good, you can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/2] MAINTAINERS: Add Jan Kara for filesystem DAX

2018-07-19 Thread Ross Zwisler
Jan has been developing and reviewing filesystem DAX related changes for
quite a while now, and has agreed to help maintain this code going forward.

Thanks, Jan!

Signed-off-by: Ross Zwisler 
Cc: Jan Kara 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d56b802a0fb..746a31f97353 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4274,6 +4274,7 @@ F:drivers/i2c/busses/i2c-diolan-u2c.c
 FILESYSTEM DIRECT ACCESS (DAX)
 M: Matthew Wilcox 
 M: Ross Zwisler 
+M: Jan Kara 
 L: linux-fsde...@vger.kernel.org
 S: Supported
 F: fs/dax.c
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/2] Update mail addresses, add FS DAX maintainer

2018-07-19 Thread Ross Zwisler
This series updates my preferred email for kernel work to my
zwis...@kernel.org mail address, and adds Jan as a maintainer for
filesystem DAX.  Thank you, Jan, for helping to keep filesystem DAX in a
happy state.

We are planning on pushing this up through the libnvdimm tree for the
v4.19 merge window.

Ross Zwisler (2):
  MAINTAINERS: update Ross Zwisler's email address
  MAINTAINERS: Add Jan Kara for filesystem DAX

 .mailmap|  1 +
 MAINTAINERS | 13 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/5] radix tree test suite: fix mapshift build target

2018-07-17 Thread Ross Zwisler
On Mon, Jul 16, 2018 at 08:18:11PM -0700, Matthew Wilcox wrote:
> On Mon, Jul 16, 2018 at 10:07:10AM -0600, Ross Zwisler wrote:
> > Incidentally, in the current linux/master the radix tree test suite again
> > fails to build:
> > 
> >   $ make
> >   sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
> > ../../../lib/radix-tree.c > radix-tree.c
> >   sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
> > ../../../lib/idr.c > idr.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o main.o main.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o linux.o linux.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o test.o test.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o find_bit.o ../../lib/find_bit.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o regression1.o regression1.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o regression2.o regression2.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o regression3.o regression3.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o tag_check.o tag_check.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o multiorder.o multiorder.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o idr-test.o idr-test.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o iteration_check.o iteration_check.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o benchmark.o benchmark.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o idr.o idr.c
> >   cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   
> > -c -o radix-tree.o radix-tree.c
> >   idr.c:7:10: fatal error: linux/xarray.h: No such file or directory
> >#include 
> > ^~~~
> >   compilation terminated.
> 
> Umm.  I think I know the problem here.  I have a suspicion that either
> Fedora or you have changed make to be parallel by default (or you're
> lying to me and saying you typed 'make' when you actually typed 'make
> -j4', but I'm pretty sure you wouldn't do that).  Because there's no
> way you'd get this output if you were compiling with make -j1.
> 
> Indeed, if I revert your commit and then build with make -j4, I see the
> same error as you.  I'll look at how to fix this properly tomorrow.

Ah, yep.

  $ alias make
  alias make='make -j32'

You've found it. :)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/5] radix tree test suite: fix mapshift build target

2018-07-16 Thread Ross Zwisler
On Mon, Jul 16, 2018 at 12:52:49PM -0700, Matthew Wilcox wrote:
> On Mon, Jul 16, 2018 at 10:07:10AM -0600, Ross Zwisler wrote:
<>
> OK ... what version of make are you using?  Because this works fine for me:
> 
> $ git clone linux clean
> $ cd clean
> $ git checkout v4.17
> $ cd tools/testing/radix-tree/ 
> $ git revert 8d9fa88edd5e360b71765feeadb915d4066c9684
> $ make
> 
> $ make --version
> GNU Make 4.1
> Built for x86_64-pc-linux-gnu
> 
> It's Debian's Version: 4.1-9.1

$ make --version
GNU Make 4.2.1
Built for x86_64-redhat-linux-gnu

The one from Fedora 27.

> > If you want generated/map-shift.h to be rebuilt each time you run 'make' so
> > that it can take a new SHIFT argument, that's fine, but let's not make users
> > run 'make mapshift' before an actual 'make' will work, which is where we're 
> > at
> > with v4.17 with my commit reverted.
> 
> I don't want it to be rebuilt, I want it to be checked before each
> build, regenerated if SHIFT has changed, and everything to rebuild if
> it has changed.

Sure, sounds good.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/5] radix tree test suite: fix mapshift build target

2018-07-16 Thread Ross Zwisler
On Sun, Jul 15, 2018 at 04:00:52PM -0700, Matthew Wilcox wrote:
> On Thu, May 03, 2018 at 01:24:26PM -0600, Ross Zwisler wrote:
> > The following commit
> > 
> >   commit c6ce3e2fe3da ("radix tree test suite: Add config option for map
> >   shift")
> > 
> > Introduced a phony makefile target called 'mapshift' that ends up
> > generating the file generated/map-shift.h.  This phony target was then
> > added as a dependency of the top level 'targets' build target, which is
> > what is run when you go to tools/testing/radix-tree and just type 'make'.
> > 
> > Unfortunately, this phony target doesn't actually work as a dependency, so
> > you end up getting:
> > 
> > $ make
> > make: *** No rule to make target 'generated/map-shift.h', needed by 
> > 'main.o'.  Stop.
> > make: *** Waiting for unfinished jobs
> > 
> > Fix this by making the file generated/map-shift.h our real makefile target,
> > and add this a dependency of the top level build target.
> 
> This commit breaks typing 'make SHIFT=6'.  It doesn't rebuild the
> test suite any more.  If I revert this patch, it works.  Also, I can't
> reproduce the problem you're reporting here.  So ... how do I reproduce
> it?  Otherwise, I'm just going to revert this patch since it regresses
> a feature I find useful.

The test suite builds fine for me in v4.17.  From a completely clean tree, in
tools/testing/radix-tree:

  $ make
  sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
../../../lib/radix-tree.c > radix-tree.c
  sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
../../../lib/idr.c > idr.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
main.o main.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
linux.o linux.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
test.o test.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
find_bit.o ../../lib/find_bit.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
regression1.o regression1.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
regression2.o regression2.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
regression3.o regression3.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
tag_check.o tag_check.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
multiorder.o multiorder.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
idr-test.o idr-test.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
iteration_check.o iteration_check.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
benchmark.o benchmark.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
idr.o idr.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
radix-tree.o radix-tree.c
  cc -fsanitize=address  multiorder.o radix-tree.o idr.o linux.o test.o 
find_bit.o  -lpthread -lurcu -o multiorder
  cc -fsanitize=address  main.o radix-tree.o idr.o linux.o test.o find_bit.o 
regression1.o regression2.o regression3.o tag_check.o multiorder.o idr-test.o 
iteration_check.o benchmark.o  -lpthread -lurcu -o main
  cc -fsanitize=address  idr-test.o radix-tree.o idr.o linux.o test.o 
find_bit.o  -lpthread -lurcu -o idr-test

and you can successfully run the radix tree test suite by running 'main'.

With the above mentioned commit reverted, this build fails:

  $ make
  make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. 
 Stop.
  make: *** Waiting for unfinished jobs

If you want generated/map-shift.h to be rebuilt each time you run 'make' so
that it can take a new SHIFT argument, that's fine, but let's not make users
run 'make mapshift' before an actual 'make' will work, which is where we're at
with v4.17 with my commit reverted.

Incidentally, in the current linux/master the radix tree test suite again
fails to build:

  $ make
  sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
../../../lib/radix-tree.c > radix-tree.c
  sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < 
../../../lib/idr.c > idr.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
main.o main.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
linux.o linux.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
test.o test.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address   -c -o 
find_bit.o ../../lib/find_bit.c
  cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -

Re: [ndctl PATCH 1/2] ndctl, test: fix tests for the array vs object listing fix

2018-07-12 Thread Ross Zwisler
On Wed, Jul 11, 2018 at 05:56:43PM -0600, Vishal Verma wrote:
> The commit below updated json listings to always be arrays unless,
> potentially, --human was specified. As a fallout of the change, some
> unit tests that used jq to look for certain elements, or the json2var
> conversion broke in certain cases. Fix the jq query in sector-mode.sh,
> and fix json2var in test/common. The 'destructive' class of unit tests
> still need to be converted to the test/common scheme, and subsequently
> have their own local json2var instances. These will be fixed in a future
> commit by simply performing the test/common conversion.
> 
> Fixes: 72c46ab194d9 ("ndctl list: always output array without --human")
> Cc: Ross Zwisler 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 

Nice, thanks for the fixes.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-11 Thread Ross Zwisler
On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >(Dave, Darrick and Jan)
> 
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

That's next on my list of things to look at. :)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCHv3] ndctl: Add 'list' verbose options

2018-07-10 Thread Ross Zwisler
On Fri, Jul 06, 2018 at 11:02:11AM -0600, Keith Busch wrote:
> The informational and miscellaneous flag options are becoming more
> numerous, and can be difficult to remember what can be listed. This
> patch adds a 'verbose' option that increases the detail listed by
> automatically enabling options and less essential information.
> 
> The verbose option can be repeated multiple times to increase the
> detail. There are currently three levels of verbose with output detail
> documented in the ndctl list manpage.
> 
> Signed-off-by: Keith Busch 
> ---
<>
> @@ -112,7 +115,7 @@ static struct json_object *region_to_json(struct 
> ndctl_region *region,
>   numa = ndctl_region_get_numa_node(region);
>   if (numa >= 0) {
>   jobj = json_object_new_int(numa);
> - if (jobj)
> + if (jobj && flags & UTIL_JSON_VERBOSE)
>   json_object_object_add(jregion, "numa_node", jobj);
>   }

This same change needs to be repeated in the other place where we print the
numa_node, in util_namespace_to_json().

The rest looks good.  You can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[fstests PATCH v2 2/2] generic/999: test DAX DMA vs truncate/hole-punch

2018-07-10 Thread Ross Zwisler
This adds a regression test for the following series:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/016842.html

which adds synchronization between DAX DMA in ext4 and truncate/hole-punch.
The intention of the test is to test those specific changes, but it runs
fine both with XFS and without DAX so I've put it in the generic tests
instead of ext4 and not restricted it to only DAX configurations.

When run with v4.18-rc4 + DAX + ext4, this test will hit the following
WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

If you change this to a WARN_ON() instead, you can see that each of the
four paths being exercised in this test hits that condition many times in
the one second that the subtest is being run.

Signed-off-by: Ross Zwisler 
---
 .gitignore |   1 +
 src/Makefile   |   2 +-
 src/t_mmap_collision.c | 235 +
 tests/generic/999  |  53 +++
 tests/generic/999.out  |   2 +
 tests/generic/group|   1 +
 6 files changed, 293 insertions(+), 1 deletion(-)
 create mode 100644 src/t_mmap_collision.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index efc73a7c..ea1aac8a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -125,6 +125,7 @@
 /src/t_holes
 /src/t_immutable
 /src/t_locks_execve
+/src/t_mmap_collision
 /src/t_mmap_cow_race
 /src/t_mmap_dio
 /src/t_mmap_fallocate
diff --git a/src/Makefile b/src/Makefile
index 9e971bcc..41826585 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -16,7 +16,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
t_mmap_cow_race t_mmap_fallocate fsync-err t_mmap_write_ro \
t_ext4_dax_journal_corruption t_ext4_dax_inline_corruption \
-   t_ofd_locks t_locks_execve
+   t_ofd_locks t_locks_execve t_mmap_collision
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/t_mmap_collision.c b/src/t_mmap_collision.c
new file mode 100644
index ..5f58d858
--- /dev/null
+++ b/src/t_mmap_collision.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * As of kernel version 4.18-rc4, Linux has an issue with ext4+DAX where DMA
+ * and direct I/O operations aren't synchronized with respect to operations
+ * which can change the block mappings of an inode.  This means that we can
+ * schedule an I/O for an inode and have the block mapping for that inode
+ * change before the I/O is actually complete.  So, blocks which were once
+ * allocated to a given inode and then freed could still have I/O operations
+ * happening to them.  If these blocks have also been reallocated to a
+ * different inode, this interaction can lead to data corruption.
+ *
+ * This test exercises four of the paths in ext4 which hit this issue.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PAGE(a) ((a)*0x1000)
+#define FILE_SIZE PAGE(4)
+
+void *dax_data;
+int nodax_fd;
+int dax_fd;
+bool done;
+
+#define err_exit(op)  \
+{ \
+   fprintf(stderr, "%s %s: %s\n", __func__, op, strerror(errno));\
+   exit(1);  \
+} \
+
+#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
+void punch_hole_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   do {
+   rc = pread(nodax_fd, dax_data + read, FILE_SIZE - read,
+   read);
+   if (rc > 0)
+   read += rc;
+   } while (rc > 0);
+
+   if (read != FILE_SIZE || rc != 0)
+   err_exit("pread");
+
+   rc = fallocate(dax_fd,
+   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   0, FILE_SIZE);
+   if (rc < 0)
+   err_exit("fallocate");
+
+   usleep(rand() % 1000);
+   }
+}
+#else
+void punch_hole_fn(void *ptr) { }
+#endif
+
+#if defined(FALLOC_FL_ZERO_RANGE) && defined(FALLOC_FL_KEEP_SIZE)
+void zero_range_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   do {
+   rc = pread(nodax_fd, dax_data + read, FILE_SIZE - read,
+  

[fstests PATCH v2 1/2] src/: add license and copyright info to files

2018-07-10 Thread Ross Zwisler
Add copyright and license info to files that I've authored in src/.

Signed-off-by: Ross Zwisler 
---
 src/t_ext4_dax_inline_corruption.c  | 2 ++
 src/t_ext4_dax_journal_corruption.c | 2 ++
 src/t_mmap_cow_race.c   | 2 ++
 src/t_mmap_stale_pmd.c  | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/src/t_ext4_dax_inline_corruption.c 
b/src/t_ext4_dax_inline_corruption.c
index b52bcc0d..e1a39a6c 100644
--- a/src/t_ext4_dax_inline_corruption.c
+++ b/src/t_ext4_dax_inline_corruption.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Intel Corporation. */
 #include 
 #include 
 #include 
diff --git a/src/t_ext4_dax_journal_corruption.c 
b/src/t_ext4_dax_journal_corruption.c
index fccef8f5..ba7a96e4 100644
--- a/src/t_ext4_dax_journal_corruption.c
+++ b/src/t_ext4_dax_journal_corruption.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Intel Corporation. */
 #include 
 #include 
 #include 
diff --git a/src/t_mmap_cow_race.c b/src/t_mmap_cow_race.c
index 207ba425..5941fcf3 100644
--- a/src/t_mmap_cow_race.c
+++ b/src/t_mmap_cow_race.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017 Intel Corporation. */
 #include 
 #include 
 #include 
diff --git a/src/t_mmap_stale_pmd.c b/src/t_mmap_stale_pmd.c
index 6a52201c..a139d8ac 100644
--- a/src/t_mmap_stale_pmd.c
+++ b/src/t_mmap_stale_pmd.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017 Intel Corporation. */
 #include 
 #include 
 #include 
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[fstests PATCH v2 0/2] Add test for DAX+ext4+DMA race

2018-07-10 Thread Ross Zwisler
As of kernel version 4.18-rc4, Linux has an issue with ext4+DAX where DMA
and direct I/O operations aren't synchronized with respect to operations
which can change the block mappings of an inode.  This can potentially
lead to data corruption, and is fixed in the following kernel series:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/016842.html

These patches add a regression test for that series.

---

Changes since v1:
 * Added copyright and license information to files I've previously
   authored in src/.

 * Added copyright, license information and info about the test's
   purpose to src/t_mmap_collision.c. (Dave)

 * Added comments to tests/generic/999 explaining why we unset the
   mount flags for SCRATCH_MNT. (Dave)

 * Subtests in src/t_mmap_collision.c are now conditional based on the
   existence of the necessary FALLOC_FL_* flags. (Eryu)

 * Added the test to the punch, collapse and zero groups. (Eryu)

Ross Zwisler (2):
  src/: add license and copyright info to files
  generic/999: test DAX DMA vs truncate/hole-punch

 .gitignore  |   1 +
 src/Makefile|   2 +-
 src/t_ext4_dax_inline_corruption.c  |   2 +
 src/t_ext4_dax_journal_corruption.c |   2 +
 src/t_mmap_collision.c  | 235 
 src/t_mmap_cow_race.c   |   2 +
 src/t_mmap_stale_pmd.c  |   2 +
 tests/generic/999   |  53 
 tests/generic/999.out   |   2 +
 tests/generic/group |   1 +
 10 files changed, 301 insertions(+), 1 deletion(-)
 create mode 100644 src/t_mmap_collision.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [fstests PATCH 2/2] generic/999: test DAX DMA vs truncate/hole-punch

2018-07-10 Thread Ross Zwisler
On Thu, Jun 21, 2018 at 10:40:33AM +0800, Eryu Guan wrote:
> On Wed, Jun 20, 2018 at 04:51:47PM -0600, Ross Zwisler wrote:
<>

> > diff --git a/tests/generic/999 b/tests/generic/999
> > new file mode 100755
> > index ..8f488cb5
> > --- /dev/null
> > +++ b/tests/generic/999
> > @@ -0,0 +1,50 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2018 Intel Corporation.  All Rights Reserved.
> > +#
> > +# FS QA Test generic/999
> > +#
> > +# This is a regression test for kernel patch:
> > +#   ext4: handle layout changes to pinned DAX mapping
> > +# created by Ross Zwisler 
> > +#
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1   # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +   cd /
> > +   rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# Modify as appropriate.
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_test
> > +_require_test_program "t_mmap_collision"
> 
> _require_xfs_io_command "fpunch"
> _require_xfs_io_command "fcollapse"
> _require_xfs_io_command "fzero"

I don't think I need these because I don't actually require this functionality
from xfs_io.  I did try and use xfs_io for this test, but the way the
threading needs to work I ended up having to write my own C program.  So,
even if xfs_io happens to be old and without fpunch, for example, as long as
fallocate can use FALLC_FL_PUNCH_HOLE I think the test should do the right
thing.

> > diff --git a/tests/generic/group b/tests/generic/group
> > index 83a6fdab..793f71ed 100644
> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -501,3 +501,4 @@
> >  496 auto quick swap
> >  497 auto quick swap collapse
> >  498 auto quick log
> > +999 auto quick dax
> 
> Also need "punch collapse zero" groups.

I'm assuming I should still add my test to these 3 groups, even though I'm not
getting this functionality via xfs_io?

Thank you for the review, I'll address the rest of your comments in v2.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-10 Thread Ross Zwisler
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Lukas Czerner 
---
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 17 +
 fs/ext4/inode.c| 46 ++
 fs/ext4/truncate.h |  4 
 4 files changed, 68 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0b127853c584..34bccd64d83d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
+extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0057fe3f248d..b8161e6b88d1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
 * released from page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret) {
+   up_write(_I(inode)->i_mmap_sem);
+   goto out_mutex;
+   }
+
ret = ext4_update_disksize_before_punch(inode, offset, len);
if (ret) {
up_write(_I(inode)->i_mmap_sem);
@@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_mmap;
+
/*
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
@@ -5641,6 +5653,11 @@ int ext4_insert_range(struct inode *inode, loff_t 
offset, loff_t len)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_mmap;
+
/*
 * Need to round down to align start offset to page size boundary
 * for page size > block size.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ea07efbe016..fadb8ecacb1e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
return 0;
 }
 
+static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+{
+   *did_unlock = true;
+   up_write(>i_mmap_sem);
+   schedule();
+   down_write(>i_mmap_sem);
+}
+
+int ext4_break_layouts(struct inode *inode)
+{
+   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct page *page;
+   bool retry;
+   int error;
+
+   if (WARN_ON_ONCE(!rwsem_is_locked(>i_mmap_sem)))
+   return -EINVAL;
+
+   do {
+   retry = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   error = ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1,
+   TASK_INTERRUPTIBLE, 0, 0,
+   ext4_wait_dax_page(ei, ));
+   } while (error == 0 && retry);
+
+   return error;
+}
+
 /*
  * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
@@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, 
loff_t length)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_dio;
+
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
 
@@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr 
*attr)
ext4_wait_for_tail_page_commit(inode);
}
down_write(_I(inode)->i_mmap_sem);
+
+   rc = ext4_break_layouts(inode);
+   if (rc) {
+   up_write(_I(inode)->i_mmap_sem);
+   error = rc;
+   goto err_out;
+   }
+
/*
 * Truncate pagecache after we've waited for commit
 * in data=journal mode to make pages freeab

[PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional

2018-07-10 Thread Ross Zwisler
Inodes using DAX should only ever have exceptional entries in their page
caches.  Make this clear by warning if the iteration in
dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a
comment for the pagevec_release() call which only deals with struct page
pointers.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 641192808bb6..897b51e41d8f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -566,7 +566,8 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (index >= end)
break;
 
-   if (!radix_tree_exceptional_entry(pvec_ent))
+   if (WARN_ON_ONCE(
+!radix_tree_exceptional_entry(pvec_ent)))
continue;
 
xa_lock_irq(>i_pages);
@@ -578,6 +579,13 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (page)
break;
}
+
+   /*
+* We don't expect normal struct page entries to exist in our
+* tree, but we keep these pagevec calls so that this code is
+* consistent with the common pattern for handling pagevecs
+* throughout the kernel.
+*/
pagevec_remove_exceptionals();
pagevec_release();
index++;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-10 Thread Ross Zwisler
Changes since v3:
 * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
   that the {ext4,xfs}_break_layouts() calls have the same meaning.
   (Dave, Darrick and Jan)

---

This series from Dan:

https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html

added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.

I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.

The timings in XFS are such that it's difficult to hit this race.  Dan
was able to show the race by manually introducing delays in the direct
I/O path.

For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest.  You can find that test here:

https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html

With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do.  I'll
continue looking at these more rare hits.

Ross Zwisler (2):
  dax: dax_layout_busy_page() warn on !exceptional
  ext4: handle layout changes to pinned DAX mappings

 fs/dax.c   | 10 +-
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 17 +
 fs/ext4/inode.c| 46 ++
 fs/ext4/truncate.h |  4 
 5 files changed, 77 insertions(+), 1 deletion(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2] mm/sparse.c: fix error path in sparse_add_one_section

2018-07-06 Thread Ross Zwisler
The following commit in -next:

commit 054620849110 ("mm/sparse.c: make sparse_init_one_section void and
remove check")

changed how the error handling in sparse_add_one_section() works.

Previously sparse_index_init() could return -EEXIST, and the function would
continue on happily.  'ret' would get unconditionally overwritten by the
result from sparse_init_one_section() and the error code after the 'out:'
label wouldn't be triggered.

With the above referenced commit, though, an -EEXIST error return from
sparse_index_init() now takes us through the function and into the error
case after 'out:'.  This eventually causes a kernel BUG, probably because
we've just freed a memory section that we successfully set up and marked as
present:

  BUG: unable to handle kernel paging request at ea000580
  RIP: 0010:memmap_init_zone+0x154/0x1cf

  Call Trace:
   move_pfn_range_to_zone+0x168/0x180
   devm_memremap_pages+0x29b/0x480
   pmem_attach_disk+0x1ae/0x6c0 [nd_pmem]
   ? devm_memremap+0x79/0xb0
   nd_pmem_probe+0x7e/0xa0 [nd_pmem]
   nvdimm_bus_probe+0x6e/0x160 [libnvdimm]
   driver_probe_device+0x310/0x480
   __device_attach_driver+0x86/0x100
   ? __driver_attach+0x110/0x110
   bus_for_each_drv+0x6e/0xb0
   __device_attach+0xe2/0x160
   device_initial_probe+0x13/0x20
   bus_probe_device+0xa6/0xc0
   device_add+0x41b/0x660
   ? lock_acquire+0xa3/0x210
   nd_async_device_register+0x12/0x40 [libnvdimm]
   async_run_entry_fn+0x3e/0x170
   process_one_work+0x230/0x680
   worker_thread+0x3f/0x3b0
   kthread+0x12f/0x150
   ? process_one_work+0x680/0x680
   ? kthread_create_worker_on_cpu+0x70/0x70
   ret_from_fork+0x3a/0x50

Fix this by clearing 'ret' back to 0 if sparse_index_init() returns
-EEXIST.  This restores the previous behavior.

Signed-off-by: Ross Zwisler 
---
 mm/sparse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/sparse.c b/mm/sparse.c
index f55e79fda03e..eb188eb6b82d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -770,6 +770,7 @@ int __meminit sparse_add_one_section(struct pglist_data 
*pgdat,
ret = sparse_index_init(section_nr, pgdat->node_id);
if (ret < 0 && ret != -EEXIST)
return ret;
+   ret = 0;
memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap);
if (!memmap)
return -ENOMEM;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/sparse.c: fix error path in sparse_add_one_section

2018-07-06 Thread Ross Zwisler
On Fri, Jul 06, 2018 at 11:23:27PM +0200, Oscar Salvador wrote:
> On Fri, Jul 06, 2018 at 01:06:58PM -0600, Ross Zwisler wrote:
> > The following commit in -next:
> > 
> > commit 054620849110 ("mm/sparse.c: make sparse_init_one_section void and
> > remove check")
> > 
> > changed how the error handling in sparse_add_one_section() works.
> > 
> > Previously sparse_index_init() could return -EEXIST, and the function would
> > continue on happily.  'ret' would get unconditionally overwritten by the
> > result from sparse_init_one_section() and the error code after the 'out:'
> > label wouldn't be triggered.
> 
> My bad, I missed that.
> 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 9574113fc745..d254bd2d3289 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -753,8 +753,12 @@ int __meminit sparse_add_one_section(struct 
> > pglist_data *pgdat,
> >  * plus, it does a kmalloc
> >  */
> > ret = sparse_index_init(section_nr, pgdat->node_id);
> > -   if (ret < 0 && ret != -EEXIST)
> > -   return ret;
> > +   if (ret < 0) {
> > +   if (ret == -EEXIST)
> > +   ret = 0;
> > +   else
> > +   return ret;
> > +   }
> 
> sparse_index_init() can return:
> 
> -ENOMEM, -EEXIST or 0.
> 
> So what about this?:
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index f55e79fda03e..eb188eb6b82d 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -770,6 +770,7 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
> ret = sparse_index_init(section_nr, pgdat->node_id);
> if (ret < 0 && ret != -EEXIST)
> return ret;
> +   ret = 0;
> 
> Does this look more clean?

Sure, that's probably better.

Andrew, what's the easiest way forward?  I can send out a v2, you can fold
this into his previous patch, or something else?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] mm/sparse.c: fix error path in sparse_add_one_section

2018-07-06 Thread Ross Zwisler
The following commit in -next:

commit 054620849110 ("mm/sparse.c: make sparse_init_one_section void and
remove check")

changed how the error handling in sparse_add_one_section() works.

Previously sparse_index_init() could return -EEXIST, and the function would
continue on happily.  'ret' would get unconditionally overwritten by the
result from sparse_init_one_section() and the error code after the 'out:'
label wouldn't be triggered.

With the above referenced commit, though, an -EEXIST error return from
sparse_index_init() now takes us through the function and into the error
case after 'out:'.  This eventually causes a kernel BUG, probably because
we've just freed a memory section that we successfully set up and marked as
present:

  BUG: unable to handle kernel paging request at ea000580
  RIP: 0010:memmap_init_zone+0x154/0x1cf

  Call Trace:
   move_pfn_range_to_zone+0x168/0x180
   devm_memremap_pages+0x29b/0x480
   pmem_attach_disk+0x1ae/0x6c0 [nd_pmem]
   ? devm_memremap+0x79/0xb0
   nd_pmem_probe+0x7e/0xa0 [nd_pmem]
   nvdimm_bus_probe+0x6e/0x160 [libnvdimm]
   driver_probe_device+0x310/0x480
   __device_attach_driver+0x86/0x100
   ? __driver_attach+0x110/0x110
   bus_for_each_drv+0x6e/0xb0
   __device_attach+0xe2/0x160
   device_initial_probe+0x13/0x20
   bus_probe_device+0xa6/0xc0
   device_add+0x41b/0x660
   ? lock_acquire+0xa3/0x210
   nd_async_device_register+0x12/0x40 [libnvdimm]
   async_run_entry_fn+0x3e/0x170
   process_one_work+0x230/0x680
   worker_thread+0x3f/0x3b0
   kthread+0x12f/0x150
   ? process_one_work+0x680/0x680
   ? kthread_create_worker_on_cpu+0x70/0x70
   ret_from_fork+0x3a/0x50

Fix this by clearing 'ret' back to 0 if sparse_index_init() returns
-EEXIST.  This restores the previous behavior.

Signed-off-by: Ross Zwisler 
---
 mm/sparse.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 9574113fc745..d254bd2d3289 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -753,8 +753,12 @@ int __meminit sparse_add_one_section(struct pglist_data 
*pgdat,
 * plus, it does a kmalloc
 */
ret = sparse_index_init(section_nr, pgdat->node_id);
-   if (ret < 0 && ret != -EEXIST)
-   return ret;
+   if (ret < 0) {
+   if (ret == -EEXIST)
+   ret = 0;
+   else
+   return ret;
+   }
memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap);
if (!memmap)
return -ENOMEM;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/sparse: Make sparse_init_one_section void and remove check

2018-07-06 Thread Ross Zwisler
On Mon, Jul 2, 2018 at 12:48 PM Pavel Tatashin
 wrote:
>
> On Mon, Jul 2, 2018 at 11:43 AM  wrote:
> >
> > From: Oscar Salvador 
> >
> > sparse_init_one_section() is being called from two sites:
> > sparse_init() and sparse_add_one_section().
> > The former calls it from a for_each_present_section_nr() loop,
> > and the latter marks the section as present before calling it.
> > This means that when sparse_init_one_section() gets called, we already know
> > that the section is present.
> > So there is no point to double check that in the function.
> >
> > This removes the check and makes the function void.
> >
> > Signed-off-by: Oscar Salvador 
>
> Thank you Oscar.
>
> Reviewed-by: Pavel Tatashin 

It looks like this change breaks "fsdax" mode namespaces in
next-20180705.  The offending commit is:

commit 054620849110 ("mm/sparse.c: make sparse_init_one_section void
and remove check")

Here is the stack trace I get when converting a raw mode namespace to
fsdax mode, and from then on during each reboot as the namespace is
being initialized:

[6.067166] BUG: unable to handle kernel paging request at ea000580
[6.068084] PGD 13ffdd067 P4D 13ffdd067 PUD 13ffdc067 PMD 0
[6.068771] Oops: 0002 [#1] PREEMPT SMP PTI
[6.069262] CPU: 11 PID: 180 Comm: kworker/u24:2 Not tainted
4.18.0-rc3-00193-g054620849110 #1
[6.070440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[6.071689] Workqueue: events_unbound async_run_entry_fn
[6.072261] RIP: 0010:memmap_init_zone+0x154/0x1cf
[6.072882] Code: 48 89 c3 48 c1 eb 0c e9 82 00 00 00 48 89 da 48
b8 00 00 00 00 00 ea ff ff b9 10 00 00 00 48 c1 e2 06 48 01 c2 31 c0
48 89 d7  ab 48 b8 ff ff ff ff ff ff 7f 00 48 23 45 c0 c7 42 34 01
00 00
[6.075396] RSP: 0018:c900024bfa70 EFLAGS: 00010246
[6.076052] RAX:  RBX: 00140002 RCX: 0010
[6.076845] RDX: ea000580 RSI:  RDI: ea000580
[6.077604] RBP: c900024bfab0 R08: 0001 R09: 88010eb50d38
[6.078394] R10:  R11:  R12: 
[6.079331] R13: 0004 R14: 0001 R15: 
[6.080274] FS:  () GS:880115a0()
knlGS:
[6.081337] CS:  0010 DS:  ES:  CR0: 80050033
[6.082092] CR2: ea000580 CR3: 02824000 CR4: 06e0
[6.083032] Call Trace:
[6.083371]  move_pfn_range_to_zone+0x168/0x180
[6.083965]  devm_memremap_pages+0x29b/0x480
[6.084550]  pmem_attach_disk+0x1ae/0x6c0 [nd_pmem]
[6.085204]  ? devm_memremap+0x79/0xb0
[6.085714]  nd_pmem_probe+0x7e/0xa0 [nd_pmem]
[6.086320]  nvdimm_bus_probe+0x6e/0x160 [libnvdimm]
[6.086977]  driver_probe_device+0x310/0x480
[6.087543]  __device_attach_driver+0x86/0x100
[6.088136]  ? __driver_attach+0x110/0x110
[6.088681]  bus_for_each_drv+0x6e/0xb0
[6.089190]  __device_attach+0xe2/0x160
[6.089705]  device_initial_probe+0x13/0x20
[6.090266]  bus_probe_device+0xa6/0xc0
[6.090772]  device_add+0x41b/0x660
[6.091249]  ? lock_acquire+0xa3/0x210
[6.091743]  nd_async_device_register+0x12/0x40 [libnvdimm]
[6.092398]  async_run_entry_fn+0x3e/0x170
[6.092921]  process_one_work+0x230/0x680
[6.093455]  worker_thread+0x3f/0x3b0
[6.093930]  kthread+0x12f/0x150
[6.094362]  ? process_one_work+0x680/0x680
[6.094903]  ? kthread_create_worker_on_cpu+0x70/0x70
[6.095574]  ret_from_fork+0x3a/0x50
[6.096069] Modules linked in: nd_pmem nd_btt dax_pmem device_dax
nfit libnvdimm
[6.097179] CR2: ea000580
[6.097764] ---[ end trace a5b8bd6a5500b68c ]---

- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] nvdimm docs: remove reference to CONFIG_NFIT_TEST

2018-07-05 Thread Ross Zwisler
This symbol doesn't currently exist.  The instructions on how to compile &
load nfit_test are here:

https://github.com/pmem/ndctl

Signed-off-by: Ross Zwisler 
---
 Documentation/nvdimm/nvdimm.txt | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt
index e894de69915a..985091a86866 100644
--- a/Documentation/nvdimm/nvdimm.txt
+++ b/Documentation/nvdimm/nvdimm.txt
@@ -248,9 +248,8 @@ by a region device with a dynamically assigned id (REGION0 
- REGION5).
 be contiguous in DPA-space.
 
 This bus is provided by the kernel under the device
-/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
-the nfit_test.ko module is loaded.  This not only test LIBNVDIMM but the
-acpi_nfit.ko driver as well.
+/sys/devices/platform/nfit_test.0 when the nfit_test.ko module is loaded.
+This not only test LIBNVDIMM but the acpi_nfit.ko driver as well.
 
 
 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-05 Thread Ross Zwisler
On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote:
> > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote:
> > > On Wed 04-07-18 10:49:23, Dave Chinner wrote:
> > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote:
> > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization 
> > > > > between
> > > > > operations in ext4 which remove blocks from an inode (hole punch, 
> > > > > truncate
> > > > > down, etc.) and pages which are pinned due to DAX DMA operations.
> > > > > 
> > > > > Signed-off-by: Ross Zwisler 
> > > > > Reviewed-by: Jan Kara 
> > > > > Reviewed-by: Lukas Czerner 
> > > > > ---
> > > > > 
> > > > > Changes since v2:
> > > > >  * Added a comment to ext4_insert_range() explaining why we don't call
> > > > >ext4_break_layouts(). (Jan)
> > > > 
> > > > Which I think is wrong and will cause data corruption.
> > > > 
> > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, 
> > > > > loff_t offset, loff_t len)
> > > > >   LLONG_MAX);
> > > > >   if (ret)
> > > > >   goto out_mmap;
> > > > > + /*
> > > > > +  * We don't need to call ext4_break_layouts() because we aren't
> > > > > +  * removing any blocks from the inode.  We are just changing 
> > > > > their
> > > > > +  * offset by inserting a hole.
> > > > > +  */
> 
> Does calling ext4_break_layouts from insert range not work?
> 
> It's my understanding that file leases work are a mechanism for the
> filesystem to delegate some of its authority over physical space
> mappings to "client" software.  AFAICT it's used for mmap'ing pmem
> directly into userspace and exporting space on shared storage over
> pNFS.  Some day we might use the same mechanism for the similar things
> that RDMA does, or the swapfile code since that's essentially how it
> works today.
> 
> The other part of these file leases is that the filesystem revokes them
> any time it wants to perform a mapping operation on a file.  This breaks
> my mental model of how leases work, and if you commit to this for ext4
> then I'll have to remember that leases are different between xfs and
> ext4.  Worse, since the reason for skipping ext4_break_layouts seems to
> be the implementation detail that "DAX won't care", then someone else
> wiring up pNFS/future RDMA/whatever will also have to remember to put it
> back into ext4 or else kaboom.
> 
> Granted, Dave said all these things already, but I actually feel
> strongly enough to reiterate.

Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() to
keep the lease mechanism consistent between ext4 and XFS, or would you prefer
the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional

2018-07-03 Thread Ross Zwisler
On Mon, Jul 02, 2018 at 06:15:03PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Jun 27, 2018 at 03:22:51PM -0600, Ross Zwisler wrote:
> > Inodes using DAX should only ever have exceptional entries in their page
> > caches.  Make this clear by warning if the iteration in
> > dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a
> > comment for the pagevec_release() call which only deals with struct page
> > pointers.
> > 
> > Signed-off-by: Ross Zwisler 
> > Reviewed-by: Jan Kara 
> 
> Thanks, applied (to the ext4 tree).  If someone thinks they should go
> in via some other tree, holler.
> 
>   - Ted

Hey Ted,

It looks like you only picked up patch 1/2?  (I'm looking at the 'dev' branch
in your repo.)  Was that intentional?

You can find the final version of the 2nd patch here:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/016602.html

Thanks,
- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-02 Thread Ross Zwisler
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Lukas Czerner 
---

Changes since v2:
 * Added a comment to ext4_insert_range() explaining why we don't call
   ext4_break_layouts(). (Jan)

 * Added Lukas's reviewed-by.

Thanks again to Lukas & Jan for their reviews.

---
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 17 +
 fs/ext4/inode.c| 46 ++
 fs/ext4/truncate.h |  4 
 4 files changed, 68 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0b127853c584..34bccd64d83d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
+extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0057fe3f248d..2d0f7e942b05 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
 * released from page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret) {
+   up_write(_I(inode)->i_mmap_sem);
+   goto out_mutex;
+   }
+
ret = ext4_update_disksize_before_punch(inode, offset, len);
if (ret) {
up_write(_I(inode)->i_mmap_sem);
@@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_mmap;
+
/*
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
@@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t 
offset, loff_t len)
LLONG_MAX);
if (ret)
goto out_mmap;
+   /*
+* We don't need to call ext4_break_layouts() because we aren't
+* removing any blocks from the inode.  We are just changing their
+* offset by inserting a hole.
+*/
truncate_pagecache(inode, ioffset);
 
credits = ext4_writepage_trans_blocks(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ea07efbe016..fadb8ecacb1e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
return 0;
 }
 
+static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+{
+   *did_unlock = true;
+   up_write(>i_mmap_sem);
+   schedule();
+   down_write(>i_mmap_sem);
+}
+
+int ext4_break_layouts(struct inode *inode)
+{
+   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct page *page;
+   bool retry;
+   int error;
+
+   if (WARN_ON_ONCE(!rwsem_is_locked(>i_mmap_sem)))
+   return -EINVAL;
+
+   do {
+   retry = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   error = ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1,
+   TASK_INTERRUPTIBLE, 0, 0,
+   ext4_wait_dax_page(ei, ));
+   } while (error == 0 && retry);
+
+   return error;
+}
+
 /*
  * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
@@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, 
loff_t length)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_dio;
+
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
 
@@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr 
*attr)
ext4_wait_for_tail_page_commit(inode);
}
down_write(_I(inode)->i_mmap_sem);
+
+   rc = ext4_break_layouts(inode);
+   if (rc)

Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-02 Thread Ross Zwisler
On Mon, Jul 02, 2018 at 09:59:48AM +0200, Lukas Czerner wrote:
> On Fri, Jun 29, 2018 at 09:13:00AM -0600, Ross Zwisler wrote:
> > On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote:
> > > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote:
> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization 
> > > > between
> > > > operations in ext4 which remove blocks from an inode (hole punch, 
> > > > truncate
> > > > down, etc.) and pages which are pinned due to DAX DMA operations.
> > > > 
> > > > Signed-off-by: Ross Zwisler 
> > > > Reviewed-by: Jan Kara 
> > > > ---
> > > >  fs/ext4/ext4.h |  1 +
> > > >  fs/ext4/extents.c  | 12 
> > > >  fs/ext4/inode.c| 46 ++
> > > >  fs/ext4/truncate.h |  4 
> > > >  4 files changed, 63 insertions(+)
> > > > 
> > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > index 0b127853c584..34bccd64d83d 100644
> > > > --- a/fs/ext4/ext4.h
> > > > +++ b/fs/ext4/ext4.h
> > > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, 
> > > > struct ext4_iloc *);
> > > >  extern int ext4_inode_attach_jinode(struct inode *inode);
> > > >  extern int ext4_can_truncate(struct inode *inode);
> > > >  extern int ext4_truncate(struct inode *);
> > > > +extern int ext4_break_layouts(struct inode *);
> > > >  extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t 
> > > > length);
> > > >  extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
> > > > nblocks);
> > > >  extern void ext4_set_inode_flags(struct inode *);
> > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > > index 0057fe3f248d..a6aef06f455b 100644
> > > > --- a/fs/ext4/extents.c
> > > > +++ b/fs/ext4/extents.c
> > > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, 
> > > > loff_t offset,
> > > >  * released from page cache.
> > > >  */
> > > > down_write(_I(inode)->i_mmap_sem);
> > > > +
> > > > +   ret = ext4_break_layouts(inode);
> > > > +   if (ret) {
> > > > +   up_write(_I(inode)->i_mmap_sem);
> > > > +   goto out_mutex;
> > > > +   }
> > > > +
> > > > ret = ext4_update_disksize_before_punch(inode, offset, 
> > > > len);
> > > > if (ret) {
> > > > up_write(_I(inode)->i_mmap_sem);
> > > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, 
> > > > loff_t offset, loff_t len)
> > > >  * page cache.
> > > >  */
> > > > down_write(_I(inode)->i_mmap_sem);
> > > > +
> > > > +   ret = ext4_break_layouts(inode);
> > > > +   if (ret)
> > > > +   goto out_mmap;
> > > 
> > > Hi,
> > > 
> > > don't we need to do the same for ext4_insert_range() since we're about
> > > to truncate_pagecache() as well ?
> > > 
> > > /thinking out loud/
> > > Xfs seems to do this before every fallocate operation, but in ext4
> > > it does not seem to be needed at least for simply allocating falocate...
> > 
> > I saw the case in ext4_insert_range(), and decided that we didn't need to
> > worry about synchronizing with DAX because no blocks were being removed from
> > the inode's extent map.  IIUC the truncate_pagecache() call is needed 
> > because
> > we are unmapping and removing any page cache mappings for the part of the 
> > file
> > after the insert because those blocks are now at a different offset in the
> > inode.  Because at the end of the operation we haven't removed any DAX pages
> > from the inode, we have nothing that we need to synchronize.
> > 
> > Hmm, unless this is a failure case we care about fixing?
> >  1) schedule I/O via O_DIRECT to page X
> >  2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger
> > offset
> >  3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block
> > that resides at X - the I/O from 1) completes
> > 
> > In this case the user is running I/O and issuing the fallocate at the same
> > time, and the sequencing could have worked out that #1 and #2 were reversed,
> > giving you the same behavior.  IMO this seems fine and that we shouldn't 
> > have
> > the DAX synchronization call in ext4_insert_range(), but I'm happy to add it
> > if I'm wrong.
> 
> Hi,
> 
> I think you're right, this case might mot matter much. I am just worried
> about unforeseen consequences of changing the layout with dax pages
> mapped. I guess we can also add this later fi we discover anything.
> 
> You can add
> 
> Reviewed-by: Lukas Czerner 
> 
> Thanks!
> -Lukas

Thank you for the review.  I'll add a comment to help explain my reasoning, as
Jan suggested.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH v2 2/2] ndctl list: always output array without --human

2018-06-29 Thread Ross Zwisler
If "ndctl list" is only printing a single object it currently does not
print it in an array.  When printing multiple objects, though, an array is
used.  This simplifies the output for single objects when a human is
looking at it but it creates complications for parsers like jq.  jq
wants to know whether it is trying to parse an object or an array of
objects.

For example, here's how you print just the mode of a single namespace:

 # ndctl list -n namespace1.0 | jq ".mode"
 "raw"

This works on a single object.  If you have multiple objects in a JSON
array you do this:

  # ndctl list | jq -r '(.[]) | .mode'
  raw
  raw

If you want to write a generic script that works in both cases you have to
do something quite complicated like:

  # ndctl list | jq -r '((. | arrays | .[]), . | objects | .mode)'
  raw
  raw

Instead of pushing the burden on the user, have ndctl output consistently be
an array even if there is only one element.  This allows the parsing code
to be simpler and only worry about one case.

If the user specifies the --human or -u flags we will assume that a person
is looking at the output and not a script and will continue to output a
single element which is not part of an array.

Signed-off-by: Ross Zwisler 
---
 daxctl/list.c| 11 +++
 ndctl/bus.c  |  2 +-
 ndctl/dimm.c |  2 +-
 ndctl/inject-smart.c |  2 +-
 ndctl/list.c |  8 
 util/json.c  |  5 +++--
 util/json.h  |  3 ++-
 7 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/daxctl/list.c b/daxctl/list.c
index ed9e76c..58c35fa 100644
--- a/daxctl/list.c
+++ b/daxctl/list.c
@@ -83,6 +83,7 @@ int cmd_list(int argc, const char **argv, void *ctx)
struct json_object *jregions = NULL;
struct json_object *jdevs = NULL;
struct daxctl_region *region;
+   unsigned long list_flags;
int i;
 
 argc = parse_options(argc, argv, options, u, 0);
@@ -100,6 +101,8 @@ int cmd_list(int argc, const char **argv, void *ctx)
if (num_list_flags() == 0)
list.devs = true;
 
+   list_flags = listopts_to_flags();
+
daxctl_region_foreach(ctx, region) {
struct json_object *jregion = NULL;
 
@@ -117,7 +120,7 @@ int cmd_list(int argc, const char **argv, void *ctx)
}
 
jregion = util_daxctl_region_to_json(region,
-   param.dev, listopts_to_flags());
+   param.dev, list_flags);
if (!jregion) {
fail("\n");
continue;
@@ -125,13 +128,13 @@ int cmd_list(int argc, const char **argv, void *ctx)
json_object_array_add(jregions, jregion);
} else if (list.devs)
jdevs = util_daxctl_devs_to_list(region, jdevs,
-   param.dev, listopts_to_flags());
+   param.dev, list_flags);
}
 
if (jregions)
-   util_display_json_array(stdout, jregions);
+   util_display_json_array(stdout, jregions, list_flags);
else if (jdevs)
-   util_display_json_array(stdout, jdevs);
+   util_display_json_array(stdout, jdevs, list_flags);
 
if (did_fail)
return -ENOMEM;
diff --git a/ndctl/bus.c b/ndctl/bus.c
index b7b7d65..bcb8c71 100644
--- a/ndctl/bus.c
+++ b/ndctl/bus.c
@@ -88,7 +88,7 @@ static int bus_action(int argc, const char **argv, const char 
*usage,
}
 
if (success)
-   util_display_json_array(stdout, jbuses);
+   util_display_json_array(stdout, jbuses, 0);
else
json_object_put(jbuses);
 
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 6964c5e..97643a3 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -1081,7 +1081,7 @@ static int dimm_action(int argc, const char **argv, void 
*ctx,
}
 
if (actx.jdimms)
-   util_display_json_array(actx.f_out, actx.jdimms);
+   util_display_json_array(actx.f_out, actx.jdimms, 0);
 
if (actx.f_out != stdout)
fclose(actx.f_out);
diff --git a/ndctl/inject-smart.c b/ndctl/inject-smart.c
index 2e62c89..8da893c 100644
--- a/ndctl/inject-smart.c
+++ b/ndctl/inject-smart.c
@@ -375,7 +375,7 @@ static int dimm_inject_smart(struct ndctl_dimm *dimm)
jhealth = util_dimm_health_to_json(dimm);
if (jhealth) {
json_object_object_add(jdimm, "health", jhealth);
-   util_display_json_array(stdout, jdimms);
+   util_display_json_array(stdout, jdimms, sctx.flags);
}
}
 out:
diff --git a/ndctl/list.c b/ndctl/list.c
index 030d73f..a06edc9 100644
--- a/ndctl/list.c
+++ b/ndctl/list.c
@@ -379,7 +379,7 @@ s

[ndctl PATCH v2 1/2] ndctl: simplify JSON print flag handling

2018-06-29 Thread Ross Zwisler
json_object_to_json_string_ext()'s second argument is a flags field which
we always want to be JSON_C_TO_STRING_PRETTY.  We were going to a lot of
trouble for this, though.  We had multiple variables set to be this one
flag and util_display_json_array() took it as an argument, even though it
never varied.

Instead, just pass in the necessary flag when calling
json_object_to_json_string_ext(), removing the local variables and the extra
argument to util_display_json_array().

Signed-off-by: Ross Zwisler 
Acked-by: Dan Williams 
---
 daxctl/list.c|  5 ++---
 ndctl/bus.c  |  3 +--
 ndctl/dimm.c |  3 +--
 ndctl/inject-smart.c |  3 +--
 ndctl/list.c | 11 +--
 util/json.c  |  3 ++-
 util/json.h  |  2 +-
 7 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/daxctl/list.c b/daxctl/list.c
index 254f0ac..ed9e76c 100644
--- a/daxctl/list.c
+++ b/daxctl/list.c
@@ -50,7 +50,6 @@ static struct {
 };
 
 static int did_fail;
-static int jflag = JSON_C_TO_STRING_PRETTY;
 
 #define fail(fmt, ...) \
 do { \
@@ -130,9 +129,9 @@ int cmd_list(int argc, const char **argv, void *ctx)
}
 
if (jregions)
-   util_display_json_array(stdout, jregions, jflag);
+   util_display_json_array(stdout, jregions);
else if (jdevs)
-   util_display_json_array(stdout, jdevs, jflag);
+   util_display_json_array(stdout, jdevs);
 
if (did_fail)
return -ENOMEM;
diff --git a/ndctl/bus.c b/ndctl/bus.c
index fc31d06..b7b7d65 100644
--- a/ndctl/bus.c
+++ b/ndctl/bus.c
@@ -88,8 +88,7 @@ static int bus_action(int argc, const char **argv, const char 
*usage,
}
 
if (success)
-   util_display_json_array(stdout, jbuses,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(stdout, jbuses);
else
json_object_put(jbuses);
 
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 1c16899..6964c5e 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -1081,8 +1081,7 @@ static int dimm_action(int argc, const char **argv, void 
*ctx,
}
 
if (actx.jdimms)
-   util_display_json_array(actx.f_out, actx.jdimms,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(actx.f_out, actx.jdimms);
 
if (actx.f_out != stdout)
fclose(actx.f_out);
diff --git a/ndctl/inject-smart.c b/ndctl/inject-smart.c
index 60de9fe..2e62c89 100644
--- a/ndctl/inject-smart.c
+++ b/ndctl/inject-smart.c
@@ -375,8 +375,7 @@ static int dimm_inject_smart(struct ndctl_dimm *dimm)
jhealth = util_dimm_health_to_json(dimm);
if (jhealth) {
json_object_object_add(jdimm, "health", jhealth);
-   util_display_json_array(stdout, jdimms,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(stdout, jdimms);
}
}
 out:
diff --git a/ndctl/list.c b/ndctl/list.c
index 6cf7c39..030d73f 100644
--- a/ndctl/list.c
+++ b/ndctl/list.c
@@ -56,7 +56,6 @@ static unsigned long listopts_to_flags(void)
 struct util_filter_params param;
 
 static int did_fail;
-static int jflag = JSON_C_TO_STRING_PRETTY;
 
 #define fail(fmt, ...) \
 do { \
@@ -380,7 +379,7 @@ static int list_display(struct list_filter_arg *lfa)
struct json_object *jbuses = lfa->jbuses;
 
if (jbuses)
-   util_display_json_array(stdout, jbuses, jflag);
+   util_display_json_array(stdout, jbuses);
else if ((!!jdimms + !!jregions + !!jnamespaces) > 1) {
struct json_object *jplatform = json_object_new_object();
 
@@ -397,14 +396,14 @@ static int list_display(struct list_filter_arg *lfa)
json_object_object_add(jplatform, "namespaces",
jnamespaces);
printf("%s\n", json_object_to_json_string_ext(jplatform,
-   jflag));
+   JSON_C_TO_STRING_PRETTY));
json_object_put(jplatform);
} else if (jdimms)
-   util_display_json_array(stdout, jdimms, jflag);
+   util_display_json_array(stdout, jdimms);
else if (jregions)
-   util_display_json_array(stdout, jregions, jflag);
+   util_display_json_array(stdout, jregions);
else if (jnamespaces)
-   util_display_json_array(stdout, jnamespaces, jflag);
+   util_display_json_array(stdout, jnamespaces);
return 0;
 }
 
diff --git a/util/json.c b/util/json.c
index 94ed948..ff894c7 100644
--- a/util/json.c
+++ b/util/json.c
@@ -105,9 +105,10 @@ struct json_object *util_json_object_hex(unsigned long 
long val,
return jobj;
 }
 
-void util_display_json_array(FILE *f_out, struct js

[ndctl PATCH v2 0/2] always output array without --human

2018-06-29 Thread Ross Zwisler
If "ndctl list" is only printing a single object it currently does not
print it in an array.  When printing multiple objects, though, an array is
used.  This simplifies the output for single objects when a human is
looking at it but it creates complications for parsers like jq.  jq
wants to know whether it is trying to parse an object or an array of
objects.

The first patch in this series is cleanup I noticed while fixing the
above issue with the second patch.

---

Changes since v1:
 * Dropped patch 2 that dealt with cleaning up the way flags are handled
   in ndctl/list.c. (Dan)

 * Added Dan's Ack to patch 1.

Ross Zwisler (2):
  ndctl: simplify JSON print flag handling
  ndctl list: always output array without --human

 daxctl/list.c| 12 +++-
 ndctl/bus.c  |  3 +--
 ndctl/dimm.c |  3 +--
 ndctl/inject-smart.c |  3 +--
 ndctl/list.c | 11 +--
 util/json.c  |  6 --
 util/json.h  |  3 ++-
 7 files changed, 21 insertions(+), 20 deletions(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 2/3] ndctl/list.c: simplify flags handling

2018-06-29 Thread Ross Zwisler
On Fri, Jun 29, 2018 at 04:57:29PM -0700, Dan Williams wrote:
> On Fri, Jun 29, 2018 at 4:39 PM, Ross Zwisler
>  wrote:
> > ndctl/list.c keeps track of a set of enum util_json_flags which are based
> > on the command line options passed in.  This handling is a little messy,
> > though.  We have an accessor function, listopts_to_flags(), which is used
> > multiple times.  We sometimes pass the flags around as function arguments,
> > and we also have a copy stashed in the local struct list_filter_arg called
> > "lfa".  In some functions we access the flags in multiple ways.
> >
> > These flags are local to ndctl/list.c, are set exactly once per invocation
> > and never change.  Create a variable local to that file, initialize it once
> > and use it everywhere.
> >
> > Signed-off-by: Ross Zwisler 
> > ---
> >  ndctl/list.c  | 54 +++---
> >  util/filter.h |  1 -
> >  2 files changed, 27 insertions(+), 28 deletions(-)
> 
> I don't see the point of this thrash. It's not a win code size wise
> and I think it needlessly adds more dependence on global variables.

I disagree.  We currently have the flags value passed around as function
arguments, we have a version stashed in a data structure we pass around, and
we have one which is already essentially a global variable that is accessed
via an accessor function.  Having all three is complex and unnecessary,
especially considering that the flags never change.

Can we just choose one way of accessing the flags and use it everywhere?  If
not a global variable, which would you like to use?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 3/3] ndctl list: always output array without --human

2018-06-29 Thread Ross Zwisler
If "ndctl list" is only printing a single object it currently does not
print it in an array.  When printing multiple objects, though, an array is
used.  This simplifies the output for single objects when a human is
looking at it but it creates complications for parsers like jq.  jq
wants to know whether it is trying to parse an object or an array of
objects.

For example, here's how you print just the mode of a single namespace:

 # ndctl list -n namespace1.0 | jq ".mode"
 "raw"

This works on a single object.  If you have multiple objects in a JSON
array you do this:

  # ndctl list | jq -r '(.[]) | .mode'
  raw
  raw

If you want to write a generic script that works in both cases you have to
do something quite complicated like:

  # ndctl list | jq -r '((. | arrays | .[]), . | objects | .mode)'
  raw
  raw

Instead of pushing the burden on the user, have ndctl output consistently be
an array even if there is only one element.  This allows the parsing code
to be simpler and only worry about one case.

If the user specifies the --human or -u flags we will assume that a person
is looking at the output and not a script and will continue to output a
single element which is not part of an array.

Signed-off-by: Ross Zwisler 
---
 daxctl/list.c| 11 +++
 ndctl/bus.c  |  2 +-
 ndctl/dimm.c |  2 +-
 ndctl/inject-smart.c |  2 +-
 ndctl/list.c |  8 
 util/json.c  |  5 +++--
 util/json.h  |  3 ++-
 7 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/daxctl/list.c b/daxctl/list.c
index ed9e76c..58c35fa 100644
--- a/daxctl/list.c
+++ b/daxctl/list.c
@@ -83,6 +83,7 @@ int cmd_list(int argc, const char **argv, void *ctx)
struct json_object *jregions = NULL;
struct json_object *jdevs = NULL;
struct daxctl_region *region;
+   unsigned long list_flags;
int i;
 
 argc = parse_options(argc, argv, options, u, 0);
@@ -100,6 +101,8 @@ int cmd_list(int argc, const char **argv, void *ctx)
if (num_list_flags() == 0)
list.devs = true;
 
+   list_flags = listopts_to_flags();
+
daxctl_region_foreach(ctx, region) {
struct json_object *jregion = NULL;
 
@@ -117,7 +120,7 @@ int cmd_list(int argc, const char **argv, void *ctx)
}
 
jregion = util_daxctl_region_to_json(region,
-   param.dev, listopts_to_flags());
+   param.dev, list_flags);
if (!jregion) {
fail("\n");
continue;
@@ -125,13 +128,13 @@ int cmd_list(int argc, const char **argv, void *ctx)
json_object_array_add(jregions, jregion);
} else if (list.devs)
jdevs = util_daxctl_devs_to_list(region, jdevs,
-   param.dev, listopts_to_flags());
+   param.dev, list_flags);
}
 
if (jregions)
-   util_display_json_array(stdout, jregions);
+   util_display_json_array(stdout, jregions, list_flags);
else if (jdevs)
-   util_display_json_array(stdout, jdevs);
+   util_display_json_array(stdout, jdevs, list_flags);
 
if (did_fail)
return -ENOMEM;
diff --git a/ndctl/bus.c b/ndctl/bus.c
index b7b7d65..bcb8c71 100644
--- a/ndctl/bus.c
+++ b/ndctl/bus.c
@@ -88,7 +88,7 @@ static int bus_action(int argc, const char **argv, const char 
*usage,
}
 
if (success)
-   util_display_json_array(stdout, jbuses);
+   util_display_json_array(stdout, jbuses, 0);
else
json_object_put(jbuses);
 
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 6964c5e..97643a3 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -1081,7 +1081,7 @@ static int dimm_action(int argc, const char **argv, void 
*ctx,
}
 
if (actx.jdimms)
-   util_display_json_array(actx.f_out, actx.jdimms);
+   util_display_json_array(actx.f_out, actx.jdimms, 0);
 
if (actx.f_out != stdout)
fclose(actx.f_out);
diff --git a/ndctl/inject-smart.c b/ndctl/inject-smart.c
index 2e62c89..8da893c 100644
--- a/ndctl/inject-smart.c
+++ b/ndctl/inject-smart.c
@@ -375,7 +375,7 @@ static int dimm_inject_smart(struct ndctl_dimm *dimm)
jhealth = util_dimm_health_to_json(dimm);
if (jhealth) {
json_object_object_add(jdimm, "health", jhealth);
-   util_display_json_array(stdout, jdimms);
+   util_display_json_array(stdout, jdimms, sctx.flags);
}
}
 out:
diff --git a/ndctl/list.c b/ndctl/list.c
index 32f679b..5d173d7 100644
--- a/ndctl/list.c
+++ b/ndctl/list.c
@@ -366,7 +366,7 @@ s

[ndctl PATCH 2/3] ndctl/list.c: simplify flags handling

2018-06-29 Thread Ross Zwisler
ndctl/list.c keeps track of a set of enum util_json_flags which are based
on the command line options passed in.  This handling is a little messy,
though.  We have an accessor function, listopts_to_flags(), which is used
multiple times.  We sometimes pass the flags around as function arguments,
and we also have a copy stashed in the local struct list_filter_arg called
"lfa".  In some functions we access the flags in multiple ways.

These flags are local to ndctl/list.c, are set exactly once per invocation
and never change.  Create a variable local to that file, initialize it once
and use it everywhere.

Signed-off-by: Ross Zwisler 
---
 ndctl/list.c  | 54 +++---
 util/filter.h |  1 -
 2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/ndctl/list.c b/ndctl/list.c
index 030d73f..32f679b 100644
--- a/ndctl/list.c
+++ b/ndctl/list.c
@@ -38,20 +38,7 @@ static struct {
bool firmware;
 } list;
 
-static unsigned long listopts_to_flags(void)
-{
-   unsigned long flags = 0;
-
-   if (list.idle)
-   flags |= UTIL_JSON_IDLE;
-   if (list.media_errors)
-   flags |= UTIL_JSON_MEDIA_ERRORS;
-   if (list.dax)
-   flags |= UTIL_JSON_DAX | UTIL_JSON_DAX_DEVS;
-   if (list.human)
-   flags |= UTIL_JSON_HUMAN;
-   return flags;
-}
+static unsigned long list_flags;
 
 struct util_filter_params param;
 
@@ -64,8 +51,7 @@ do { \
VERSION, __func__, __LINE__, ##__VA_ARGS__); \
 } while (0)
 
-static struct json_object *region_to_json(struct ndctl_region *region,
-   unsigned long flags)
+static struct json_object *region_to_json(struct ndctl_region *region)
 {
struct json_object *jregion = json_object_new_object();
struct json_object *jobj, *jbbs, *jmappings = NULL;
@@ -83,13 +69,13 @@ static struct json_object *region_to_json(struct 
ndctl_region *region,
goto err;
json_object_object_add(jregion, "dev", jobj);
 
-   jobj = util_json_object_size(ndctl_region_get_size(region), flags);
+   jobj = util_json_object_size(ndctl_region_get_size(region), list_flags);
if (!jobj)
goto err;
json_object_object_add(jregion, "size", jobj);
 
jobj = util_json_object_size(ndctl_region_get_available_size(region),
-   flags);
+   list_flags);
if (!jobj)
goto err;
json_object_object_add(jregion, "available_size", jobj);
@@ -118,7 +104,8 @@ static struct json_object *region_to_json(struct 
ndctl_region *region,
iset = ndctl_region_get_interleave_set(region);
if (iset) {
jobj = util_json_object_hex(
-   ndctl_interleave_set_get_cookie(iset), flags);
+   ndctl_interleave_set_get_cookie(iset),
+   list_flags);
if (!jobj)
fail("\n");
else
@@ -147,7 +134,7 @@ static struct json_object *region_to_json(struct 
ndctl_region *region,
json_object_object_add(jregion, "mappings", jmappings);
}
 
-   jmapping = util_mapping_to_json(mapping, listopts_to_flags());
+   jmapping = util_mapping_to_json(mapping, list_flags);
if (!jmapping) {
fail("\n");
continue;
@@ -162,7 +149,7 @@ static struct json_object *region_to_json(struct 
ndctl_region *region,
json_object_object_add(jregion, "state", jobj);
}
 
-   jbbs = util_region_badblocks_to_json(region, _count, flags);
+   jbbs = util_region_badblocks_to_json(region, _count, list_flags);
if (bb_count) {
jobj = json_object_new_int(bb_count);
if (!jobj) {
@@ -171,7 +158,7 @@ static struct json_object *region_to_json(struct 
ndctl_region *region,
}
json_object_object_add(jregion, "badblock_count", jobj);
}
-   if ((flags & UTIL_JSON_MEDIA_ERRORS) && jbbs)
+   if ((list_flags & UTIL_JSON_MEDIA_ERRORS) && jbbs)
json_object_object_add(jregion, "badblocks", jbbs);
 
pd = ndctl_region_get_persistence_domain(region);
@@ -222,7 +209,7 @@ static void filter_namespace(struct ndctl_namespace *ndns,
lfa->jnamespaces);
}
 
-   jndns = util_namespace_to_json(ndns, lfa->flags);
+   jndns = util_namespace_to_json(ndns, list_flags);
if (!jndns) {
fail("\n");
return;
@@ -256,7 +243,7 @@ static bool filter_region(struct ndctl_region *region,
lfa->jregions);
}
 
-   jregion = region_to

[ndctl PATCH 1/3] ndctl: simplify JSON print flag handling

2018-06-29 Thread Ross Zwisler
json_object_to_json_string_ext()'s second argument is a flags field which
we always want to be JSON_C_TO_STRING_PRETTY.  We were going to a lot of
trouble for this, though.  We had multiple variables set to be this one
flag and util_display_json_array() took it as an argument, even though it
never varied.

Instead, just pass in the necessary flag when calling
json_object_to_json_string_ext(), removing the local variables and the extra
argument to util_display_json_array().

Signed-off-by: Ross Zwisler 
---
 daxctl/list.c|  5 ++---
 ndctl/bus.c  |  3 +--
 ndctl/dimm.c |  3 +--
 ndctl/inject-smart.c |  3 +--
 ndctl/list.c | 11 +--
 util/json.c  |  3 ++-
 util/json.h  |  2 +-
 7 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/daxctl/list.c b/daxctl/list.c
index 254f0ac..ed9e76c 100644
--- a/daxctl/list.c
+++ b/daxctl/list.c
@@ -50,7 +50,6 @@ static struct {
 };
 
 static int did_fail;
-static int jflag = JSON_C_TO_STRING_PRETTY;
 
 #define fail(fmt, ...) \
 do { \
@@ -130,9 +129,9 @@ int cmd_list(int argc, const char **argv, void *ctx)
}
 
if (jregions)
-   util_display_json_array(stdout, jregions, jflag);
+   util_display_json_array(stdout, jregions);
else if (jdevs)
-   util_display_json_array(stdout, jdevs, jflag);
+   util_display_json_array(stdout, jdevs);
 
if (did_fail)
return -ENOMEM;
diff --git a/ndctl/bus.c b/ndctl/bus.c
index fc31d06..b7b7d65 100644
--- a/ndctl/bus.c
+++ b/ndctl/bus.c
@@ -88,8 +88,7 @@ static int bus_action(int argc, const char **argv, const char 
*usage,
}
 
if (success)
-   util_display_json_array(stdout, jbuses,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(stdout, jbuses);
else
json_object_put(jbuses);
 
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 1c16899..6964c5e 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -1081,8 +1081,7 @@ static int dimm_action(int argc, const char **argv, void 
*ctx,
}
 
if (actx.jdimms)
-   util_display_json_array(actx.f_out, actx.jdimms,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(actx.f_out, actx.jdimms);
 
if (actx.f_out != stdout)
fclose(actx.f_out);
diff --git a/ndctl/inject-smart.c b/ndctl/inject-smart.c
index 60de9fe..2e62c89 100644
--- a/ndctl/inject-smart.c
+++ b/ndctl/inject-smart.c
@@ -375,8 +375,7 @@ static int dimm_inject_smart(struct ndctl_dimm *dimm)
jhealth = util_dimm_health_to_json(dimm);
if (jhealth) {
json_object_object_add(jdimm, "health", jhealth);
-   util_display_json_array(stdout, jdimms,
-   JSON_C_TO_STRING_PRETTY);
+   util_display_json_array(stdout, jdimms);
}
}
 out:
diff --git a/ndctl/list.c b/ndctl/list.c
index 6cf7c39..030d73f 100644
--- a/ndctl/list.c
+++ b/ndctl/list.c
@@ -56,7 +56,6 @@ static unsigned long listopts_to_flags(void)
 struct util_filter_params param;
 
 static int did_fail;
-static int jflag = JSON_C_TO_STRING_PRETTY;
 
 #define fail(fmt, ...) \
 do { \
@@ -380,7 +379,7 @@ static int list_display(struct list_filter_arg *lfa)
struct json_object *jbuses = lfa->jbuses;
 
if (jbuses)
-   util_display_json_array(stdout, jbuses, jflag);
+   util_display_json_array(stdout, jbuses);
else if ((!!jdimms + !!jregions + !!jnamespaces) > 1) {
struct json_object *jplatform = json_object_new_object();
 
@@ -397,14 +396,14 @@ static int list_display(struct list_filter_arg *lfa)
json_object_object_add(jplatform, "namespaces",
jnamespaces);
printf("%s\n", json_object_to_json_string_ext(jplatform,
-   jflag));
+   JSON_C_TO_STRING_PRETTY));
json_object_put(jplatform);
} else if (jdimms)
-   util_display_json_array(stdout, jdimms, jflag);
+   util_display_json_array(stdout, jdimms);
else if (jregions)
-   util_display_json_array(stdout, jregions, jflag);
+   util_display_json_array(stdout, jregions);
else if (jnamespaces)
-   util_display_json_array(stdout, jnamespaces, jflag);
+   util_display_json_array(stdout, jnamespaces);
return 0;
 }
 
diff --git a/util/json.c b/util/json.c
index 94ed948..ff894c7 100644
--- a/util/json.c
+++ b/util/json.c
@@ -105,9 +105,10 @@ struct json_object *util_json_object_hex(unsigned long 
long val,
return jobj;
 }
 
-void util_display_json_array(FILE *f_out, struct json_object *jarray, int 
jflag

Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings

2018-06-29 Thread Ross Zwisler
On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote:
> On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote:
> > Follow the lead of xfs_break_dax_layouts() and add synchronization between
> > operations in ext4 which remove blocks from an inode (hole punch, truncate
> > down, etc.) and pages which are pinned due to DAX DMA operations.
> > 
> > Signed-off-by: Ross Zwisler 
> > Reviewed-by: Jan Kara 
> > ---
> >  fs/ext4/ext4.h |  1 +
> >  fs/ext4/extents.c  | 12 
> >  fs/ext4/inode.c| 46 ++
> >  fs/ext4/truncate.h |  4 
> >  4 files changed, 63 insertions(+)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 0b127853c584..34bccd64d83d 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
> > ext4_iloc *);
> >  extern int ext4_inode_attach_jinode(struct inode *inode);
> >  extern int ext4_can_truncate(struct inode *inode);
> >  extern int ext4_truncate(struct inode *);
> > +extern int ext4_break_layouts(struct inode *);
> >  extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t 
> > length);
> >  extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
> > nblocks);
> >  extern void ext4_set_inode_flags(struct inode *);
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 0057fe3f248d..a6aef06f455b 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, 
> > loff_t offset,
> >  * released from page cache.
> >  */
> > down_write(_I(inode)->i_mmap_sem);
> > +
> > +   ret = ext4_break_layouts(inode);
> > +   if (ret) {
> > +   up_write(_I(inode)->i_mmap_sem);
> > +   goto out_mutex;
> > +   }
> > +
> > ret = ext4_update_disksize_before_punch(inode, offset, len);
> > if (ret) {
> > up_write(_I(inode)->i_mmap_sem);
> > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
> > offset, loff_t len)
> >  * page cache.
> >  */
> > down_write(_I(inode)->i_mmap_sem);
> > +
> > +   ret = ext4_break_layouts(inode);
> > +   if (ret)
> > +   goto out_mmap;
> 
> Hi,
> 
> don't we need to do the same for ext4_insert_range() since we're about
> to truncate_pagecache() as well ?
> 
> /thinking out loud/
> Xfs seems to do this before every fallocate operation, but in ext4
> it does not seem to be needed at least for simply allocating falocate...

I saw the case in ext4_insert_range(), and decided that we didn't need to
worry about synchronizing with DAX because no blocks were being removed from
the inode's extent map.  IIUC the truncate_pagecache() call is needed because
we are unmapping and removing any page cache mappings for the part of the file
after the insert because those blocks are now at a different offset in the
inode.  Because at the end of the operation we haven't removed any DAX pages
from the inode, we have nothing that we need to synchronize.

Hmm, unless this is a failure case we care about fixing?
 1) schedule I/O via O_DIRECT to page X
 2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger
offset
 3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block
that resides at X - the I/O from 1) completes

In this case the user is running I/O and issuing the fallocate at the same
time, and the sequencing could have worked out that #1 and #2 were reversed,
giving you the same behavior.  IMO this seems fine and that we shouldn't have
the DAX synchronization call in ext4_insert_range(), but I'm happy to add it
if I'm wrong.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-28 Thread Ross Zwisler
On Thu, Jun 28, 2018 at 05:42:34PM +, Kani, Toshi wrote:
> On Tue, 2018-06-26 at 16:04 -0600, Ross Zwisler wrote:
> > On Tue, Jun 26, 2018 at 02:51:52PM -0700, Dan Williams wrote:
> > > On Tue, Jun 26, 2018 at 2:31 PM, Kani, Toshi  wrote:
> > > > On Tue, 2018-06-26 at 14:28 -0700, Dan Williams wrote:
> > > > > On Tue, Jun 26, 2018 at 2:23 PM, Kani, Toshi  
> > > > > wrote:
> > > > > > On Tue, 2018-06-26 at 14:02 -0700, Dan Williams wrote:
> > > > > > > On Tue, Jun 26, 2018 at 1:54 PM, Kani, Toshi  
> > > > > > > wrote:
> > > > > 
> > > > > [..]
> > > > > > > > When this dm change was made, the pmem driver supported DAX for 
> > > > > > > > both raw
> > > > > > > > and memory modes (note: sector mode does not use the pmem 
> > > > > > > > driver).  I
> > > > > > > > think the issue was introduced when we dropped DAX support from 
> > > > > > > > raw
> > > > > > > > mode.
> > > > > > > 
> > > > > > > Still DAX with raw mode never really worked any way. It was also
> > > > > > > something that was broken from day one. So what happens to 
> > > > > > > someone who
> > > > > > > happened to avoid all the problems with page-less DAX and enabled
> > > > > > > device-mapper on top? That failure mode detail needs to be added 
> > > > > > > to
> > > > > > > this changelog if we want to propose this for -stable.
> > > > > > 
> > > > > > My point is that the behavior should be consistent between pmem and
> > > > > > device-mapper.  When -o dax succeeds on a pmem, then it should 
> > > > > > succeed
> > > > > > on a device-mapper on top of that pmem.
> > > > > > 
> > > > > > Has the drop of dax support from raw mode made to -stable back to 
> > > > > > the
> > > > > > baseline accepted 545ed20e6df6?  It will introduce inconsistency,
> > > > > > otherwise.
> > > > > 
> > > > > That commit, 569d0365f571 "dax: require 'struct page' by default for
> > > > > filesystem dax", has not been tagged for -stable.
> > > > 
> > > > Then, Fixes tag should be set to 569d0365f571 to keep the behavior
> > > > consistent.
> > > 
> > > Sure, and the failure mode is...? I'm thinking the commit log should say:
> > > 
> > > "Starting with commit 569d0365f571 "dax: require 'struct page' by
> > > default for filesystem dax", dax is no longer supported for page-less
> > > configurations. However, device-mapper sees the QUEUE_FLAG_DAX still
> > > being set and falsely assumes that DAX is enabled, this leads to
> > > "
> > 
> > Dan is correct that there is no user visible change for this.  It is the 
> > right
> > thing to do for consistency and sanity, but it doesn't actually have user
> > visible behavior that needs to be backported to stable.
> > 
> > Toshi is correct that this change is only for raw mode namespaces, not btt
> > namespaces.
> > 
> > I'll adjust the changelog and remove the stable flag for v5, and I'll add a
> > Fixes: tag for patch 2.
> 
> Hi Ross,
> 
> Your patches look good.  But I am still not clear about the Fixes &
> stable handling.  Talking about user visible behavior, I do not think we
> had any issue until dax support was dropped from raw mode.  Until then,
> the pmem driver supported dax for all modes, and the check for
> direct_access worked.

I agree that the fsdax + raw mode failure mode I mentioned in my cover letter
only started when we restricted filesystem DAX to having struct page, but I
think that the other failure mode, fsdax + some random block driver (I used
brd) was present in DM from the beginning.

In any case, I think both are fixed with the patches, and I think it's fine
that all 3 get thrown at stable.  Thanks, Mike, for the help.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] Documentation: add a newline in namespace Theory of Operations

2018-06-27 Thread Ross Zwisler
On Wed, Jun 27, 2018 at 04:15:43PM -0600, Vishal Verma wrote:
> The first bullet in the modes description was merged in with the
> previous paragraph in the online version of the man pages. Fix by adding
> a newline before the bulleted list.
> 
> Reported-by: Ross Zwisler 
> Signed-off-by: Vishal Verma 
> ---
>  Documentation/ndctl/namespace-description.txt | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/ndctl/namespace-description.txt 
> b/Documentation/ndctl/namespace-description.txt
> index 1dd687e..94999e5 100644
> --- a/Documentation/ndctl/namespace-description.txt
> +++ b/Documentation/ndctl/namespace-description.txt
> @@ -21,6 +21,7 @@ area.
>  A namespace can be provisioned to operate in one of 4 modes, 'fsdax',
>  'devdax', 'sector', and 'raw'. Here are the expected usage models for
>  these modes:
> +
>   - fsdax: Filesystem-DAX mode is the default mode of a namespace
> when specifying 'ndctl create-namespace' with no options. It creates
> a block device (/dev/pmemX[.Y]) that supports the DAX capabilities

Cool, thanks for fixing.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional

2018-06-27 Thread Ross Zwisler
Inodes using DAX should only ever have exceptional entries in their page
caches.  Make this clear by warning if the iteration in
dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a
comment for the pagevec_release() call which only deals with struct page
pointers.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 641192808bb6..897b51e41d8f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -566,7 +566,8 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (index >= end)
break;
 
-   if (!radix_tree_exceptional_entry(pvec_ent))
+   if (WARN_ON_ONCE(
+!radix_tree_exceptional_entry(pvec_ent)))
continue;
 
xa_lock_irq(>i_pages);
@@ -578,6 +579,13 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (page)
break;
}
+
+   /*
+* We don't expect normal struct page entries to exist in our
+* tree, but we keep these pagevec calls so that this code is
+* consistent with the common pattern for handling pagevecs
+* throughout the kernel.
+*/
pagevec_remove_exceptionals();
pagevec_release();
index++;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings

2018-06-27 Thread Ross Zwisler
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 12 
 fs/ext4/inode.c| 46 ++
 fs/ext4/truncate.h |  4 
 4 files changed, 63 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0b127853c584..34bccd64d83d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
+extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0057fe3f248d..a6aef06f455b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
 * released from page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret) {
+   up_write(_I(inode)->i_mmap_sem);
+   goto out_mutex;
+   }
+
ret = ext4_update_disksize_before_punch(inode, offset, len);
if (ret) {
up_write(_I(inode)->i_mmap_sem);
@@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_mmap;
+
/*
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ea07efbe016..fadb8ecacb1e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
return 0;
 }
 
+static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+{
+   *did_unlock = true;
+   up_write(>i_mmap_sem);
+   schedule();
+   down_write(>i_mmap_sem);
+}
+
+int ext4_break_layouts(struct inode *inode)
+{
+   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct page *page;
+   bool retry;
+   int error;
+
+   if (WARN_ON_ONCE(!rwsem_is_locked(>i_mmap_sem)))
+   return -EINVAL;
+
+   do {
+   retry = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   error = ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1,
+   TASK_INTERRUPTIBLE, 0, 0,
+   ext4_wait_dax_page(ei, ));
+   } while (error == 0 && retry);
+
+   return error;
+}
+
 /*
  * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
@@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, 
loff_t length)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_dio;
+
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
 
@@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr 
*attr)
ext4_wait_for_tail_page_commit(inode);
}
down_write(_I(inode)->i_mmap_sem);
+
+   rc = ext4_break_layouts(inode);
+   if (rc) {
+   up_write(_I(inode)->i_mmap_sem);
+   error = rc;
+   goto err_out;
+   }
+
/*
 * Truncate pagecache after we've waited for commit
 * in data=journal mode to make pages freeable.
diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index 0cb13badf473..bcbe3668c1d4 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -11,6 +11,10 @@
  */
 static inline void ext4_truncate_failed_write(struct inode *inode)
 {
+   /*
+* We don't need to call ext4_break_layouts() because the blocks we
+* are truncating were never visible to userspace.
+*/
down_write(_I(inode)->

[PATCH v2 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-06-27 Thread Ross Zwisler
This series from Dan:

https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html

added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.

I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.

The timings in XFS are such that it's difficult to hit this race.  Dan
was able to show the race by manually introducing delays in the direct
I/O path.

For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest.  You can find that test here:

https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html

With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do.  I'll
continue looking at these more rare hits.

--- 

Changes in v2:
 * A little cleanup to each patch as suggested by Jan.
 * Removed the ext4_break_layouts() call in ext4_truncate_failed_write()
   and added a comment instead. (Jan)
 * Added reviewed-by tags from Jan.

Ross Zwisler (2):
  dax: dax_layout_busy_page() warn on !exceptional
  ext4: handle layout changes to pinned DAX mappings

 fs/dax.c   | 10 +-
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 12 
 fs/ext4/inode.c| 46 ++
 fs/ext4/truncate.h |  4 
 5 files changed, 72 insertions(+), 1 deletion(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] ext4: handle layout changes to pinned DAX mappings

2018-06-27 Thread Ross Zwisler
On Fri, Jun 22, 2018 at 10:19:15AM +0200, Jan Kara wrote:
> On Wed 20-06-18 16:15:03, Ross Zwisler wrote:
> > Follow the lead of xfs_break_dax_layouts() and add synchronization between
> > operations in ext4 which remove blocks from an inode (hole punch, truncate
> > down, etc.) and pages which are pinned due to DAX DMA operations.
> > 
> > Signed-off-by: Ross Zwisler 
> > ---
> >  fs/ext4/ext4.h |  1 +
> >  fs/ext4/extents.c  | 12 
> >  fs/ext4/inode.c| 48 
> >  fs/ext4/truncate.h |  1 +
> >  4 files changed, 62 insertions(+)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 0b127853c584..34bccd64d83d 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
> > ext4_iloc *);
> >  extern int ext4_inode_attach_jinode(struct inode *inode);
> >  extern int ext4_can_truncate(struct inode *inode);
> >  extern int ext4_truncate(struct inode *);
> > +extern int ext4_break_layouts(struct inode *);
> >  extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t 
> > length);
> >  extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
> > nblocks);
> >  extern void ext4_set_inode_flags(struct inode *);
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 0057fe3f248d..a6aef06f455b 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, 
> > loff_t offset,
> >  * released from page cache.
> >  */
> > down_write(_I(inode)->i_mmap_sem);
> > +
> > +   ret = ext4_break_layouts(inode);
> > +   if (ret) {
> > +   up_write(_I(inode)->i_mmap_sem);
> > +   goto out_mutex;
> > +   }
> > +
> > ret = ext4_update_disksize_before_punch(inode, offset, len);
> > if (ret) {
> > up_write(_I(inode)->i_mmap_sem);
> > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
> > offset, loff_t len)
> >  * page cache.
> >  */
> > down_write(_I(inode)->i_mmap_sem);
> > +
> > +   ret = ext4_break_layouts(inode);
> > +   if (ret)
> > +   goto out_mmap;
> > +
> > /*
> >  * Need to round down offset to be aligned with page size boundary
> >  * for page size > block size.
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 2ea07efbe016..c795e5118745 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4193,6 +4193,41 @@ int ext4_update_disksize_before_punch(struct inode 
> > *inode, loff_t offset,
> > return 0;
> >  }
> >  
> > +static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool 
> > *did_unlock)
> > +{
> > +   *did_unlock = true;
> > +   up_write(>i_mmap_sem);
> > +   schedule();
> > +   down_write(>i_mmap_sem);
> > +}
> > +
> > +int ext4_break_layouts(struct inode *inode)
> > +{
> > +   struct ext4_inode_info *ei = EXT4_I(inode);
> > +   struct page *page;
> > +   bool retry;
> > +   int error;
> > +
> > +   if (unlikely(!rwsem_is_locked(>i_mmap_sem))) {
> > +   WARN_ON_ONCE(1);
> > +   return -EINVAL;
> > +   }
> 
> This could be shortened as:
> 
> if (WARN_ON_ONCE(!rwsem_is_locked(>i_mmap_sem))) {
> }
> 
> couldn't it?

Yep, that's much better.  Thank you for the review.

> Besides that the patch looks to me. You can add:
> 
> Reviewed-by: Jan Kara 
> 
> And I'm really wondering which protection are we still missing that you are
> still able to hit the warning with these patches applied.

I'm also very curious and am going to dig into that this week.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 2/3] dax: bdev_dax_supported() check for QUEUE_FLAG_DAX

2018-06-26 Thread Ross Zwisler
Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported().  This
is needed for DM configurations where the first element in the dm-linear or
dm-stripe target supports DAX, but other elements do not.  Without this
check __bdev_dax_supported() will pass for such devices, letting a
filesystem on that device mount with the DAX option.

Signed-off-by: Ross Zwisler 
Suggested-by: Mike Snitzer 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: sta...@vger.kernel.org
---
 drivers/dax/super.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 903d9c473749..45276abf03aa 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -86,6 +86,7 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
blocksize)
 {
struct dax_device *dax_dev;
bool dax_enabled = false;
+   struct request_queue *q;
pgoff_t pgoff;
int err, id;
void *kaddr;
@@ -99,6 +100,13 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
blocksize)
return false;
}
 
+   q = bdev_get_queue(bdev);
+   if (!q || !blk_queue_dax(q)) {
+   pr_debug("%s: error: request queue doesn't support dax\n",
+   bdevname(bdev, buf));
+   return false;
+   }
+
err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 0/3] Fix DM DAX handling

2018-06-26 Thread Ross Zwisler
This series fixes a few issues that I found with DM's handling of DAX
devices.  Here are some of the issues I found:

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a raw PMEM namespace but which can hold a
   filesystem mounted with the -o dax mount option.  DAX operations to
   the raw PMEM namespace part lack struct page and can fail in
   interesting/unexpected ways when doing things like fork(), examining
   memory with gdb, etc.

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem
   mounted with the -o dax mount option.  All I/O to this filesystem
   will fail.

---

Changes since v4:
 * No code changes.
 * Updated the changelogs for patches 1 and 3.
 * Removed the Cc: stable from patch 1.
 * Added a Fixes: tag to patch 2.

Ross Zwisler (3):
  pmem: only set QUEUE_FLAG_DAX for fsdax mode
  dax: bdev_dax_supported() check for QUEUE_FLAG_DAX
  dm: prevent DAX mounts if not supported

 drivers/dax/super.c   | 8 
 drivers/md/dm-table.c | 7 ---
 drivers/md/dm.c   | 3 +--
 drivers/nvdimm/pmem.c | 3 ++-
 4 files changed, 15 insertions(+), 6 deletions(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-26 Thread Ross Zwisler
QUEUE_FLAG_DAX is an indication that a given block device supports
filesystem DAX and should not be set for PMEM namespaces which are in "raw"
mode.  These namespaces lack struct page and are prevented from
participating in filesystem DAX as of commit 569d0365f571 "dax: require
'struct page' by default for filesystem dax".

Signed-off-by: Ross Zwisler 
Suggested-by: Mike Snitzer 
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 68940356cad3..8b1fd7f1a224 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -414,7 +414,8 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
blk_queue_max_hw_sectors(q, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
-   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   if (pmem->pfn_flags & PFN_MAP)
+   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
q->queuedata = pmem;
 
disk = alloc_disk_node(0, nid);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 3/3] dm: prevent DAX mounts if not supported

2018-06-26 Thread Ross Zwisler
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  Really we should be using
bdev_dax_supported() like filesystems do at mount time.  This performs
other tests like checking to make sure the dax_direct_access() path works.

We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
any of the underlying devices do not support DAX.  This makes the handling
of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
in dm_table_set_restrictions().

Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
will ensure that filesystems built upon DM devices will only be able to
mount with DAX if all underlying devices also support DAX.

Signed-off-by: Ross Zwisler 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: sta...@vger.kernel.org
---
 drivers/md/dm-table.c | 7 ---
 drivers/md/dm.c   | 3 +--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 938766794c2e..3d0e2c198f06 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -885,9 +885,7 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
 static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
   sector_t start, sector_t len, void *data)
 {
-   struct request_queue *q = bdev_get_queue(dev->bdev);
-
-   return q && blk_queue_dax(q);
+   return bdev_dax_supported(dev->bdev, PAGE_SIZE);
 }
 
 static bool dm_table_supports_dax(struct dm_table *t)
@@ -1907,6 +1905,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
 
if (dm_table_supports_dax(t))
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
+
if (dm_table_supports_dax_write_cache(t))
dax_write_cache(t->md->dax_dev, true);
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e65429a29c06..bef5a3f243ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1056,8 +1056,7 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
 
  out:
dm_put_live_table(md, srcu_idx);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-26 Thread Ross Zwisler
On Tue, Jun 26, 2018 at 02:51:52PM -0700, Dan Williams wrote:
> On Tue, Jun 26, 2018 at 2:31 PM, Kani, Toshi  wrote:
> > On Tue, 2018-06-26 at 14:28 -0700, Dan Williams wrote:
> >> On Tue, Jun 26, 2018 at 2:23 PM, Kani, Toshi  wrote:
> >> > On Tue, 2018-06-26 at 14:02 -0700, Dan Williams wrote:
> >> > > On Tue, Jun 26, 2018 at 1:54 PM, Kani, Toshi  
> >> > > wrote:
> >>
> >> [..]
> >> > > > When this dm change was made, the pmem driver supported DAX for both 
> >> > > > raw
> >> > > > and memory modes (note: sector mode does not use the pmem driver).  I
> >> > > > think the issue was introduced when we dropped DAX support from raw
> >> > > > mode.
> >> > >
> >> > > Still DAX with raw mode never really worked any way. It was also
> >> > > something that was broken from day one. So what happens to someone who
> >> > > happened to avoid all the problems with page-less DAX and enabled
> >> > > device-mapper on top? That failure mode detail needs to be added to
> >> > > this changelog if we want to propose this for -stable.
> >> >
> >> > My point is that the behavior should be consistent between pmem and
> >> > device-mapper.  When -o dax succeeds on a pmem, then it should succeed
> >> > on a device-mapper on top of that pmem.
> >> >
> >> > Has the drop of dax support from raw mode made to -stable back to the
> >> > baseline accepted 545ed20e6df6?  It will introduce inconsistency,
> >> > otherwise.
> >>
> >> That commit, 569d0365f571 "dax: require 'struct page' by default for
> >> filesystem dax", has not been tagged for -stable.
> >
> > Then, Fixes tag should be set to 569d0365f571 to keep the behavior
> > consistent.
> 
> Sure, and the failure mode is...? I'm thinking the commit log should say:
> 
> "Starting with commit 569d0365f571 "dax: require 'struct page' by
> default for filesystem dax", dax is no longer supported for page-less
> configurations. However, device-mapper sees the QUEUE_FLAG_DAX still
> being set and falsely assumes that DAX is enabled, this leads to
> "

Dan is correct that there is no user visible change for this.  It is the right
thing to do for consistency and sanity, but it doesn't actually have user
visible behavior that needs to be backported to stable.

Toshi is correct that this change is only for raw mode namespaces, not btt
namespaces.

I'll adjust the changelog and remove the stable flag for v5, and I'll add a
Fixes: tag for patch 2.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-26 Thread Ross Zwisler
On Tue, Jun 26, 2018 at 12:07:40PM -0700, Dan Williams wrote:
> On Tue, Jun 26, 2018 at 11:58 AM, Mike Snitzer  wrote:
> > On Tue, Jun 26 2018 at  2:52pm -0400,
> > Dan Williams  wrote:
> >
> >> On Tue, Jun 26, 2018 at 10:59 AM, Ross Zwisler
> >>  wrote:
> >> > QUEUE_FLAG_DAX is an indication that a given block device supports
> >> > filesystem DAX and should not be set for PMEM namespaces which are in 
> >> > "raw"
> >> > or "sector" modes.  These namespaces lack struct page and are prevented
> >> > from participating in filesystem DAX.
> >> >
> >> > Signed-off-by: Ross Zwisler 
> >> > Suggested-by: Mike Snitzer 
> >> > Cc: sta...@vger.kernel.org
> >>
> >> Why is this cc: stable? What is the user visible impact of this change
> >> especially given the requirement to validate QUEUE_FLAG_DAX with
> >> bdev_dax_supported()? Patch looks good, but it's just a cosmetic fixup
> >> afaics.
> >
> > This isn't cosmetic when you consider that stacking up a DM device is
> > looking at this flag to determine whether a table does or does _not_
> > support DAX.
> >
> > So this patch, in conjunction with the other changes in the series, is
> > certainly something I'd consider appropriate for stable.
> 
> I think this classifies as something that never worked correctly and
> is not a regression. It does not identify which commit it is repairing
> or the user visible failure mode.

Ah, do I need a Fixes: tag for patch 2, then?  That one *does* need to go to
stable, I think.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-26 Thread Ross Zwisler
On Tue, Jun 26, 2018 at 02:58:30PM -0400, Mike Snitzer wrote:
> On Tue, Jun 26 2018 at  2:52pm -0400,
> Dan Williams  wrote:
> 
> > On Tue, Jun 26, 2018 at 10:59 AM, Ross Zwisler
> >  wrote:
> > > QUEUE_FLAG_DAX is an indication that a given block device supports
> > > filesystem DAX and should not be set for PMEM namespaces which are in 
> > > "raw"
> > > or "sector" modes.  These namespaces lack struct page and are prevented
> > > from participating in filesystem DAX.
> > >
> > > Signed-off-by: Ross Zwisler 
> > > Suggested-by: Mike Snitzer 
> > > Cc: sta...@vger.kernel.org
> > 
> > Why is this cc: stable? What is the user visible impact of this change
> > especially given the requirement to validate QUEUE_FLAG_DAX with
> > bdev_dax_supported()? Patch looks good, but it's just a cosmetic fixup
> > afaics.
> 
> This isn't cosmetic when you consider that stacking up a DM device is
> looking at this flag to determine whether a table does or does _not_
> support DAX.
> 
> So this patch, in conjunction with the other changes in the series, is
> certainly something I'd consider appropriate for stable.
> 
> Mike

Because in patch 3 of this series we now use the full bdev_dax_supported()
instead of just checking the queue flag in device_supports_dax(), I agree that
this isn't strictly necessary for stable.  device_supports_dax() will still
notice that the raw/sector namespaces don't support DAX because
bdev_dax_supported() will fail, and we'll end up doing the right thing and not
setting QUEUE_FLAG_DAX on the DM device.

I think maybe it's good to have in stable for completeness (and it's a very
small change), but if we drop it from stable the code will still do the right
thing AFAICT.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v4 3/3] dm: prevent DAX mounts if not supported

2018-06-26 Thread Ross Zwisler
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  Really we should be using
bdev_dax_supported() like filesystems do at mount time.  This performs
other tests like checking to make sure the dax_direct_access() path works.

Conditionally set QUEUE_FLAG_DAX on the DM device's request queue based on
whether all underlying devices support DAX.  Now that bdev_dax_supported()
explicitly checks for this flag, this will ensure that filesystems built
upon DM devices will only be able to mount with DAX if all underlying
devices also support DAX.

Signed-off-by: Ross Zwisler 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: sta...@vger.kernel.org

---

Changes in v4:
 * Set/clear QUEUE_FLAG_DAX in dm_table_set_restrictions(). (Mike)
---
 drivers/md/dm-table.c | 7 ---
 drivers/md/dm.c   | 3 +--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 938766794c2e..3d0e2c198f06 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -885,9 +885,7 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
 static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
   sector_t start, sector_t len, void *data)
 {
-   struct request_queue *q = bdev_get_queue(dev->bdev);
-
-   return q && blk_queue_dax(q);
+   return bdev_dax_supported(dev->bdev, PAGE_SIZE);
 }
 
 static bool dm_table_supports_dax(struct dm_table *t)
@@ -1907,6 +1905,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
 
if (dm_table_supports_dax(t))
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
+
if (dm_table_supports_dax_write_cache(t))
dax_write_cache(t->md->dax_dev, true);
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e65429a29c06..bef5a3f243ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1056,8 +1056,7 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
 
  out:
dm_put_live_table(md, srcu_idx);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 3/3] dm: prevent DAX mounts if not supported

2018-06-26 Thread Ross Zwisler
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  Really we should be using
bdev_dax_supported() like filesystems do at mount time.  This performs
other tests like checking to make sure the dax_direct_access() path works.

Conditionally set QUEUE_FLAG_DAX on the DM device's request queue based on
whether all underlying devices support DAX.  Now that bdev_dax_supported()
explicitly checks for this flag, this will ensure that filesystems built
upon DM devices will only be able to mount with DAX if all underlying
devices also support DAX.

Signed-off-by: Ross Zwisler 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: sta...@vger.kernel.org
---
 drivers/md/dm-ioctl.c | 5 +
 drivers/md/dm-table.c | 7 +++
 drivers/md/dm.c   | 3 +--
 include/linux/device-mapper.h | 5 +
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index b810ea77e6b1..0055bdbee5b1 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1350,6 +1350,11 @@ static int table_load(struct file *filp, struct dm_ioctl 
*param, size_t param_si
goto err_unlock_md_type;
}
 
+   if (dm_table_supports_dax(t))
+   blk_queue_flag_set(QUEUE_FLAG_DAX, md->queue);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_DAX, md->queue);
+
dm_unlock_md_type(md);
 
/* stage inactive table */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 938766794c2e..c673b4a51fb2 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -885,12 +885,10 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
 static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
   sector_t start, sector_t len, void *data)
 {
-   struct request_queue *q = bdev_get_queue(dev->bdev);
-
-   return q && blk_queue_dax(q);
+   return bdev_dax_supported(dev->bdev, PAGE_SIZE);
 }
 
-static bool dm_table_supports_dax(struct dm_table *t)
+bool dm_table_supports_dax(struct dm_table *t)
 {
struct dm_target *ti;
unsigned i;
@@ -909,6 +907,7 @@ static bool dm_table_supports_dax(struct dm_table *t)
 
return true;
 }
+EXPORT_SYMBOL_GPL(dm_table_supports_dax);
 
 static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e65429a29c06..bef5a3f243ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1056,8 +1056,7 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
 
  out:
dm_put_live_table(md, srcu_idx);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 6fb0808e87c8..45ea9e1f9af9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -461,6 +461,11 @@ void dm_table_add_target_callbacks(struct dm_table *t, 
struct dm_target_callback
  */
 void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type);
 
+/*
+ * Check to see if this target type and all table devices support DAX.
+ */
+bool dm_table_supports_dax(struct dm_table *t);
+
 /*
  * Finally call this to make the table ready for use.
  */
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 0/3] Fix DM DAX handling

2018-06-26 Thread Ross Zwisler
This series fixes a few issues that I found with DM's handling of DAX
devices.  Here are some of the issues I found:

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a raw PMEM namespace but which can hold a
   filesystem mounted with the -o dax mount option.  DAX operations to
   the raw PMEM namespace part lack struct page and can fail in
   interesting/unexpected ways when doing things like fork(), examining
   memory with gdb, etc.

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem
   mounted with the -o dax mount option.  All I/O to this filesystem
   will fail.

---

Changes since v2:
  * Only set QUEUE_FLAG_DAX for fsdax mode PMEM namespaces. (Mike)
  * Check for QUEUE_FLAG_DAX in __bdev_dax_supported(). (Mike)
  * Get rid of DM_TYPE_DAX_BIO_BASED reworks. (Mike)
  * Dropped the first 2 prep patches of v2 since they were merged for
v4.18-rc1.  (Thanks, Darrick!)

---

Mike, can you take this series through your tree?

Personally I think this should be treated as a bug fix and merged in the
v4.18-rc* series.

Ross Zwisler (3):
  pmem: only set QUEUE_FLAG_DAX for fsdax mode
  dax: bdev_dax_supported() check for QUEUE_FLAG_DAX
  dm: prevent DAX mounts if not supported

 drivers/dax/super.c   | 8 
 drivers/md/dm-ioctl.c | 5 +
 drivers/md/dm-table.c | 7 +++
 drivers/md/dm.c   | 3 +--
 drivers/nvdimm/pmem.c | 3 ++-
 include/linux/device-mapper.h | 5 +
 6 files changed, 24 insertions(+), 7 deletions(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 2/3] dax: bdev_dax_supported() check for QUEUE_FLAG_DAX

2018-06-26 Thread Ross Zwisler
Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported().  This
is needed for DM configurations where the first element in the dm-linear or
dm-stripe target supports DAX, but other elements do not.  Without this
check __bdev_dax_supported() will pass for such devices, letting a
filesystem on that device mount with the DAX option.

Signed-off-by: Ross Zwisler 
Suggested-by: Mike Snitzer 
Cc: sta...@vger.kernel.org
---
 drivers/dax/super.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 903d9c473749..45276abf03aa 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -86,6 +86,7 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
blocksize)
 {
struct dax_device *dax_dev;
bool dax_enabled = false;
+   struct request_queue *q;
pgoff_t pgoff;
int err, id;
void *kaddr;
@@ -99,6 +100,13 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
blocksize)
return false;
}
 
+   q = bdev_get_queue(bdev);
+   if (!q || !blk_queue_dax(q)) {
+   pr_debug("%s: error: request queue doesn't support dax\n",
+   bdevname(bdev, buf));
+   return false;
+   }
+
err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode

2018-06-26 Thread Ross Zwisler
QUEUE_FLAG_DAX is an indication that a given block device supports
filesystem DAX and should not be set for PMEM namespaces which are in "raw"
or "sector" modes.  These namespaces lack struct page and are prevented
from participating in filesystem DAX.

Signed-off-by: Ross Zwisler 
Suggested-by: Mike Snitzer 
Cc: sta...@vger.kernel.org
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 68940356cad3..8b1fd7f1a224 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -414,7 +414,8 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
blk_queue_max_hw_sectors(q, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
-   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   if (pmem->pfn_flags & PFN_MAP)
+   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
q->queuedata = pmem;
 
disk = alloc_disk_node(0, nid);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 4/7] dm: prevent DAX mounts if not supported

2018-06-25 Thread Ross Zwisler
On Wed, Jun 20, 2018 at 11:17:49AM -0400, Mike Snitzer wrote:
> On Mon, Jun 04 2018 at  7:15pm -0400,
> Ross Zwisler  wrote:
> 
> > On Fri, Jun 01, 2018 at 05:55:13PM -0400, Mike Snitzer wrote:
> > > On Tue, May 29 2018 at  3:51pm -0400,
> > > Ross Zwisler  wrote:
> > > 
> > > > Currently the code in dm_dax_direct_access() only checks whether the 
> > > > target
> > > > type has a direct_access() operation defined, not whether the underlying
> > > > block devices all support DAX.  This latter property can be seen by 
> > > > looking
> > > > at whether we set the QUEUE_FLAG_DAX request queue flag when creating 
> > > > the
> > > > DM device.
> > > 
> > > Wait... I thought DAX support was all or nothing?
> > 
> > Right, it is, and that's what I'm trying to capture.  The point of this 
> > series
> > is to make sure that we don't use DAX thru DM if one of the DM members 
> > doesn't
> > support DAX.
> > 
> > This is a bit tricky, though, because as you've pointed out there are a lot 
> > of
> > elements that go into a block device actually supporting DAX.  
> > 
> > First, the block device has to have a direct_access() operation defined in 
> > its
> > struct dax_operations table.  This is a static definition in the drivers,
> > though, so it's necessary but not sufficient.  For example, the PMEM driver
> > always defines a direct_access() operation, but depending on the mode of the
> > namespace (raw, fsdax or sector) it may or may not support DAX.
> > 
> > The next step is that a driver needs to say that he block queue supports
> > QUEUE_FLAG_DAX.  This again is necessary but not sufficient.  The PMEM 
> > driver
> > currently sets this for all namespace modes, but I agree that this should be
> > restricted to modes that support DAX.  Even once we do that, though, for the
> > block driver this isn't fully sufficient.  We'd really like users to call
> > bdev_dax_supported() so it can run some additional tests to make sure that 
> > DAX
> > will work.
> > 
> > So, the real test that filesystems rely on is bdev_dax_suppported().
> > 
> > The trick is that with DM we need to verify each block device via
> > bdev_dax_supported() just like a filesystem would, and then have some way of
> > communicating the result of all those checks to the filesystem which is
> > eventually mounted on the DM device.  At DAX mount time the filesystem will
> > call bdev_dax_supported() on the DM device, but it'll really only check the
> > first device.  
> > 
> > So, the strategy is to have DM manually check each member device via
> > bdev_dax_supported() then if they all pass set QUEUE_FLAG_DAX.  This then
> > becomes our one source of truth on whether or not a DM device supports DAX.
> > When the filesystem mounts with DAX support it'll also run
> > bdev_dax_supported(), but if we have QUEUE_FLAG_DAX set on the DM device, we
> > know that this check will pass.
> > 
> > > > This is problematic if we have, for example, a dm-linear device made up 
> > > > of
> > > > a PMEM namespace in fsdax mode followed by a ramdisk from BRD.
> > > > QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but
> > > > we have a working direct_access() entry point and the first member of 
> > > > the
> > > > dm-linear set *does* support DAX.
> > > 
> > > If you don't have a uniformly capable device then it is very dangerous
> > > to advertise that the entire device has a certain capability.  That
> > > completely bit me in the past with discard (because for every IO I
> > > wasn't then checking if the destination device supported discards).
> > >
> > > It is all well and good that you're adding that check here.  But what I
> > > don't like is how you're saying QUEUE_FLAG_DAX implies direct_access()
> > > operation exists.. yet for raw PMEM namespaces we just discussed how
> > > that is a lie.
> > 
> > QUEUE_FLAG_DAX does imply that direct_access() exits.  However, as discussed
> > above for a given bdev we really do need to check bdev_dax_supported().
> > 
> > > SO this type of change showcases how the QUEUE_FLAG_DAX doesn't _really_
> > > imply direct_access() exists.
> > > 
> > > > This allows the user to create a filesystem on the dm-linear device, and
> > > > then mount it with DAX.  The filesystem's bdev_dax_supported() test will
> > > > pass because it'll 

Re: [fstests PATCH 1/2] src/: fix up mmap() error checking

2018-06-22 Thread Ross Zwisler
On Fri, Jun 22, 2018 at 10:28:38AM +0800, Eryu Guan wrote:
> On Wed, Jun 20, 2018 at 04:51:46PM -0600, Ross Zwisler wrote:
> > I noticed that in some of my C tests in src/ I was incorrectly checking for
> > mmap() failure by looking for NULL instead of MAP_FAILED.  Fix those and
> > clean up some places where we were testing against -1 (the actual value of
> > MAP_FAILED) which was manually being cast to a pointer.
> > 
> > Signed-off-by: Ross Zwisler 
> 
> BTW, I've applied this patch, no need to resend it in v2.

Okay, thanks for letting me know.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [fstests PATCH 2/2] generic/999: test DAX DMA vs truncate/hole-punch

2018-06-21 Thread Ross Zwisler
On Thu, Jun 21, 2018 at 12:18:42PM +1000, Dave Chinner wrote:
> On Wed, Jun 20, 2018 at 04:51:47PM -0600, Ross Zwisler wrote:
<>
> > +rm -f $seqres.full
> > +
> > +# Modify as appropriate.
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_test
> > +_require_test_program "t_mmap_collision"
> > +
> > +# turn off DAX on our scratch device so we can get normal O_DIRECT behavior
> > +export MOUNT_OPTIONS=""
> > +_scratch_unmount >> $seqres.full 2>&1
> > +_scratch_mount >> $seqres.full 2>&1
> 
> _exclude_scratch_mount_option "dax"
> 
> i.e. we skip the test if "-o dax" is in the mount options. We try
> not to turn off all mount options, because then we never exercise
> any filesystem configuration other than mount defaults, regardless of
> what other mount options the tester has specified on the CLI or
> in config files...
> 
> > +Silence is golden
> > diff --git a/tests/generic/group b/tests/generic/group
> > index 83a6fdab..793f71ed 100644
> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -501,3 +501,4 @@
> >  496 auto quick swap
> >  497 auto quick swap collapse
> >  498 auto quick log
> > +999 auto quick dax
> 
> Why should this be in a "dax" group when the test explicitly turns
> off the the DAX mount option?

Right, sorry, this is a bit confusing.  We only turn off dax for the scratch
device, but we use the full mount options for the test device.  To reproduce
the bug I'm looking for we would like those mount options to include -o dax
(hence the membership in the dax group), but the test runs fine without this
so I didn't want to require the dax mount option for the test to run.

I'll add comments to this effect, and I'll address the rest of your feedback
in v2.

Thank you for the review.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[fstests PATCH 2/2] generic/999: test DAX DMA vs truncate/hole-punch

2018-06-20 Thread Ross Zwisler
This adds a regression test for the following series:

https://lists.01.org/pipermail/linux-nvdimm/2018-June/016431.html

which added synchronization between DAX DMA in ext4 and
truncate/hole-punch.  The intention of the test is to test those specific
changes, but it runs fine both with XFS and without DAX, so I've put it in
the generic tests instead of ext4 and not restricted it to only DAX
configurations.

When run with v4.18-rc1 + DAX + ext4, this test will hit the following
WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

If you change this to a WARN_ON() instead, you can see that each of the
four paths being exercised in this test hits that condition many times in
the one second that the subtest is being run.

Signed-off-by: Ross Zwisler 
---
 .gitignore |   1 +
 src/Makefile   |   2 +-
 src/t_mmap_collision.c | 208 +
 tests/generic/999  |  50 
 tests/generic/999.out  |   2 +
 tests/generic/group|   1 +
 6 files changed, 263 insertions(+), 1 deletion(-)
 create mode 100644 src/t_mmap_collision.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index efc73a7c..936955e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -126,6 +126,7 @@
 /src/t_immutable
 /src/t_locks_execve
 /src/t_mmap_cow_race
+/src/t_mmap_collision
 /src/t_mmap_dio
 /src/t_mmap_fallocate
 /src/t_mmap_stale_pmd
diff --git a/src/Makefile b/src/Makefile
index b06b7e25..2dbe5e7e 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -15,7 +15,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
t_mmap_cow_race t_mmap_fallocate fsync-err t_mmap_write_ro \
t_ext4_dax_journal_corruption t_ext4_dax_inline_corruption \
-   t_ofd_locks t_locks_execve
+   t_ofd_locks t_locks_execve t_mmap_collision
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/t_mmap_collision.c b/src/t_mmap_collision.c
new file mode 100644
index ..f652e913
--- /dev/null
+++ b/src/t_mmap_collision.c
@@ -0,0 +1,208 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PAGE(a) ((a)*0x1000)
+#define FILE_SIZE PAGE(4)
+
+void *dax_data;
+int nodax_fd;
+int dax_fd;
+bool done;
+
+#define err_exit(op)  \
+{ \
+   fprintf(stderr, "%s %s: %s\n", __func__, op, strerror(errno));\
+   exit(1);  \
+} \
+
+void punch_hole_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   do {
+   rc = pread(nodax_fd, dax_data + read, FILE_SIZE - read,
+   read);
+   if (rc > 0)
+   read += rc;
+   } while (rc > 0);
+
+   if (read != FILE_SIZE || rc != 0)
+   err_exit("pread");
+
+   rc = fallocate(dax_fd,
+   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   0, FILE_SIZE);
+   if (rc < 0)
+   err_exit("fallocate");
+
+   usleep(rand() % 1000);
+   }
+}
+
+void zero_range_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   do {
+   rc = pread(nodax_fd, dax_data + read, FILE_SIZE - read,
+   read);
+   if (rc > 0)
+   read += rc;
+   } while (rc > 0);
+
+   if (read != FILE_SIZE || rc != 0)
+   err_exit("pread");
+
+   rc = fallocate(dax_fd,
+   FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE,
+   0, FILE_SIZE);
+   if (rc < 0)
+   err_exit("fallocate");
+
+   usleep(rand() % 1000);
+   }
+}
+
+void truncate_down_fn(void *ptr)
+{
+   ssize_t read;
+   int rc;
+
+   while (!done) {
+   read = 0;
+
+   if (ftruncate(dax_fd, 0) < 0)
+   err_exit("ftruncate");
+   if (fallocate(dax_fd, 0, 0, FILE_SIZE) < 0)
+   err_exit("fallocate");
+
+   do {
+  

[PATCH 1/2] dax: dax_layout_busy_page() warn on !exceptional

2018-06-20 Thread Ross Zwisler
Inodes using DAX should only ever have exceptional entries in their page
caches.  Make this clear by warning if the iteration in
dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a
comment for the pagevec_release() call which only deals with struct page
pointers.

Signed-off-by: Ross Zwisler 
---
 fs/dax.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 641192808bb6..4a5e31f8a2d4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -566,7 +566,7 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (index >= end)
break;
 
-   if (!radix_tree_exceptional_entry(pvec_ent))
+   if 
(WARN_ON_ONCE(!radix_tree_exceptional_entry(pvec_ent)))
continue;
 
xa_lock_irq(>i_pages);
@@ -578,6 +578,13 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (page)
break;
}
+
+   /*
+* We don't expect normal struct page entries to exist in our
+* tree, but we keep these pagevec calls so that this code is
+* consistent with the common pattern for handling pagevecs
+* throughout the kernel.
+*/
pagevec_remove_exceptionals();
pagevec_release();
index++;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/2] ext4: handle layout changes to pinned DAX mappings

2018-06-20 Thread Ross Zwisler
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler 
---
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 12 
 fs/ext4/inode.c| 48 
 fs/ext4/truncate.h |  1 +
 4 files changed, 62 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0b127853c584..34bccd64d83d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct 
ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
+extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0057fe3f248d..a6aef06f455b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t 
offset,
 * released from page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret) {
+   up_write(_I(inode)->i_mmap_sem);
+   goto out_mutex;
+   }
+
ret = ext4_update_disksize_before_punch(inode, offset, len);
if (ret) {
up_write(_I(inode)->i_mmap_sem);
@@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_mmap;
+
/*
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ea07efbe016..c795e5118745 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,6 +4193,41 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
return 0;
 }
 
+static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+{
+   *did_unlock = true;
+   up_write(>i_mmap_sem);
+   schedule();
+   down_write(>i_mmap_sem);
+}
+
+int ext4_break_layouts(struct inode *inode)
+{
+   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct page *page;
+   bool retry;
+   int error;
+
+   if (unlikely(!rwsem_is_locked(>i_mmap_sem))) {
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   do {
+   retry = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   error = ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1,
+   TASK_INTERRUPTIBLE, 0, 0,
+   ext4_wait_dax_page(ei, ));
+   } while (error == 0 && retry);
+
+   return error;
+}
+
 /*
  * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
@@ -4266,6 +4301,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, 
loff_t length)
 * page cache.
 */
down_write(_I(inode)->i_mmap_sem);
+
+   ret = ext4_break_layouts(inode);
+   if (ret)
+   goto out_dio;
+
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
 
@@ -5554,6 +5594,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr 
*attr)
ext4_wait_for_tail_page_commit(inode);
}
down_write(_I(inode)->i_mmap_sem);
+
+   rc = ext4_break_layouts(inode);
+   if (rc) {
+   up_write(_I(inode)->i_mmap_sem);
+   error = rc;
+   goto err_out;
+   }
+
/*
 * Truncate pagecache after we've waited for commit
 * in data=journal mode to make pages freeable.
diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index 0cb13badf473..a3b78241e9f6 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -12,6 +12,7 @@
 static inline void ext4_truncate_failed_write(struct inode *inode)
 {
down_write(_I(inode)->i_mmap_sem);
+   ext4_break_layouts(inode);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ext4_tru

[PATCH 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-06-20 Thread Ross Zwisler
This series from Dan:

https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html

added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.

I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.

The timings in XFS are such that it's difficult to hit this race.  Dan
was able to show the race by manually introducing delays in the direct
I/O path.

For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(), with the exception of the site in
ext4_truncate_failed_write() which is an error path.  Each of the other
4 paths easily hits this race many times in my test setup with the
xfstest.  I'll post this test shortly.

With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do.  I'll
continue looking at these more rare hits while we review this set.

Ross Zwisler (2):
  dax: dax_layout_busy_page() warn on !exceptional
  ext4: handle layout changes to pinned DAX mappings

 fs/dax.c   |  9 -
 fs/ext4/ext4.h |  1 +
 fs/ext4/extents.c  | 12 
 fs/ext4/inode.c| 48 
 fs/ext4/truncate.h |  1 +
 5 files changed, 70 insertions(+), 1 deletion(-)

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] tools/testing/nvdimm: advertise a write cache for nfit_test

2018-06-18 Thread Ross Zwisler
On Mon, Jun 18, 2018 at 05:17:02PM -0600, Vishal Verma wrote:
> Commit 546eb0317cfa "libnvdimm, pmem: Do not flush power-fail protected CPU 
> caches"
> fixed the write_cache detection to correctly show the lack of a write
> cache based on the platform capabilities described in the ACPI NFIT. The
> nfit_test unit tests expected a write cache to be present, so change the
> nfit test namespaces to only advertise a persistence domain limited to
> the memory controller. This allows the kernel to show a write_cache
> attribute, and the test behaviour remains unchanged.
> 
> Cc: Ross Zwisler 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 

Sure, this makes sense to me.  Thanks for the fix.

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [fstests PATCH] generic/223: skip when using DAX

2018-06-13 Thread Ross Zwisler
On Wed, Jun 13, 2018 at 03:25:58PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 13, 2018 at 03:07:42PM -0600, Ross Zwisler wrote:
> > As of these upstream kernel commits:
> > 
> > commit 6e2608dfd934 ("xfs, dax: introduce xfs_dax_aops")
> > commit 5f0663bb4a64 ("ext4, dax: introduce ext4_dax_aops")
> > 
> > generic/223 fails on XFS and ext4 because filesystems mounted with DAX no
> > longer support bmap.  This is desired behavior and will not be fixed,
> > according to:
> > 
> > https://lists.01.org/pipermail/linux-nvdimm/2018-April/015383.html
> > 
> > So, just skip over generic/223 when using DAX so we don't throw false
> > positive test failures.
> 
> Just because we decided not to support FIBMAP on XFSDAX doesn't mean we
> should let this test bitrot. :)
> 
> Just out of curiosity, does the following patch fix g/223 for you?

Yep!  This makes generic/223 pass in my setup for both DAX and non-DAX, with
both XFS and ext4.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v11 4/7] mm, fs, dax: handle layout changes to pinned dax mappings

2018-06-12 Thread Ross Zwisler
On Fri, May 18, 2018 at 06:35:13PM -0700, Dan Williams wrote:
> Background:
> 
> get_user_pages() in the filesystem pins file backed memory pages for
> access by devices performing dma. However, it only pins the memory pages
> not the page-to-file offset association. If a file is truncated the
> pages are mapped out of the file and dma may continue indefinitely into
> a page that is owned by a device driver. This breaks coherency of the
> file vs dma, but the assumption is that if userspace wants the
> file-space truncated it does not matter what data is inbound from the
> device, it is not relevant anymore. The only expectation is that dma can
> safely continue while the filesystem reallocates the block(s).
> 
> Problem:
> 
> This expectation that dma can safely continue while the filesystem
> changes the block map is broken by dax. With dax the target dma page
> *is* the filesystem block. The model of leaving the page pinned for dma,
> but truncating the file block out of the file, means that the filesytem
> is free to reallocate a block under active dma to another file and now
> the expected data-incoherency situation has turned into active
> data-corruption.
> 
> Solution:
> 
> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> file while any page/block in the file is under active dma. This solution
> assumes that dma is transient. Cases where dma operations are known to
> not be transient, like RDMA, have been explicitly disabled via
> commits like 5f1d43de5416 "IB/core: disable memory registration of
> filesystem-dax vmas".
> 
> The dax_layout_busy_page() routine is called by filesystems with a lock
> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> The process of looking up a busy page invalidates all mappings
> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> The filesystem continues to call dax_layout_busy_page() until it finally
> returns no more active pages. This approach assumes that the page
> pinning is transient, if that assumption is violated the system would
> have likely hung from the uncompleted I/O.
> 
> Cc: Jeff Moyer 
> Cc: Dave Chinner 
> Cc: Matthew Wilcox 
> Cc: Alexander Viro 
> Cc: "Darrick J. Wong" 
> Cc: Ross Zwisler 
> Cc: Dave Hansen 
> Cc: Andrew Morton 
> Reported-by: Christoph Hellwig 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jan Kara 
> Signed-off-by: Dan Williams 
> ---
<>
> @@ -492,6 +505,90 @@ static void *grab_mapping_entry(struct address_space 
> *mapping, pgoff_t index,
>   return entry;
>  }
>  
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> + pgoff_t indices[PAGEVEC_SIZE];
> + struct page *page = NULL;
> + struct pagevec pvec;
> + pgoff_t index, end;
> + unsigned i;
> +
> + /*
> +  * In the 'limited' case get_user_pages() for dax is disabled.
> +  */
> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> + return NULL;
> +
> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> + return NULL;
> +
> + pagevec_init();
> + index = 0;
> + end = -1;
> +
> + /*
> +  * If we race get_user_pages_fast() here either we'll see the
> +  * elevated page count in the pagevec_lookup and wait, or
> +  * get_user_pages_fast() will see that the page it took a reference
> +  * against is no longer mapped in the page tables and bail to the
> +  * get_user_pages() slow path.  The slow path is protected by
> +  * pte_lock() and pmd_lock(). New references are not taken without
> +  * holding those locks, and unmap_mapping_range() will not zero the
> +  * pte or pmd without holding the respective lock, so we are
> +  * guaranteed to either see new references or prevent new
> +  * references from being established.
> +  */
> + unmap_mapping_range(mapping, 0, 0, 1);
> +
> + while (index < end 

Re: [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages

2018-06-12 Thread Ross Zwisler
On Fri, Jun 08, 2018 at 04:51:19PM -0700, Dan Williams wrote:
> mce: Uncorrected hardware memory error in user-access at af34214200
> {1}[Hardware Error]: It has been corrected by h/w and requires no further 
> action
> mce: [Hardware Error]: Machine check events logged
> {1}[Hardware Error]: event severity: corrected
> Memory failure: 0xaf34214: reserved kernel page still referenced by 1 
> users
> [..]
> Memory failure: 0xaf34214: recovery action for reserved kernel page: 
> Failed
> mce: Memory error not recovered
> 
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
> 
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.
> 
> Cc: Jan Kara 
> Cc: Christoph Hellwig 
> Cc: Jérôme Glisse 
> Cc: Matthew Wilcox 
> Cc: Naoya Horiguchi 
> Cc: Ross Zwisler 
> Signed-off-by: Dan Williams 
> ---
<>
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + const bool unmap_success = true;
> + unsigned long size;
> + struct page *page;
> + LIST_HEAD(tokill);
> + int rc = -EBUSY;
> + loff_t start;
> +
> + /*
> +  * Prevent the inode from being freed while we are interrogating
> +  * the address_space, typically this would be handled by
> +  * lock_page(), but dax pages do not use the page lock.
> +  */
> + page = dax_lock_page(pfn);
> + if (!page)
> + goto out;
> +
> + if (hwpoison_filter(page)) {
> + rc = 0;
> + goto unlock;
> + }
> +
> + switch (pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_PUBLIC:
> + /*
> +  * TODO: Handle HMM pages which may need coordination
> +  * with device-side memory.
> +  */
> + goto unlock;
> + default:
> + break;
> + }
> +
> + /*
> +  * If the page is not mapped in userspace then report it as
> +  * unhandled.
> +  */
> + size = dax_mapping_size(page);
> + if (!size) {
> + pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
> + goto unlock;
> + }
> +
> + SetPageHWPoison(page);
> +
> + /*
> +  * Unlike System-RAM there is no possibility to swap in a
> +  * different physical page at a given virtual address, so all
> +  * userspace consumption of ZONE_DEVICE memory necessitates
> +  * SIGBUS (i.e. MF_MUST_KILL)
> +  */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, , flags & MF_ACTION_REQUIRED);

You know "flags & MF_ACTION_REQUIRED" will always be true, so you can just
pass in MF_ACTION_REQUIRED or even just "true".

> +
> + start = (page->index << PAGE_SHIFT) & ~(size - 1);
> + unmap_mapping_range(page->mapping, start, start + size, 0);
> +
> + kill_procs(, flags & MF_MUST_KILL, !unmap_success, ilog2(size),

You know "flags & MF_MUST_KILL" will always be true, so you can just pass in
MF_MUST_KILL or even just "true".

Also, you can get rid of the constant "unmap_success" if you want and just
pass in false as the 3rd argument.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()

2018-06-12 Thread Ross Zwisler
On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
> 
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
> 
> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> protect against the inode being freed, and revalidates the page->mapping
> association under xa_lock().
> 
> Signed-off-by: Dan Williams 
> ---
>  fs/dax.c|   76 
> +++
>  include/linux/dax.h |   15 ++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index cccf6cad1a7a..b7e71b108fcf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   }
>  }
>  
> +struct page *dax_lock_page(unsigned long pfn)
> +{
> + pgoff_t index;
> + struct inode *inode;
> + wait_queue_head_t *wq;
> + void *entry = NULL, **slot;
> + struct address_space *mapping;
> + struct wait_exceptional_entry_queue ewait;
> + struct page *ret = NULL, *page = pfn_to_page(pfn);
> +
> + rcu_read_lock();
> + for (;;) {
> + mapping = READ_ONCE(page->mapping);

Why the READ_ONCE()?

> +
> + if (!mapping || !IS_DAX(mapping->host))

Might read better using the dax_mapping() helper.

Also, forgive my ignorance, but this implies that dev dax has page->mapping
set up and that that inode will have IS_DAX set, right?  This will let us get
past this point for device DAX, and we'll bail out at the S_ISCHR() check?

> + break;
> +
> + /*
> +  * In the device-dax case there's no need to lock, a
> +  * struct dev_pagemap pin is sufficient to keep the
> +  * inode alive.
> +  */
> + inode = mapping->host;
> + if (S_ISCHR(inode->i_mode)) {
> + ret = page;

'ret' isn't actually used for anything in this function, we just
unconditionally return 'page'.

> + break;
> + }
> +
> + xa_lock_irq(>i_pages);
> + if (mapping != page->mapping) {
> + xa_unlock_irq(>i_pages);
> + continue;
> + }
> + index = page->index;
> +
> + init_wait();
> + ewait.wait.func = wake_exceptional_entry_func;
> +
> + entry = __radix_tree_lookup(>i_pages, index, NULL,
> + );
> + if (!entry ||

So if we do a lookup and there is no entry in the tree, we won't add an empty
entry and lock it, we'll just return with no entry in the tree and nothing
locked.

Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in 
dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
nothing locked so page faults could have happened, etc... (which would mean
that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
unlocked entry).

Is that okay?  Or do we need to insert a locked empty entry here?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()

2018-06-12 Thread Ross Zwisler
On Mon, Jun 11, 2018 at 05:41:46PM +0200, Jan Kara wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
> > In preparation for implementing support for memory poison (media error)
> > handling via dax mappings, implement a lock_page() equivalent. Poison
> > error handling requires rmap and needs guarantees that the page->mapping
> > association is maintained / valid (inode not freed) for the duration of
> > the lookup.
> > 
> > In the device-dax case it is sufficient to simply hold a dev_pagemap
> > reference. In the filesystem-dax case we need to use the entry lock.
> > 
> > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> > protect against the inode being freed, and revalidates the page->mapping
> > association under xa_lock().
> > 
> > Signed-off-by: Dan Williams 
> 
> Some comments below...
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index cccf6cad1a7a..b7e71b108fcf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct 
> > address_space *mapping,
> > }
> >  }
> >  
> > +struct page *dax_lock_page(unsigned long pfn)
> > +{
> 
> Why do you return struct page here? Any reason behind that? Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

It's also a bit awkward that the functions are asymmetric in their arguments:
dax_lock_page(pfn) vs dax_unlock_page(struct page)

Looking at dax_lock_page(), we only use 'pfn' to get 'page', so maybe it would
be cleaner to just always deal with struct page, i.e.:

void dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [qemu PATCH 1/5] gitignore: ignore generated qapi job files

2018-06-08 Thread Ross Zwisler
On Fri, Jun 08, 2018 at 09:59:00AM -0500, Eric Blake wrote:
> On 06/07/2018 05:31 PM, Ross Zwisler wrote:
> > With a fully built QEMU I currently see the following with "git status":
> > 
> >Untracked files:
> >  (use "git add ..." to include in what will be committed)
> > 
> > qapi/qapi-commands-job.c
> > qapi/qapi-commands-job.h
> > qapi/qapi-events-job.c
> > qapi/qapi-events-job.h
> > qapi/qapi-types-job.c
> > qapi/qapi-types-job.h
> > qapi/qapi-visit-job.c
> > qapi/qapi-visit-job.h
> > 
> > These are all generated files.  Ignore them.
> > 
> > Signed-off-by: Ross Zwisler 
> > Cc: Kevin Wolf 
> > Cc: Eric Blake 
> > Fixes: commit bf42508f24ee ("job: Introduce qapi/job.json")
> 
> You're the third to post this:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg07280.html
> https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg01624.html

I pulled your master branch yesterday before posting the series.  If multiple
people are sending fixes to the same issue and the fixes are correct, the
easiest way to stop getting those fixes is probably to merge one of the
patches.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] [qemu PATCH 2/5] acpi: "make check" should fail on asl mismatch

2018-06-08 Thread Ross Zwisler
On Fri, Jun 08, 2018 at 07:17:51AM +0200, Thomas Huth wrote:
> On 08.06.2018 01:09, Michael S. Tsirkin wrote:
> > On Thu, Jun 07, 2018 at 04:31:08PM -0600, Ross Zwisler wrote:
> >> Currently if "make check" detects a mismatch in the ASL generated during
> >> testing, we print an error such as:
> >>
> >>   acpi-test: Warning! SSDT mismatch. Actual [asl:/tmp/asl-QZDWJZ.dsl,
> >>   aml:/tmp/aml-T8JYJZ], Expected [asl:/tmp/asl-DTWVJZ.dsl,
> >>   aml:tests/acpi-test-data/q35/SSDT.dimmpxm].
> >>
> >> but the testing still exits with good shell status.  This is wrong, and
> >> makes bisecting such a failure difficult.
> >>
> >> Signed-off-by: Ross Zwisler 
> > 
> > Failing would also mean that any change must update the expected files
> > at the same time.  And that in turn is problematic because expected
> > files are binary and can't be merged.
> > 
> > In other words the way we devel ACPI right now means that bisect will
> > periodically produce a diff, it's not an error.
> 
> But apparently the current way also allows that real bug go unnoticed
> for a while, until somebody accidentially spots the warning in the
> output of "make check". Wouldn't it be better to fail at CI time
> already? If a merge of the file is required, you can still resolve that
> manually (i.e. by rebasing one of the pull requests).

I share this point of view.  The unit tests only add value if we keep them up
to date and passing as we modify the source.  The ACPI tables in this case
were broken in an innocuous way and just needed to be updated to match again,
but it means that the tests for them are now basically turned off.  Someone
else could come along and break the ACPI table in a real and harmful way, and
nobody would notice because the and result would still just be an ACPI table
mismatch that is non-fatal and ignored.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[qemu PATCH 4/5] machine: fix some misspelled words

2018-06-07 Thread Ross Zwisler
Normally this might not be worth fixing, but several of these are strings
which are displayed to users.

Signed-off-by: Ross Zwisler 
---
 hw/core/machine.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 617e5f8d75..a21269fa39 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -609,7 +609,7 @@ static void machine_class_init(ObjectClass *oc, void *data)
 machine_get_igd_gfx_passthru, machine_set_igd_gfx_passthru,
 _abort);
 object_class_property_set_description(oc, "igd-passthru",
-"Set on/off to enable/disable igd passthrou", _abort);
+"Set on/off to enable/disable igd passthru", _abort);
 
 object_class_property_add_str(oc, "firmware",
 machine_get_firmware, machine_set_firmware,
@@ -633,7 +633,7 @@ static void machine_class_init(ObjectClass *oc, void *data)
 machine_get_memory_encryption, machine_set_memory_encryption,
 _abort);
 object_class_property_set_description(oc, "memory-encryption",
-"Set memory encyption object to use", _abort);
+"Set memory encryption object to use", _abort);
 }
 
 static void machine_class_base_init(ObjectClass *oc, void *data)
@@ -806,7 +806,7 @@ void machine_run_board_init(MachineState *machine)
 for (i = 0; machine_class->valid_cpu_types[i]; i++) {
 if (object_class_dynamic_cast(class,
   machine_class->valid_cpu_types[i])) {
-/* The user specificed CPU is in the valid field, we are
+/* The user specified CPU is in the valid field, we are
  * good to go.
  */
 break;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[qemu PATCH 1/5] gitignore: ignore generated qapi job files

2018-06-07 Thread Ross Zwisler
With a fully built QEMU I currently see the following with "git status":

  Untracked files:
(use "git add ..." to include in what will be committed)

qapi/qapi-commands-job.c
qapi/qapi-commands-job.h
qapi/qapi-events-job.c
qapi/qapi-events-job.h
qapi/qapi-types-job.c
qapi/qapi-types-job.h
qapi/qapi-visit-job.c
qapi/qapi-visit-job.h

These are all generated files.  Ignore them.

Signed-off-by: Ross Zwisler 
Cc: Kevin Wolf 
Cc: Eric Blake 
Fixes: commit bf42508f24ee ("job: Introduce qapi/job.json")
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index 81e1f2fb0f..2980090f0a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -100,6 +100,7 @@
 /qapi/qapi-visit-ui.[ch]
 /qapi/qapi-visit.[ch]
 /qapi/qapi-doc.texi
+/qapi/qapi-*-job.[ch]
 /qemu-doc.html
 /qemu-doc.info
 /qemu-doc.txt
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[qemu PATCH 2/5] acpi: "make check" should fail on asl mismatch

2018-06-07 Thread Ross Zwisler
Currently if "make check" detects a mismatch in the ASL generated during
testing, we print an error such as:

  acpi-test: Warning! SSDT mismatch. Actual [asl:/tmp/asl-QZDWJZ.dsl,
  aml:/tmp/aml-T8JYJZ], Expected [asl:/tmp/asl-DTWVJZ.dsl,
  aml:tests/acpi-test-data/q35/SSDT.dimmpxm].

but the testing still exits with good shell status.  This is wrong, and
makes bisecting such a failure difficult.

Signed-off-by: Ross Zwisler 
---
 tests/bios-tables-test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 256d463cb8..9b5db1ee8f 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -469,6 +469,7 @@ static void test_acpi_asl(test_data *data)
 }
 }
   }
+  g_test_fail();
 }
 g_string_free(asl, true);
 g_string_free(exp_asl, true);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[qemu PATCH 3/5] hw/i386: Update SSDT table used by "make check"

2018-06-07 Thread Ross Zwisler
This commit:

commit aa78a16d8645 ("hw/i386: Rename 2.13 machine types to 3.0")

updated the name used to create the q35 machine, which in turn changed the
SSDT table which is generated when we run "make check":

  acpi-test: Warning! SSDT mismatch. Actual [asl:/tmp/asl-QZDWJZ.dsl,
  aml:/tmp/aml-T8JYJZ], Expected [asl:/tmp/asl-DTWVJZ.dsl,
  aml:tests/acpi-test-data/q35/SSDT.dimmpxm].

Here's the only difference, aside from the checksum:

  < Name (MEMA, 0x07FFF000)
  ---
  > Name (MEMA, 0x07FFE000)

Update the binary table that we compare against so it reflects this name
change.

Signed-off-by: Ross Zwisler 
Cc: Peter Maydell 
Cc: Cornelia Huck 
Cc: Thomas Huth 
Cc: Eduardo Habkost 
Fixes: commit aa78a16d8645 ("hw/i386: Rename 2.13 machine types to 3.0")
---
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 685 -> 685 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/tests/acpi-test-data/q35/SSDT.dimmpxm 
b/tests/acpi-test-data/q35/SSDT.dimmpxm
index 
8ba0e67cb72daa81a65da4906d37a5e0f4af1fd4..2d5b721bcf9c398feb6d005761f898015042e8a4
 100644
GIT binary patch
delta 23
fcmZ3>x|WqIIM^j*EfWI+qr*n71x(Bz{<8xBPCEwk

delta 23
fcmZ3>x|WqIIM^j*EfWI+W57nP1x(Bj{<8xBPMZev

-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 2/4] libnvdimm: unconditionally deep flush on *sync

2018-06-06 Thread Ross Zwisler
On Wed, Jun 06, 2018 at 10:57:59AM -0700, Dan Williams wrote:
> On Wed, Jun 6, 2018 at 9:45 AM, Ross Zwisler
>  wrote:
> > Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
> > write to each of the flush hints for a region) in response to an
> > msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
> > we were setting up the request queue.  This happens due to the write cache
> > value passed in to blk_queue_write_cache(), which then causes the block
> > layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set.  We do have a
> > "write_cache" sysfs entry for namespaces, i.e.:
> >
> >   /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache
> >
> > which can be used to control whether or not the kernel thinks a given
> > namespace has a write cache, but this didn't modify the deep flush behavior
> > that we set up when the driver was initialized.  Instead, it only modified
> > whether or not DAX would flush CPU caches via dax_flush() in response to
> > *sync calls.
> >
> > Simplify this by making the *sync deep flush always happen, regardless of
> > the write cache setting of a namespace.  The DAX CPU cache flushing will
> > still be controlled the write_cache setting of the namespace.
> >
> > Signed-off-by: Ross Zwisler 
> > Suggested-by: Dan Williams 
> 
> Looks, good. I believe we want this one and ["PATCH v3 4/4] libnvdimm:
> don't flush power-fail protected CPU caches" marked for -stable and
> tagged with:
> 
> Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices
> via fsync()")
> 
> ...any concerns with that?

Nope, sounds good.  Can you fix that up when you apply, or would it be helpful
for me to send another revision with those tags?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 5/7] dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode

2018-06-06 Thread Ross Zwisler
On Mon, Jun 04, 2018 at 08:46:28PM -0400, Mike Snitzer wrote:
> On Mon, Jun 04 2018 at  7:24pm -0400,
> Ross Zwisler  wrote:
> 
> > On Fri, Jun 01, 2018 at 06:04:43PM -0400, Mike Snitzer wrote:
> > > On Tue, May 29 2018 at  3:51pm -0400,
> > > Ross Zwisler  wrote:
> > > 
> > > > The DM_TYPE_DAX_BIO_BASED dm_queue_mode was introduced to prevent DM
> > > > devices that could possibly support DAX from transitioning into DM 
> > > > devices
> > > > that cannot support DAX.
> > > > 
> > > > For example, the following transition will currently fail:
> > > > 
> > > >  dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw]
> > > >   DM_TYPE_DAX_BIO_BASED   DM_TYPE_BIO_BASED
> > > > 
> > > > but these will both succeed:
> > > > 
> > > >  dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw]
> > > >   DM_TYPE_DAX_BIO_BASEDDM_TYPE_BIO_BASED
> > > > 
> > > 
> > > I fail to see how this succeeds given
> > > drivers/md/dm-ioctl.c:is_valid_type() only allows transitions from:
> > > 
> > > DM_TYPE_BIO_BASED => DM_TYPE_DAX_BIO_BASED
> > 
> > Right, sorry, that was a typo.  What I meant was:
> > 
> > > For example, the following transition will currently fail:
> > > 
> > >  dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw]
> > >   DM_TYPE_DAX_BIO_BASED   DM_TYPE_BIO_BASED
> > > 
> > > but these will both succeed:
> > > 
> > >  dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw]
> > > DM_TYPE_BIO_BASEDDM_TYPE_BIO_BASED
> > > 
> > >  dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem]
> > > DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED
> > 
> > So we allow 2 of the 3 transitions, but the reason that we disallow the 
> > third
> > isn't fully clear to me.
> > 
> > > >  dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem]
> > > > DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED
> > > > 
> > > > This seems arbitrary, as really the choice on whether to use DAX 
> > > > happens at
> > > > filesystem mount time.  There's no guarantee that the in the first case
> > > > (double fsdax pmem) we were using the dax mount option with our file
> > > > system.
> > > > 
> > > > Instead, get rid of DM_TYPE_DAX_BIO_BASED and all the special casing 
> > > > around
> > > > it, and instead make the request queue's QUEUE_FLAG_DAX be our one 
> > > > source
> > > > of truth.  If this is set, we can use DAX, and if not, not.  We keep 
> > > > this
> > > > up to date in table_load() as the table changes.  As with regular block
> > > > devices the filesystem will then know at mount time whether DAX is a
> > > > supported mount option or not.
> > > 
> > > If you don't think you need this specialization that is fine.. but DM
> > > devices supporting suspending (as part of table reloads) so is there any
> > > risk that there will be inflight IO (say if someone did 'dmsetup suspend
> > > --noflush').. and then upon reload the device type changed out from
> > > under us.. anyway, I don't have all the PMEM DAX stuff paged back into
> > > my head yet.
> > > 
> > > But this just seems like we really shouldn't be allowing the
> > > transition from what was DM_TYPE_DAX_BIO_BASED back to DM_TYPE_BIO_BASED
> > 
> > I admit I don't fully understand all the ways that DM supports suspending 
> > and
> > resuming devices.  Is there actually a case where we can change out the DM
> > devices while I/O is running, and somehow end up trying to issue a DAX I/O 
> > to
> > a device that doesn't support DAX?
> 
> Yes, provided root permissions, it's very easy to dmsetup suspend/load/resume
> to replace any portion of the DM device's logical address space to map to an
> entirely different DM target (with a different backing store).  It's
> pretty intrusive to do such things, but easily done and powerful.
> 
> Mike

Hmmm, I don't understand how you can do this if there is a filesystem built on
your DM device?  Say you have a DM device, either striped or linear, that is
made up of 2 devices, and then you use dmsetup to replace one of the DM member
devices with something else.  You've just swapped out half

Re: [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-06 Thread Ross Zwisler
On Wed, Jun 06, 2018 at 11:00:33AM -0600, Ross Zwisler wrote:
> On Wed, Jun 06, 2018 at 10:50:51AM -0600, Ross Zwisler wrote:
> > On Tue, Jun 05, 2018 at 03:21:30PM -0700, Dan Williams wrote:
> > > On Tue, Jun 5, 2018 at 3:07 PM, Ross Zwisler
> > >  wrote:
> > > > On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote:
> > > >> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote:
> > > >> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
> > > >> >  wrote:
> > > >> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
> > > >> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
> > > >> > >> > Add a machine command line option to allow the user to control 
> > > >> > >> > the Platform
> > > >> > >> > Capabilities Structure in the virtualized NFIT.  This Platform 
> > > >> > >> > Capabilities
> > > >> > >> > Structure was added in ACPI 6.2 Errata A.
> > > >> > >> >
> > > >> > >> > Signed-off-by: Ross Zwisler 
> > > >> > >>
> > > >> > >> I tried playing with it and encoding the capabilities is
> > > >> > >> quite awkward.
> > > >> > >>
> > > >> > >> Can we add bits for specific capabilities instead of nvdimm-cap?
> > > >> > >>
> > > >> > >> How about:
> > > >> > >>
> > > >> > >> "cpu-flush-on-power-loss-cap"
> > > >> > >> "memory-flush-on-power-loss-cap"
> > > >> > >> "byte-addressable-mirroring-cap"
> > > >> > >
> > > >> > > Hmmm...I don't like that as much because:
> > > >> > >
> > > >> > > a) It's very verbose.  Looking at my current qemu command line few 
> > > >> > > other
> > > >> > >options require that many characters, and you'd commonly be 
> > > >> > > defining more
> > > >> > >than one of these for a given VM.
> > > >> > >
> > > >> > > b) It means that the QEMU will need to be updated if/when new 
> > > >> > > flags are added,
> > > >> > >because we'll have to have new options for each flag.  The 
> > > >> > > current
> > > >> > >implementation is more future-proof because you can specify any 
> > > >> > > flags
> > > >> > >value you want.
> > > >> > >
> > > >> > > However, if you feel strongly about this, I'll make the change.
> > > >> >
> > > >> > Straw-man: Could we do something similar with what we are doing in 
> > > >> > ndctl?
> > > >> >
> > > >> > enum ndctl_persistence_domain {
> > > >> > PERSISTENCE_NONE = 0,
> > > >> > PERSISTENCE_MEM_CTRL = 10,
> > > >> > PERSISTENCE_CPU_CACHE = 20,
> > > >> > PERSISTENCE_UNKNOWN = INT_MAX,
> > > >> > };
> > > >> >
> > > >> > ...and have the command line take a number where "10" and "20" are
> > > >> > supported today, but allows us to adapt to new persistence domains in
> > > >> > the future.
> > > >>
> > > >> I'm fine with that except can we have symbolic names instead of numbers
> > > >> on command line?
> > > >>
> > > >> --
> > > >> MST
> > > >
> > > > Okay, we can move to the symbolic names.  Do you want them to be that 
> > > > long, or
> > > > would:
> > > >
> > > > nvdimm-cap-cpu
> > > > nvdimm-cap-mem-ctrl
> > > > nvdimm-cap-mirroring
> > > 
> > > Wait, why is mirroring part of this?
> > > 
> > > I was thinking this option would be:
> > > 
> > > --persistence-domain={cpu,mem-ctrl}
> > > 
> > > ...and try not to let ACPI specifics leak into the qemu command line
> > > interface. For example PowerPC qemu could have a persistence domain
> > > communicated via Open Firmware or some other mechanism.
> > 
> > Sure, this seems fine, though we may want to throw an "nvdimm" in the name
> > somewhere so it's clear what the option affects.
> > 
> > nvdimm-persistence={cpu,mem-ctrl} maybe?
> > 
> > Michael, does this work for you?
> 
> Hmm...also, the version of these patches that used numeric values did go
> upstream in QEMU.  So, do we need to leave that interface intact, and just add
> a new one that uses symbolic names?  Or did you still just want to replace the
> numeric one since it hasn't appeared in a QEMU release yet?

I guess another alternative would be to just add symbolic name options to the
existing command line option, i.e. allow all of these:

nvdimm-cap=3# CPU cache
nvdimm-cap=cpu  # CPU cache
nvdimm-cap=2# memory controller
nvdimm-cap=mem-ctrl # memory controller

And just have them as aliases.  This retains backwards compatibility with
what is already there, allows for other numeric values without QEMU updates if
other bits are defined (though we are still tightly tied to ACPI in this
case), and provides a symbolic name which is easier to use.

Thoughts?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-06 Thread Ross Zwisler
On Wed, Jun 06, 2018 at 10:50:51AM -0600, Ross Zwisler wrote:
> On Tue, Jun 05, 2018 at 03:21:30PM -0700, Dan Williams wrote:
> > On Tue, Jun 5, 2018 at 3:07 PM, Ross Zwisler
> >  wrote:
> > > On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote:
> > >> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote:
> > >> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
> > >> >  wrote:
> > >> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
> > >> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
> > >> > >> > Add a machine command line option to allow the user to control 
> > >> > >> > the Platform
> > >> > >> > Capabilities Structure in the virtualized NFIT.  This Platform 
> > >> > >> > Capabilities
> > >> > >> > Structure was added in ACPI 6.2 Errata A.
> > >> > >> >
> > >> > >> > Signed-off-by: Ross Zwisler 
> > >> > >>
> > >> > >> I tried playing with it and encoding the capabilities is
> > >> > >> quite awkward.
> > >> > >>
> > >> > >> Can we add bits for specific capabilities instead of nvdimm-cap?
> > >> > >>
> > >> > >> How about:
> > >> > >>
> > >> > >> "cpu-flush-on-power-loss-cap"
> > >> > >> "memory-flush-on-power-loss-cap"
> > >> > >> "byte-addressable-mirroring-cap"
> > >> > >
> > >> > > Hmmm...I don't like that as much because:
> > >> > >
> > >> > > a) It's very verbose.  Looking at my current qemu command line few 
> > >> > > other
> > >> > >options require that many characters, and you'd commonly be 
> > >> > > defining more
> > >> > >than one of these for a given VM.
> > >> > >
> > >> > > b) It means that the QEMU will need to be updated if/when new flags 
> > >> > > are added,
> > >> > >because we'll have to have new options for each flag.  The current
> > >> > >implementation is more future-proof because you can specify any 
> > >> > > flags
> > >> > >value you want.
> > >> > >
> > >> > > However, if you feel strongly about this, I'll make the change.
> > >> >
> > >> > Straw-man: Could we do something similar with what we are doing in 
> > >> > ndctl?
> > >> >
> > >> > enum ndctl_persistence_domain {
> > >> > PERSISTENCE_NONE = 0,
> > >> > PERSISTENCE_MEM_CTRL = 10,
> > >> > PERSISTENCE_CPU_CACHE = 20,
> > >> > PERSISTENCE_UNKNOWN = INT_MAX,
> > >> > };
> > >> >
> > >> > ...and have the command line take a number where "10" and "20" are
> > >> > supported today, but allows us to adapt to new persistence domains in
> > >> > the future.
> > >>
> > >> I'm fine with that except can we have symbolic names instead of numbers
> > >> on command line?
> > >>
> > >> --
> > >> MST
> > >
> > > Okay, we can move to the symbolic names.  Do you want them to be that 
> > > long, or
> > > would:
> > >
> > > nvdimm-cap-cpu
> > > nvdimm-cap-mem-ctrl
> > > nvdimm-cap-mirroring
> > 
> > Wait, why is mirroring part of this?
> > 
> > I was thinking this option would be:
> > 
> > --persistence-domain={cpu,mem-ctrl}
> > 
> > ...and try not to let ACPI specifics leak into the qemu command line
> > interface. For example PowerPC qemu could have a persistence domain
> > communicated via Open Firmware or some other mechanism.
> 
> Sure, this seems fine, though we may want to throw an "nvdimm" in the name
> somewhere so it's clear what the option affects.
> 
> nvdimm-persistence={cpu,mem-ctrl} maybe?
> 
> Michael, does this work for you?

Hmm...also, the version of these patches that used numeric values did go
upstream in QEMU.  So, do we need to leave that interface intact, and just add
a new one that uses symbolic names?  Or did you still just want to replace the
numeric one since it hasn't appeared in a QEMU release yet?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-06 Thread Ross Zwisler
On Tue, Jun 05, 2018 at 03:21:30PM -0700, Dan Williams wrote:
> On Tue, Jun 5, 2018 at 3:07 PM, Ross Zwisler
>  wrote:
> > On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote:
> >> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote:
> >> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
> >> >  wrote:
> >> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
> >> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
> >> > >> > Add a machine command line option to allow the user to control the 
> >> > >> > Platform
> >> > >> > Capabilities Structure in the virtualized NFIT.  This Platform 
> >> > >> > Capabilities
> >> > >> > Structure was added in ACPI 6.2 Errata A.
> >> > >> >
> >> > >> > Signed-off-by: Ross Zwisler 
> >> > >>
> >> > >> I tried playing with it and encoding the capabilities is
> >> > >> quite awkward.
> >> > >>
> >> > >> Can we add bits for specific capabilities instead of nvdimm-cap?
> >> > >>
> >> > >> How about:
> >> > >>
> >> > >> "cpu-flush-on-power-loss-cap"
> >> > >> "memory-flush-on-power-loss-cap"
> >> > >> "byte-addressable-mirroring-cap"
> >> > >
> >> > > Hmmm...I don't like that as much because:
> >> > >
> >> > > a) It's very verbose.  Looking at my current qemu command line few 
> >> > > other
> >> > >options require that many characters, and you'd commonly be 
> >> > > defining more
> >> > >than one of these for a given VM.
> >> > >
> >> > > b) It means that the QEMU will need to be updated if/when new flags 
> >> > > are added,
> >> > >because we'll have to have new options for each flag.  The current
> >> > >implementation is more future-proof because you can specify any 
> >> > > flags
> >> > >value you want.
> >> > >
> >> > > However, if you feel strongly about this, I'll make the change.
> >> >
> >> > Straw-man: Could we do something similar with what we are doing in ndctl?
> >> >
> >> > enum ndctl_persistence_domain {
> >> > PERSISTENCE_NONE = 0,
> >> > PERSISTENCE_MEM_CTRL = 10,
> >> > PERSISTENCE_CPU_CACHE = 20,
> >> > PERSISTENCE_UNKNOWN = INT_MAX,
> >> > };
> >> >
> >> > ...and have the command line take a number where "10" and "20" are
> >> > supported today, but allows us to adapt to new persistence domains in
> >> > the future.
> >>
> >> I'm fine with that except can we have symbolic names instead of numbers
> >> on command line?
> >>
> >> --
> >> MST
> >
> > Okay, we can move to the symbolic names.  Do you want them to be that long, 
> > or
> > would:
> >
> > nvdimm-cap-cpu
> > nvdimm-cap-mem-ctrl
> > nvdimm-cap-mirroring
> 
> Wait, why is mirroring part of this?
> 
> I was thinking this option would be:
> 
> --persistence-domain={cpu,mem-ctrl}
> 
> ...and try not to let ACPI specifics leak into the qemu command line
> interface. For example PowerPC qemu could have a persistence domain
> communicated via Open Firmware or some other mechanism.

Sure, this seems fine, though we may want to throw an "nvdimm" in the name
somewhere so it's clear what the option affects.

nvdimm-persistence={cpu,mem-ctrl} maybe?

Michael, does this work for you?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 1/4] pmem, libnvdimm: complete REQ_FLUSH => REQ_PREFLUSH

2018-06-06 Thread Ross Zwisler
Complete the move from REQ_FLUSH to REQ_PREFLUSH that apparently started
way back in v4.8.

Signed-off-by: Ross Zwisler 
---
 drivers/nvdimm/pmem.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..252adfab1e47 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -164,11 +164,6 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, 
struct page *page,
return rc;
 }
 
-/* account for REQ_FLUSH rename, replace with REQ_PREFLUSH after v4.8-rc1 */
-#ifndef REQ_FLUSH
-#define REQ_FLUSH REQ_PREFLUSH
-#endif
-
 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 {
blk_status_t rc = 0;
@@ -179,7 +174,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
struct pmem_device *pmem = q->queuedata;
struct nd_region *nd_region = to_region(pmem);
 
-   if (bio->bi_opf & REQ_FLUSH)
+   if (bio->bi_opf & REQ_PREFLUSH)
nvdimm_flush(nd_region);
 
do_acct = nd_iostat_start(bio, );
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 4/4] libnvdimm: don't flush power-fail protected CPU caches

2018-06-06 Thread Ross Zwisler
This commit:

5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")

intended to make sure that deep flush was always available even on
platforms which support a power-fail protected CPU cache.  An unintended
side effect of this change was that we also lost the ability to skip
flushing CPU caches on those power-fail protected CPU cache.

Fix this by skipping the low level cache flushing in dax_flush() if we have
CPU caches which are power-fail protected.  The user can still override this
behavior by manually setting the write_cache state of a namespace.  See
libndctl's ndctl_namespace_write_cache_is_enabled(),
ndctl_namespace_enable_write_cache() and
ndctl_namespace_disable_write_cache() functions.

Signed-off-by: Ross Zwisler 
Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
fsync()")
---
 drivers/nvdimm/region_devs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index a612be6f019d..ec3543b83330 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1132,7 +1132,8 @@ EXPORT_SYMBOL_GPL(nvdimm_has_flush);
 
 int nvdimm_has_cache(struct nd_region *nd_region)
 {
-   return is_nd_pmem(_region->dev);
+   return is_nd_pmem(_region->dev) &&
+   !test_bit(ND_REGION_PERSIST_CACHE, _region->flags);
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_cache);
 
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 2/4] libnvdimm: unconditionally deep flush on *sync

2018-06-06 Thread Ross Zwisler
Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
write to each of the flush hints for a region) in response to an
msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
we were setting up the request queue.  This happens due to the write cache
value passed in to blk_queue_write_cache(), which then causes the block
layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set.  We do have a
"write_cache" sysfs entry for namespaces, i.e.:

  /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache

which can be used to control whether or not the kernel thinks a given
namespace has a write cache, but this didn't modify the deep flush behavior
that we set up when the driver was initialized.  Instead, it only modified
whether or not DAX would flush CPU caches via dax_flush() in response to
*sync calls.

Simplify this by making the *sync deep flush always happen, regardless of
the write cache setting of a namespace.  The DAX CPU cache flushing will
still be controlled the write_cache setting of the namespace.

Signed-off-by: Ross Zwisler 
Suggested-by: Dan Williams 
---
 drivers/nvdimm/pmem.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 252adfab1e47..97b4c39a9267 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,7 +294,7 @@ static int pmem_attach_disk(struct device *dev,
 {
struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
-   int nid = dev_to_node(dev), fua, wbc;
+   int nid = dev_to_node(dev), fua;
struct resource *res = >res;
struct resource bb_res;
struct nd_pfn *nd_pfn = NULL;
@@ -330,7 +330,6 @@ static int pmem_attach_disk(struct device *dev,
dev_warn(dev, "unable to guarantee persistence of writes\n");
fua = 0;
}
-   wbc = nvdimm_has_cache(nd_region);
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(>dev))) {
@@ -377,7 +376,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
 
-   blk_queue_write_cache(q, wbc, fua);
+   blk_queue_write_cache(q, true, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
@@ -408,7 +407,7 @@ static int pmem_attach_disk(struct device *dev,
put_disk(disk);
return -ENOMEM;
}
-   dax_write_cache(dax_dev, wbc);
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
 
gendev = disk_to_dev(disk);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 3/4] libnvdimm: use dax_write_cache* helpers

2018-06-06 Thread Ross Zwisler
Use dax_write_cache() and dax_write_cache_enabled() instead of open coding
the bit operations.

Signed-off-by: Ross Zwisler 
---
 drivers/dax/super.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c2c46f96b18c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -182,8 +182,7 @@ static ssize_t write_cache_show(struct device *dev,
if (!dax_dev)
return -ENXIO;
 
-   rc = sprintf(buf, "%d\n", !!test_bit(DAXDEV_WRITE_CACHE,
-   _dev->flags));
+   rc = sprintf(buf, "%d\n", !!dax_write_cache_enabled(dax_dev));
put_dax(dax_dev);
return rc;
 }
@@ -201,10 +200,8 @@ static ssize_t write_cache_store(struct device *dev,
 
if (rc)
len = rc;
-   else if (write_cache)
-   set_bit(DAXDEV_WRITE_CACHE, _dev->flags);
else
-   clear_bit(DAXDEV_WRITE_CACHE, _dev->flags);
+   dax_write_cache(dax_dev, write_cache);
 
put_dax(dax_dev);
return len;
@@ -286,7 +283,7 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, _dev->flags)))
+   if (unlikely(!dax_write_cache_enabled(dax_dev)))
return;
 
arch_wb_cache_pmem(addr, size);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 3/3] libnvdimm: don't flush power-fail protected CPU caches

2018-06-06 Thread Ross Zwisler
On Tue, Jun 05, 2018 at 07:00:14PM -0700, Dan Williams wrote:
> On Tue, Jun 5, 2018 at 4:58 PM, Ross Zwisler
>  wrote:
> > This commit:
> >
> > 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
> > fsync()")
> >
> > intended to make sure that deep flush was always available even on
> > platforms which support a power-fail protected CPU cache.  An unintended
> > side effect of this change was that we also lost the ability to skip
> > flushing CPU caches on those power-fail protected CPU cache.
> >
> > Signed-off-by: Ross Zwisler 
> > Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
> > fsync()")
> > ---
> >  drivers/dax/super.c   | 14 +-
> >  drivers/nvdimm/pmem.c |  2 ++
> >  include/linux/dax.h   |  4 
> >  3 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c2c46f96b18c..80253c531a9b 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -152,6 +152,8 @@ enum dax_device_flags {
> > DAXDEV_ALIVE,
> > /* gate whether dax_flush() calls the low level flush routine */
> > DAXDEV_WRITE_CACHE,
> > +   /* only flush the CPU caches if they are not power fail protected */
> > +   DAXDEV_FLUSH_ON_SYNC,
> 
> I'm not grokking why we need DAXDEV_FLUSH_ON_SYNC. The power fail
> protected status of the cache only determines the default for
> DAXDEV_WRITE_CACHE.

Ah, yea, that's much cleaner.  I'll send out a v3.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 1/3] libnvdimm: unconditionally deep flush on *sync

2018-06-05 Thread Ross Zwisler
Prior to this commit we would only do a "deep flush" in response to an
msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
we were setting up the request queue.  This happens due to the write cache
value passed in to blk_queue_write_cache().  We do have a "write_cache"
sysfs entry for namespaces, i.e.:

  /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache

which can be used to control whether or not the kernel thinks a given
namespace has a write cache, but this didn't modify the deep flush behavior
that we set up when the driver was initialized.  Instead, it only modified
whether or not DAX would flush CPU caches in response to *sync calls.

Simplify this by making the *sync "deep flush" always happen, regardless of
the write cache setting of a namespace.  The DAX CPU cache flushing will be
controlled by a combination of the write_cache setting as well as whether
the platform supports flush-on-fail CPU caches.

Signed-off-by: Ross Zwisler 
Suggested-by: Dan Williams 
---
 drivers/nvdimm/pmem.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..a152dd9e4134 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -299,7 +299,7 @@ static int pmem_attach_disk(struct device *dev,
 {
struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
-   int nid = dev_to_node(dev), fua, wbc;
+   int nid = dev_to_node(dev), fua;
struct resource *res = >res;
struct resource bb_res;
struct nd_pfn *nd_pfn = NULL;
@@ -335,7 +335,6 @@ static int pmem_attach_disk(struct device *dev,
dev_warn(dev, "unable to guarantee persistence of writes\n");
fua = 0;
}
-   wbc = nvdimm_has_cache(nd_region);
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(>dev))) {
@@ -382,7 +381,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
 
-   blk_queue_write_cache(q, wbc, fua);
+   blk_queue_write_cache(q, true, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
@@ -413,7 +412,7 @@ static int pmem_attach_disk(struct device *dev,
put_disk(disk);
return -ENOMEM;
}
-   dax_write_cache(dax_dev, wbc);
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
 
gendev = disk_to_dev(disk);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 3/3] libnvdimm: don't flush power-fail protected CPU caches

2018-06-05 Thread Ross Zwisler
This commit:

5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")

intended to make sure that deep flush was always available even on
platforms which support a power-fail protected CPU cache.  An unintended
side effect of this change was that we also lost the ability to skip
flushing CPU caches on those power-fail protected CPU cache.

Signed-off-by: Ross Zwisler 
Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
fsync()")
---
 drivers/dax/super.c   | 14 +-
 drivers/nvdimm/pmem.c |  2 ++
 include/linux/dax.h   |  4 
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c2c46f96b18c..80253c531a9b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -152,6 +152,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+   /* only flush the CPU caches if they are not power fail protected */
+   DAXDEV_FLUSH_ON_SYNC,
 };
 
 /**
@@ -283,7 +285,8 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!dax_write_cache_enabled(dax_dev)))
+   if (unlikely(!dax_write_cache_enabled(dax_dev)) ||
+   !test_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags))
return;
 
arch_wb_cache_pmem(addr, size);
@@ -310,6 +313,15 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
 
+void dax_flush_on_sync(struct dax_device *dax_dev, bool flush)
+{
+   if (flush)
+   set_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags);
+   else
+   clear_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_flush_on_sync);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a152dd9e4134..e8c2795bf766 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -413,6 +413,8 @@ static int pmem_attach_disk(struct device *dev,
return -ENOMEM;
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+   dax_flush_on_sync(dax_dev,
+   !test_bit(ND_REGION_PERSIST_CACHE, _region->flags));
pmem->dax_dev = dax_dev;
 
gendev = disk_to_dev(disk);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f9eb22ad341e..4575742508b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,6 +32,7 @@ void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
+void dax_flush_on_sync(struct dax_device *dax_dev, bool flush);
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
@@ -59,6 +60,9 @@ static inline bool dax_write_cache_enabled(struct dax_device 
*dax_dev)
 {
return false;
 }
+static inline void dax_flush_on_sync(struct dax_device *dax_dev, bool flush)
+{
+}
 #endif
 
 struct writeback_control;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 2/3] libnvdimm: use dax_write_cache* helpers

2018-06-05 Thread Ross Zwisler
Use dax_write_cache() and dax_write_cache_enabled() instead of open coding
the bit operations.

Signed-off-by: Ross Zwisler 
---
 drivers/dax/super.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c2c46f96b18c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -182,8 +182,7 @@ static ssize_t write_cache_show(struct device *dev,
if (!dax_dev)
return -ENXIO;
 
-   rc = sprintf(buf, "%d\n", !!test_bit(DAXDEV_WRITE_CACHE,
-   _dev->flags));
+   rc = sprintf(buf, "%d\n", !!dax_write_cache_enabled(dax_dev));
put_dax(dax_dev);
return rc;
 }
@@ -201,10 +200,8 @@ static ssize_t write_cache_store(struct device *dev,
 
if (rc)
len = rc;
-   else if (write_cache)
-   set_bit(DAXDEV_WRITE_CACHE, _dev->flags);
else
-   clear_bit(DAXDEV_WRITE_CACHE, _dev->flags);
+   dax_write_cache(dax_dev, write_cache);
 
put_dax(dax_dev);
return len;
@@ -286,7 +283,7 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, _dev->flags)))
+   if (unlikely(!dax_write_cache_enabled(dax_dev)))
return;
 
arch_wb_cache_pmem(addr, size);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-05 Thread Ross Zwisler
On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote:
> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
> >  wrote:
> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
> > >> > Add a machine command line option to allow the user to control the 
> > >> > Platform
> > >> > Capabilities Structure in the virtualized NFIT.  This Platform 
> > >> > Capabilities
> > >> > Structure was added in ACPI 6.2 Errata A.
> > >> >
> > >> > Signed-off-by: Ross Zwisler 
> > >>
> > >> I tried playing with it and encoding the capabilities is
> > >> quite awkward.
> > >>
> > >> Can we add bits for specific capabilities instead of nvdimm-cap?
> > >>
> > >> How about:
> > >>
> > >> "cpu-flush-on-power-loss-cap"
> > >> "memory-flush-on-power-loss-cap"
> > >> "byte-addressable-mirroring-cap"
> > >
> > > Hmmm...I don't like that as much because:
> > >
> > > a) It's very verbose.  Looking at my current qemu command line few other
> > >options require that many characters, and you'd commonly be defining 
> > > more
> > >than one of these for a given VM.
> > >
> > > b) It means that the QEMU will need to be updated if/when new flags are 
> > > added,
> > >because we'll have to have new options for each flag.  The current
> > >implementation is more future-proof because you can specify any flags
> > >value you want.
> > >
> > > However, if you feel strongly about this, I'll make the change.
> > 
> > Straw-man: Could we do something similar with what we are doing in ndctl?
> > 
> > enum ndctl_persistence_domain {
> > PERSISTENCE_NONE = 0,
> > PERSISTENCE_MEM_CTRL = 10,
> > PERSISTENCE_CPU_CACHE = 20,
> > PERSISTENCE_UNKNOWN = INT_MAX,
> > };
> > 
> > ...and have the command line take a number where "10" and "20" are
> > supported today, but allows us to adapt to new persistence domains in
> > the future.
> 
> I'm fine with that except can we have symbolic names instead of numbers
> on command line?
> 
> -- 
> MST

Okay, we can move to the symbolic names.  Do you want them to be that long, or
would:

nvdimm-cap-cpu
nvdimm-cap-mem-ctrl
nvdimm-cap-mirroring

or something be better?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] libnvdimm: don't flush power-fail protected CPU caches

2018-06-05 Thread Ross Zwisler
On Tue, Jun 05, 2018 at 02:20:38PM -0700, Dan Williams wrote:
> On Tue, Jun 5, 2018 at 1:58 PM, Ross Zwisler
>  wrote:
> > This commit:
> >
> > 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
> > fsync()")
> >
> > intended to make sure that deep flush was always available even on
> > platforms which support a power-fail protected CPU cache.  An unintended
> > side effect of this change was that we also lost the ability to skip
> > flushing CPU caches on those power-fail protected CPU cache.
> >
> > Signed-off-by: Ross Zwisler 
> > Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
> > fsync()")
> > ---
> >  drivers/dax/super.c   | 20 +++-
> >  drivers/nvdimm/pmem.c |  2 ++
> >  include/linux/dax.h   |  9 +
> >  3 files changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c2c46f96b18c..457e0bb6c936 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -152,6 +152,8 @@ enum dax_device_flags {
> > DAXDEV_ALIVE,
> > /* gate whether dax_flush() calls the low level flush routine */
> > DAXDEV_WRITE_CACHE,
> > +   /* only flush the CPU caches if they are not power fail protected */
> > +   DAXDEV_FLUSH_ON_SYNC,
> >  };
> >
> >  /**
> > @@ -283,7 +285,8 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
> >  void arch_wb_cache_pmem(void *addr, size_t size);
> >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> >  {
> > -   if (unlikely(!dax_write_cache_enabled(dax_dev)))
> > +   if (unlikely(!dax_write_cache_enabled(dax_dev)) ||
> > +   !dax_flush_on_sync_enabled(dax_dev))
> 
> This seems backwards. I think we should teach the pmem driver to still
> issue deep flush even when dax_write_cache_enabled() is false.

That does still happen.  Deep flush is essentially controlled by the 'wbc'
variable in pmem_attach_disk(), which we use to set blk_queue_write_cache().
My understanding is that this causes the block layer to send down
REQ_FUA/REQ_PREFLUSH BIOs, and it's in response to these that we do a deep
flush via nvdimm_flush().  Whether this happens is totally up to the device's
write cache setting, and doesn't look at whether the platform has
flush-on-fail CPU caches.

This does bring up another wrinkle, though: we export a write_cache sysfs
entry that you can use to change the write cache setting of a namespace:

i.e.:
/sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache

This changes whether or not the DAXDEV_WRITE_CACHE flag is set, but does *not*
change whether the block queue says it supports a write cache
(blk_queue_write_cache()).  So, the sysfs entry ends up controlling whether or
not we do CPU cache flushing via DAX, but does not do anything with the deep
flush code.

I'm guessing this should be fixed?  I'll go take a look...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/2] libnvdimm: use dax_write_cache* helpers

2018-06-05 Thread Ross Zwisler
Use dax_write_cache() and dax_write_cache_enabled() instead of open coding
the bit operations.

Signed-off-by: Ross Zwisler 
---
 drivers/dax/super.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c2c46f96b18c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -182,8 +182,7 @@ static ssize_t write_cache_show(struct device *dev,
if (!dax_dev)
return -ENXIO;
 
-   rc = sprintf(buf, "%d\n", !!test_bit(DAXDEV_WRITE_CACHE,
-   _dev->flags));
+   rc = sprintf(buf, "%d\n", !!dax_write_cache_enabled(dax_dev));
put_dax(dax_dev);
return rc;
 }
@@ -201,10 +200,8 @@ static ssize_t write_cache_store(struct device *dev,
 
if (rc)
len = rc;
-   else if (write_cache)
-   set_bit(DAXDEV_WRITE_CACHE, _dev->flags);
else
-   clear_bit(DAXDEV_WRITE_CACHE, _dev->flags);
+   dax_write_cache(dax_dev, write_cache);
 
put_dax(dax_dev);
return len;
@@ -286,7 +283,7 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, _dev->flags)))
+   if (unlikely(!dax_write_cache_enabled(dax_dev)))
return;
 
arch_wb_cache_pmem(addr, size);
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/2] libnvdimm: don't flush power-fail protected CPU caches

2018-06-05 Thread Ross Zwisler
This commit:

5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")

intended to make sure that deep flush was always available even on
platforms which support a power-fail protected CPU cache.  An unintended
side effect of this change was that we also lost the ability to skip
flushing CPU caches on those power-fail protected CPU cache.

Signed-off-by: Ross Zwisler 
Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via 
fsync()")
---
 drivers/dax/super.c   | 20 +++-
 drivers/nvdimm/pmem.c |  2 ++
 include/linux/dax.h   |  9 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c2c46f96b18c..457e0bb6c936 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -152,6 +152,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+   /* only flush the CPU caches if they are not power fail protected */
+   DAXDEV_FLUSH_ON_SYNC,
 };
 
 /**
@@ -283,7 +285,8 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!dax_write_cache_enabled(dax_dev)))
+   if (unlikely(!dax_write_cache_enabled(dax_dev)) ||
+   !dax_flush_on_sync_enabled(dax_dev))
return;
 
arch_wb_cache_pmem(addr, size);
@@ -310,6 +313,21 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
 
+void dax_flush_on_sync(struct dax_device *dax_dev, bool flush)
+{
+   if (flush)
+   set_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags);
+   else
+   clear_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_flush_on_sync);
+
+bool dax_flush_on_sync_enabled(struct dax_device *dax_dev)
+{
+   return test_bit(DAXDEV_FLUSH_ON_SYNC, _dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_flush_on_sync_enabled);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..faeb2deae7f0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -414,6 +414,8 @@ static int pmem_attach_disk(struct device *dev,
return -ENOMEM;
}
dax_write_cache(dax_dev, wbc);
+   dax_flush_on_sync(dax_dev,
+   !test_bit(ND_REGION_PERSIST_CACHE, _region->flags));
pmem->dax_dev = dax_dev;
 
gendev = disk_to_dev(disk);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f9eb22ad341e..1e6086405a1a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,6 +32,8 @@ void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
+void dax_flush_on_sync(struct dax_device *dax_dev, bool flush);
+bool dax_flush_on_sync_enabled(struct dax_device *dax_dev);
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
@@ -59,6 +61,13 @@ static inline bool dax_write_cache_enabled(struct dax_device 
*dax_dev)
 {
return false;
 }
+static inline void dax_flush_on_sync(struct dax_device *dax_dev, bool flush)
+{
+}
+static inline bool dax_flush_on_sync_enabled(struct dax_device *dax_dev)
+{
+   return false;
+}
 #endif
 
 struct writeback_control;
-- 
2.14.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-05 Thread Ross Zwisler
On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
> > Add a machine command line option to allow the user to control the Platform
> > Capabilities Structure in the virtualized NFIT.  This Platform Capabilities
> > Structure was added in ACPI 6.2 Errata A.
> > 
> > Signed-off-by: Ross Zwisler 
> 
> I tried playing with it and encoding the capabilities is
> quite awkward.
> 
> Can we add bits for specific capabilities instead of nvdimm-cap?
> 
> How about:
> 
> "cpu-flush-on-power-loss-cap"
> "memory-flush-on-power-loss-cap"
> "byte-addressable-mirroring-cap"

Hmmm...I don't like that as much because:

a) It's very verbose.  Looking at my current qemu command line few other
   options require that many characters, and you'd commonly be defining more
   than one of these for a given VM.

b) It means that the QEMU will need to be updated if/when new flags are added,
   because we'll have to have new options for each flag.  The current
   implementation is more future-proof because you can specify any flags
   value you want.

However, if you feel strongly about this, I'll make the change.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 5/7] dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode

2018-06-04 Thread Ross Zwisler
On Fri, Jun 01, 2018 at 06:04:43PM -0400, Mike Snitzer wrote:
> On Tue, May 29 2018 at  3:51pm -0400,
> Ross Zwisler  wrote:
> 
> > The DM_TYPE_DAX_BIO_BASED dm_queue_mode was introduced to prevent DM
> > devices that could possibly support DAX from transitioning into DM devices
> > that cannot support DAX.
> > 
> > For example, the following transition will currently fail:
> > 
> >  dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw]
> >   DM_TYPE_DAX_BIO_BASED   DM_TYPE_BIO_BASED
> > 
> > but these will both succeed:
> > 
> >  dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw]
> >   DM_TYPE_DAX_BIO_BASEDDM_TYPE_BIO_BASED
> > 
> 
> I fail to see how this succeeds given
> drivers/md/dm-ioctl.c:is_valid_type() only allows transitions from:
> 
> DM_TYPE_BIO_BASED => DM_TYPE_DAX_BIO_BASED

Right, sorry, that was a typo.  What I meant was:

> For example, the following transition will currently fail:
> 
>  dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw]
>   DM_TYPE_DAX_BIO_BASED   DM_TYPE_BIO_BASED
> 
> but these will both succeed:
> 
>  dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw]
> DM_TYPE_BIO_BASEDDM_TYPE_BIO_BASED
> 
>  dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem]
> DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED

So we allow 2 of the 3 transitions, but the reason that we disallow the third
isn't fully clear to me.

> >  dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem]
> > DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED
> > 
> > This seems arbitrary, as really the choice on whether to use DAX happens at
> > filesystem mount time.  There's no guarantee that the in the first case
> > (double fsdax pmem) we were using the dax mount option with our file
> > system.
> > 
> > Instead, get rid of DM_TYPE_DAX_BIO_BASED and all the special casing around
> > it, and instead make the request queue's QUEUE_FLAG_DAX be our one source
> > of truth.  If this is set, we can use DAX, and if not, not.  We keep this
> > up to date in table_load() as the table changes.  As with regular block
> > devices the filesystem will then know at mount time whether DAX is a
> > supported mount option or not.
> 
> If you don't think you need this specialization that is fine.. but DM
> devices supporting suspending (as part of table reloads) so is there any
> risk that there will be inflight IO (say if someone did 'dmsetup suspend
> --noflush').. and then upon reload the device type changed out from
> under us.. anyway, I don't have all the PMEM DAX stuff paged back into
> my head yet.
> 
> But this just seems like we really shouldn't be allowing the
> transition from what was DM_TYPE_DAX_BIO_BASED back to DM_TYPE_BIO_BASED

I admit I don't fully understand all the ways that DM supports suspending and
resuming devices.  Is there actually a case where we can change out the DM
devices while I/O is running, and somehow end up trying to issue a DAX I/O to
a device that doesn't support DAX?

Toshi, do you have a test case that shows this somehow?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 4/7] dm: prevent DAX mounts if not supported

2018-06-04 Thread Ross Zwisler
On Fri, Jun 01, 2018 at 05:55:13PM -0400, Mike Snitzer wrote:
> On Tue, May 29 2018 at  3:51pm -0400,
> Ross Zwisler  wrote:
> 
> > Currently the code in dm_dax_direct_access() only checks whether the target
> > type has a direct_access() operation defined, not whether the underlying
> > block devices all support DAX.  This latter property can be seen by looking
> > at whether we set the QUEUE_FLAG_DAX request queue flag when creating the
> > DM device.
> 
> Wait... I thought DAX support was all or nothing?

Right, it is, and that's what I'm trying to capture.  The point of this series
is to make sure that we don't use DAX thru DM if one of the DM members doesn't
support DAX.

This is a bit tricky, though, because as you've pointed out there are a lot of
elements that go into a block device actually supporting DAX.  

First, the block device has to have a direct_access() operation defined in its
struct dax_operations table.  This is a static definition in the drivers,
though, so it's necessary but not sufficient.  For example, the PMEM driver
always defines a direct_access() operation, but depending on the mode of the
namespace (raw, fsdax or sector) it may or may not support DAX.

The next step is that a driver needs to say that he block queue supports
QUEUE_FLAG_DAX.  This again is necessary but not sufficient.  The PMEM driver
currently sets this for all namespace modes, but I agree that this should be
restricted to modes that support DAX.  Even once we do that, though, for the
block driver this isn't fully sufficient.  We'd really like users to call
bdev_dax_supported() so it can run some additional tests to make sure that DAX
will work.

So, the real test that filesystems rely on is bdev_dax_suppported().

The trick is that with DM we need to verify each block device via
bdev_dax_supported() just like a filesystem would, and then have some way of
communicating the result of all those checks to the filesystem which is
eventually mounted on the DM device.  At DAX mount time the filesystem will
call bdev_dax_supported() on the DM device, but it'll really only check the
first device.  

So, the strategy is to have DM manually check each member device via
bdev_dax_supported() then if they all pass set QUEUE_FLAG_DAX.  This then
becomes our one source of truth on whether or not a DM device supports DAX.
When the filesystem mounts with DAX support it'll also run
bdev_dax_supported(), but if we have QUEUE_FLAG_DAX set on the DM device, we
know that this check will pass.

> > This is problematic if we have, for example, a dm-linear device made up of
> > a PMEM namespace in fsdax mode followed by a ramdisk from BRD.
> > QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but
> > we have a working direct_access() entry point and the first member of the
> > dm-linear set *does* support DAX.
> 
> If you don't have a uniformly capable device then it is very dangerous
> to advertise that the entire device has a certain capability.  That
> completely bit me in the past with discard (because for every IO I
> wasn't then checking if the destination device supported discards).
>
> It is all well and good that you're adding that check here.  But what I
> don't like is how you're saying QUEUE_FLAG_DAX implies direct_access()
> operation exists.. yet for raw PMEM namespaces we just discussed how
> that is a lie.

QUEUE_FLAG_DAX does imply that direct_access() exits.  However, as discussed
above for a given bdev we really do need to check bdev_dax_supported().

> SO this type of change showcases how the QUEUE_FLAG_DAX doesn't _really_
> imply direct_access() exists.
> 
> > This allows the user to create a filesystem on the dm-linear device, and
> > then mount it with DAX.  The filesystem's bdev_dax_supported() test will
> > pass because it'll operate on the first member of the dm-linear device,
> > which happens to be a fsdax PMEM namespace.
> > 
> > All DAX I/O will then fail to that dm-linear device because the lack of
> > QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working.  This means that
> > the struct dax_device isn't ever set in the filesystem, so
> > dax_direct_access() will always return -EOPNOTSUPP.
> 
> Now you've lost me... these past 2 paragraphs.  Why can a user mount it
> is DAX mode?  Because bdev_dax_supported() only accesses the first
> portion (which happens to have DAX capabilities?)

Right.  bdev_dax_supported() runs all of its checks, and because they are
running against the first block device in the dm set, they all pass.  But the
overall DM device does not actually support DAX.

> Isn't this exactly why you should be checking for QUEUE_FLAG_DAX in the
> caller (bdev_dax_supported)?  Why not use bdev_get_queue() and verify
> QUEUE_FLAG_DAX is set in there?

I'll look into tha

  1   2   3   4   5   6   7   8   >