[ndctl PATCH] ndctl: revert glob support for dimm commands

2016-10-19 Thread Dan Williams
As pointed out by Eric, files in the local directory with names of the
form 'nmemX' can break the recently added glob support by filtering
otherwise valid dimm names in a glob like 'nmem[X-Y]'.  For robustness,
the glob argument would always need to be shell escaped, but at that
point it is less characters to simply rely on shell expansion of the
form nmem{X..Y}, if available.

Reported-by: Eric Blake 
Signed-off-by: Dan Williams 
---
 ndctl/builtin-dimm.c |   70 +-
 1 file changed, 13 insertions(+), 57 deletions(-)

diff --git a/ndctl/builtin-dimm.c b/ndctl/builtin-dimm.c
index 304bd83a33da..19036fd38b46 100644
--- a/ndctl/builtin-dimm.c
+++ b/ndctl/builtin-dimm.c
@@ -10,7 +10,6 @@
  * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
  * more details.
  */
-#include 
 #include 
 #include 
 #include 
@@ -774,72 +773,35 @@ static int dimm_action(int argc, const char **argv, 
struct ndctl_ctx *ctx,
int (*action)(struct ndctl_dimm *dimm, struct action_context 
*actx),
const struct option *options, const char *usage)
 {
-   int rc = 0, count, err = 0, glob_cnt = 0;
struct action_context actx = { NULL, NULL };
+   int i, rc = 0, count, err = 0;
const char * const u[] = {
usage,
NULL
};
-   char *all[] = { "all " };
-   glob_t glob_buf;
-   size_t i;
+   unsigned long id;
 
 argc = parse_options(argc, argv, options, u, 0);
 
if (argc == 0)
usage_with_options(u, options);
-   for (i = 0; i < (size_t) argc; i++) {
-   char *path;
-
+   for (i = 0; i < argc; i++) {
if (strcmp(argv[i], "all") == 0) {
argv[0] = "all";
argc = 1;
-   glob_cnt = 0;
break;
}
-   rc = asprintf(, "/sys/bus/nd/devices/%s", argv[i]);
-   if (rc < 0) {
-   fprintf(stderr, "failed to parse %s\n", argv[i]);
-   usage_with_options(u, options);
-   }
 
-   rc = glob(path, glob_cnt++ ? GLOB_APPEND : 0, NULL, _buf);
-   switch (rc) {
-   case GLOB_NOSPACE:
-   case GLOB_ABORTED:
-   fprintf(stderr, "failed to parse %s\n", argv[i]);
-   usage_with_options(u, options);
-   break;
-   case GLOB_NOMATCH:
-   case 0:
-   break;
+   if (sscanf(argv[i], "nmem%lu", ) != 1) {
+   fprintf(stderr, "'%s' is not a valid dimm name\n",
+   argv[i]);
+   err++;
}
-   free(path);
-   }
-
-   if (!glob_cnt)
-   glob_buf.gl_pathc = 0;
-   count = 0;
-   for (i = 0; i < glob_buf.gl_pathc; i++) {
-   char *dimm_name = strrchr(glob_buf.gl_pathv[i], '/');
-   unsigned long id;
-
-   if (!dimm_name++)
-   continue;
-   if (sscanf(dimm_name, "nmem%lu", ) == 1)
-   count++;
}
 
-   if (strcmp(argv[0], "all") == 0) {
-   glob_buf.gl_pathc = ARRAY_SIZE(all);
-   glob_buf.gl_pathv = all;
-   } else if (!count) {
-   fprintf(stderr, "Error: ' ");
-   for (i = 0; i < (size_t) argc; i++)
-   fprintf(stderr, "%s ", argv[i]);
-   fprintf(stderr, "' does not specify any present devices\n");
-   fprintf(stderr, "See 'ndctl list -D'\n");
+   if (err == argc) {
usage_with_options(u, options);
+   return -EINVAL;
}
 
if (param.json) {
@@ -864,23 +826,20 @@ static int dimm_action(int argc, const char **argv, 
struct ndctl_ctx *ctx,
ndctl_set_log_priority(ctx, LOG_DEBUG);
 
rc = 0;
+   err = 0;
count = 0;
-   for (i = 0; i < glob_buf.gl_pathc; i++) {
-   char *dimm_name = strrchr(glob_buf.gl_pathv[i], '/');
+   for (i = 0; i < argc; i++) {
struct ndctl_dimm *dimm;
struct ndctl_bus *bus;
-   unsigned long id;
 
-   if (!dimm_name++)
-   continue;
-   if (sscanf(dimm_name, "nmem%lu", ) != 1)
+   if (sscanf(argv[i], "nmem%lu", ) != 1)
continue;
 
ndctl_bus_foreach(ctx, bus) {
if (!util_bus_filter(bus, param.bus))
continue;
ndctl_dimm_foreach(bus, dimm) {
-   if (!util_dimm_filter(dimm, dimm_name))
+   if (!util_dimm_filter(dimm, argv[i]))
  

Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-19 Thread Stephen Bates
> >>
> >> If you're only using the block-device as a entry-point to create
> >> dax-mappings then a device-dax (drivers/dax/) character-device might
> >> be a better fit.
> >>
> >
> > We chose a block device because we felt it was intuitive for users to
> > carve up a memory region but putting a DAX filesystem on it and creating
> > files on that DAX aware FS. It seemed like a convenient way to
> > partition up the region and to be easily able to get the DMA address
> > for the memory backing the device.
> >
> > That said I would be very keen to get other peoples thoughts on how
> > they would like to see this done. And I know some people have had some
> > reservations about using DAX mounted FS to do this in the past.
>
> I guess it depends on the expected size of these devices BARs, but I
> get the sense they may be smaller / more precious such that you
> wouldn't want to spend capacity on filesystem metadata? For the target
> use case is it assumed that these device BARs are always backed by
> non-volatile memory?  Otherwise this is a mkfs each boot for a
> volatile device.

Dan

Fair point and this is a concern I share. We are not assuming that all
iopmem devices are backed by non-volatile memory so the mkfs
recreation comment is valid. All in all I think you are persuading us
to take a look at /dev/dax ;-). I will see if anyone else chips in
with their thoughts on this.

>
> >>
> >> > 2. Memory Segment Spacing. This patch has the same limitations that
> >> > ZONE_DEVICE does in that memory regions must be spaces at least
> >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> >> > be usable on neighboring BARs. For our purposes, this is not an issue as
> >> > we'd only be looking at enabling a single BAR in a given PCIe device.
> >> > More exotic use cases may have problems with this.
> >>
> >> I'm working on patches for 4.10 to allow mixing multiple
> >> devm_memremap_pages() allocations within the same physical section.
> >> Hopefully this won't be a problem going forward.
> >>
> >
> > Thanks Dan. Your patches will help address the problem of how to
> > partition a /dev/dax device but they don't help the case then BARs
> > themselves are small, closely spaced and non-segment aligned. However
> > I think most people using iopmem will want to use reasonbly large
> > BARs so I am not sure item 2 is that big of an issue.
>
> I think you might have misunderstood what I'm proposing.  The patches
> I'm working on are separate from a facility to carve up a /dev/dax
> device.  The effort is to allow devm_memremap_pages() to maintain
> several allocations within the same 128MB section.  I need this for
> persistent memory to handle platforms that mix pmem and system-ram in
> the same section.  I want to be able to map ZONE_DEVICE pages for a
> portion of a section and be able to remove portions of section that
> may collide with allocations of a different lifetime.

Oh I did misunderstand. This is very cool and would be useful to us.
One more reason to consider moving to /dev/dax in the next spin of
this patchset ;-).

Thanks

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] [ndctl PATCH 0/8] dimm label space initialization support

2016-10-19 Thread Dan Williams
On Wed, Oct 19, 2016 at 12:41 PM, Dan Williams  wrote:
> On Wed, Oct 19, 2016 at 11:42 AM, Eric Blake  wrote:
>> On 10/19/2016 11:47 AM, Dan Williams wrote:
>>> The 4.9 kernel added support for sub-dividing PMEM.  With this kernel
>>> patch [1] on top of that baseline, the PMEM-sub-division support can be
>>> enabled for QEMU-KVM and any other platforms that advertise both un-aliased
>>> PMEM regions and support for the label DSM commands [2].
>>>
>>> Given this increasing need to perform a label management operation
>>> across a set of DIMMs this update also adds glob(3) support.  For
>>> example you can now write commands like:
>>>
>>> ndctl zero-labels nmem[2-4]
>>
>> This is slightly scary, as it depends on the user not having any file
>> named nmem2, nmem3, or nmem4 in the current working directory.  Your
>> example should probably encourage proper shell quoting, as in:
>>
>> ndctl zero-labels 'nmem[2-4]'
>
> True.  Although, the glob is run against the list of present device
> names in the system, so local files named nmem should change the
> operation of the command.

s/should/shouldn't/

In any event I don't see the danger in leaving it in, and my fingers
default to [2-4] vs {2..4}.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-19 Thread Dan Williams
On Wed, Oct 19, 2016 at 11:40 AM, Stephen Bates  wrote:
> On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
>> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
>> > From: Logan Gunthorpe 
>> >
>> > We build on recent work that adds memory regions owned by a device
>> > driver (ZONE_DEVICE) [1] and to add struct page support for these new
>> > regions of memory [2].
>> >
>> > 1. Add an extra flags argument into dev_memremap_pages to take in a
>> > MEMREMAP_XX argument. We update the existing calls to this function to
>> > reflect the change.
>> >
>> > 2. For completeness, we add MEMREMAP_WT support to the memremap;
>> > however we have no actual need for this functionality.
>> >
>> > 3. We add the static functions, add_zone_device_pages and
>> > remove_zone_device pages. These are similar to arch_add_memory except
>> > they don't create the memory mapping. We don't believe these need to be
>> > made arch specific, but are open to other opinions.
>> >
>> > 4. dev_memremap_pages and devm_memremap_pages_release are updated to
>> > treat IO memory slightly differently. For IO memory we use a combination
>> > of the appropriate io_remap function and the zone_device pages functions
>> > created above. A flags variable and kaddr pointer are added to struct
>> > page_mem to facilitate this for the release function. We also set up
>> > the page attribute tables for the mapped region correctly based on the
>> > desired mapping.
>> >
>>
>> This description says "what" is being done, but not "why".
>
> Hi Dan
>
> We discuss the motivation in the cover letter.
>
>>
>> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
>> memory",  it mentions that the lack of I/O coherency is a known issue
>> and users of this functionality need to be cognizant of the pitfalls.
>> If that is the case why do we need support for different cpu mapping
>> types than the default write-back cache setting?  It's up to the
>> application to handle cache cpu flushing similar to what we require of
>> device-dax users in the persistent memory case.
>
> Some of the iopmem hardware we have tested has certain alignment
> restrictions on BAR accesses. At the very least we require write
> combine mappings for these. We then felt it appropriate to add the
> other mappings for the sake of completeness.

If the device can support write-combine then it can support bursts, so
I wonder why it couldn't support read bursts for cache fills...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-19 Thread Dan Williams
On Wed, Oct 19, 2016 at 11:48 AM, Stephen Bates  wrote:
> On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
>> [ adding Ashok and David for potential iommu comments ]
>>
>
> Hi Dan
>
> Thanks for adding Ashok and David!
>
>>
>> I agree with the motivation and the need for a solution, but I have
>> some questions about this implementation.
>>
>> >
>> > Consumers
>> > -
>> >
>> > We provide a PCIe device driver in an accompanying patch that can be
>> > used to map any PCIe BAR into a DAX capable block device. For
>> > non-persistent BARs this simply serves as an alternative to using
>> > system memory bounce buffers. For persistent BARs this can serve as an
>> > additional storage device in the system.
>>
>> Why block devices?  I wonder if iopmem was initially designed back
>> when we were considering enabling DAX for raw block devices.  However,
>> that support has since been ripped out / abandoned.  You currently
>> need a filesystem on top of a block-device to get DAX operation.
>> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
>> if all you want is a way to map the bar for another PCI-E device in
>> the topology.
>>
>> If you're only using the block-device as a entry-point to create
>> dax-mappings then a device-dax (drivers/dax/) character-device might
>> be a better fit.
>>
>
> We chose a block device because we felt it was intuitive for users to
> carve up a memory region but putting a DAX filesystem on it and creating
> files on that DAX aware FS. It seemed like a convenient way to
> partition up the region and to be easily able to get the DMA address
> for the memory backing the device.
>
> That said I would be very keen to get other peoples thoughts on how
> they would like to see this done. And I know some people have had some
> reservations about using DAX mounted FS to do this in the past.

I guess it depends on the expected size of these devices BARs, but I
get the sense they may be smaller / more precious such that you
wouldn't want to spend capacity on filesystem metadata? For the target
use case is it assumed that these device BARs are always backed by
non-volatile memory?  Otherwise this is a mkfs each boot for a
volatile device.

>>
>> > 2. Memory Segment Spacing. This patch has the same limitations that
>> > ZONE_DEVICE does in that memory regions must be spaces at least
>> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
>> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
>> > be usable on neighboring BARs. For our purposes, this is not an issue as
>> > we'd only be looking at enabling a single BAR in a given PCIe device.
>> > More exotic use cases may have problems with this.
>>
>> I'm working on patches for 4.10 to allow mixing multiple
>> devm_memremap_pages() allocations within the same physical section.
>> Hopefully this won't be a problem going forward.
>>
>
> Thanks Dan. Your patches will help address the problem of how to
> partition a /dev/dax device but they don't help the case then BARs
> themselves are small, closely spaced and non-segment aligned. However
> I think most people using iopmem will want to use reasonbly large
> BARs so I am not sure item 2 is that big of an issue.

I think you might have misunderstood what I'm proposing.  The patches
I'm working on are separate from a facility to carve up a /dev/dax
device.  The effort is to allow devm_memremap_pages() to maintain
several allocations within the same 128MB section.  I need this for
persistent memory to handle platforms that mix pmem and system-ram in
the same section.  I want to be able to map ZONE_DEVICE pages for a
portion of a section and be able to remove portions of section that
may collide with allocations of a different lifetime.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] [ndctl PATCH 0/8] dimm label space initialization support

2016-10-19 Thread Dan Williams
On Wed, Oct 19, 2016 at 11:42 AM, Eric Blake  wrote:
> On 10/19/2016 11:47 AM, Dan Williams wrote:
>> The 4.9 kernel added support for sub-dividing PMEM.  With this kernel
>> patch [1] on top of that baseline, the PMEM-sub-division support can be
>> enabled for QEMU-KVM and any other platforms that advertise both un-aliased
>> PMEM regions and support for the label DSM commands [2].
>>
>> Given this increasing need to perform a label management operation
>> across a set of DIMMs this update also adds glob(3) support.  For
>> example you can now write commands like:
>>
>> ndctl zero-labels nmem[2-4]
>
> This is slightly scary, as it depends on the user not having any file
> named nmem2, nmem3, or nmem4 in the current working directory.  Your
> example should probably encourage proper shell quoting, as in:
>
> ndctl zero-labels 'nmem[2-4]'

True.  Although, the glob is run against the list of present device
names in the system, so local files named nmem should change the
operation of the command.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 08/16] dax: remove dax_pmd_fault()

2016-10-19 Thread Ross Zwisler
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 213 
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7238702..3d0b103 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -915,219 +915,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR  ((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-   const char *reason, const char *fn)
-{
-   if (bh) {
-   char bname[BDEVNAME_SIZE];
-   bdevname(bh->b_bdev, bname);
-   pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-   "length %zd fallback: %s\n", fn, current->comm,
-   address, bname, bh->b_state, (u64)bh->b_blocknr,
-   bh->b_size, reason);
-   } else {
-   pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-   current->comm, address, reason);
-   }
-}
-
-#define dax_pmd_dbg(bh, address, reason)   __dax_dbg(bh, address, reason, 
"dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-   pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-   struct file *file = vma->vm_file;
-   struct address_space *mapping = file->f_mapping;
-   struct inode *inode = mapping->host;
-   struct buffer_head bh;
-   unsigned blkbits = inode->i_blkbits;
-   unsigned long pmd_addr = address & PMD_MASK;
-   bool write = flags & FAULT_FLAG_WRITE;
-   struct block_device *bdev;
-   pgoff_t size, pgoff;
-   sector_t block;
-   int result = 0;
-   bool alloc = false;
-
-   /* dax pmd mappings require pfn_t_devmap() */
-   if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-   return VM_FAULT_FALLBACK;
-
-   /* Fall back to PTEs if we're going to COW */
-   if (write && !(vma->vm_flags & VM_SHARED)) {
-   split_huge_pmd(vma, pmd, address);
-   dax_pmd_dbg(NULL, address, "cow write");
-   return VM_FAULT_FALLBACK;
-   }
-   /* If the PMD would extend outside the VMA */
-   if (pmd_addr < vma->vm_start) {
-   dax_pmd_dbg(NULL, address, "vma start unaligned");
-   return VM_FAULT_FALLBACK;
-   }
-   if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-   dax_pmd_dbg(NULL, address, "vma end unaligned");
-   return VM_FAULT_FALLBACK;
-   }
-
-   pgoff = linear_page_index(vma, pmd_addr);
-   size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-   if (pgoff >= size)
-   return VM_FAULT_SIGBUS;
-   /* If the PMD would cover blocks out of the file */
-   if ((pgoff | PG_PMD_COLOUR) >= size) {
-   dax_pmd_dbg(NULL, address,
-   "offset + huge page size > file size");
-   return VM_FAULT_FALLBACK;
-   }
-
-   memset(, 0, sizeof(bh));
-   bh.b_bdev = inode->i_sb->s_bdev;
-   block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-   bh.b_size = PMD_SIZE;
-
-   if (get_block(inode, block, , 0) != 0)
-   return VM_FAULT_SIGBUS;
-
-   if (!buffer_mapped() && write) {
-   if (get_block(inode, block, , 1) != 0)
-   return VM_FAULT_SIGBUS;
-   alloc = true;
-   WARN_ON_ONCE(buffer_unwritten() || buffer_new());
-   }
-
-   bdev = bh.b_bdev;
-
-   if (bh.b_size < PMD_SIZE) {
-   dax_pmd_dbg(, address, "allocated block too small");
-   return VM_FAULT_FALLBACK;
-   }
-
-   /*
-* If we allocated new storage, make sure no process has any
-* zero pages covering this hole
-*/
-   if (alloc) {
-   loff_t lstart = pgoff << PAGE_SHIFT;
-   loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-   truncate_pagecache_range(inode, lstart, lend);

[PATCH v8 10/16] dax: add dax_iomap_sector() helper function

2016-10-19 Thread Ross Zwisler
To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fdbd7a1..7737954 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1030,6 +1030,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, 
get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+   return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
@@ -1055,8 +1060,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
struct blk_dax_ctl dax = { 0 };
ssize_t map_len;
 
-   dax.sector = iomap->blkno +
-   (((pos & PAGE_MASK) - iomap->offset) >> 9);
+   dax.sector = dax_iomap_sector(iomap, pos);
dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
map_len = dax_map_atomic(iomap->bdev, );
if (map_len < 0) {
@@ -1193,7 +1197,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
goto unlock_entry;
}
 
-   sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+   sector = dax_iomap_sector(, pos);
 
if (vmf->cow_page) {
switch (iomap.type) {
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 11/16] dax: dax_iomap_fault() needs to call iomap_end()

2016-10-19 Thread Ross Zwisler
Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler 
Suggested-by: Jan Kara 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 37 +
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7737954..6edd89b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1165,6 +1165,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
struct iomap iomap = { 0 };
unsigned flags = 0;
int error, major = 0;
+   int locked_status = 0;
void *entry;
 
/*
@@ -1194,7 +1195,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
goto unlock_entry;
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
error = -EIO;   /* fs corruption? */
-   goto unlock_entry;
+   goto finish_iomap;
}
 
sector = dax_iomap_sector(, pos);
@@ -1216,13 +1217,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
 
if (error)
-   goto unlock_entry;
+   goto finish_iomap;
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
-   return VM_FAULT_LOCKED;
+   locked_status = VM_FAULT_LOCKED;
+   } else {
+   vmf->entry = entry;
+   locked_status = VM_FAULT_DAX_LOCKED;
}
-   vmf->entry = entry;
-   return VM_FAULT_DAX_LOCKED;
+   goto finish_iomap;
}
 
switch (iomap.type) {
@@ -1237,8 +1240,10 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
-   if (!(vmf->flags & FAULT_FLAG_WRITE))
-   return dax_load_hole(mapping, entry, vmf);
+   if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+   locked_status = dax_load_hole(mapping, entry, vmf);
+   break;
+   }
/*FALLTHRU*/
default:
WARN_ON_ONCE(1);
@@ -1246,14 +1251,30 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
break;
}
 
+ finish_iomap:
+   if (ops->iomap_end) {
+   if (error) {
+   /* keep previous error */
+   ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
+   );
+   } else {
+   error = ops->iomap_end(inode, pos, PAGE_SIZE,
+   PAGE_SIZE, flags, );
+   }
+   }
  unlock_entry:
-   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+   if (!locked_status || error)
+   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error < 0 && error != -EBUSY)
return VM_FAULT_SIGBUS | major;
+   if (locked_status) {
+   WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
+   return locked_status;
+   }
return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 14/16] dax: add struct iomap based DAX PMD support

2016-10-19 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 378 +++-
 include/linux/dax.h |  55 ++--
 mm/filemap.c|   3 +-
 3 files changed, 386 insertions(+), 50 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0582c7c..1fdf4c0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -76,6 +76,26 @@ static void dax_unmap_atomic(struct block_device *bdev,
blk_queue_exit(bdev->bd_queue);
 }
 
+static int dax_is_pmd_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_PMD;
+}
+
+static int dax_is_pte_entry(void *entry)
+{
+   return !((unsigned long)entry & RADIX_DAX_PMD);
+}
+
+static int dax_is_zero_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_HZP;
+}
+
+static int dax_is_empty_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_EMPTY;
+}
+
 struct page *read_dax_sector(struct block_device *bdev, sector_t n)
 {
struct page *page = alloc_pages(GFP_KERNEL, 0);
@@ -281,7 +301,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct 
address_space *mapping,
 * queue to the start of that PMD.  This ensures that all offsets in
 * the range covered by the PMD map to the same bit lock.
 */
-   if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+   if (dax_is_pmd_entry(entry))
index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
key->mapping = mapping;
@@ -413,36 +433,116 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)

[PATCH v8 12/16] dax: move RADIX_DAX_* defines to dax.h

2016-10-19 Thread Ross Zwisler
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 14 --
 include/linux/dax.h | 15 ++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6edd89b..c45cc4d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include 
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-   RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+   RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops);
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 16/16] dax: remove "depends on BROKEN" from FS_DAX_PMD

2016-10-19 Thread Ross Zwisler
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 4bd03a2..8e9e5f41 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
-   depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 13/16] dax: move put_(un)locked_mapping_entry() in dax.c

2016-10-19 Thread Ross Zwisler
No functional change.

The static functions put_locked_mapping_entry() and
put_unlocked_mapping_entry() will soon be used in error cases in
grab_mapping_entry(), so move their definitions above this function.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 50 +-
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c45cc4d..0582c7c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -382,6 +382,31 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
}
 }
 
+static void put_locked_mapping_entry(struct address_space *mapping,
+pgoff_t index, void *entry)
+{
+   if (!radix_tree_exceptional_entry(entry)) {
+   unlock_page(entry);
+   put_page(entry);
+   } else {
+   dax_unlock_mapping_entry(mapping, index);
+   }
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * get_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+  pgoff_t index, void *entry)
+{
+   if (!radix_tree_exceptional_entry(entry))
+   return;
+
+   /* We have to wake up next waiter for the radix tree entry lock */
+   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
+}
+
 /*
  * Find radix tree entry at given index. If it points to a page, return with
  * the page locked. If it points to the exceptional entry, return with the
@@ -486,31 +511,6 @@ void dax_unlock_mapping_entry(struct address_space 
*mapping, pgoff_t index)
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
-static void put_locked_mapping_entry(struct address_space *mapping,
-pgoff_t index, void *entry)
-{
-   if (!radix_tree_exceptional_entry(entry)) {
-   unlock_page(entry);
-   put_page(entry);
-   } else {
-   dax_unlock_mapping_entry(mapping, index);
-   }
-}
-
-/*
- * Called when we are done with radix tree entry we looked up via
- * get_unlocked_mapping_entry() and which we didn't lock in the end.
- */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-  pgoff_t index, void *entry)
-{
-   if (!radix_tree_exceptional_entry(entry))
-   return;
-
-   /* We have to wake up next waiter for the radix tree entry lock */
-   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
-}
-
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 09/16] dax: correct dax iomap code namespace

2016-10-19 Thread Ross Zwisler
The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler 
Suggested-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 16 
 fs/ext2/file.c  |  6 +++---
 fs/xfs/xfs_file.c   |  8 
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 3d0b103..fdbd7a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1031,7 +1031,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
 {
struct iov_iter *iter = data;
@@ -1088,7 +1088,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:  The control block for this I/O
  * @iter:  The addresses to do I/O from or to
  * @ops:   iomap ops passed from the file system
@@ -1098,7 +1098,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops)
 {
struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1128,7 +1128,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-   iter, iomap_dax_actor);
+   iter, dax_iomap_actor);
if (ret <= 0)
break;
pos += ret;
@@ -1138,10 +1138,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
iocb->ki_pos += done;
return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1150,7 +1150,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
struct iomap_ops *ops)
 {
struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1252,5 +1252,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
return VM_FAULT_SIGBUS | major;
return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index fb88b51..b0f2415 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return 0; /* skip atime */
 
inode_lock_shared(inode);
-   ret = iomap_dax_rw(iocb, to, _iomap_ops);
+   ret = dax_iomap_rw(iocb, to, _iomap_ops);
inode_unlock_shared(inode);
 
file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
if (ret)
goto out_unlock;
 
-   ret = iomap_dax_rw(iocb, from, _iomap_ops);
+   ret = dax_iomap_rw(iocb, from, _iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
}
down_read(>dax_sem);
 
-   ret = iomap_dax_fault(vma, vmf, _iomap_ops);
+   ret = dax_iomap_fault(vma, vmf, _iomap_ops);
 
up_read(>dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a314fc7..e7f35d54 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
return 0; /* skip atime */
 
xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-   ret 

[PATCH v8 06/16] dax: consistent variable naming for DAX entries

2016-10-19 Thread Ross Zwisler
No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 219fa2b..835e7f0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space 
*mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
pgoff_t index, void ***slotp)
 {
-   void *ret, **slot;
+   void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
ewait.key.index = index;
 
for (;;) {
-   ret = __radix_tree_lookup(>page_tree, index, NULL,
+   entry = __radix_tree_lookup(>page_tree, index, NULL,
  );
-   if (!ret || !radix_tree_exceptional_entry(ret) ||
+   if (!entry || !radix_tree_exceptional_entry(entry) ||
!slot_locked(mapping, slot)) {
if (slotp)
*slotp = slot;
-   return ret;
+   return entry;
}
prepare_to_wait_exclusive(wq, ,
  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *ret, **slot;
+   void *entry, **slot;
 
 restart:
spin_lock_irq(>tree_lock);
-   ret = get_unlocked_mapping_entry(mapping, index, );
+   entry = get_unlocked_mapping_entry(mapping, index, );
/* No entry for given index? Make sure radix tree is big enough. */
-   if (!ret) {
+   if (!entry) {
int err;
 
spin_unlock_irq(>tree_lock);
@@ -410,10 +410,10 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
if (err)
return ERR_PTR(err);
-   ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+   entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
   RADIX_DAX_ENTRY_LOCK);
spin_lock_irq(>tree_lock);
-   err = radix_tree_insert(>page_tree, index, ret);
+   err = radix_tree_insert(>page_tree, index, entry);
radix_tree_preload_end();
if (err) {
spin_unlock_irq(>tree_lock);
@@ -425,11 +425,11 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
/* Good, we have inserted empty locked entry into the tree. */
mapping->nrexceptional++;
spin_unlock_irq(>tree_lock);
-   return ret;
+   return entry;
}
/* Normal page in radix tree? */
-   if (!radix_tree_exceptional_entry(ret)) {
-   struct page *page = ret;
+   if (!radix_tree_exceptional_entry(entry)) {
+   struct page *page = entry;
 
get_page(page);
spin_unlock_irq(>tree_lock);
@@ -442,9 +442,9 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
}
return page;
}
-   ret = lock_slot(mapping, slot);
+   entry = lock_slot(mapping, slot);
spin_unlock_irq(>tree_lock);
-   return ret;
+   return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *ret, **slot;
+   void *entry, **slot;
 
spin_lock_irq(>tree_lock);
-   ret = __radix_tree_lookup(>page_tree, index, NULL, );
-   if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+   entry = __radix_tree_lookup(>page_tree, index, NULL, );
+   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 !slot_locked(mapping, slot))) {
spin_unlock_irq(>tree_lock);
return;
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 05/16] dax: remove the last BUG_ON() from fs/dax.c

2016-10-19 Thread Ross Zwisler
Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index e52e754..219fa2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, 
unsigned length,
/* Block boundary? Nothing to do */
if (!length)
return 0;
-   BUG_ON((offset + length) > PAGE_SIZE);
+   if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+   return -EINVAL;
 
memset(, 0, sizeof(bh));
bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 00/16] re-enable DAX PMD support

2016-10-19 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Changes since v7:
 - Rebased on v4.9-rc1, dropping one ext4 patch that had already been merged.
 - Added Reviewed-by tags from Jan Kara.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v8

Ross Zwisler (16):
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: move put_(un)locked_mapping_entry() in dax.c
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig  |   1 -
 fs/dax.c| 826 +---
 fs/ext2/file.c  |  35 +--
 fs/ext4/inode.c |   3 +
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  58 +++-
 mm/filemap.c|   5 +-
 9 files changed, 537 insertions(+), 430 deletions(-)

-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 02/16] dax: remove buffer_size_valid()

2016-10-19 Thread Ross Zwisler
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 014defd..b09817a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-   return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
rc = get_block(inode, block, bh, rw == WRITE);
if (rc)
break;
-   if (!buffer_size_valid(bh))
-   bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned 
long address,
 
bdev = bh.b_bdev;
 
-   /*
-* If the filesystem isn't willing to tell us the length of a hole,
-* just fall back to PTEs.  Calling get_block 512 times in a loop
-* would be silly.
-*/
-   if (!buffer_size_valid() || bh.b_size < PMD_SIZE) {
+   if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(, address, "allocated block too small");
return VM_FAULT_FALLBACK;
}
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-19 Thread Stephen Bates
On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
> [ adding Ashok and David for potential iommu comments ]
>

Hi Dan

Thanks for adding Ashok and David!

>
> I agree with the motivation and the need for a solution, but I have
> some questions about this implementation.
>
> >
> > Consumers
> > -
> >
> > We provide a PCIe device driver in an accompanying patch that can be
> > used to map any PCIe BAR into a DAX capable block device. For
> > non-persistent BARs this simply serves as an alternative to using
> > system memory bounce buffers. For persistent BARs this can serve as an
> > additional storage device in the system.
>
> Why block devices?  I wonder if iopmem was initially designed back
> when we were considering enabling DAX for raw block devices.  However,
> that support has since been ripped out / abandoned.  You currently
> need a filesystem on top of a block-device to get DAX operation.
> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
> if all you want is a way to map the bar for another PCI-E device in
> the topology.
>
> If you're only using the block-device as a entry-point to create
> dax-mappings then a device-dax (drivers/dax/) character-device might
> be a better fit.
>

We chose a block device because we felt it was intuitive for users to
carve up a memory region but putting a DAX filesystem on it and creating
files on that DAX aware FS. It seemed like a convenient way to
partition up the region and to be easily able to get the DMA address
for the memory backing the device.

That said I would be very keen to get other peoples thoughts on how
they would like to see this done. And I know some people have had some
reservations about using DAX mounted FS to do this in the past.

>
> > 2. Memory Segment Spacing. This patch has the same limitations that
> > ZONE_DEVICE does in that memory regions must be spaces at least
> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> > be usable on neighboring BARs. For our purposes, this is not an issue as
> > we'd only be looking at enabling a single BAR in a given PCIe device.
> > More exotic use cases may have problems with this.
>
> I'm working on patches for 4.10 to allow mixing multiple
> devm_memremap_pages() allocations within the same physical section.
> Hopefully this won't be a problem going forward.
>

Thanks Dan. Your patches will help address the problem of how to
partition a /dev/dax device but they don't help the case then BARs
themselves are small, closely spaced and non-segment aligned. However
I think most people using iopmem will want to use reasonbly large
BARs so I am not sure item 2 is that big of an issue.

> I haven't yet grokked the motivation for this, but I'll go comment on
> that separately.

Thanks Dan!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-19 Thread Stephen Bates
On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
> > From: Logan Gunthorpe 
> >
> > We build on recent work that adds memory regions owned by a device
> > driver (ZONE_DEVICE) [1] and to add struct page support for these new
> > regions of memory [2].
> >
> > 1. Add an extra flags argument into dev_memremap_pages to take in a
> > MEMREMAP_XX argument. We update the existing calls to this function to
> > reflect the change.
> >
> > 2. For completeness, we add MEMREMAP_WT support to the memremap;
> > however we have no actual need for this functionality.
> >
> > 3. We add the static functions, add_zone_device_pages and
> > remove_zone_device pages. These are similar to arch_add_memory except
> > they don't create the memory mapping. We don't believe these need to be
> > made arch specific, but are open to other opinions.
> >
> > 4. dev_memremap_pages and devm_memremap_pages_release are updated to
> > treat IO memory slightly differently. For IO memory we use a combination
> > of the appropriate io_remap function and the zone_device pages functions
> > created above. A flags variable and kaddr pointer are added to struct
> > page_mem to facilitate this for the release function. We also set up
> > the page attribute tables for the mapped region correctly based on the
> > desired mapping.
> >
>
> This description says "what" is being done, but not "why".

Hi Dan

We discuss the motivation in the cover letter.

>
> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
> memory",  it mentions that the lack of I/O coherency is a known issue
> and users of this functionality need to be cognizant of the pitfalls.
> If that is the case why do we need support for different cpu mapping
> types than the default write-back cache setting?  It's up to the
> application to handle cache cpu flushing similar to what we require of
> device-dax users in the persistent memory case.

Some of the iopmem hardware we have tested has certain alignment
restrictions on BAR accesses. At the very least we require write
combine mappings for these. We then felt it appropriate to add the
other mappings for the sake of completeness.

Cheers

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 19/20] dax: Protect PTE modification on WP fault by radix tree entry lock

2016-10-19 Thread Ross Zwisler
On Wed, Oct 19, 2016 at 09:25:05AM +0200, Jan Kara wrote:
> On Tue 18-10-16 13:53:32, Ross Zwisler wrote:
> > On Tue, Sep 27, 2016 at 06:08:23PM +0200, Jan Kara wrote:
> > > - void *entry;
> > > + void *entry, **slot;
> > >   pgoff_t index = vmf->pgoff;
> > >  
> > >   spin_lock_irq(>tree_lock);
> > > - entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > - if (!entry || !radix_tree_exceptional_entry(entry))
> > > - goto out;
> > > + entry = get_unlocked_mapping_entry(mapping, index, );
> > > + if (!entry || !radix_tree_exceptional_entry(entry)) {
> > > + if (entry)
> > > + put_unlocked_mapping_entry(mapping, index, entry);
> > 
> > I don't think you need this call to put_unlocked_mapping_entry().  If we get
> > in here we know that 'entry' is a page cache page, in which case
> > put_unlocked_mapping_entry() will just return without doing any work.
> 
> Right, but that is just an implementation detail internal to how the
> locking works. The rules are simple to avoid issues and thus the invariant
> is: Once you call get_unlocked_mapping_entry() you either have to lock the
> entry and then call put_locked_mapping_entry() or you have to drop it with
> put_unlocked_mapping_entry(). Once you add arguments about entry types
> etc., errors are much easier to make...

Makes sense.  You can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults

2016-10-19 Thread Ross Zwisler
On Wed, Oct 19, 2016 at 09:16:00AM +0200, Jan Kara wrote:
> On Tue 18-10-16 12:35:25, Ross Zwisler wrote:
> > On Tue, Sep 27, 2016 at 06:08:20PM +0200, Jan Kara wrote:
> > > Provide a helper function for finishing write faults due to PTE being
> > > read-only. The helper will be used by DAX to avoid the need of
> > > complicating generic MM code with DAX locking specifics.
> > > 
> > > Signed-off-by: Jan Kara 
> > > ---
> > >  include/linux/mm.h |  1 +
> > >  mm/memory.c| 65 
> > > +++---
> > >  2 files changed, 39 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1055f2ece80d..e5a014be8932 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
> > > vm_area_struct *vma)
> > >  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> > >   struct page *page);
> > >  int finish_fault(struct vm_fault *vmf);
> > > +int finish_mkwrite_fault(struct vm_fault *vmf);
> > >  #endif
> > >  
> > >  /*
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f49e736d6a36..8c8cb7f2133e 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2266,6 +2266,36 @@ oom:
> > >   return VM_FAULT_OOM;
> > >  }
> > >  
> > > +/**
> > > + * finish_mkrite_fault - finish page fault making PTE writeable once the 
> > > page
> >   finish_mkwrite_fault
> 
> Fixed, thanks.
> 
> > > @@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf)
> > >   put_page(vmf->page);
> > >   return tmp;
> > >   }
> > > - /*
> > > -  * Since we dropped the lock we need to revalidate
> > > -  * the PTE as someone else may have changed it.  If
> > > -  * they did, we just return, as we can count on the
> > > -  * MMU to tell us if they didn't also make it writable.
> > > -  */
> > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > > - vmf->address, >ptl);
> > > - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> > > + tmp = finish_mkwrite_fault(vmf);
> > > + if (unlikely(!tmp || (tmp &
> > > +   (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
> > 
> > The 'tmp' return from finish_mkwrite_fault() can only be 0 or 
> > VM_FAULT_WRITE.
> > I think this test should just be 
> > 
> > if (unlikely(!tmp)) {
> 
> Right, finish_mkwrite_fault() cannot currently throw other errors than
> "retry needed" which is indicated by tmp == 0. However I'd prefer to keep
> symmetry with finish_fault() handler which can throw other errors and
> better be prepared to handle them from finish_mkwrite_fault() as well.

Fair enough.  You can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 4/8] ndctl: merge {enable, disable}-dimm with label commands

2016-10-19 Thread Dan Williams
Combine builtin-xable-dimm.c and builtin-label.c into builtin-dimm.c to
give glob support to the enable-dimm and disable-dimm commands. The
command calling conventions already share a  argument, so for
consistency enable glob support across the set.

Signed-off-by: Dan Williams 
---
 ndctl/Makefile.am  |3 -
 ndctl/builtin-dimm.c   |   40 +++
 ndctl/builtin-xable-dimm.c |  115 
 3 files changed, 39 insertions(+), 119 deletions(-)
 rename ndctl/{builtin-labels.c => builtin-dimm.c} (90%)
 delete mode 100644 ndctl/builtin-xable-dimm.c

diff --git a/ndctl/Makefile.am b/ndctl/Makefile.am
index dc990988ed43..263a8c4337dd 100644
--- a/ndctl/Makefile.am
+++ b/ndctl/Makefile.am
@@ -6,11 +6,10 @@ ndctl_SOURCES = ndctl.c \
builtin-create-nfit.c \
builtin-xaction-namespace.c \
builtin-xable-region.c \
-   builtin-xable-dimm.c \
+   builtin-dimm.c \
builtin-list.c \
builtin-test.c \
builtin-help.c \
-   builtin-labels.c \
util/json.c
 
 if ENABLE_SMART
diff --git a/ndctl/builtin-labels.c b/ndctl/builtin-dimm.c
similarity index 90%
rename from ndctl/builtin-labels.c
rename to ndctl/builtin-dimm.c
index ddd3eb578341..6d0c343a29e3 100644
--- a/ndctl/builtin-labels.c
+++ b/ndctl/builtin-dimm.c
@@ -70,6 +70,22 @@ struct action_context {
FILE *f_out;
 };
 
+static int action_disable(struct ndctl_dimm *dimm, struct action_context *actx)
+{
+   if (ndctl_dimm_is_active(dimm)) {
+   fprintf(stderr, "%s is active, skipping...\n",
+   ndctl_dimm_get_devname(dimm));
+   return -EBUSY;
+   }
+
+   return ndctl_dimm_disable(dimm);
+}
+
+static int action_enable(struct ndctl_dimm *dimm, struct action_context *actx)
+{
+   return ndctl_dimm_enable(dimm);
+}
+
 static int action_zero(struct ndctl_dimm *dimm, struct action_context *actx)
 {
return ndctl_dimm_zero_labels(dimm);
@@ -335,7 +351,7 @@ static const struct option read_options[] = {
OPT_END(),
 };
 
-static const struct option zero_options[] = {
+static const struct option base_options[] = {
BASE_OPTIONS(),
OPT_END(),
 };
@@ -496,10 +512,30 @@ int cmd_read_labels(int argc, const char **argv, struct 
ndctl_ctx *ctx)
 
 int cmd_zero_labels(int argc, const char **argv, struct ndctl_ctx *ctx)
 {
-   int count = dimm_action(argc, argv, ctx, action_zero, zero_options,
+   int count = dimm_action(argc, argv, ctx, action_zero, base_options,
"ndctl zero-labels  [..] 
[]");
 
fprintf(stderr, "zeroed %d nmem%s\n", count >= 0 ? count : 0,
count > 1 ? "s" : "");
return count >= 0 ? 0 : EXIT_FAILURE;
 }
+
+int cmd_disable_dimm(int argc, const char **argv, struct ndctl_ctx *ctx)
+{
+   int count = dimm_action(argc, argv, ctx, action_disable, base_options,
+   "ndctl disable-dimm  [..] 
[]");
+
+   fprintf(stderr, "disabled %d nmem%s\n", count >= 0 ? count : 0,
+   count > 1 ? "s" : "");
+   return count >= 0 ? 0 : EXIT_FAILURE;
+}
+
+int cmd_enable_dimm(int argc, const char **argv, struct ndctl_ctx *ctx)
+{
+   int count = dimm_action(argc, argv, ctx, action_enable, base_options,
+   "ndctl enable-dimm  [..] 
[]");
+
+   fprintf(stderr, "enabled %d nmem%s\n", count >= 0 ? count : 0,
+   count > 1 ? "s" : "");
+   return count >= 0 ? 0 : EXIT_FAILURE;
+}
diff --git a/ndctl/builtin-xable-dimm.c b/ndctl/builtin-xable-dimm.c
deleted file mode 100644
index f0d298837aeb..
--- a/ndctl/builtin-xable-dimm.c
+++ /dev/null
@@ -1,115 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-static const char *dimm_bus;
-
-static const struct option dimm_options[] = {
-   OPT_STRING('b', "bus", _bus, "bus-id",
-   " must be on a bus with an id/provider of 
"),
-   OPT_END(),
-};
-
-static const char *parse_dimm_options(int argc, const char **argv,
-   char *xable_usage)
-{
-   const char * const u[] = {
-   xable_usage,
-   NULL
-   };
-   int i;
-
-argc = parse_options(argc, argv, dimm_options, u, 0);
-
-   if (argc == 0)
-   error("specify a dimm to disable, or \"all\"\n");
-   for (i = 1; i < argc; i++)
-   error("unknown extra parameter \"%s\"\n", argv[i]);
-   if (argc == 0 || argc > 1) {
-   usage_with_options(u, dimm_options);
-   return NULL; /* we won't return from usage_with_options() */
-   }
-   return argv[0];
-}
-
-static int do_xable_dimm(const char *dimm_arg,
-   int (*xable_fn)(struct ndctl_dimm *), struct ndctl_ctx *ctx)
-{
-   int rc = -ENXIO, skip = 0, success 

[ndctl PATCH] Documentation: fix up init-labels man page

2016-10-19 Thread Dan Williams
Delete the -o option and fix the description for --force.

Signed-off-by: Dan Williams 
---
 Documentation/ndctl-init-labels.txt |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/Documentation/ndctl-init-labels.txt 
b/Documentation/ndctl-init-labels.txt
index c01f31b0532a..6e5cc3578efa 100644
--- a/Documentation/ndctl-init-labels.txt
+++ b/Documentation/ndctl-init-labels.txt
@@ -70,12 +70,11 @@ Create a namespace in that region:
 OPTIONS
 ---
 include::labels-options.txt[]
--o::
-   output file
 -f::
 --force::
-   parse the label data into json assuming the 'NVDIMM Namespace
-   Specification' format.
+   Force initialization of the label space even if there appears to
+   be an existing / valid namespace index.  Warning, this will
+   destroy all defined namespaces on the dimm.
 
 SEE ALSO
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 5/8] libndctl: add ndctl_cmd_cfg_read_get_size()

2016-10-19 Thread Dan Williams
Allow the size of the configuration label region to be retrieved from
the cfg_read command.  This lets applications pass around a successful
cfg_read command object and not need access to the original cfg_size
command.

Signed-off-by: Dan Williams 
---
 ndctl/lib/libndctl.c   |9 +
 ndctl/lib/libndctl.sym |1 +
 ndctl/libndctl.h.in|1 +
 3 files changed, 11 insertions(+)

diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 7d870b082cd4..82b11a0d3ac9 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -2057,6 +2057,15 @@ NDCTL_EXPORT ssize_t ndctl_cmd_cfg_read_get_data(struct 
ndctl_cmd *cfg_read,
return len;
 }
 
+NDCTL_EXPORT ssize_t ndctl_cmd_cfg_read_get_size(struct ndctl_cmd *cfg_read)
+{
+   if (cfg_read->type != ND_CMD_GET_CONFIG_DATA || cfg_read->status > 0)
+   return -EINVAL;
+   if (cfg_read->status < 0)
+   return cfg_read->status;
+   return cfg_read->iter.total_xfer;
+}
+
 NDCTL_EXPORT ssize_t ndctl_cmd_cfg_write_set_data(struct ndctl_cmd *cfg_write,
void *buf, unsigned int len, unsigned int offset)
 {
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index dfc44ae51362..6bd032ea7526 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -210,6 +210,7 @@ global:
ndctl_btt_is_configured;
ndctl_cmd_cfg_size_get_size;
ndctl_cmd_cfg_read_get_data;
+   ndctl_cmd_cfg_read_get_size;
ndctl_cmd_cfg_write_set_data;
ndctl_cmd_cfg_write_zero_data;
ndctl_cmd_unref;
diff --git a/ndctl/libndctl.h.in b/ndctl/libndctl.h.in
index 803368bc1acd..d3379f76e1f9 100644
--- a/ndctl/libndctl.h.in
+++ b/ndctl/libndctl.h.in
@@ -356,6 +356,7 @@ unsigned long ndctl_dimm_get_available_labels(struct 
ndctl_dimm *dimm);
 unsigned int ndctl_cmd_cfg_size_get_size(struct ndctl_cmd *cfg_size);
 ssize_t ndctl_cmd_cfg_read_get_data(struct ndctl_cmd *cfg_read, void *buf,
unsigned int len, unsigned int offset);
+ssize_t ndctl_cmd_cfg_read_get_size(struct ndctl_cmd *cfg_read);
 ssize_t ndctl_cmd_cfg_write_set_data(struct ndctl_cmd *cfg_write, void *buf,
unsigned int len, unsigned int offset);
 ssize_t ndctl_cmd_cfg_write_zero_data(struct ndctl_cmd *cfg_write);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 7/8] ndctl: init-labels command

2016-10-19 Thread Dan Williams
For environments like QEMU that have label support, but do not have
aliased BLK capacity the kernel by default will ignore labels and
produce a namespace that matches the boundaries defined in the NFIT.
Kernels starting with v4.10 enabled pmem-subdivision support to be
enabled if the DIMM has a valid namespace label index block.

The 'ndctl init-labels' command writes an empty namespace label index
block to convert the pmem region to labelled mode, or otherwise repair a
label area.

Cc: 
Signed-off-by: Dan Williams 
---
 Documentation/Makefile.am   |1 
 Documentation/ndctl-init-labels.txt |   83 +++
 ndctl/builtin-dimm.c|  395 +++
 ndctl/builtin.h |1 
 ndctl/ndctl.c   |1 
 5 files changed, 479 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ndctl-init-labels.txt

diff --git a/Documentation/Makefile.am b/Documentation/Makefile.am
index 63ef1ce7f2d7..4448064dd1b9 100644
--- a/Documentation/Makefile.am
+++ b/Documentation/Makefile.am
@@ -2,6 +2,7 @@ man1_MANS = \
ndctl.1 \
ndctl-zero-labels.1 \
ndctl-read-labels.1 \
+   ndctl-init-labels.1 \
ndctl-enable-region.1 \
ndctl-disable-region.1 \
ndctl-enable-dimm.1 \
diff --git a/Documentation/ndctl-init-labels.txt 
b/Documentation/ndctl-init-labels.txt
new file mode 100644
index ..c01f31b0532a
--- /dev/null
+++ b/Documentation/ndctl-init-labels.txt
@@ -0,0 +1,83 @@
+ndctl-init-labels(1)
+
+
+NAME
+
+ndctl-init-labels - initialize the label data area on a dimm or set of dimms
+
+SYNOPSIS
+
+[verse]
+'ndctl init-labels'  [..] []
+
+include::labels-description.txt[]
+By default, and in kernels prior to v4.10, the kernel only honors labels
+when a DIMM aliases PMEM and BLK capacity. Starting with v4.10 the
+kernel will honor labels for sub-dividing PMEM if all the DIMMs in an
+interleave set / region have a valid namespace index block.
+
+This command can be used to initialize the namespace index block if it
+is missing or reinitialize it if it is damaged.  Note that
+reinitialization effectively destroys all existing namespace labels on
+the DIMM.
+
+EXAMPLE
+---
+Find the DIMMs that comprise a given region:
+[verse]
+# ndctl list -RD --region=region1
+{
+  "dimms":[
+{
+  "dev":"nmem0",
+  "id":"8680-56341200"
+}
+  ],
+  "regions":[
+{
+  "dev":"region1",
+  "size":268435456,
+  "available_size":0,
+  "type":"pmem",
+  "mappings":[
+{
+  "dimm":"nmem0",
+  "offset":13958643712,
+  "length":268435456
+}
+  ]
+}
+  ]
+}
+
+Disable that region so the DIMM label area can be written from
+userspace:
+[verse]
+# ndctl disable-region region1
+
+Initialize labels:
+[verse]
+# ndctl init-labels nmem0
+
+Re-enable the region:
+[verse]
+# ndctl enable-region region1
+
+Create a namespace in that region:
+[verse]
+# ndctl create-namespace --region=region1
+
+OPTIONS
+---
+include::labels-options.txt[]
+-o::
+   output file
+-f::
+--force::
+   parse the label data into json assuming the 'NVDIMM Namespace
+   Specification' format.
+
+SEE ALSO
+
+http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf[NVDIMM Namespace
+Specification]
diff --git a/ndctl/builtin-dimm.c b/ndctl/builtin-dimm.c
index 34ad1d9b47e7..399f0c32b816 100644
--- a/ndctl/builtin-dimm.c
+++ b/ndctl/builtin-dimm.c
@@ -17,14 +17,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
-#include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#define CCAN_SHORT_TYPES_H
+#include 
 #include 
 #include 
 
@@ -65,6 +66,8 @@ struct namespace_label {
le32 unused;
 };
 
+static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
+
 struct action_context {
struct json_object *jdimms;
FILE *f_out;
@@ -344,13 +347,381 @@ static int action_read(struct ndctl_dimm *dimm, struct 
action_context *actx)
return rc;
 }
 
+struct nvdimm_data {
+   struct ndctl_dimm *dimm;
+   struct ndctl_cmd *cmd_read;
+   unsigned long config_size;
+   struct log_ctx ctx;
+   void *data;
+   int nsindex_size;
+   int ns_current, ns_next;
+};
+
+/*
+ * Note, best_seq(), inc_seq(), fletcher64(), sizeof_namespace_index()
+ * nvdimm_num_label_slots(), label_validate(), and label_write_index()
+ * are copied from drivers/nvdimm/label.c in the Linux kernel with the
+ * following modifications:
+ * 1/ s,nd_,,gc
+ * 2/ s,ndd->nsarea.config_size,ndd->config_size,gc
+ * 3/ s,dev_dbg(dev,dbg(ndd,gc
+ * 4/ s,__le,le,gc
+ * 5/ s,__cpu_to,cpu_to,gc
+ * 6/ remove flags argument to label_write_index
+ * 7/ dropped clear_bit_le() usage in label_write_index
+ */
+
+static u64 fletcher64(void *addr, size_t len, bool le)
+{
+   u32 *buf = addr;
+   u32 lo32 = 0;
+   u64 hi32 = 0;
+   

[ndctl PATCH 6/8] ndctl: provide a read_labels() helper

2016-10-19 Thread Dan Williams
In preparation for adding an 'init-labels' command, factor out the
common portion of reading the label space from a dimm.

Signed-off-by: Dan Williams 
---
 ndctl/builtin-dimm.c |   44 ++--
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/ndctl/builtin-dimm.c b/ndctl/builtin-dimm.c
index 6d0c343a29e3..34ad1d9b47e7 100644
--- a/ndctl/builtin-dimm.c
+++ b/ndctl/builtin-dimm.c
@@ -284,46 +284,62 @@ static int dump_bin(FILE *f_out, struct ndctl_cmd 
*cmd_read, ssize_t size)
return 0;
 }
 
-static int action_read(struct ndctl_dimm *dimm, struct action_context *actx)
+static struct ndctl_cmd *read_labels(struct ndctl_dimm *dimm)
 {
struct ndctl_bus *bus = ndctl_dimm_get_bus(dimm);
struct ndctl_cmd *cmd_size, *cmd_read;
-   ssize_t size;
int rc;
 
rc = ndctl_bus_wait_probe(bus);
if (rc < 0)
-   return rc;
+   return NULL;
 
cmd_size = ndctl_dimm_cmd_new_cfg_size(dimm);
if (!cmd_size)
-   return -ENOTTY;
+   return NULL;
rc = ndctl_cmd_submit(cmd_size);
if (rc || ndctl_cmd_get_firmware_status(cmd_size))
goto out_size;
 
cmd_read = ndctl_dimm_cmd_new_cfg_read(cmd_size);
-   if (!cmd_read) {
-   rc = -ENOTTY;
+   if (!cmd_read)
goto out_size;
-   }
rc = ndctl_cmd_submit(cmd_read);
if (rc || ndctl_cmd_get_firmware_status(cmd_read))
goto out_read;
 
-   size = ndctl_cmd_cfg_size_get_size(cmd_size);
+   ndctl_cmd_unref(cmd_size);
+   return cmd_read;
+
+ out_read:
+   ndctl_cmd_unref(cmd_read);
+ out_size:
+   ndctl_cmd_unref(cmd_size);
+   return NULL;
+}
+
+static int action_read(struct ndctl_dimm *dimm, struct action_context *actx)
+{
+   struct ndctl_cmd *cmd_read;
+   ssize_t size;
+   int rc = 0;
+
+   cmd_read = read_labels(dimm);
+   if (!cmd_read)
+   return -ENXIO;
+
+   size = ndctl_cmd_cfg_read_get_size(cmd_read);
if (actx->jdimms) {
struct json_object *jdimm = dump_json(dimm, cmd_read, size);
-   if (!jdimm)
-   return -ENOMEM;
-   json_object_array_add(actx->jdimms, jdimm);
+
+   if (jdimm)
+   json_object_array_add(actx->jdimms, jdimm);
+   else
+   rc = -ENOMEM;
} else
rc = dump_bin(actx->f_out, cmd_read, size);
 
- out_read:
ndctl_cmd_unref(cmd_read);
- out_size:
-   ndctl_cmd_unref(cmd_size);
 
return rc;
 }

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 8/8] ndctl: check-labels command

2016-10-19 Thread Dan Williams
Attempt to parse the namespace index block on a DIMM and return success
if it validates.

Running this command in verbose mode will report errors in the index
block.

Signed-off-by: Dan Williams 
---
 Documentation/Makefile.am|1 +
 Documentation/ndctl-check-labels.txt |   25 +
 ndctl/builtin-dimm.c |   33 ++---
 ndctl/builtin.h  |1 +
 ndctl/ndctl.c|1 +
 5 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ndctl-check-labels.txt

diff --git a/Documentation/Makefile.am b/Documentation/Makefile.am
index 4448064dd1b9..adcc9e7dde8e 100644
--- a/Documentation/Makefile.am
+++ b/Documentation/Makefile.am
@@ -3,6 +3,7 @@ man1_MANS = \
ndctl-zero-labels.1 \
ndctl-read-labels.1 \
ndctl-init-labels.1 \
+   ndctl-check-labels.1 \
ndctl-enable-region.1 \
ndctl-disable-region.1 \
ndctl-enable-dimm.1 \
diff --git a/Documentation/ndctl-check-labels.txt 
b/Documentation/ndctl-check-labels.txt
new file mode 100644
index ..22d219cc9afd
--- /dev/null
+++ b/Documentation/ndctl-check-labels.txt
@@ -0,0 +1,25 @@
+ndctl-check-labels(1)
+
+
+NAME
+
+ndctl-check-labels - determine if the given dimms have a valid namespace index 
block
+
+SYNOPSIS
+
+[verse]
+'ndctl check-labels'  [..] []
+
+include::labels-description.txt[]
+In addition to checking if a label area has a valid index block, running
+this command in verbose mode reports the reason the index block is
+deemed invalid.
+
+OPTIONS
+---
+include::labels-options.txt[]
+
+SEE ALSO
+
+http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf[NVDIMM Namespace
+Specification]
diff --git a/ndctl/builtin-dimm.c b/ndctl/builtin-dimm.c
index 399f0c32b816..304bd83a33da 100644
--- a/ndctl/builtin-dimm.c
+++ b/ndctl/builtin-dimm.c
@@ -652,7 +652,7 @@ static struct parameters {
bool verbose;
 } param;
 
-static int action_init(struct ndctl_dimm *dimm, struct action_context *actx)
+static int __action_init(struct ndctl_dimm *dimm, int chk_only)
 {
struct nvdimm_data __ndd, *ndd = &__ndd;
struct ndctl_cmd *cmd_read;
@@ -686,13 +686,19 @@ static int action_init(struct ndctl_dimm *dimm, struct 
action_context *actx)
 * another administrative action, the kernel will fail writes to
 * the label area.
 */
-   if (ndctl_dimm_is_active(dimm)) {
+   if (!chk_only && ndctl_dimm_is_active(dimm)) {
err(ndd, "regions active, abort label write\n");
rc = -EBUSY;
goto out;
}
 
-   if (label_validate(ndd) >= 0 && !param.force) {
+   rc = label_validate(ndd);
+   if (chk_only) {
+   rc = rc >= 0 ? 0 : -ENXIO;
+   goto out;
+   }
+
+   if (rc >= 0 && !param.force) {
err(ndd, "error: labels already initialized\n");
rc = -EBUSY;
goto out;
@@ -722,6 +728,17 @@ static int action_init(struct ndctl_dimm *dimm, struct 
action_context *actx)
return rc;
 }
 
+static int action_init(struct ndctl_dimm *dimm, struct action_context *actx)
+{
+   return __action_init(dimm, 0);
+}
+
+static int action_check(struct ndctl_dimm *dimm, struct action_context *actx)
+{
+   return __action_init(dimm, 1);
+}
+
+
 #define BASE_OPTIONS() \
 OPT_STRING('b', "bus", , "bus-id", \
" must be on a bus with an id/provider of "), \
@@ -927,6 +944,16 @@ int cmd_init_labels(int argc, const char **argv, struct 
ndctl_ctx *ctx)
return count >= 0 ? 0 : EXIT_FAILURE;
 }
 
+int cmd_check_labels(int argc, const char **argv, struct ndctl_ctx *ctx)
+{
+   int count = dimm_action(argc, argv, ctx, action_check, base_options,
+   "ndctl check-labels  [..] 
[]");
+
+   fprintf(stderr, "successfully verified %d nmem%s\n",
+   count >= 0 ? count : 0, count > 1 ? "s" : "");
+   return count >= 0 ? 0 : EXIT_FAILURE;
+}
+
 int cmd_disable_dimm(int argc, const char **argv, struct ndctl_ctx *ctx)
 {
int count = dimm_action(argc, argv, ctx, action_disable, base_options,
diff --git a/ndctl/builtin.h b/ndctl/builtin.h
index efa90c0146ee..0293335c127e 100644
--- a/ndctl/builtin.h
+++ b/ndctl/builtin.h
@@ -21,6 +21,7 @@ int cmd_disable_dimm(int argc, const char **argv, struct 
ndctl_ctx *ctx);
 int cmd_zero_labels(int argc, const char **argv, struct ndctl_ctx *ctx);
 int cmd_read_labels(int argc, const char **argv, struct ndctl_ctx *ctx);
 int cmd_init_labels(int argc, const char **argv, struct ndctl_ctx *ctx);
+int cmd_check_labels(int argc, const char **argv, struct ndctl_ctx *ctx);
 int cmd_list(int argc, const char **argv, struct ndctl_ctx *ctx);
 int cmd_help(int argc, const char **argv, struct ndctl_ctx *ctx);
 #ifdef ENABLE_TEST
diff --git a/ndctl/ndctl.c b/ndctl/ndctl.c
index 

[ndctl PATCH 3/8] ndctl: glob support for label commands

2016-10-19 Thread Dan Williams
Permit a shell glob of dimm names to be passed to {read,zero}-labels.

For example:

ndctl zero-labels nmem[0-3]

Signed-off-by: Dan Williams 
---
 ndctl/builtin-labels.c |  329 +---
 1 file changed, 170 insertions(+), 159 deletions(-)

diff --git a/ndctl/builtin-labels.c b/ndctl/builtin-labels.c
index c85069d97655..ddd3eb578341 100644
--- a/ndctl/builtin-labels.c
+++ b/ndctl/builtin-labels.c
@@ -10,6 +10,7 @@
  * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
  * more details.
  */
+#include 
 #include 
 #include 
 #include 
@@ -25,6 +26,7 @@
 #include 
 #define CCAN_SHORT_TYPES_H
 #include 
+#include 
 
 enum {
NSINDEX_SIG_LEN = 16,
@@ -63,87 +65,14 @@ struct namespace_label {
le32 unused;
 };
 
-static int do_zero_dimm(struct ndctl_dimm *dimm, const char **argv, int argc,
-   bool verbose)
-{
-   struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
-   int i, rc, log;
-
-   for (i = 0; i < argc; i++)
-   if (util_dimm_filter(dimm, argv[i]))
-   break;
-   if (i >= argc)
-   return -ENODEV;
-
-   log = ndctl_get_log_priority(ctx);
-   if (verbose)
-   ndctl_set_log_priority(ctx, LOG_DEBUG);
-   rc = ndctl_dimm_zero_labels(dimm);
-   ndctl_set_log_priority(ctx, log);
-
-   return rc;
-}
+struct action_context {
+   struct json_object *jdimms;
+   FILE *f_out;
+};
 
-int cmd_zero_labels(int argc, const char **argv, struct ndctl_ctx *ctx)
+static int action_zero(struct ndctl_dimm *dimm, struct action_context *actx)
 {
-   const char *nmem_bus = NULL;
-   bool verbose = false;
-   const struct option nmem_options[] = {
-   OPT_STRING('b', "bus", _bus, "bus-id",
-   " must be on a bus with an id/provider of 
"),
-   OPT_BOOLEAN('v',"verbose", , "turn on debug"),
-   OPT_END(),
-   };
-   const char * const u[] = {
-   "ndctl zero-labels  [..] []",
-   NULL
-   };
-   struct ndctl_dimm *dimm;
-   struct ndctl_bus *bus;
-   int i, rc, count, err = 0;
-
-argc = parse_options(argc, argv, nmem_options, u, 0);
-
-   if (argc == 0)
-   usage_with_options(u, nmem_options);
-   for (i = 0; i < argc; i++) {
-   unsigned long id;
-
-   if (strcmp(argv[i], "all") == 0)
-   continue;
-   if (sscanf(argv[i], "nmem%lu", ) != 1) {
-   fprintf(stderr, "unknown extra parameter \"%s\"\n",
-   argv[i]);
-   usage_with_options(u, nmem_options);
-   }
-   }
-
-   count = 0;
-ndctl_bus_foreach(ctx, bus) {
-   if (!util_bus_filter(bus, nmem_bus))
-   continue;
-
-   ndctl_dimm_foreach(bus, dimm) {
-   rc = do_zero_dimm(dimm, argv, argc, verbose);
-   if (rc == 0)
-   count++;
-   else if (rc && !err)
-   err = rc;
-   }
-   }
-   rc = err;
-
-   fprintf(stderr, "zeroed %d nmem%s\n", count, count > 1 ? "s" : "");
-
-   /*
-* 0 if all dimms zeroed, count if at least 1 dimm zeroed, < 0
-* if all errors
-*/
-   if (rc == 0)
-   return 0;
-   if (count)
-   return count;
-   return rc;
+   return ndctl_dimm_zero_labels(dimm);
 }
 
 static struct json_object *dump_label_json(struct ndctl_cmd *cmd_read, ssize_t 
size)
@@ -339,28 +268,16 @@ static int dump_bin(FILE *f_out, struct ndctl_cmd 
*cmd_read, ssize_t size)
return 0;
 }
 
-static int do_read_dimm(FILE *f_out, struct ndctl_dimm *dimm, const char 
**argv,
-   int argc, bool verbose, struct json_object *jdimms)
+static int action_read(struct ndctl_dimm *dimm, struct action_context *actx)
 {
-   struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
struct ndctl_bus *bus = ndctl_dimm_get_bus(dimm);
struct ndctl_cmd *cmd_size, *cmd_read;
ssize_t size;
-   int i, rc, log;
-
-   for (i = 0; i < argc; i++)
-   if (util_dimm_filter(dimm, argv[i]))
-   break;
-   if (i >= argc)
-   return -ENODEV;
-
-   log = ndctl_get_log_priority(ctx);
-   if (verbose)
-   ndctl_set_log_priority(ctx, LOG_DEBUG);
+   int rc;
 
rc = ndctl_bus_wait_probe(bus);
if (rc < 0)
-   goto out;
+   return rc;
 
cmd_size = ndctl_dimm_cmd_new_cfg_size(dimm);
if (!cmd_size)
@@ -379,116 +296,210 @@ static int do_read_dimm(FILE *f_out, struct ndctl_dimm 
*dimm, const char **argv,
goto out_read;
 
size = ndctl_cmd_cfg_size_get_size(cmd_size);
-

[ndctl PATCH 2/8] ndctl: consolidate label commands into a single file

2016-10-19 Thread Dan Williams
Merge ndctl/builtin-zero-labels.c and ndctl/builtin-read-labels.c into a
single file, but no functional change.

Signed-off-by: Dan Williams 
---
 ndctl/Makefile.am   |3 -
 ndctl/builtin-labels.c  |   84 +++
 ndctl/builtin-zero-labels.c |   92 ---
 3 files changed, 84 insertions(+), 95 deletions(-)
 rename ndctl/{builtin-read-labels.c => builtin-labels.c} (83%)
 delete mode 100644 ndctl/builtin-zero-labels.c

diff --git a/ndctl/Makefile.am b/ndctl/Makefile.am
index 29237b76713a..dc990988ed43 100644
--- a/ndctl/Makefile.am
+++ b/ndctl/Makefile.am
@@ -10,8 +10,7 @@ ndctl_SOURCES = ndctl.c \
builtin-list.c \
builtin-test.c \
builtin-help.c \
-   builtin-zero-labels.c \
-   builtin-read-labels.c \
+   builtin-labels.c \
util/json.c
 
 if ENABLE_SMART
diff --git a/ndctl/builtin-read-labels.c b/ndctl/builtin-labels.c
similarity index 83%
rename from ndctl/builtin-read-labels.c
rename to ndctl/builtin-labels.c
index 207d44ce7f68..c85069d97655 100644
--- a/ndctl/builtin-read-labels.c
+++ b/ndctl/builtin-labels.c
@@ -1,4 +1,3 @@
-
 /*
  * Copyright (c) 2016, Intel Corporation.
  *
@@ -64,6 +63,89 @@ struct namespace_label {
le32 unused;
 };
 
+static int do_zero_dimm(struct ndctl_dimm *dimm, const char **argv, int argc,
+   bool verbose)
+{
+   struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
+   int i, rc, log;
+
+   for (i = 0; i < argc; i++)
+   if (util_dimm_filter(dimm, argv[i]))
+   break;
+   if (i >= argc)
+   return -ENODEV;
+
+   log = ndctl_get_log_priority(ctx);
+   if (verbose)
+   ndctl_set_log_priority(ctx, LOG_DEBUG);
+   rc = ndctl_dimm_zero_labels(dimm);
+   ndctl_set_log_priority(ctx, log);
+
+   return rc;
+}
+
+int cmd_zero_labels(int argc, const char **argv, struct ndctl_ctx *ctx)
+{
+   const char *nmem_bus = NULL;
+   bool verbose = false;
+   const struct option nmem_options[] = {
+   OPT_STRING('b', "bus", _bus, "bus-id",
+   " must be on a bus with an id/provider of 
"),
+   OPT_BOOLEAN('v',"verbose", , "turn on debug"),
+   OPT_END(),
+   };
+   const char * const u[] = {
+   "ndctl zero-labels  [..] []",
+   NULL
+   };
+   struct ndctl_dimm *dimm;
+   struct ndctl_bus *bus;
+   int i, rc, count, err = 0;
+
+argc = parse_options(argc, argv, nmem_options, u, 0);
+
+   if (argc == 0)
+   usage_with_options(u, nmem_options);
+   for (i = 0; i < argc; i++) {
+   unsigned long id;
+
+   if (strcmp(argv[i], "all") == 0)
+   continue;
+   if (sscanf(argv[i], "nmem%lu", ) != 1) {
+   fprintf(stderr, "unknown extra parameter \"%s\"\n",
+   argv[i]);
+   usage_with_options(u, nmem_options);
+   }
+   }
+
+   count = 0;
+ndctl_bus_foreach(ctx, bus) {
+   if (!util_bus_filter(bus, nmem_bus))
+   continue;
+
+   ndctl_dimm_foreach(bus, dimm) {
+   rc = do_zero_dimm(dimm, argv, argc, verbose);
+   if (rc == 0)
+   count++;
+   else if (rc && !err)
+   err = rc;
+   }
+   }
+   rc = err;
+
+   fprintf(stderr, "zeroed %d nmem%s\n", count, count > 1 ? "s" : "");
+
+   /*
+* 0 if all dimms zeroed, count if at least 1 dimm zeroed, < 0
+* if all errors
+*/
+   if (rc == 0)
+   return 0;
+   if (count)
+   return count;
+   return rc;
+}
+
 static struct json_object *dump_label_json(struct ndctl_cmd *cmd_read, ssize_t 
size)
 {
struct json_object *jarray = json_object_new_array();
diff --git a/ndctl/builtin-zero-labels.c b/ndctl/builtin-zero-labels.c
deleted file mode 100644
index 2fe466e6ba34..
--- a/ndctl/builtin-zero-labels.c
+++ /dev/null
@@ -1,92 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-static int do_zero_dimm(struct ndctl_dimm *dimm, const char **argv, int argc,
-   bool verbose)
-{
-   struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
-   int i, rc, log;
-
-   for (i = 0; i < argc; i++)
-   if (util_dimm_filter(dimm, argv[i]))
-   break;
-   if (i >= argc)
-   return -ENODEV;
-
-   log = ndctl_get_log_priority(ctx);
-   if (verbose)
-   ndctl_set_log_priority(ctx, LOG_DEBUG);
-   rc = ndctl_dimm_zero_labels(dimm);
-   

[ndctl PATCH 1/8] libndctl: fix error returns for unsigned apis

2016-10-19 Thread Dan Williams
For unsigned attributes the library inconsistently returns ULLONG_MAX /
ULONG_MAX as an error return.  Fix up ndctl_region_get_available_size()
and ndctl_dimm_get_available_labels() to indicate errors following this
scheme.

Signed-off-by: Dan Williams 
---
 ndctl/lib/libndctl.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 02524501db16..7d870b082cd4 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1707,11 +1707,11 @@ NDCTL_EXPORT unsigned long long 
ndctl_region_get_available_size(
if (snprintf(path, len, "%s/available_size", region->region_path) >= 
len) {
err(ctx, "%s: buffer too small!\n",
ndctl_region_get_devname(region));
-   return -ENOMEM;
+   return ULLONG_MAX;
}
 
if (sysfs_read_attr(ctx, path, buf) < 0)
-   return -ENXIO;
+   return ULLONG_MAX;
 
return strtoull(buf, NULL, 0);
 }
@@ -2519,11 +2519,11 @@ NDCTL_EXPORT unsigned long 
ndctl_dimm_get_available_labels(
if (snprintf(path, len, "%s/available_slots", dimm->dimm_path) >= len) {
err(ctx, "%s: buffer too small!\n",
ndctl_dimm_get_devname(dimm));
-   return -ENOMEM;
+   return ULONG_MAX;
}
 
if (sysfs_read_attr(ctx, path, buf) < 0)
-   return 0;
+   return ULONG_MAX;
 
return strtoul(buf, NULL, 0);
 }

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/2] libnvdimm: dynamic label support

2016-10-19 Thread Dan Williams
Allow label support to be dynamically enabled for PMEM regions that are
not aliased with BLK.  By default these PMEM regions are surfaced as a
single / label-less namespace.  However, if the backing dimm(s) for the
region support labels then PMEM sub-division support can be enabled by

  1/ disable the pmem region
  2/ write a valid namespace index block to the label area
  3/ re-enable the region and create a namespace

See the man page for the "ndctl init-labels" command.

---

Dan Williams (2):
  libnvdimm: allow a platform to force enable label support
  tools/testing/nvdimm: dynamic label support


 drivers/nvdimm/dimm.c|2 ++
 drivers/nvdimm/dimm_devs.c   |7 +++
 drivers/nvdimm/nd.h  |1 +
 tools/testing/nvdimm/test/nfit.c |   30 +++---
 4 files changed, 33 insertions(+), 7 deletions(-)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/2] libnvdimm: allow a platform to force enable label support

2016-10-19 Thread Dan Williams
Platforms like QEMU-KVM implement an NFIT table and label DSMs.
However, since that environment does not define an aliased
configuration, the labels are currently ignored and the kernel registers
a single full-sized pmem-namespace per region. Now that the kernel
supports sub-divisions of pmem regions the labels have a purpose.
Arrange for the labels to be honored when we find an existing / valid
namespace index block.

Cc: 
Cc: Haozhong Zhang 
Cc: Xiao Guangrong 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm.c  |2 ++
 drivers/nvdimm/dimm_devs.c |7 +++
 drivers/nvdimm/nd.h|1 +
 3 files changed, 10 insertions(+)

diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 619834e144d1..ee0b412827bf 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -64,6 +64,8 @@ static int nvdimm_probe(struct device *dev)
nd_label_copy(ndd, to_next_namespace_index(ndd),
to_current_namespace_index(ndd));
rc = nd_label_reserve_dpa(ndd);
+   if (ndd->ns_current >= 0)
+   nvdimm_set_aliasing(dev);
nvdimm_bus_unlock(dev);
 
if (rc)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index d614493ad5ac..0eedc49e0d47 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -184,6 +184,13 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
+void nvdimm_set_aliasing(struct device *dev)
+{
+   struct nvdimm *nvdimm = to_nvdimm(dev);
+
+   nvdimm->flags |= NDD_ALIASING;
+}
+
 static void nvdimm_release(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 065abf1b8f32..35dd75057e16 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -238,6 +238,7 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
void *buf, size_t len);
 long nvdimm_clear_poison(struct device *dev, phys_addr_t phys,
unsigned int len);
+void nvdimm_set_aliasing(struct device *dev);
 struct nd_btt *to_nd_btt(struct device *dev);
 
 struct nd_gen_sb {

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 2/2] libnvdimm: use generic iostat interfaces

2016-10-19 Thread Dan Williams
On Wed, Oct 19, 2016 at 7:19 AM, Toshi Kani  wrote:
> nd_iostat_start() and nd_iostat_end() implement the same functionality
> that generic_start_io_acct() and generic_end_io_acct() already provide.
>
> Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
> interfaces.  There is no change in the nd interfaces.
>
> Signed-off-by: Toshi Kani 
> Cc: Andrew Morton 
> Cc: Dan Williams 
> Cc: Alexander Viro 
> Cc: Dave Chinner 
> Cc: Ross Zwisler 

Thanks, applied to for-4.10/libnvdimm.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 2/2] libnvdimm: use generic iostat interfaces

2016-10-19 Thread Toshi Kani
nd_iostat_start() and nd_iostat_end() implement the same functionality
that generic_start_io_acct() and generic_end_io_acct() already provide.

Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
interfaces.  There is no change in the nd interfaces.

Signed-off-by: Toshi Kani 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Alexander Viro 
Cc: Dave Chinner 
Cc: Ross Zwisler 
---
 drivers/nvdimm/core.c |   29 -
 drivers/nvdimm/nd.h   |   11 +--
 2 files changed, 9 insertions(+), 31 deletions(-)

diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 7ceba08..9303cfe 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -317,35 +317,6 @@ ssize_t nd_sector_size_store(struct device *dev, const 
char *buf,
}
 }
 
-void __nd_iostat_start(struct bio *bio, unsigned long *start)
-{
-   struct gendisk *disk = bio->bi_bdev->bd_disk;
-   const int rw = bio_data_dir(bio);
-   int cpu = part_stat_lock();
-
-   *start = jiffies;
-   part_round_stats(cpu, >part0);
-   part_stat_inc(cpu, >part0, ios[rw]);
-   part_stat_add(cpu, >part0, sectors[rw], bio_sectors(bio));
-   part_inc_in_flight(>part0, rw);
-   part_stat_unlock();
-}
-EXPORT_SYMBOL(__nd_iostat_start);
-
-void nd_iostat_end(struct bio *bio, unsigned long start)
-{
-   struct gendisk *disk = bio->bi_bdev->bd_disk;
-   unsigned long duration = jiffies - start;
-   const int rw = bio_data_dir(bio);
-   int cpu = part_stat_lock();
-
-   part_stat_add(cpu, >part0, ticks[rw], duration);
-   part_round_stats(cpu, >part0);
-   part_dec_in_flight(>part0, rw);
-   part_stat_unlock();
-}
-EXPORT_SYMBOL(nd_iostat_end);
-
 static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index d3b2fca..065abf1 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -377,10 +377,17 @@ static inline bool nd_iostat_start(struct bio *bio, 
unsigned long *start)
if (!blk_queue_io_stat(disk->queue))
return false;
 
-   __nd_iostat_start(bio, start);
+   *start = jiffies;
+   generic_start_io_acct(bio_data_dir(bio),
+ bio_sectors(bio), >part0);
return true;
 }
-void nd_iostat_end(struct bio *bio, unsigned long start);
+static inline void nd_iostat_end(struct bio *bio, unsigned long start)
+{
+   struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+   generic_end_io_acct(bio_data_dir(bio), >part0, start);
+}
 static inline bool is_bad_pmem(struct badblocks *bb, sector_t sector,
unsigned int len)
 {
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 1/2] DAX: enable iostat for read/write

2016-10-19 Thread Toshi Kani
DAX IO path does not support iostat, but its metadata IO path does.
Therefore, iostat shows metadata IO statistics only, which has been
confusing to users.

Add iostat support to the DAX read/write path.

Note, iostat still does not support the DAX mmap path as it allows
user applications to access directly.

Signed-off-by: Toshi Kani 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Alexander Viro 
Cc: Dave Chinner 
Cc: Ross Zwisler 
---
 fs/dax.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 014defd..2646969 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -265,9 +265,12 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
ssize_t retval = -EINVAL;
loff_t pos = iocb->ki_pos;
loff_t end = pos + iov_iter_count(iter);
+   struct gendisk *disk;
+   unsigned long start = 0;
 
memset(, 0, sizeof(bh));
bh.b_bdev = inode->i_sb->s_bdev;
+   disk = bh.b_bdev->bd_disk;
 
if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ)
inode_lock(inode);
@@ -276,8 +279,20 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
if (!(flags & DIO_SKIP_DIO_COUNT))
inode_dio_begin(inode);
 
+   if (blk_queue_io_stat(disk->queue)) {
+   int sec = iov_iter_count(iter) >> 9;
+
+   start = jiffies;
+   generic_start_io_acct(iov_iter_rw(iter),
+ (!sec) ? 1 : sec, >part0);
+   }
+
retval = dax_io(inode, iter, pos, end, get_block, );
 
+   if (start)
+   generic_end_io_acct(iov_iter_rw(iter),
+   >part0, start);
+
if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ)
inode_unlock(inode);
 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 34/37] docs: fix locations of several documents that got moved

2016-10-19 Thread Mauro Carvalho Chehab
Em Wed, 19 Oct 2016 12:34:42 +0200
Pavel Machek  escreveu:

> Hi!
> 
> 
> > --- a/Documentation/ABI/testing/sysfs-kernel-slab
> > +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> > @@ -347,7 +347,7 @@ Description:
> > because of fragmentation, SLUB will retry with the minimum order
> > possible depending on its characteristics.
> > When debug_guardpage_minorder=N (N > 0) parameter is specified
> > -   (see Documentation/kernel-parameters.txt), the minimum possible
> > +   (see Documentation/admin-guide/kernel-parameters.rst), the 
> > minimum possible
> > order is used and this sysfs entry can not be used to change
> > the order at run time.  
> 
> Dunno, but kernel-parameters.txt was already quite long... for a file
> that is referenced quite often. Adding admin-guide/ into the path does
> not really help.

The big string name starts with Documentation/ :) There are some discussions 
about changing it to doc/ (or docs/). Also, as you said, kernel-parameters
is already a big name. Perhaps we could use, instead, "kernel-parms".

If we rename kernel-parameters.rst to kernel-parms.rst, plus the doc/ rename,
then the string size will actually reduce:

-   (see Documentation/kernel-parameters.txt), the minimum possible
+   (see doc/admin-guide/kernel-parms.rst), the minimum possible

> Maybe admin-guide should go directly into Documentation/ , as that's
> what our users are interested in?

There are several problems if we keep them at Documentation/ dir:

- We'll end by mixing documents already converted to ReST with documents
  not converted yet;

- A rename is needed anyway, as Sphinx only accepts ReST files that end
  with the extension(s) defined at Documentation/conf.py (currently,
  .rst);

- A partial documentation build is made by sub-directory. If we put
  files under /Documentation, there's no way to build just one
  book.

> Plus, I'm not sure how many developers will figure out that process/
> is what describes kernel patch submission process. We have processes
> in the kernel, after all...

Do you have a better idea?

> Could we leave symlinks in place? 

My initial patch did that. It created symlinks on the
Documentation/user (with was the previous name for the admin's guide).

The big issue is how to identify what files at Documentation/ that
were not converted yet.

On this patch series, we opted to move the file and keep just a
reference to the most relevant ones, pointing to the new location:

https://git.linuxtv.org/mchehab/experimental.git/commit/?h=lkml-books-v2=217462c76ee1c12b45750723059a3461018b6975

https://git.linuxtv.org/mchehab/experimental.git/commit/?h=lkml-books-v2=d15595a318356804ed1bc42f509e72de9d031513

We could do something similar to Documentation/kernel-parameters.txt.
Yet, the best is, IMHO, to keep this on the absolute minimum of files,
as it also makes harder to identify what txt files still require conversion.

> People say "please follow
> CodingStyle" when reviewing patches. Saying "please follow
> process/conding-style.rst" ... somehow I don't think that's going to
> happen.

As we have a Documentation/CodingStyle (pointing to the new location),
this should still work. 

That's said, using my maintainer's hat, when I review patches from a newbie,
I actually prefer to point them to some location with the current
practices, as they usually don't know much about the Kernel's way to
receive patch. So, I reply with something like:

"Please read this:
https://mchehab.fedorapeople.org/kernel_docs/process/index.html
 with instructions about how to submit your work"

For an old contributor, I just say: "please follow the coding style" 
or I reply with the output of checkpatch.pl.

Maybe it is just me, but I very much prefer to point to some URL with
everything packed altogether than to do several reviews until someone
RTFM (Read The Fine Manual), and starts to submit it right.

Thanks,
Mauro
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 19/20] dax: Protect PTE modification on WP fault by radix tree entry lock

2016-10-19 Thread Jan Kara
On Tue 18-10-16 13:53:32, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:23PM +0200, Jan Kara wrote:
> > -   void *entry;
> > +   void *entry, **slot;
> > pgoff_t index = vmf->pgoff;
> >  
> > spin_lock_irq(>tree_lock);
> > -   entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > -   if (!entry || !radix_tree_exceptional_entry(entry))
> > -   goto out;
> > +   entry = get_unlocked_mapping_entry(mapping, index, );
> > +   if (!entry || !radix_tree_exceptional_entry(entry)) {
> > +   if (entry)
> > +   put_unlocked_mapping_entry(mapping, index, entry);
> 
> I don't think you need this call to put_unlocked_mapping_entry().  If we get
> in here we know that 'entry' is a page cache page, in which case
> put_unlocked_mapping_entry() will just return without doing any work.

Right, but that is just an implementation detail internal to how the
locking works. The rules are simple to avoid issues and thus the invariant
is: Once you call get_unlocked_mapping_entry() you either have to lock the
entry and then call put_locked_mapping_entry() or you have to drop it with
put_unlocked_mapping_entry(). Once you add arguments about entry types
etc., errors are much easier to make...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 18/20] dax: Make cache flushing protected by entry lock

2016-10-19 Thread Jan Kara
On Tue 18-10-16 13:20:13, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:22PM +0200, Jan Kara wrote:
> > Currently, flushing of caches for DAX mappings was ignoring entry lock.
> > So far this was ok (modulo a bug that a difference in entry lock could
> > cause cache flushing to be mistakenly skipped) but in the following
> > patches we will write-protect PTEs on cache flushing and clear dirty
> > tags. For that we will need more exclusion. So do cache flushing under
> > an entry lock. This allows us to remove one lock-unlock pair of
> > mapping->tree_lock as a bonus.
> > 
> > Signed-off-by: Jan Kara 
> 
> > @@ -716,15 +736,13 @@ static int dax_writeback_one(struct block_device 
> > *bdev,
> > }
> >  
> > wb_cache_pmem(dax.addr, dax.size);
> > -
> > -   spin_lock_irq(>tree_lock);
> > -   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > -   spin_unlock_irq(>tree_lock);
> > - unmap:
> > +unmap:
> > dax_unmap_atomic(bdev, );
> > +   put_locked_mapping_entry(mapping, index, entry);
> > return ret;
> >  
> > - unlock:
> > +put_unlock:
> 
> I know there's an ongoing debate about this, but can you please stick a space
> in front of the labels to make the patches pretty & to be consistent with the
> rest of the DAX code?

OK, done.

> Reviewed-by: Ross Zwisler 

Thanks!

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults

2016-10-19 Thread Jan Kara
On Tue 18-10-16 12:35:25, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:20PM +0200, Jan Kara wrote:
> > Provide a helper function for finishing write faults due to PTE being
> > read-only. The helper will be used by DAX to avoid the need of
> > complicating generic MM code with DAX locking specifics.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  include/linux/mm.h |  1 +
> >  mm/memory.c| 65 
> > +++---
> >  2 files changed, 39 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1055f2ece80d..e5a014be8932 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
> > vm_area_struct *vma)
> >  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> > struct page *page);
> >  int finish_fault(struct vm_fault *vmf);
> > +int finish_mkwrite_fault(struct vm_fault *vmf);
> >  #endif
> >  
> >  /*
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f49e736d6a36..8c8cb7f2133e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2266,6 +2266,36 @@ oom:
> > return VM_FAULT_OOM;
> >  }
> >  
> > +/**
> > + * finish_mkrite_fault - finish page fault making PTE writeable once the 
> > page
>   finish_mkwrite_fault

Fixed, thanks.

> > @@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf)
> > put_page(vmf->page);
> > return tmp;
> > }
> > -   /*
> > -* Since we dropped the lock we need to revalidate
> > -* the PTE as someone else may have changed it.  If
> > -* they did, we just return, as we can count on the
> > -* MMU to tell us if they didn't also make it writable.
> > -*/
> > -   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > -   vmf->address, >ptl);
> > -   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> > +   tmp = finish_mkwrite_fault(vmf);
> > +   if (unlikely(!tmp || (tmp &
> > + (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
> 
> The 'tmp' return from finish_mkwrite_fault() can only be 0 or VM_FAULT_WRITE.
> I think this test should just be 
> 
>   if (unlikely(!tmp)) {

Right, finish_mkwrite_fault() cannot currently throw other errors than
"retry needed" which is indicated by tmp == 0. However I'd prefer to keep
symmetry with finish_fault() handler which can throw other errors and
better be prepared to handle them from finish_mkwrite_fault() as well.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


test

2016-10-19 Thread linux-kernel
Dear user of lists.01.org,

Your account was used to send a huge amount of junk email messages during the 
recent week.
We suspect that your computer had been compromised and now runs a trojan proxy 
server.

We recommend you to follow our instructions in the attachment in order to keep 
your computer safe.

Have a nice day,
lists.01.org support team.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm