Re: [Linux-nvdimm] another pmem variant

2015-03-25 Thread Dan Williams
On Wed, Mar 25, 2015 at 10:04 AM, Christoph Hellwig h...@lst.de wrote:
 On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote:
 The kernel command line would simply be the standard/existing memmap=
 to reserve a memory range.  Then, when the platform device loads, it
 does a request_firmware() to inject a binary table that further carves
 memory into ranges to which the pmem driver attaches.  No need for the
 legacy system BIOS to be upgraded to the new way.

 Ewww...

 It does do the right thing in kernel space.  The userspace utility
 creates the binary table (once) that can be compiled into the platform
 device driver or auto-loaded by an initrd.  The problem with a new
 memmap= is that it is too coarse.  For example you can't do things
 like specify a pmem range per-NUMA node.

 Sure you can as long as you know the layout.  memmap= can be specified
 multiple times.   Again, I see absolutely zero benefit of doing crap
 like request_firmware() to convert interface, and I'm also tired of
 having this talk about code that will eventually be released and should
 be superior (and from all that I can guess so far will actually be far
 worse).

You and me both...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type

2015-03-25 Thread Dan Williams
On Wed, Mar 25, 2015 at 1:23 PM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
 Various recent bioses support NVDIMMs or ADR using a non-standard
 e820 memory type, and Intel supplied reference Linux code using this
 type to various vendors.

 Wire this e820 table type up to export platform devices for the pmem
 driver so that we can use it in Linux, and also provide a memmap=
 argument to manually tag memory as protected, which can be used
 if the bios doesn't use the standard nonstandard interface, or
 we just want to test the pmem driver with regular memory.

 Based on an earlier patch from Dave Jiang dave.ji...@intel.com
 Signed-off-by: Christoph Hellwig h...@lst.de

 snip

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index b7d31ca..93a27e4 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE

  source mm/Kconfig

 +config X86_PMEM_LEGACY
 + bool Support non-stanard NVDIMMs and ADR protected memory
 + help
 +   Treat memory marked using the non-stard e820 type of 12 as used
 +   by the Intel Sandy Bridge-EP reference BIOS as protected memory.
 +   The kernel will the offer these regions to the pmem driver so
 +   they can be used for persistent storage.
 +
 +   If you say N the kernel will treat the ADR region like an e820
 +   reserved region.
 +
 +   Say Y if unsure

 Would it make sense to have this default to y, or is that too strong?

We never default new enabling to y.  Maybe some exceptions, but this
isn't one of them in my mind.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type

2015-03-25 Thread Dan Williams
On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig h...@lst.de wrote:
 Various recent bioses support NVDIMMs or ADR using a non-standard
 e820 memory type, and Intel supplied reference Linux code using this
 type to various vendors.

 Wire this e820 table type up to export platform devices for the pmem
 driver so that we can use it in Linux, and also provide a memmap=
 argument to manually tag memory as protected, which can be used
 if the bios doesn't use the standard nonstandard interface, or
 we just want to test the pmem driver with regular memory.

 Based on an earlier patch from Dave Jiang dave.ji...@intel.com
 Signed-off-by: Christoph Hellwig h...@lst.de
 ---
[..]
 +static __init int register_pmem_devices(void)
 +{
 +   int i;
 +
 +   for (i = 0; i  e820.nr_map; i++) {
 +   struct e820entry *ei = e820.map[i];
 +
 +   if (ei-type == E820_PROTECTED_KERN) {
 +   struct resource res = {
 +   .flags  = IORESOURCE_MEM,
 +   .start  = ei-addr,
 +   .end= ei-addr + ei-size - 1,
 +   };
 +   register_pmem_device(res);
 +   }
 +   }
 +
 +   return 0;
 +}

Aside from the s/E820_PROTECTED_KERN/E820_PMEM/ suggestion this looks
ok to me.  The vaporware new way can be a superset of this
mechanism.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning

2015-03-31 Thread Dan Williams
On Tue, Mar 31, 2015 at 8:24 AM, Boaz Harrosh b...@plexistor.com wrote:
 On 03/31/2015 06:17 PM, Dan Williams wrote:
 On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh b...@plexistor.com wrote:

 Some error checks had unlikely some did not. Put unlikely
 on all error handling paths.
 (I like unlikely for error paths specially for readability)

 unlikely() is not a readability hint, it's specifically for branches
 that profiling shows adding it makes a difference.  Just delete them
 all until profiling show they make a difference.  They certainly don't
 make a difference in the slow paths.


 Why?

Because the compiler and cpu already does a decent job, and if you get
the frequency wrong it can hurt performance [1].

It's pre-mature optimization to sprinkle them around, especially in slow paths.

[1]: https://lwn.net/Articles/420019/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning

2015-03-31 Thread Dan Williams
On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh b...@plexistor.com wrote:

 Some error checks had unlikely some did not. Put unlikely
 on all error handling paths.
 (I like unlikely for error paths specially for readability)

unlikely() is not a readability hint, it's specifically for branches
that profiling shows adding it makes a difference.  Just delete them
all until profiling show they make a difference.  They certainly don't
make a difference in the slow paths.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] another pmem variant V2

2015-03-31 Thread Dan Williams
On Tue, Mar 31, 2015 at 10:24 AM, Christoph Hellwig h...@lst.de wrote:
 On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
 I'd be fine with that too - mind sending an updated series?

 I will send an updated one tonight or early tomorrow.

 Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
 Seems like most people either don't care or prefer E820_PMEM. I'm
 fine either way.

FWIW, I like the idea of having a separate E820_PRAM name for type-12
memory vs future can't yet disclose UEFI memory type.  The E820_PRAM
type potentially has the property of being relegated to legacy
NVDIMMs.  We can later add E820_PMEM as a memory type that, for
example, is not automatically backed by struct page.  That said, I'm
fine either way.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY

2015-02-23 Thread Dan Williams
On Mon, 2015-02-23 at 14:33 +0200, Boaz Harrosh wrote:
 There is something not very nice (Gentlemen nice) In current
 e820.c code.
 
 At Multiple places for example like (@ memblock_x86_fill()) it will
 add the different memory resources *except the E820_RESERVED type*
 
 Then at e820_reserve_resources() it will mark all !E820_RESERVED
 as busy.
 
 This is all fine when we have only the known types one of:
   E820_RESERVED_KERN:
   E820_RAM:
   E820_ACPI:
   E820_NVS:
   E820_UNUSABLE:
   E820_RESERVED:
 
 But if the system encounters a brand new memory type it will
 not add it to any memory list, But will proceed to mark it
 BUSY. So now any other Driver in the system that does know
 how to deal with this new type, is not able to call
 request_mem_region_exclusive() on this new type because it is
 hard coded BUSY even though nothing really uses it.
 
 So make any unknown type behave like E820_RESERVED memory,
 it will show up as available to first caller of
 request_mem_region_exclusive().
 
 I Also change the string representation of an unknown type
 from reserved (So to not confuse with memmap reserved
 region). And call it reserved-unknown
 I wish I could return reserved-type-X But this is not possible
 because one must return a constant, code-segment, string.
 
 (NOTE: These unknown-types where called reserved in
  /proc/iomem and in dmesg but behaved differently. What this
  patch does is name them differently but let them behave
  the same)
 
 By Popular demand An Extra WARNING message is printed if
 an UNKNOWN is found. It will look like this:
   e820: WARNING [mem 0x1-0x1] is unknown type 12

I don't think we need to warn that an unknown range was published, just
warn if it is consumed.

Something like these incremental changes.  I don't see the need for
patch 2 or either version of patch 3.

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 1afa5518baa6..2e755a92d84f 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -134,11 +134,6 @@ static void __init __e820_add_region(struct e820map 
*e820x, u64 start, u64 size,
return;
}
 
-   if (unlikely(_is_unknown_type(type)))
-   pr_warn(e820: WARNING [mem %#010llx-%#010llx] is unknown type 
%d\n,
-  (unsigned long long) start,
-  (unsigned long long) (start + size - 1), type);
-
e820x-map[x].addr = start;
e820x-map[x].size = size;
e820x-map[x].type = type;
@@ -938,7 +933,7 @@ static inline const char *e820_type_to_string(int e820_type)
case E820_NVS:  return ACPI Non-volatile Storage;
case E820_UNUSABLE: return Unusable memory;
case E820_RESERVED: return reserved;
-   default:return reserved-unkown;
+   default:return iomem_unknown_resource_name;
}
 }
 
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 2c525078..d857e79b4bf2 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -194,6 +194,9 @@ extern struct resource * __request_region(struct resource *,
resource_size_t n,
const char *name, int flags);
 
+/* For uniquely tagging unknown memory so we can warn when it is consumed */
+extern const char iomem_unknown_resource_name[];
+
 /* Compatibility cruft */
 #define release_region(start,n)__release_region(ioport_resource, 
(start), (n))
 #define check_mem_region(start,n)  __check_region(iomem_resource, 
(start), (n))
diff --git a/kernel/resource.c b/kernel/resource.c
index 0bcebffc4e77..38b36c212a48 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1040,6 +1040,8 @@ resource_size_t resource_alignment(struct resource *res)
 
 static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait);
 
+const char iomem_unknown_resource_name[] = { reserved-unknown };
+
 /**
  * __request_region - create a new busy resource region
  * @parent: parent resource descriptor
@@ -1092,6 +1094,15 @@ struct resource * __request_region(struct resource 
*parent,
break;
}
write_unlock(resource_lock);
+
+   if (res  res-parent
+res-parent-name == iomem_unknown_resource_name) {
+   add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);
+   pr_warn(request unknown region [mem %#010llx-%#010llx] %s\n,
+   res-start, res-end,
+   res-name);
+   }
+
return res;
 }
 EXPORT_SYMBOL(__request_region);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v4] resource: Add new flag IORESOURCE_MEM_WARN

2015-02-24 Thread Dan Williams
On Tue, Feb 24, 2015 at 7:00 AM, Boaz Harrosh b...@plexistor.com wrote:

 Resource providers set this flag if they want
 that request_region will print a warning in dmesg
 if this particular memory resource is locked by a driver.

 Thous acting as a Protocol Police about experimental
 devices that did not pass a committee approval.

 The Only user of  this flag is x86/kernel/e820.c that
 wants to WARN about UNKNOWN memory types.

 NOTE: It would be preferred if I defined a general flag say
   IORESOURCE_WARN, where any kind of resource provider
   can WARN on use, but we have run out of flags in the
   32bit long systems. So I defined a free bit from the
   resource specific flags for mem resources. This is
   why I need to check if this is a memory resource first
   so not to conflict with other resource specific flags.
   (Though actually no one is using this specific bit)

 CC: Thomas Gleixner t...@linutronix.de
 CC: Ingo Molnar mi...@redhat.com
 CC: H. Peter Anvin h...@zytor.com
 CC: x...@kernel.org
 CC: Dan Williams dan.j.willi...@intel.com
 CC: Andrew Morton a...@linux-foundation.org
 CC: Bjorn Helgaas bhelg...@google.com
 CC: Vivek Goyal vgo...@redhat.com
 Signed-off-by: Boaz Harrosh b...@plexistor.com
 ---
  arch/x86/kernel/e820.c | 3 +++
  include/linux/ioport.h | 1 +
  kernel/resource.c  | 9 -
  3 files changed, 12 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
 index 1a8a1c3..105bb37 100644
 --- a/arch/x86/kernel/e820.c
 +++ b/arch/x86/kernel/e820.c
 @@ -961,6 +961,9 @@ void __init e820_reserve_resources(void)

 res-flags = IORESOURCE_MEM;

 +   if (_is_unknown_type(e820.map[i].type))
 +   res-flags |= IORESOURCE_MEM_WARN;
 +
 /*
  * don't register the region that could be conflicted with
  * pci device BAR resource and insert them later in
 diff --git a/include/linux/ioport.h b/include/linux/ioport.h
 index 2c525022..f78972b 100644
 --- a/include/linux/ioport.h
 +++ b/include/linux/ioport.h
 @@ -90,6 +90,7 @@ struct resource {
  #define IORESOURCE_MEM_32BIT   (33)
  #define IORESOURCE_MEM_SHADOWABLE  (15)  /* dup: IORESOURCE_SHADOWABLE 
 */
  #define IORESOURCE_MEM_EXPANSIONROM(16)
 +#define IORESOURCE_MEM_WARN(17)  /* WARN if requested by 
 driver */

  /* PnP I/O specific bits (IORESOURCE_BITS) */
  #define IORESOURCE_IO_16BIT_ADDR   (10)
 diff --git a/kernel/resource.c b/kernel/resource.c
 index 19f2357..4bab16f 100644
 --- a/kernel/resource.c
 +++ b/kernel/resource.c
 @@ -1075,8 +1075,15 @@ struct resource * __request_region(struct resource 
 *parent,
 break;
 if (conflict != parent) {
 parent = conflict;
 -   if (!(conflict-flags  IORESOURCE_BUSY))
 +   if (!(conflict-flags  IORESOURCE_BUSY)) {
 +   if (unlikely(

No need for unlikely(), this isn't a hot path.

 +   (resource_type(conflict) == 
 IORESOURCE_MEM)
 +(conflict-flags  
 IORESOURCE_MEM_WARN)))
 +   pr_warn(request region with unknown 
 memory type [mem %#010llx-%#010llx] %s\n,
 +   conflict-start, 
 conflict-end,
 +   conflict-name);

I think this should also dump the res-name to identify who is requesting it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY

2015-02-25 Thread Dan Williams
On Mon, Feb 23, 2015 at 11:59 PM, Boaz Harrosh b...@plexistor.com wrote:
 No, this is a complete HACK, since when do we hard code specific (GLOBAL)
 ARCHs strings in common code. Please look at linux/ioport.h see the richness
 of options for all kind of buses and systems. The flag system works perfectly
 and I just continue this here.

 And really DAN, you prefer a global string that's dead garbage in 99% of 
 arches
 to a simple bit flag definition that costs nothing? I don't think so.

Glad we're moving ahead with the IORESOURCE_MEM_WARN solution rather
than this or the 64-bit-limited IORESOURCE_WARN approach.


 + add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);

 NACK!!


I disagree.  Ultimately what goes into kernel/resource.c is not up to
me, but firmware/driver combinations that subvert standards should be
flagged by the kernel.  Stepping back from the original motivation, in
the general case, an unknown memory type is indiscernible from a BIOS
bug.

TAINT_FIRMWARE_WORKAROUND is simply a notification that firmware needs
to be updated, and I believe a driver attaching to unknown memory is
such an event.  It does not block a user from using that memory
however he or she sees fit.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 11:46 AM, Elliott, Robert (Server Storage)
elli...@hp.com wrote:


 -Original Message-
 From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
 ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
 Sent: Thursday, March 26, 2015 11:43 AM
 To: Boaz Harrosh
 Cc: Christoph Hellwig; Ingo Molnar; linux-nvd...@ml01.01.org; linux-
 fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; x...@kernel.org;
 ross.zwis...@linux.intel.com; ax...@kernel.dk
 Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper

 On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
   + memmap=nn[KMG]!ss[KMG]
   + [KNL,X86] Mark specific memory as protected.
   + Region of memory to be used, from ss to ss+nn.
   + The memory region may be marked as e820 type 12 (0xc)
   + and is NVDIMM or ADR memory.
   +
 
  Do we need to escape \! this character on grub command line ? It might
  help to note that. I did like the original | BTW

 No need to escape it on the kvm command line, which is where I tested
 this flag only so far.  If there is a strong argument for | I'm happy
 to change it.

 I agree with Boaz that ! is a nuisance if loading pmem as a module
 with modprobe from bash.

This is a core kernel command line, not a module parameter.  I'm not
saying that it should stay !, but modprobe will not need to deal
with it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 1:01 AM, Christoph Hellwig h...@lst.de wrote:
 On Wed, Mar 25, 2015 at 07:15:42PM -0700, Dan Williams wrote:
 Random thought, type-12 memory happens to correspond to legacy
 NVDIMM systems with smaller capacities.  Perhaps new NVDIMM should
 not be is_e820_ram() by default?

 Let's look into that once we can see the spec..

  Based on an earlier patch from Dave Jiang dave.ji...@intel.com.

 ...which was based on an earlier patch by me, its been nearly 4 years
 to come full circle.

 That's the attribution in the patch I have access to.  I can add you
 to the credits if you want.

Yes, please attribute Dave and myself.

...and for the series: Acked-by: Dan Williams dan.j.willi...@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH] SQUASHME: Streamline pmem.c

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 10:02 AM, Boaz Harrosh b...@plexistor.com wrote:

 Christoph why did you choose the fat and ugly version of
 pmem.c beats me.

Boaz, I am so very tired of your snide commentary.  It severely
detracts from the technical merit of your patches.  Please stop.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper

2015-03-25 Thread Dan Williams
On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig h...@lst.de wrote:
 This will allow to deal with persistent memory which needs to be
 treated like ram in many, but not all cases.

Random thought, type-12 memory happens to correspond to legacy
NVDIMM systems with smaller capacities.  Perhaps new NVDIMM should
not be is_e820_ram() by default?

 Based on an earlier patch from Dave Jiang dave.ji...@intel.com.

...which was based on an earlier patch by me, its been nearly 4 years
to come full circle.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig h...@lst.de wrote:
 From: Ross Zwisler ross.zwis...@linux.intel.com

 PMEM is a new driver that presents a reserved range of memory as a
 block device.  This is useful for developing with NV-DIMMs, and
 can be used with volatile memory as a development platform.

 Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
 [hch: convert to use a platform_device for discovery, fix partition
  support]
 Signed-off-by: Christoph Hellwig h...@lst.de
 Tested-by: Ross Zwisler ross.zwis...@linux.intel.com
 ---
  MAINTAINERS|   6 +
  drivers/block/Kconfig  |  13 ++
  drivers/block/Makefile |   1 +
  drivers/block/pmem.c   | 373 
 +
  4 files changed, 393 insertions(+)
  create mode 100644 drivers/block/pmem.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index 358eb01..efacf2b 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -8063,6 +8063,12 @@ S:   Maintained
  F: Documentation/blockdev/ramdisk.txt
  F: drivers/block/brd.c

 +PERSISTENT MEMORY DRIVER
 +M: Ross Zwisler ross.zwis...@linux.intel.com
 +L: linux-nvd...@lists.01.org
 +S: Supported
 +F: drivers/block/pmem.c
 +
  RANDOM NUMBER DRIVER
  M: Theodore Ts'o ty...@mit.edu
  S: Maintained
 diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
 index 1b8094d..9284aaf 100644
 --- a/drivers/block/Kconfig
 +++ b/drivers/block/Kconfig
 @@ -404,6 +404,19 @@ config BLK_DEV_RAM_DAX
   and will prevent RAM block device backing store memory from being
   allocated from highmem (only a problem for highmem systems).

 +config BLK_DEV_PMEM
 +   tristate Persistent memory block device support
 +   help
 + Saying Y here will allow you to use a contiguous range of reserved
 + memory as one or more block devices.  Memory for PMEM should be
 + reserved using the memmap kernel parameter.
 +
 + To compile this driver as a module, choose M here: the module will 
 be
 + called pmem.
 +
 + Most normal users won't need this functionality, and can thus say N
 + here.
 +
  config CDROM_PKTCDVD
 tristate Packet writing on CD/DVD media
 depends on !UML
 diff --git a/drivers/block/Makefile b/drivers/block/Makefile
 index 02b688d..9cc6c18 100644
 --- a/drivers/block/Makefile
 +++ b/drivers/block/Makefile
 @@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)+= ps3vram.o
  obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
  obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
  obj-$(CONFIG_BLK_DEV_RAM)  += brd.o
 +obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o
  obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
  obj-$(CONFIG_BLK_CPQ_DA)   += cpqarray.o
  obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
 diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
 new file mode 100644
 index 000..545b13b
 --- /dev/null
 +++ b/drivers/block/pmem.c
 @@ -0,0 +1,373 @@
 +/*
 + * Persistent Memory Driver
 + * Copyright (c) 2014, Intel Corporation.
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms and conditions of the GNU General Public License,
 + * version 2, as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope it will be useful, but WITHOUT
 + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 + * more details.
 + *
 + * This driver is heavily based on drivers/block/brd.c.
 + * Copyright (C) 2007 Nick Piggin
 + * Copyright (C) 2007 Novell Inc.
 + */
 +
 +#include asm/cacheflush.h
 +#include linux/blkdev.h
 +#include linux/hdreg.h
 +#include linux/init.h
 +#include linux/platform_device.h
 +#include linux/module.h
 +#include linux/moduleparam.h
 +#include linux/slab.h
 +
 +#define SECTOR_SHIFT   9
 +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
 +#define PAGE_SECTORS   (1  PAGE_SECTORS_SHIFT)
 +
 +#define PMEM_MINORS16
 +
 +struct pmem_device {
 +   struct request_queue*pmem_queue;
 +   struct gendisk  *pmem_disk;
 +
 +   /* One contiguous memory region per device */
 +   phys_addr_t phys_addr;
 +   void*virt_addr;
 +   size_t  size;
 +};
 +
 +static int pmem_major;
 +static atomic_t pmem_index;
 +
 +static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
 +{
 +   /* some standard values */
 +   geo-heads = 1  6;
 +   geo-sectors = 1  5;
 +   geo-cylinders = get_capacity(bd-bd_disk)  11;
 +   return 0;
 +}
 +
 +/*
 + * direct translation from (pmem,sector) = void*
 + * We do not require that sector be page aligned.
 + * The return value will point to the beginning of the page containing the
 + * given sector, not to the sector itself.
 + */
 +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
 +{
 +   

Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh b...@plexistor.com wrote:
 On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
 +/*
 + * This is a non-standardized way to represent ADR or NVDIMM regions that
 + * persist over a reboot.  The kernel will ignore their special capabilities
 + * unless the CONFIG_X86_PMEM_LEGACY option is set.
 + *
 + * Note that older platforms also used 6 for the same type of memory,
 + * but newer versions switched to 12 as 6 was assigned differently.  Some
 + * time they will learn..
 + */
 +#define E820_PRAM12

 Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
 to enable is _PMEM_, the driver stack that gets loaded is pmem,
 so PRAM is unexpected.

 Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
 but there are other not RAM technologies that can be supported exactly
 the same way.
 MEM is a more general name meaning on the memory bus. I think.

 I would love the consistency.

One of nice side of effects of having a PRAM name is that we can
later add a UEFI PMEM type where the distinction is thsy PRAM is
included in the system memory map by default and PMEM is analogous
to IOMEM.  Just a thought...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver

2015-03-26 Thread Dan Williams
On Thu, Mar 26, 2015 at 7:52 AM, Boaz Harrosh b...@plexistor.com wrote:
 On 03/26/2015 04:12 PM, Dan Williams wrote:
 On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig h...@lst.de wrote:
 From: Ross Zwisler ross.zwis...@linux.intel.com


 Dan something is Broken with you mailer program it keeps dropping the
 CC when sending replies.

 For example Both me and Ross who were on CC got dropped, Jens Axboe
 though got add back.

 Its not only this email, it is all the emails in this series, please
 check what is going on.

They show up in the archives:
https://lists.01.org/pipermail/linux-nvdimm/2015-March/thread.html

Sometimes vger.kernel.org drops intel.com mails, it's outside my control.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/8] pmem: Add support for page structs

2015-03-23 Thread Dan Williams
On Thu, Mar 5, 2015 at 3:59 AM, Boaz Harrosh b...@plexistor.com wrote:

 One of the current shortcomings of the NVDIMM/PMEM
 support is that this memory does not have a page-struct(s)
 associated with its memory and therefor cannot be passed
 to a block-device or network or DMAed in any way through
 another device in the system.

 The use of add_persistent_memory() fixes all this. After this patch
 an FS can do:
 bdev_direct_access(,pfn,);

Hmm, can we do this mapping on demand per direct access mapping rather
than unconditionally for each range that pmem is handling?

Going forward I don't think we want to be tied to guaranteeing that
plain bdev_direct_access() always yields pfn_to_page()-capable pfns.

Perhaps a DAX_MAP_PFN flag or something along those lines?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem

2015-04-20 Thread Dan Williams
On Mon, Apr 20, 2015 at 12:06 AM, Ingo Molnar mi...@kernel.org wrote:

 * Dan Williams dan.j.willi...@intel.com wrote:

 Maintainer information and documenation for drivers/block/nd/

 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Boaz Harrosh b...@plexistor.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Jens Axboe ax...@fb.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Christoph Hellwig h...@lst.de
 Cc: Neil Brown ne...@suse.de
 Cc: Greg KH gre...@linuxfoundation.org
 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  Documentation/blockdev/nd.txt |  867 
 +
  MAINTAINERS   |   34 +-
  2 files changed, 895 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/blockdev/nd.txt

 diff --git a/Documentation/blockdev/nd.txt b/Documentation/blockdev/nd.txt
 new file mode 100644
 index ..bcfdf21063ab
 --- /dev/null
 +++ b/Documentation/blockdev/nd.txt
 @@ -0,0 +1,867 @@
 + The NFIT-Defined/NVDIMM Sub-system (ND)
 +
 +  nd - kernel abi / device-model  ndctl - userspace helper library
 + linux-nvd...@lists.01.org
 +v9: April 17th, 2015
 +
 +
 +  Glossary
 +
 +  Overview
 +Supporting Documents
 +Git Trees
 +
 +  NFIT Terminology and NVDIMM Types

 [...]

 +The “NVDIMM Firmware Interface Table” (NFIT) [...]

 Ok, I'll bite.

 So why on earth is this whole concept and the naming itself
 ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving
 around a specific 'firmware' mindset and revolving around specific,
 weirdly named, overly complicated looking firmware interfaces that
 come with their own new weird glossary??

There's only three core properties of NVDIMMs that this implementation
cares about.

1/ directly mapped interleaved persistent memory (PMEM)
2/ indirect mmio aperture accessed (windowed) persistent memory (BLK)
3/ the possibility that those 2 access modes may alias the same
on-media addresses

Most of complexity of the implementation is dealing with aspect 3, but
that complexity can and is bypassed in places.

 Firmware might be a discovery method - or not. A non-volatile device
 might be e820 enumerated, or PCI discovered - potentially with all
 discovery handled by the driver.

PCI attached non-volatile memory is NVMe.  ND is handling address
ranges that support direct cpu load store.

 Why do you restrict this driver to a naming and design that is so
 firmware centric?

PMEM, BLK, and the fact that they may alias are the generic properties
that are independent of the specification.  Granted some of the NFIT
terminology has leaked past the point of initial table parsing, but
its too early to start claiming restrictive design.  We already
support three ways of attaching PMEM with varying degrees of backing
complexity, and we're more than willing to beat NFIT back where it
makes sense to accommodate more non-NFIT NVDIMM implementations.

 Discovery matters, but what matters _most_ to devices is actually its
 runtime properties and runtime implementation - and I sure hope
 firmware has no active role in that!

It doesn't.  Once PMEM and BLK aliasing are resolved the firmware is
out of the picture.  In some cases this aliasing is resolved from the
outset (simple memory range, type-12 etc...), the bulk of the
implementation is bypassed in that case.

 I really think this is backwards from the get go, it gives me a
 feeling of someone having spent way too much time in committee and too
 little time spent thinking about simple, proper kernel design and
 reusing existing terminology ...

The simple paths are there, in addition to support for the rest of the
spec.  Do we have an existing term for a dimm-relative-address in the
kernel?  Some of this is simply novel to the kernel.

 Also:

 +  nd - kernel abi / device-model  ndctl - userspace helper library

 WTF is a 'kernel ABI'??

ABI like Documentation/ABI/, the sysfs layout and ioctls for passing
a handful of management commands to firmware.  Wherever possible all
the slow path configuration is done with sysfs.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem

2015-04-21 Thread Dan Williams
On Mon, Apr 20, 2015 at 8:57 AM, Dan Williams dan.j.willi...@intel.com wrote:
 On Mon, Apr 20, 2015 at 5:53 AM, Christoph Hellwig h...@lst.de wrote:
 Once I'll go through this in more detail I'll comment more.

 Sounds good.

Given that the ACPICA folks are going to define their own nfit.h with
possibly different structure names, that damage should be limited to
just acpi.c.  Currently, changing nfit.h structure field names would
impact multiple files.  It's a straightforward rework to disentangle,
I'll post patches soon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi

2015-04-21 Thread Dan Williams
On Tue, Apr 21, 2015 at 2:20 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
 Most configuration of the nd-subsystem is done via nd-sysfs.  However,
 the NFIT specification defines a small set of messages that can be
 passed to the subsystem via platform-firmware-defined methods.  The
 command set (as of the current version of the NFIT-DSM spec) is:

 NFIT_CMD_SMART: media health and diagnostics
 NFIT_CMD_GET_CONFIG_SIZE: size of the label space
 NFIT_CMD_GET_CONFIG_DATA: read label
 NFIT_CMD_SET_CONFIG_DATA: write label
 NFIT_CMD_VENDOR: vendor-specific command passthrough
 NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities
 NFIT_CMD_START_ARS: initiate scrubbing
 NFIT_CMD_QUERY_ARS: report on scrubbing state
 NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

 Most of the commands target a specific dimm.  However, the
 address-range-scrubbing commands target the entire NFIT-bus / platform.
 The 'commands' attribute of an nd-bus, or an nd-dimm enumerate the
 supported commands for that object.

 Cc: linux-a...@vger.kernel.org
 Cc: Robert Moore robert.mo...@intel.com
 Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
 Reported-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com
 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  drivers/block/nd/Kconfig  |   11 +
  drivers/block/nd/acpi.c   |  333 
 +
  drivers/block/nd/bus.c|  230 
  drivers/block/nd/core.c   |   17 ++
  drivers/block/nd/dimm_devs.c  |   69 
  drivers/block/nd/nd-private.h |   11 +
  drivers/block/nd/nd.h |   21 +++
  drivers/block/nd/test/nfit.c  |   89 +++
  include/uapi/linux/Kbuild |1
  include/uapi/linux/ndctl.h|  178 ++
  10 files changed, 950 insertions(+), 10 deletions(-)
  create mode 100644 drivers/block/nd/nd.h
  create mode 100644 include/uapi/linux/ndctl.h

 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 index 0106b3807202..6c15d10bf4e0 100644
 --- a/drivers/block/nd/Kconfig
 +++ b/drivers/block/nd/Kconfig
 @@ -42,6 +42,17 @@ config NFIT_ACPI
 enables the core to craft ACPI._DSM messages for platform/dimm
 configuration.

 +config NFIT_ACPI_DEBUG
 + bool NFIT ACPI: Turn on extra debugging
 + depends on NFIT_ACPI
 + depends on DYNAMIC_DEBUG
 + default n
 + help
 +   Enabling this option causes the nd_acpi driver to dump the
 +   input and output buffers of _DSM operations on the ACPI0012
 +   device, which can be very verbose.  Leave it disabled unless
 +   you are debugging a hardware / firmware issue.
 +
  config NFIT_TEST
   tristate NFIT TEST: Manufactured NFIT for interface testing
   depends on DMA_CMA
 diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
 index 48db723d7a90..073ff28fdbfe 100644
 --- a/drivers/block/nd/acpi.c
 +++ b/drivers/block/nd/acpi.c
 @@ -13,8 +13,10 @@
  #include linux/list.h
  #include linux/acpi.h
  #include linux/mutex.h
 +#include linux/ndctl.h
  #include linux/module.h
  #include nfit.h
 +#include nd.h

  enum {
   NFIT_ACPI_NOTIFY_TABLE = 0x80,
 @@ -26,20 +28,330 @@ struct acpi_nfit {
   struct nd_bus *nd_bus;
  };

 +static struct acpi_nfit *to_acpi_nfit(struct nfit_bus_descriptor *nfit_desc)
 +{
 + return container_of(nfit_desc, struct acpi_nfit, nfit_desc);
 +}
 +
 +#define NFIT_ACPI_MAX_ELEM 4
 +struct nfit_cmd_desc {
 + int in_num;
 + int out_num;
 + u32 in_sizes[NFIT_ACPI_MAX_ELEM];
 + int out_sizes[NFIT_ACPI_MAX_ELEM];
 +};
 +
 +static const struct nfit_cmd_desc nfit_dimm_descs[] = {
 + [NFIT_CMD_IMPLEMENTED] = { },
 + [NFIT_CMD_SMART] = {
 + .out_num = 2,
 + .out_sizes = { 4, 8, },
 + },
 + [NFIT_CMD_SMART_THRESHOLD] = {
 + .out_num = 2,
 + .out_sizes = { 4, 8, },
 + },
 + [NFIT_CMD_DIMM_FLAGS] = {
 + .out_num = 2,
 + .out_sizes = { 4, 4 },
 + },
 + [NFIT_CMD_GET_CONFIG_SIZE] = {
 + .out_num = 3,
 + .out_sizes = { 4, 4, 4, },
 + },
 + [NFIT_CMD_GET_CONFIG_DATA] = {
 + .in_num = 2,
 + .in_sizes = { 4, 4, },
 + .out_num = 2,
 + .out_sizes = { 4, UINT_MAX, },
 + },
 + [NFIT_CMD_SET_CONFIG_DATA] = {
 + .in_num = 3,
 + .in_sizes = { 4, 4, UINT_MAX, },
 + .out_num = 1,
 + .out_sizes = { 4, },
 + },
 + [NFIT_CMD_VENDOR] = {
 + .in_num = 3,
 + .in_sizes = { 4, 4, UINT_MAX, },
 + .out_num = 3,
 + .out_sizes = { 4, 4, UINT_MAX, },
 + },
 +};
 +
 +static const struct nfit_cmd_desc nfit_acpi_descs[] = {
 + [NFIT_CMD_IMPLEMENTED] = { },
 + [NFIT_CMD_ARS_CAP] = {
 + .in_num = 2,
 + .in_sizes

Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-21 Thread Dan Williams
On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote:
 On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote:
  On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
   :
  +
  +static int nd_mem_init(struct nd_bus *nd_bus)
  +{
  + struct nd_spa *nd_spa;
  +
  + /*
  +  * For each SPA-DCR address range find its corresponding
  +  * MEMDEV(s).  From each MEMDEV find the corresponding DCR.
  +  * Then, try to find a SPA-BDW and a corresponding BDW that
  +  * references the DCR.  Throw it all into an nd_mem object.
  +  * Note, that BDWs are optional.
  +  */
  + list_for_each_entry(nd_spa, nd_bus-spas, list) {
  + u16 spa_index = readw(nd_spa-nfit_spa-spa_index);
  + int type = nfit_spa_type(nd_spa-nfit_spa);
  + struct nd_mem *nd_mem, *found;
  + struct nd_memdev *nd_memdev;
  + u16 dcr_index;
  +
  + if (type != NFIT_SPA_DCR)
  + continue;
 
  This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM
  Control Region GUID, for initializing an nd_mem object.  However,
  battery-backed DIMMs do not have such control region SPA.  IIUC, the
  NFIT spec does not require NFIT_SPA_DCR.
 
  Can you change this function to work with NFIT_SPA_PM as well?

 NFIT_SPA_PM ranges are handled separately from nd_mem_init().  See
 nd_region_create() in patch 10.

 If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in
 core.c fails in nd_bus_init_interleave_sets() and skips all subsequent
 nd_bus_xxx() calls.  So, nd_region_create() won't be called.

 nd_bus_init_interleave_sets() fails because init_interleave_set()
 returns -ENODEV if (!nd_mem).

Ah, ok  your test case is specifying PMEM backed by memory device
info.  We have a test case for simple ranges (nfit_test1_setup()), but
it doesn't hit this bug because it does not specify any memory-device
tables.

Thanks, will fix this in v2 of the patch set.

 BTW, there are two nd_bus_probe() in bus.c and core.c, which is
 confusing.

Ok, will fix this as well in the v2 posting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 9:39 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-21 at 13:35 -0700, Dan Williams wrote:
 On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote:
  On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote:
  On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote:
   On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
:
   +
   +static int nd_mem_init(struct nd_bus *nd_bus)
   +{
   + struct nd_spa *nd_spa;
   +
   + /*
   +  * For each SPA-DCR address range find its corresponding
   +  * MEMDEV(s).  From each MEMDEV find the corresponding DCR.
   +  * Then, try to find a SPA-BDW and a corresponding BDW that
   +  * references the DCR.  Throw it all into an nd_mem object.
   +  * Note, that BDWs are optional.
   +  */
   + list_for_each_entry(nd_spa, nd_bus-spas, list) {
   + u16 spa_index = readw(nd_spa-nfit_spa-spa_index);
   + int type = nfit_spa_type(nd_spa-nfit_spa);
   + struct nd_mem *nd_mem, *found;
   + struct nd_memdev *nd_memdev;
   + u16 dcr_index;
   +
   + if (type != NFIT_SPA_DCR)
   + continue;
  
   This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM
   Control Region GUID, for initializing an nd_mem object.  However,
   battery-backed DIMMs do not have such control region SPA.  IIUC, the
   NFIT spec does not require NFIT_SPA_DCR.
  
   Can you change this function to work with NFIT_SPA_PM as well?
 
  NFIT_SPA_PM ranges are handled separately from nd_mem_init().  See
  nd_region_create() in patch 10.
 
  If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in
  core.c fails in nd_bus_init_interleave_sets() and skips all subsequent
  nd_bus_xxx() calls.  So, nd_region_create() won't be called.
 
  nd_bus_init_interleave_sets() fails because init_interleave_set()
  returns -ENODEV if (!nd_mem).

 Ah, ok  your test case is specifying PMEM backed by memory device
 info.  We have a test case for simple ranges (nfit_test1_setup()), but
 it doesn't hit this bug because it does not specify any memory-device
 tables.

 Yes, we have NFIT table with SPA range (PM), memory device to SPA, and
 NVDIMM control region structures.  With the memory device to SPA
 structure, this code requires full sets of information, including the
 namespace label data in _DSM [1], which is outside of ACPI 6.0 and is
 optional.  Battery-backed DIMMs do not have such label data.

This is what nd_namespace_io devices are for, they do not require labels.

Question, if you don't have labels and you don't have DSMs then why
publish a MEMDEV table at all?  Why not simply publish an anonymous
range?  See nfit_test1_setup().

 It needs
 to work with NFIT table with these structures without this _DSM or with
 a different type of _DSM which this code may or may not need to support.
 It should also check Region Format Interface Code (RFIC) in the NVDIMM
 control region structure before assuming this _DSM is present to
 implement RFIC 0x0201.

Ok I can look into adding this check, but I don't think it is
necessary if you simply refrain from publishing a MEMDEV entry.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com wrote:
 On 4/22/2015 1:03 PM, Dan Williams wrote:
 On Wed, Apr 22, 2015 at 9:39 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-21 at 13:35 -0700, Dan Williams wrote:
 On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote:
 On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
  :
 +
 +static int nd_mem_init(struct nd_bus *nd_bus)
 +{
 + struct nd_spa *nd_spa;
 +
 + /*
 +  * For each SPA-DCR address range find its corresponding
 +  * MEMDEV(s).  From each MEMDEV find the corresponding DCR.
 +  * Then, try to find a SPA-BDW and a corresponding BDW that
 +  * references the DCR.  Throw it all into an nd_mem object.
 +  * Note, that BDWs are optional.
 +  */
 + list_for_each_entry(nd_spa, nd_bus-spas, list) {
 + u16 spa_index = readw(nd_spa-nfit_spa-spa_index);
 + int type = nfit_spa_type(nd_spa-nfit_spa);
 + struct nd_mem *nd_mem, *found;
 + struct nd_memdev *nd_memdev;
 + u16 dcr_index;
 +
 + if (type != NFIT_SPA_DCR)
 + continue;

 This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM
 Control Region GUID, for initializing an nd_mem object.  However,
 battery-backed DIMMs do not have such control region SPA.  IIUC, the
 NFIT spec does not require NFIT_SPA_DCR.

 Can you change this function to work with NFIT_SPA_PM as well?

 NFIT_SPA_PM ranges are handled separately from nd_mem_init().  See
 nd_region_create() in patch 10.

 If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in
 core.c fails in nd_bus_init_interleave_sets() and skips all subsequent
 nd_bus_xxx() calls.  So, nd_region_create() won't be called.

 nd_bus_init_interleave_sets() fails because init_interleave_set()
 returns -ENODEV if (!nd_mem).

 Ah, ok  your test case is specifying PMEM backed by memory device
 info.  We have a test case for simple ranges (nfit_test1_setup()), but
 it doesn't hit this bug because it does not specify any memory-device
 tables.

 Yes, we have NFIT table with SPA range (PM), memory device to SPA, and
 NVDIMM control region structures.  With the memory device to SPA
 structure, this code requires full sets of information, including the
 namespace label data in _DSM [1], which is outside of ACPI 6.0 and is
 optional.  Battery-backed DIMMs do not have such label data.

 This is what nd_namespace_io devices are for, they do not require labels.

 Question, if you don't have labels and you don't have DSMs then why
 publish a MEMDEV table at all?  Why not simply publish an anonymous
 range?  See nfit_test1_setup().

 The MEMDEV table provides useful information, and there may be _DSMs,
 perhaps just not the same _DSM as some other devices.

 It needs
 to work with NFIT table with these structures without this _DSM or with
 a different type of _DSM which this code may or may not need to support.
 It should also check Region Format Interface Code (RFIC) in the NVDIMM
 control region structure before assuming this _DSM is present to
 implement RFIC 0x0201.

 Ok I can look into adding this check, but I don't think it is
 necessary if you simply refrain from publishing a MEMDEV entry.

 But we need the MEMDEV. And as Toshi mentions, we could have other
 RFICs with other _DSMs than your example.  That's why there is an RFIC.

Wait, point of clarification, DCRs (dimm-control-regions) have RFICs,
not MEMDEVs (memory-device-to-spa-mapping).  Toshi's original report
was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM
device.  That specific problem can be fixed by either deleting the
MEMDEV, or adding a DCR.

Of course, if you add a DCR with a different intended DSM layout than
the DSM-example-interface the driver will need to add support for
handling that case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 11:23 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Wed, 2015-04-22 at 11:20 -0700, Dan Williams wrote:
 On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com 
 wrote:
 Wait, point of clarification, DCRs (dimm-control-regions) have RFICs,
 not MEMDEVs (memory-device-to-spa-mapping).  Toshi's original report
 was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM
 device.  That specific problem can be fixed by either deleting the
 MEMDEV, or adding a DCR.

 By a DCR, do you mean a DCR structure or SPA with Control Region GUID?

Hmm, I meant a DCR as defined below.  I agree you would not need a SPA-DCR.

 Adding a DCR structure does not solve this issue since it requires SPA
 with Control Region GUID, which battery-backed DIMMs do not have.

I would not go that far, half of a DCR entry is relevant for any
NVDIMM, and half is only relevant if a DIMM offers BLK access:

struct acpi_nfit_dcr {
u16 type;
u16 length;
u16 dcr_index;
u16 vendor_id;
u16 device_id;
u16 revision_id;
u16 sub_vendor_id;
u16 sub_device_id;
u16 sub_revision_id;
u8 reserved[6];
u32 serial_number;
u16 fic;
 BLK relevant fields start here 
u16 num_bcw;
u64 bcw_size;
u64 cmd_offset;
u64 cmd_size;
u64 status_offset;
u64 status_size;
u16 flags;
u8 reserved2[6];
};

 Of course, if you add a DCR with a different intended DSM layout than
 the DSM-example-interface the driver will need to add support for
 handling that case.

 Yes, we consider to add different _DSMs for management.  We do not need
 the nd_acpi driver to support it now, but we need this framework to work
 without the DSM-example-interface present.


One possible workaround is that I could ignore MEMDEV entries that do
not have a corresponding DCR.  This would enable nd_namespace_io
devices to be surfaced for your use case.  Would that work for you?
I.e. do you need the nfit_handle exposed?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 12:12 PM, Elliott, Robert (Server Storage)
elli...@hp.com wrote:
 -Original Message-
 From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
 Dan Williams
 Sent: Friday, April 17, 2015 8:37 PM
 To: linux-nvd...@lists.01.org
 Subject: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices

 ...
 +/*
 + * btt_sb_checksum: compute checksum for btt info block
 + *
 + * Returns a fletcher64 checksum of everything in the given info block
 + * except the last field (since that's where the checksum lives).
 + */
 +u64 btt_sb_checksum(struct btt_sb *btt_sb)
 +{
 + u64 sum, sum_save;
 +
 + sum_save = btt_sb-checksum;
 + btt_sb-checksum = 0;
 + sum = nd_fletcher64(btt_sb, sizeof(*btt_sb));
 + btt_sb-checksum = sum_save;
 + return sum;
 +}
 +EXPORT_SYMBOL(btt_sb_checksum);
 ...

 Of all the functions with prototypes in nd.h, this is the only
 function that doesn't have a name starting with nd_.

 Following such a convention helps ease setting up ftrace filters.

Sure, I'll fix that up.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 12:06 PM, Elliott, Robert (Server Storage)
elli...@hp.com wrote:
 -Original Message-
 From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
 Dan Williams
 Sent: Friday, April 17, 2015 8:35 PM
 To: linux-nvd...@lists.01.org
 Subject: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem

 ...
  create mode 100644 drivers/block/nd/acpi.c
  create mode 100644 drivers/block/nd/blk.c
  create mode 100644 drivers/block/nd/bus.c
  create mode 100644 drivers/block/nd/core.c
 ...

 The kernel already has lots of files with these names:
  5 acpi.c
 10 bus.c
 66 core.c

 I often use ctags like this:
 vim -t core.c
 but that doesn’t immediately work with common filenames - it
 presents a list of all 66 files to choose from.

 Also, blk.c is a name one might expect to see in the block/
 directory (e.g., next to blk.h).

 An nd_ prefix on all the filenames would help.


I picked up the don't duplicate the directory name in the source file
name approach from a review comment from Linus on a SCSI driver a
long time back (iirc).  I'm not motivated to stop that practice now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-22 Thread Dan Williams
On Wed, Apr 22, 2015 at 12:38 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Wed, 2015-04-22 at 12:28 -0700, Dan Williams wrote:
 On Wed, Apr 22, 2015 at 11:23 AM, Toshi Kani toshi.k...@hp.com wrote:
  On Wed, 2015-04-22 at 11:20 -0700, Dan Williams wrote:
  On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com 
  wrote:
  Wait, point of clarification, DCRs (dimm-control-regions) have RFICs,
  not MEMDEVs (memory-device-to-spa-mapping).  Toshi's original report
  was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM
  device.  That specific problem can be fixed by either deleting the
  MEMDEV, or adding a DCR.
 
  By a DCR, do you mean a DCR structure or SPA with Control Region GUID?

 Hmm, I meant a DCR as defined below.  I agree you would not need a SPA-DCR.

  Adding a DCR structure does not solve this issue since it requires SPA
  with Control Region GUID, which battery-backed DIMMs do not have.

 I would not go that far, half of a DCR entry is relevant for any
 NVDIMM, and half is only relevant if a DIMM offers BLK access:

 struct acpi_nfit_dcr {
 u16 type;
 u16 length;
 u16 dcr_index;
 u16 vendor_id;
 u16 device_id;
 u16 revision_id;
 u16 sub_vendor_id;
 u16 sub_device_id;
 u16 sub_revision_id;
 u8 reserved[6];
 u32 serial_number;
 u16 fic;
  BLK relevant fields start here 
 u16 num_bcw;
 u64 bcw_size;
 u64 cmd_offset;
 u64 cmd_size;
 u64 status_offset;
 u64 status_size;
 u16 flags;
 u8 reserved2[6];
 };

 Yes, we do have a DCR entry.  But we do not have a SPA-DCR.

Got it. will fix.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development

2015-04-24 Thread Dan Williams
On Fri, Apr 24, 2015 at 2:59 PM, Linda Knippers linda.knipp...@hp.com wrote:
 On 4/24/2015 5:50 PM, Dan Williams wrote:
 On Fri, Apr 24, 2015 at 2:47 PM, Linda Knippers linda.knipp...@hp.com 
 wrote:
 On 4/17/2015 9:35 PM, Dan Williams wrote:
 :
 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 index 5fa74f124b3e..0106b3807202 100644
 --- a/drivers/block/nd/Kconfig
 +++ b/drivers/block/nd/Kconfig
 @@ -41,4 +41,24 @@ config NFIT_ACPI
 register the platform-global NFIT blob with the core.  Also
 enables the core to craft ACPI._DSM messages for platform/dimm
 configuration.
 +
 +config NFIT_TEST
 + tristate NFIT TEST: Manufactured NFIT for interface testing
 + depends on DMA_CMA
 + depends on ND_CORE=m
 + depends on m
 + help
 +   For development purposes register a manufactured
 +   NFIT table to verify the resulting device model topology.
 +   Note, this module arranges for ioremap_cache() to be
 +   overridden locally to allow simulation of system-memory as an
 +   io-memory-resource.
 +
 +   Note, this test expects to be able to find at least
 +   256MB of CMA space (CONFIG_CMA_SIZE_MBYTES) or it will fail to

 It seems to actually be wanting = 584MB.

 Ah, true, this Kconfig text is stale.  Will fix.

 Thanks.  One more question...

 +#ifdef CONFIG_CMA_SIZE_MBYTES
 +#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
 +#else
 +#define CMA_SIZE_MBYTES 0
 +#endif
 +
 +static __init int nfit_test_init(void)
 +{
 + int rc, i;
 +
 + if (CMA_SIZE_MBYTES  584) {
 + pr_err(need CONFIG_CMA_SIZE_MBYTES = 584 to load\n);
 + return -EINVAL;
 + }
 +

 Since the kernel takes a cma= boot parameter, it would be nice if
 this check is against what the kernel is using rather than the config
 option.  Is that possible?

Yeah, that would be more friendly.  I also think we can reduce the BLK
aperture sizes.  Since those don't need to be DAX capable they can
come from vmalloc memory rather than CMA.  I'll take a look.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi

2015-04-24 Thread Dan Williams
On Fri, Apr 24, 2015 at 8:56 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
 Most configuration of the nd-subsystem is done via nd-sysfs.  However,
 the NFIT specification defines a small set of messages that can be
 passed to the subsystem via platform-firmware-defined methods.  The
 command set (as of the current version of the NFIT-DSM spec) is:

 NFIT_CMD_SMART: media health and diagnostics
 NFIT_CMD_GET_CONFIG_SIZE: size of the label space
 NFIT_CMD_GET_CONFIG_DATA: read label
 NFIT_CMD_SET_CONFIG_DATA: write label
 NFIT_CMD_VENDOR: vendor-specific command passthrough
 NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities
 NFIT_CMD_START_ARS: initiate scrubbing
 NFIT_CMD_QUERY_ARS: report on scrubbing state
 NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

 nd/bus.c provides two features, 1) the top level ND bus driver which
 is the central part of the ND, and 2) the ioctl interface specific to
 the example-DSM-interface.  I think the example-DSM-specific part should
 be put into an example-DSM-support module, so that the ND can support
 other _DSMs as necessary.  Also, _DSM needs to be handled as optional.

I don't think it needs to be separated, they'll both end up using the
same infrastructure just with different UUIDs on the ACPI device
interface or different format-interface-codes.  A firmware
implementation is also free to disable individual DSMs (see
nd_acpi_add_dimm).  That said, you're right, we do need a fix to allow
PMEM from DIMMs without DSMs to activate.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi

2015-04-24 Thread Dan Williams
On Fri, Apr 24, 2015 at 9:09 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-24 at 09:56 -0600, Toshi Kani wrote:
 On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
  Most configuration of the nd-subsystem is done via nd-sysfs.  However,
  the NFIT specification defines a small set of messages that can be
  passed to the subsystem via platform-firmware-defined methods.  The
  command set (as of the current version of the NFIT-DSM spec) is:
 
  NFIT_CMD_SMART: media health and diagnostics
  NFIT_CMD_GET_CONFIG_SIZE: size of the label space
  NFIT_CMD_GET_CONFIG_DATA: read label
  NFIT_CMD_SET_CONFIG_DATA: write label
  NFIT_CMD_VENDOR: vendor-specific command passthrough
  NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities
  NFIT_CMD_START_ARS: initiate scrubbing
  NFIT_CMD_QUERY_ARS: report on scrubbing state
  NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

 nd/bus.c provides two features, 1) the top level ND bus driver which
 is the central part of the ND, and 2) the ioctl interface specific to
 the example-DSM-interface.  I think the example-DSM-specific part should
 be put into an example-DSM-support module, so that the ND can support
 other _DSMs as necessary.  Also, _DSM needs to be handled as optional.

 And the same for nd/acpi.c, which is 1) the ACPI0012 handler, and 2)
 the example-DSM-support module.  I think they need to be separated.


Ok, send me a patch as I'm not sure what type of separation you are proposing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi

2015-04-24 Thread Dan Williams
On Fri, Apr 24, 2015 at 10:18 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-24 at 09:25 -0700, Dan Williams wrote:
 On Fri, Apr 24, 2015 at 8:56 AM, Toshi Kani toshi.k...@hp.com wrote:
  On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
  Most configuration of the nd-subsystem is done via nd-sysfs.  However,
  the NFIT specification defines a small set of messages that can be
  passed to the subsystem via platform-firmware-defined methods.  The
  command set (as of the current version of the NFIT-DSM spec) is:
 
  NFIT_CMD_SMART: media health and diagnostics
  NFIT_CMD_GET_CONFIG_SIZE: size of the label space
  NFIT_CMD_GET_CONFIG_DATA: read label
  NFIT_CMD_SET_CONFIG_DATA: write label
  NFIT_CMD_VENDOR: vendor-specific command passthrough
  NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities
  NFIT_CMD_START_ARS: initiate scrubbing
  NFIT_CMD_QUERY_ARS: report on scrubbing state
  NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events
 
  nd/bus.c provides two features, 1) the top level ND bus driver which
  is the central part of the ND, and 2) the ioctl interface specific to
  the example-DSM-interface.  I think the example-DSM-specific part should
  be put into an example-DSM-support module, so that the ND can support
  other _DSMs as necessary.  Also, _DSM needs to be handled as optional.

 I don't think it needs to be separated, they'll both end up using the
 same infrastructure just with different UUIDs on the ACPI device
 interface or different format-interface-codes.  A firmware
 implementation is also free to disable individual DSMs (see
 nd_acpi_add_dimm).

 Well, ioctl cmd# is essentially func# of the _DSM, and each cmd
 structure needs to match with its _DSM output data structure.  So, I do
 not think these cmds will work for other _DSMs.  That said, the ND is
 complex enough already, and we should not make it more complicated for
 the initial version...  So, how about changing the name of /dev/ndctl0
 to indicate RFIC 0x0201, ex. /dev/nd0201ctl0?  That should allow
 separate ioctl()s for other RFICs.  The code can be updated when other
 _DSM actually needs to be supported by the ND.

No, all you need is unique command names (see libndctl
ndctl_{bus|dimm}_is_cmd_supported()) and then translate the ND cmd
number to the firmware function number in the provider.  It just so
happens that for these first set of commands the ND cmd number matches
the ACPI device function number in the DSM-interface-example, but
there is no reason that need always be the case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types

2015-04-20 Thread Dan Williams
On Sun, Apr 19, 2015 at 12:46 AM, Boaz Harrosh b...@plexistor.com wrote:
 On 04/18/2015 04:35 AM, Dan Williams wrote:
[..]
 diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
 index 11cc7d54ec3f..410af501a941 100644
 --- a/arch/x86/kernel/e820.c
 +++ b/arch/x86/kernel/e820.c
 @@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type)
   case E820_RESERVED_KERN:
   printk(KERN_CONT usable);
   break;
 + case E820_PMEM:
 + case E820_PRAM:

 NACK!

 This is the most important print in the system and it is a pure
 user Interface. It has no effect what so ever on functionality
 It is to Inform the user through dmesg what is the content of the
 table.

It still describes how the memory is used which is reserved for a
driver, I don't see how increasing the verbosity here improves debug
given the alternatives, see below...


   case E820_RESERVED:
   printk(KERN_CONT reserved);
   break;
 @@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type)
   case E820_UNUSABLE:
   printk(KERN_CONT unusable);
   break;

 +   case E820_PMEM:
 - case E820_PRAM:
 - printk(KERN_CONT persistent (type %u), type);
 - break;

 Just add the new (7) entry here please. Here Christoph has bike shed
 it for you.

/proc/iomem has these details to differentiate PRAM and PMEM as well
as show which driver(s)/device(s) have claimed the range(s).

   default:
   printk(KERN_CONT type %u, type);
   break;

Here is where you can see undefined/unknown types.

 @@ -919,10 +918,26 @@ static inline const char *e820_type_to_string(int 
 e820_type)
   case E820_NVS:  return ACPI Non-volatile Storage;
   case E820_UNUSABLE: return Unusable memory;
   case E820_PRAM: return Persistent RAM;
 + case E820_PMEM: return Persistent I/O Memory;
   default:return reserved;
   }
  }

 +static bool do_mark_busy(u32 type, struct resource *res)
 +{
 + if (res-start  (1ULL20))
 + return true;
 +
 + switch (type) {
 + case E820_RESERVED:
 + case E820_PRAM:
 + case E820_PMEM:
 + return false;
 + default:
 + return true;
 + }

 Sigh. Again an unknown type comes out busy. Busy means
 resource used. It does *not* mean unknown type.

 It just forces researchers to ignore the return value of
 request_region. And not be protected by double lock. It
 does not really prevent anything

You're free to submit a standalone patch to change this policy... see
the new OEM-reserved memory types in ACPI 6.

That said, I think we're better off with the current policy.  If
unknown memory types were treated as permanently-busy back when we
initially started experimenting with NVDIMM support (2010) then I
doubt the e820-type-12 prototype would ever have escaped the lab.  We
could have avoided a good amount of confusion.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem

2015-04-20 Thread Dan Williams
On Mon, Apr 20, 2015 at 5:53 AM, Christoph Hellwig h...@lst.de wrote:
 [I haven't much time to look through the patches, so only high level
  hand wavey comments for now, sorry..]

 On Mon, Apr 20, 2015 at 01:14:42AM -0700, Dan Williams wrote:
  So why on earth is this whole concept and the naming itself
  ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving
  around a specific 'firmware' mindset and revolving around specific,
  weirdly named, overly complicated looking firmware interfaces that
  come with their own new weird glossary??

 There's only three core properties of NVDIMMs that this implementation
 cares about.

 1/ directly mapped interleaved persistent memory (PMEM)
 2/ indirect mmio aperture accessed (windowed) persistent memory (BLK)
 3/ the possibility that those 2 access modes may alias the same
 on-media addresses

 Most of complexity of the implementation is dealing with aspect 3, but
 that complexity can and is bypassed in places.

  Firmware might be a discovery method - or not. A non-volatile device
  might be e820 enumerated, or PCI discovered - potentially with all
  discovery handled by the driver.

 PCI attached non-volatile memory is NVMe.  ND is handling address
 ranges that support direct cpu load store.

 But those can't be attached in all kinds of different ways.  It's not like
 this is a new thing - they've been used in Storage OEM systems for a long
 time, both on Intel platforms and other CPUs.

 And the current pmem.c can also handle cases like a PCI card exposing
 a large mmio region that can be used as persistent memory.

 So a big vote from me into naming this the pmem subsystem and trying
 to have names not too tied to one specific firmware interface.

While I understand a kernel developer's natural aversion to anything
committee defined, NFIT does seem be a superset of all the base
mechanisms needed to describe NVDIMM resources.  Also, it's worth
noting that meaning of 'N' in ND is purposefully vague.  The whole
point of listing it as Nfit-Defined / NvDimm Subsystem was to
indicate that ND is generic and could also refer generally to
Non-volatile-Devices.  What's missing, in my opinion, is an existing
NVDIMM platform that would like to leverage some of base enabling that
this sub-system provides and will never have an NFIT capability.  In
the absence of alternative concerns/implementations we reached for
NFIT terminology out of convenience, but I'm all up for deprecating
NFIT-Defined as one of the meanings of 'ND'.

 Once I'll go through this in more detail I'll comment more.

Sounds good.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-21 Thread Dan Williams
On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote:
  :
 +
 +static int nd_mem_init(struct nd_bus *nd_bus)
 +{
 + struct nd_spa *nd_spa;
 +
 + /*
 +  * For each SPA-DCR address range find its corresponding
 +  * MEMDEV(s).  From each MEMDEV find the corresponding DCR.
 +  * Then, try to find a SPA-BDW and a corresponding BDW that
 +  * references the DCR.  Throw it all into an nd_mem object.
 +  * Note, that BDWs are optional.
 +  */
 + list_for_each_entry(nd_spa, nd_bus-spas, list) {
 + u16 spa_index = readw(nd_spa-nfit_spa-spa_index);
 + int type = nfit_spa_type(nd_spa-nfit_spa);
 + struct nd_mem *nd_mem, *found;
 + struct nd_memdev *nd_memdev;
 + u16 dcr_index;
 +
 + if (type != NFIT_SPA_DCR)
 + continue;

 This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM
 Control Region GUID, for initializing an nd_mem object.  However,
 battery-backed DIMMs do not have such control region SPA.  IIUC, the
 NFIT spec does not require NFIT_SPA_DCR.

 Can you change this function to work with NFIT_SPA_PM as well?

NFIT_SPA_PM ranges are handled separately from nd_mem_init().  See
nd_region_create() in patch 10.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development

2015-04-24 Thread Dan Williams
On Fri, Apr 24, 2015 at 2:47 PM, Linda Knippers linda.knipp...@hp.com wrote:
 On 4/17/2015 9:35 PM, Dan Williams wrote:
 :
 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 index 5fa74f124b3e..0106b3807202 100644
 --- a/drivers/block/nd/Kconfig
 +++ b/drivers/block/nd/Kconfig
 @@ -41,4 +41,24 @@ config NFIT_ACPI
 register the platform-global NFIT blob with the core.  Also
 enables the core to craft ACPI._DSM messages for platform/dimm
 configuration.
 +
 +config NFIT_TEST
 + tristate NFIT TEST: Manufactured NFIT for interface testing
 + depends on DMA_CMA
 + depends on ND_CORE=m
 + depends on m
 + help
 +   For development purposes register a manufactured
 +   NFIT table to verify the resulting device model topology.
 +   Note, this module arranges for ioremap_cache() to be
 +   overridden locally to allow simulation of system-memory as an
 +   io-memory-resource.
 +
 +   Note, this test expects to be able to find at least
 +   256MB of CMA space (CONFIG_CMA_SIZE_MBYTES) or it will fail to

 It seems to actually be wanting = 584MB.

Ah, true, this Kconfig text is stale.  Will fix.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 5:25 PM, Rafael J. Wysocki r...@rjwysocki.net wrote:
 On Tuesday, April 28, 2015 02:24:12 PM Dan Williams wrote:
 Changes since v1 [1]: Incorporates feedback received prior to April 24.

 1/ Ingo said [2]:

So why on earth is this whole concept and the naming itself
('drivers/block/nd/' stands for 'NFIT Defined', apparently)
revolving around a specific 'firmware' mindset and revolving
around specific, weirdly named, overly complicated looking
firmware interfaces that come with their own new weird
glossary??

Indeed, we of course consulted the NFIT specification to determine
the shape of the sub-system, but then let its terms and data
structures permeate too deep into the implementation.  That is fixed
now with all NFIT specifics factored out into acpi.c.  The NFIT is no
longer required reading to review libnd.  Only three concepts are
needed:

   i/ PMEM - contiguous memory range where cpu stores are
persistent once they are flushed through the memory
  controller.

  ii/ BLK - mmio apertures (sliding windows) that can be
programmed to access an aperture's-worth of persistent
  media at a time.

 iii/ DPA - dimm-physical-address, address space local to a
dimm.  A dimm may provide both PMEM-mode and BLK-mode
access to a range of DPA.  libnd manages allocation of DPA
  to either PMEM or BLK-namespaces to resolve this aliasing.

The v1..v2 diffstat below shows the migration of nfit-specifics to
acpi.c and the new state of libnd being nfit-free.  nd now only
refers to non-volatile devices.  Note, reworked documentation will
return once the review has settled.

Documentation/blockdev/nd.txt |  867 -
MAINTAINERS   |   34 +-
arch/ia64/kernel/efi.c|5 +-
arch/x86/kernel/e820.c|   11 +-
arch/x86/kernel/pmem.c|2 +-
drivers/block/Makefile|2 +-
drivers/block/nd/Kconfig  |  135 ++--
drivers/block/nd/Makefile |   32 +-
drivers/block/nd/acpi.c   | 1506 
 +++--
drivers/block/nd/acpi_nfit.h  |  321 
drivers/block/nd/blk.c|   27 +-
drivers/block/nd/btt.c|6 +-
drivers/block/nd/btt_devs.c   |8 +-
drivers/block/nd/bus.c|  337 +
drivers/block/nd/core.c   |  574 +-
drivers/block/nd/dimm.c   |   11 -
drivers/block/nd/dimm_devs.c  |  292 ++-
drivers/block/nd/e820.c   |  100 +++
drivers/block/nd/libnd.h  |  122 +++
drivers/block/nd/namespace_devs.c |   10 +-
drivers/block/nd/nd-private.h |  107 +--
drivers/block/nd/nd.h |   91 +--
drivers/block/nd/nfit.h   |  238 --
drivers/block/nd/pmem.c   |   56 +-
drivers/block/nd/region.c |   78 +-
drivers/block/nd/region_devs.c|  783 +++
drivers/block/nd/test/iomap.c |   86 +--
drivers/block/nd/test/nfit.c  | 1115 +++
drivers/block/nd/test/nfit_test.h |   15 +-
include/uapi/linux/ndctl.h|  130 ++--
30 files changed, 3166 insertions(+), 3935 deletions(-)
delete mode 100644 Documentation/blockdev/nd.txt
create mode 100644 drivers/block/nd/acpi_nfit.h
create mode 100644 drivers/block/nd/e820.c
create mode 100644 drivers/block/nd/libnd.h
delete mode 100644 drivers/block/nd/nfit.h

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000484.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000520.html

 2/ Christoph asked the pmem ida conversion to be moved to its own patch
(done), and to consider leaving the current pmem.c in drivers/block/.
Instead, I converted the e820-type-12 enabling to be the first
non-ACPI-NFIT based consumer of libnd.  The new nd_e820 driver simply
registers e820-type-12 ranges as libnd PMEM regions.  Among other
things this conversion enables BTT for these ranges.  The alternative
is to move drivers/block/nd/nd.h internals out to include/linux/
which I think is worse.

 3/ Toshi reported that the NFIT parsing fails to handle the case of a
PMEM range with a single-dimm (non-aliasing) interleave description.
Support for this case was added and is tested by default by the
nfit_test.1 configuration.

 4/ Toshi reported that we should not be treating a missing _STA property
as a dimm disabled by firmware case.  (fixed).

 5/ Christoph noted that ND_ARCH_HAS_IOREMAP_CACHE needs to be moved to
arch code.  It is gone for now and we'll revisit when adding cached
mappings back to the PMEM driver.

 6/ Toshi mentioned that the presence of two different nd_bus_probe()
functions was confusing.  (cleaned up

Re: [PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 1:52 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 Changes since v1 [1]: Incorporates feedback received prior to April 24.

 1/ Ingo said [2]:

So why on earth is this whole concept and the naming itself
('drivers/block/nd/' stands for 'NFIT Defined', apparently)
revolving around a specific 'firmware' mindset and revolving
around specific, weirdly named, overly complicated looking
firmware interfaces that come with their own new weird
glossary??

Indeed, we of course consulted the NFIT specification to determine
the shape of the sub-system, but then let its terms and data
structures permeate too deep into the implementation.  That is fixed
now with all NFIT specifics factored out into acpi.c.  The NFIT is no
longer required reading to review libnd.  Only three concepts are
needed:

   i/ PMEM - contiguous memory range where cpu stores are
  persistent once they are flushed through the memory
  controller.

  ii/ BLK - mmio apertures (sliding windows) that can be
  programmed to access an aperture's-worth of persistent
  media at a time.

 iii/ DPA - dimm-physical-address, address space local to a
  dimm.  A dimm may provide both PMEM-mode and BLK-mode
  access to a range of DPA.  libnd manages allocation of DPA
  to either PMEM or BLK-namespaces to resolve this aliasing.

 Mostly for my understanding: is there a name for address relative to
 the address lines on the DIMM?  That is, a DIMM that exposes 8 GB of
 apparent physical memory, possibly interleaved, broken up, or weirdly
 remapped by the memory controller, would still have addresses between
 0 and 8 GB.  Some of those might be PMEM windows, some might be MMIO,
 some might be BLK apertures, etc.

 IIUC DPA refers to actual addressable storage, not this type of address?

No, DPA is exactly as you describe above.  You can't directly access
it except through a PMEM mapping (possibly interleaved with DPA from
other DIMMs) or a BLK aperture (mmio window into DPA).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 2:06 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 1:59 PM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 On Tue, Apr 28, 2015 at 1:52 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 Changes since v1 [1]: Incorporates feedback received prior to April 24.

 1/ Ingo said [2]:

So why on earth is this whole concept and the naming itself
('drivers/block/nd/' stands for 'NFIT Defined', apparently)
revolving around a specific 'firmware' mindset and revolving
around specific, weirdly named, overly complicated looking
firmware interfaces that come with their own new weird
glossary??

Indeed, we of course consulted the NFIT specification to determine
the shape of the sub-system, but then let its terms and data
structures permeate too deep into the implementation.  That is fixed
now with all NFIT specifics factored out into acpi.c.  The NFIT is no
longer required reading to review libnd.  Only three concepts are
needed:

   i/ PMEM - contiguous memory range where cpu stores are
  persistent once they are flushed through the memory
  controller.

  ii/ BLK - mmio apertures (sliding windows) that can be
  programmed to access an aperture's-worth of persistent
  media at a time.

 iii/ DPA - dimm-physical-address, address space local to a
  dimm.  A dimm may provide both PMEM-mode and BLK-mode
  access to a range of DPA.  libnd manages allocation of DPA
  to either PMEM or BLK-namespaces to resolve this aliasing.

 Mostly for my understanding: is there a name for address relative to
 the address lines on the DIMM?  That is, a DIMM that exposes 8 GB of
 apparent physical memory, possibly interleaved, broken up, or weirdly
 remapped by the memory controller, would still have addresses between
 0 and 8 GB.  Some of those might be PMEM windows, some might be MMIO,
 some might be BLK apertures, etc.

 IIUC DPA refers to actual addressable storage, not this type of address?

 No, DPA is exactly as you describe above.  You can't directly access
 it except through a PMEM mapping (possibly interleaved with DPA from
 other DIMMs) or a BLK aperture (mmio window into DPA).

 So the thing I'm describing has no name, then?  Oh, well.

What?  The thing you are describing *is* DPA.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 2:10 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:26 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 From: Ross Zwisler ross.zwis...@linux.intel.com

 The libnd implementation handles allocating dimm address space (DPA)
 between PMEM and BLK mode interfaces.  After DPA has been allocated from
 a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
 as a struct bio based block device. Unlike PMEM, BLK is required to
 handle platform specific details like mmio register formats and memory
 controller interleave.  For this reason the libnd generic nd_blk driver
 calls back into the bus provider to carry out the I/O.

 This initial implementation handles the BLK interface defined by the
 ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
 DCR (dimm control region), BDW (block data window), IDT (interleave
 descriptor) NFIT structures and the hardware register format.
 [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
 [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Boaz Harrosh b...@plexistor.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Jens Axboe ax...@fb.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Christoph Hellwig h...@lst.de
 Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  drivers/block/nd/Kconfig  |   12 +
  drivers/block/nd/Makefile |3
  drivers/block/nd/acpi.c   |  422 
 +++--
  drivers/block/nd/acpi_nfit.h  |   47 
  drivers/block/nd/blk.c|  264 +++
  drivers/block/nd/libnd.h  |   11 +
  drivers/block/nd/namespace_devs.c |   47 
  drivers/block/nd/nd-private.h |3
  drivers/block/nd/nd.h |   16 +
  drivers/block/nd/region.c |8 +
  drivers/block/nd/region_devs.c|   65 +-
  drivers/block/nd/test/nfit.c  |   29 +++
  drivers/block/nd/test/nfit_test.h |2
  13 files changed, 891 insertions(+), 38 deletions(-)
  create mode 100644 drivers/block/nd/blk.c

 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 index 612bf2b14283..bac4290129fc 100644
 --- a/drivers/block/nd/Kconfig
 +++ b/drivers/block/nd/Kconfig
 @@ -95,6 +95,18 @@ config BLK_DEV_PMEM

   Say Y if you want to use a NVDIMM described by ACPI, E820, etc...

 +config ND_BLK
 +   tristate BLK: Block data window (aperture) device support
 +   depends on LIBND
 +   default ND_ACPI
 +   help
 + This driver performs I/O using a set of mmio windows on a
 + dimm.  The set of apertures will all access the one DIMM.
 + Multiple windows allow multiple threads to have a different
 + portions of the dimm open at one time.
 +
 + Say Y if you want to use a NVDIMM with BLK-mode capability
 +

 This describes how it works, not what it is.  How about:

 This driver exposes NVDIMM BLK regions as block devices.  BLK regions
 are regions of NVDIMM storage that are sector-addressable, not
 byte-addressible, and do not support DAX.

They *are* byte-addressable albeit through an indirection window.  The
indirection windows are too small for DAX to be a viable access mode.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/20] e820, efi: add ACPI 6.0 persistent memory types

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 1:49 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
 index 11cc7d54ec3f..d38b53a7e9b2 100644
 --- a/arch/x86/kernel/e820.c
 +++ b/arch/x86/kernel/e820.c
 @@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type)
 case E820_UNUSABLE:
 printk(KERN_CONT unusable);
 break;
 +   case E820_PMEM:
 case E820_PRAM:
 printk(KERN_CONT persistent (type %u), type);
 break;

 I'd kind of like to make it more clear what's going on here.  It
 doesn't help that the spec chose poor names.

 How about NVDIMM physical aperture for E820_PMEM and legacy
 persistent RAM for E820_PRAM?

The term aperture to me implies this BLK (mmio-windowed) mode of
accessing persistent media that the NFIT specification introduces.  In
fact, those ranges are mapped E820_RESERVED.  E820_PMEM really is a
memory range that happens to be persistent.

 Otherwise this looks generaly sensible, although I don't really
 understand why e820_type_to_string and e820_print_type are different.

 e820_type_to_string() appears in /proc/iomem and seems to afford
being more descriptive than e820_print_type() that just scrolls by in
dmesg, but I'm just guessing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 2:24 PM, Elliott, Robert (Server Storage)
elli...@hp.com wrote:
 -Original Message-
 From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
 Dan Williams
 Sent: Tuesday, April 28, 2015 1:24 PM
 To: linux-nvd...@lists.01.org
 Cc: Neil Brown; Dave Chinner; H. Peter Anvin; Christoph Hellwig; Rafael J.
 Wysocki; Robert Moore; Ingo Molnar; linux-a...@vger.kernel.org; Jens Axboe;
 Borislav Petkov; Thomas Gleixner; Greg KH; linux-kernel@vger.kernel.org;
 Andy Lutomirski; Andrew Morton; Linus Torvalds
 Subject: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device
 support

 Changes since v1 [1]: Incorporates feedback received prior to April 24.

 Here are some comments on the sysfs properties reported for a pmem device.
 They are based on v1, but I don't think v2 changes anything.

 1. This confuses lsblk (part of util-linux):
 /sys/block/pmem0/device/type:4

 lsblk shows:
 NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
 pmem0 251:00 8G  0 worm
 pmem1 251:16   0 8G  0 worm
 pmem2 251:32   0 8G  0 worm
 pmem3 251:48   0 8G  0 worm
 pmem4 251:64   0 8G  0 worm
 pmem5 251:80   0 8G  0 worm
 pmem6 251:96   0 8G  0 worm
 pmem7 251:112  0 8G  0 worm

 lsblk's blkdev_scsi_type_to_name() considers 4 to mean
 SCSI_TYPE_WORM (write once read many ... used for certain optical
 and tape drives).

Why is lsblk assuming these are scsi devices?  I'll need to go check that out.

 I'm not sure what nd and pmem are doing to result in that value.

That is their libnd specific device type number from
include/uapi/ndctl.h.  4 == ND_DEVICE_NAMESPACE_IO.   lsblk has no
business interpreting this as something SCSI specific.

 2. To avoid confusing software trying to detect fast storage vs.
 slow storage devices via sysfs, this value should be 0:
 /sys/block/pmem0/queue/rotational:1

 That can be done by adding this shortly after the blk_alloc_queue call:
 queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem-pmem_queue);

Yeah, good catch.

 3. Is there any reason to have a 512 KiB limit on the transfer
 length?
 /sys/block/pmem0/queue/max_hw_sectors_kb:512

 That is from:
blk_queue_max_hw_sectors(pmem-pmem_queue, 1024);

I'd only change this from the default if performance testing showed it
made a non-trivial difference.

 4. These are read-writeable, but IOs never reach a queue, so
 the queue size is irrelevant and merging never happens:
 /sys/block/pmem0/queue/nomerges:0
 /sys/block/pmem0/queue/nr_requests:128

 Consider making them both read-only with:
 * nomerges set to 2 (no merging happening)
 * nr_requests as small as the block layer allows to avoid
 wasting memory.

 5. No scatter-gather lists are created by the driver, so these
 read-only fields are meaningless:
 /sys/block/pmem0/queue/max_segments:128
 /sys/block/pmem0/queue/max_segment_size:65536

 Is there a better way to report them as irrelevant?

Again it comes back to the question of whether these default settings
are actively harmful.


 6. There is no completion processing, so the read-writeable
 cpu affinity is not used:
 /sys/block/pmem0/queue/rq_affinity:0

 Consider making it read-only and set to 2, meaning the
 completions always run on the requesting CPU.

There are no completions with pmem, the entire I/O path is
synchronous.  Ideally, this attribute would disappear for a pmem
queue, not be set to 2.

 7. With mmap() allowing less than logical block sized accesses
 to the device, this could be considered misleading:
 /sys/block/pmem0/queue/physical_block_size:512

I don't see how it is misleading.  If you access it as a block device
the block size is 512.  If the application is mmap() + DAX aware it
knows that the physical_block_size is being bypassed.


 Perhaps that needs to be 1 byte or a cacheline size (64 bytes
 on x86) to indicate that direct partial logical block accesses
 are possible.

No, because that breaks the definition of a block device.  Through the
bdev interface it's always accessed a block at a time.

 The btt driver could report 512 as one indication
 it is different.

 I wouldn't be surprised if smaller values than the logical block
 size confused some software, though.

Precisely why we shouldn't go there with pmem.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)

2015-04-29 Thread Dan Williams
On Wed, Apr 29, 2015 at 8:53 AM, Elliott, Robert (Server Storage)
elli...@hp.com wrote:
 -Original Message-
 From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
 Dan Williams
 Sent: Tuesday, April 28, 2015 1:25 PM
 Subject: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-
 data-window, persistent memory, volatile memory)

 A region device represents the maximum capacity of a BLK range (mmio
 block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
 volatile memory), without regard for aliasing.  Aliasing, in the
 dimm-local address space (DPA), is resolved by metadata on a dimm to
 designate which exclusive interface will access the aliased DPA ranges.
 Support for the per-dimm metadata/label arrvies is in a subsequent
 patch.

 The name format of region devices is regionN where, like dimms, N is
 a global ida index assigned at discovery time.  This id is not reliable
 across reboots nor in the presence of hotplug.  Look to attributes of
 the region or static id-data of the sub-namespace to generate a
 persistent name.
 ...
 +++ b/drivers/block/nd/region_devs.c
 ...
 +static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 + struct nd_region_desc *ndr_desc, struct device_type *dev_type)
 +{
 + struct nd_region *nd_region;
 + struct device *dev;
 + u16 i;
 +
 + for (i = 0; i  ndr_desc-num_mappings; i++) {
 + struct nd_mapping *nd_mapping = ndr_desc-nd_mapping[i];
 + struct nd_dimm *nd_dimm = nd_mapping-nd_dimm;
 +
 + if ((nd_mapping-start | nd_mapping-size) % SZ_4K) {
 + dev_err(nd_bus-dev, %pf: %s mapping%d is not 4K
 aligned\n,
 + __builtin_return_address(0),

 Please use KiB rather than the unclear K.

Ok.

 Same comment for a dev_dbg print in patch 14.

It's a debug statement, but ok.

[..]

 Could this include nd in the name, like ndregion%d?

 The other dev_set_name calls in this patch set use:
 btt%d
 ndbus%d
 nmem%d
 namespace%d.%d

 which are a bit more distinctive.

They sit on an nd bus and don't have global device nodes, I don't
see a need to make them anymore distinctive.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2015-04-29 Thread Dan Williams
On Wed, 2015-04-29 at 17:17 +0200, D.S. Ljungmark wrote:
 On 29/04/15 16:51, Denys Vlasenko wrote:
  On Wed, Apr 1, 2015 at 9:48 PM, David Miller da...@davemloft.net wrote:
  D.S. Ljungmark (1):
ipv6: Don't reduce hop limit for an interface
  
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6fd99094de2b83d1d4c8457f2c83483b2828e75a
  
  I was testing this change and apparently it doesn't close the hole.
  
  The python script I use to send RAs:
  
  #!/usr/bin/env python
  import sys
  import time
  import scapy.all
  from scapy.layers.inet6 import *
  ip = IPv6()
  # ip.dst = 'ff02::1'
  ip.dst = sys.argv[1]
  icmp = ICMPv6ND_RA()
  icmp.chlim = 1
  for x in range(10):
  send(ip/icmp)
  time.sleep(1)
  
  # ./ipv6-hop-limit.py fe80::21e:37ff:fed0:5006
  .
  Sent 1 packets.
  ...10 times...
  Sent 1 packets.
  
  After I do this, on the targeted machine I check hop_limits:
  
  # for f in /proc/sys/net/ipv6/conf/*/hop_limit; do echo -n $f:; cat $f; done
  /proc/sys/net/ipv6/conf/all/hop_limit:64
  /proc/sys/net/ipv6/conf/default/hop_limit:64
  /proc/sys/net/ipv6/conf/enp0s25/hop_limit:1  === THIS
  /proc/sys/net/ipv6/conf/lo/hop_limit:64
  /proc/sys/net/ipv6/conf/wlp3s0/hop_limit:64
  
  As you see, the interface which received RAs still lowered
  its hop_limit to 1. I take it means that the bug is still present
  (right? I'm not a network guy...).
 
 It might not be present in the _kernel_. Do you run NetworkManager on
 your system? If so, see below.
 
  
  I triple-checked that I do run the kernel with the fix.
  Further investigation shows that the code touched by the fix
  is not even reached, hop_limit is changed elsewhere.
  
  I'm willing to test additional patches.
 
 NetworkManager had it's own re-implementation of the bug. It got fixed
 with NetworkManager commit:
 
 commit bdaaf9849b0cacf131b71fa2ae168f5db796874f
 Author: Thomas Haller thal...@redhat.com
 Date:   Wed Apr 8 15:54:30 2015 +0200
 
 platform: don't accept lowering IPv6 hop-limit from RA (CVE-2015-2924)
 
 
 
 Beforte that commit, NetworkManager would take the RA packet, extract
 the hop limit, and write it to the sysctl itself.

Yup, we basically followed the original kernel logic here, so we needed
to patch it in NM as well.  It's been backported to NM 0.9.10, 1.0, and
obviously is in git master.

Dan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 11/20] libnd, nd_pmem: add libnd support to the pmem driver

2015-04-29 Thread Dan Williams
On Tue, Apr 28, 2015 at 3:58 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 3:21 PM, Phil Pokorny
 ppoko...@penguincomputing.com wrote:
 On Tue, Apr 28, 2015 at 2:04 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:25 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
[..]
 This is such a mess that I think this driver should maybe flat-out
 refuse to load in this type of configuration without some scary module
 option.  I have some NVDIMMs that report as type 12 but need two extra
 out-of-tree drivers to work safely.  First, they need i2c_imc or the
 equivalent (I'll try to resubmit that soon).  Second, they need secret
 magic NDAed register poking.  The latter is very problematic.

 At the very least, I think we should discourage people who don't
 really know what they're doing from using this driver without care.

The benefit of the type-12 experiment having not made it very far out
of the lab is that it may be feasible to whitelist known platforms
where we believe ADR is available.  Otherwise, the presence of the
NFIT asserts platform persistent memory support.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory

2015-04-29 Thread Dan Williams
On Tue, Apr 28, 2015 at 4:06 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 3:30 PM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 On Tue, Apr 28, 2015 at 2:10 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 28, 2015 at 11:26 AM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 From: Ross Zwisler ross.zwis...@linux.intel.com

 The libnd implementation handles allocating dimm address space (DPA)
 between PMEM and BLK mode interfaces.  After DPA has been allocated from
 a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
 as a struct bio based block device. Unlike PMEM, BLK is required to
 handle platform specific details like mmio register formats and memory
 controller interleave.  For this reason the libnd generic nd_blk driver
 calls back into the bus provider to carry out the I/O.

 This initial implementation handles the BLK interface defined by the
 ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
 DCR (dimm control region), BDW (block data window), IDT (interleave
 descriptor) NFIT structures and the hardware register format.
 [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
 [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Boaz Harrosh b...@plexistor.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Jens Axboe ax...@fb.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Christoph Hellwig h...@lst.de
 Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  drivers/block/nd/Kconfig  |   12 +
  drivers/block/nd/Makefile |3
  drivers/block/nd/acpi.c   |  422 
 +++--
  drivers/block/nd/acpi_nfit.h  |   47 
  drivers/block/nd/blk.c|  264 +++
  drivers/block/nd/libnd.h  |   11 +
  drivers/block/nd/namespace_devs.c |   47 
  drivers/block/nd/nd-private.h |3
  drivers/block/nd/nd.h |   16 +
  drivers/block/nd/region.c |8 +
  drivers/block/nd/region_devs.c|   65 +-
  drivers/block/nd/test/nfit.c  |   29 +++
  drivers/block/nd/test/nfit_test.h |2
  13 files changed, 891 insertions(+), 38 deletions(-)
  create mode 100644 drivers/block/nd/blk.c

 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 index 612bf2b14283..bac4290129fc 100644
 --- a/drivers/block/nd/Kconfig
 +++ b/drivers/block/nd/Kconfig
 @@ -95,6 +95,18 @@ config BLK_DEV_PMEM

   Say Y if you want to use a NVDIMM described by ACPI, E820, etc...

 +config ND_BLK
 +   tristate BLK: Block data window (aperture) device support
 +   depends on LIBND
 +   default ND_ACPI
 +   help
 + This driver performs I/O using a set of mmio windows on a
 + dimm.  The set of apertures will all access the one DIMM.
 + Multiple windows allow multiple threads to have a different
 + portions of the dimm open at one time.
 +
 + Say Y if you want to use a NVDIMM with BLK-mode capability
 +

 This describes how it works, not what it is.  How about:

 This driver exposes NVDIMM BLK regions as block devices.  BLK regions
 are regions of NVDIMM storage that are sector-addressable, not
 byte-addressible, and do not support DAX.

 They *are* byte-addressable albeit through an indirection window.  The
 indirection windows are too small for DAX to be a viable access mode.

 Right, I was assuming incorrectly that the sector-atomic thing was a
 necessary part of BLK, or at least of this implementation.

 Anyway, I think my point stands: let's describe what these drivers do
 from a user's perspective, not how they work.

Agreed.  How about:

Support NVDIMMs, or other devices, that implement a BLK-mode access
capability.  BLK-mode access uses memory-mapped-i/o apertures to
access persistent media.

Say Y if your platform firmware emits an ACPI.NFIT table
(CONFIG_ND_ACPI), or otherwise exposes BLK-mode capabilities.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2015-04-29 Thread Dan Williams
On Wed, 2015-04-29 at 18:55 +0200, D.S. Ljungmark wrote:
 
 On 29/04/15 18:50, Dan Williams wrote:
  On Wed, 2015-04-29 at 17:17 +0200, D.S. Ljungmark wrote:
  On 29/04/15 16:51, Denys Vlasenko wrote:
  On Wed, Apr 1, 2015 at 9:48 PM, David Miller da...@davemloft.net wrote:
  D.S. Ljungmark (1):
ipv6: Don't reduce hop limit for an interface
 
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6fd99094de2b83d1d4c8457f2c83483b2828e75a
 
  I was testing this change and apparently it doesn't close the hole.
 
  The python script I use to send RAs:
 -
  #!/usr/bin/env python
  import sys
  import time
  import scapy.all
  from scapy.layers.inet6 import *
  ip = IPv6()
  # ip.dst = 'ff02::1'
  ip.dst = sys.argv[1]
  icmp = ICMPv6ND_RA()
  icmp.chlim = 1
  for x in range(10):
  send(ip/icmp)
  time.sleep(1)
 
  # ./ipv6-hop-limit.py fe80::21e:37ff:fed0:5006
  .
  Sent 1 packets.
  ...10 times...
  Sent 1 packets.
 
  After I do this, on the targeted machine I check hop_limits:
 
  # for f in /proc/sys/net/ipv6/conf/*/hop_limit; do echo -n $f:; cat $f; 
  done
  /proc/sys/net/ipv6/conf/all/hop_limit:64
  /proc/sys/net/ipv6/conf/default/hop_limit:64
  /proc/sys/net/ipv6/conf/enp0s25/hop_limit:1  === THIS
  /proc/sys/net/ipv6/conf/lo/hop_limit:64
  /proc/sys/net/ipv6/conf/wlp3s0/hop_limit:64
 
  As you see, the interface which received RAs still lowered
  its hop_limit to 1. I take it means that the bug is still present
  (right? I'm not a network guy...).
 
  It might not be present in the _kernel_. Do you run NetworkManager on
  your system? If so, see below.
 
 
  I triple-checked that I do run the kernel with the fix.
  Further investigation shows that the code touched by the fix
  is not even reached, hop_limit is changed elsewhere.
 
  I'm willing to test additional patches.
 
  NetworkManager had it's own re-implementation of the bug. It got fixed
  with NetworkManager commit:
 
  commit bdaaf9849b0cacf131b71fa2ae168f5db796874f
  Author: Thomas Haller thal...@redhat.com
  Date:   Wed Apr 8 15:54:30 2015 +0200
 
  platform: don't accept lowering IPv6 hop-limit from RA (CVE-2015-2924)
 
 
 
  Beforte that commit, NetworkManager would take the RA packet, extract
  the hop limit, and write it to the sysctl itself.
  
  Yup, we basically followed the original kernel logic here, so we needed
  to patch it in NM as well.  It's been backported to NM 0.9.10, 1.0, and
  obviously is in git master.
  
 
 Are there any release announcements for NetworkManager? Or a place to
 link for official releases/homepage?

The mailing list:
https://mail.gnome.org/mailman/listinfo/networkmanager-list

The project site: https://wiki.gnome.org/Projects/NetworkManager

Dan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 10/20] pmem: use ida

2015-04-29 Thread Dan Williams
On Wed, Apr 29, 2015 at 11:25 AM, Toshi Kani toshi.k...@hp.com wrote:
 Hi Dan,

 Thanks for the update.  This version of the patchset enumerates our NFIT
 table properly. :-)

 On Tue, 2015-04-28 at 14:25 -0400, Dan Williams wrote:
 In preparation for the pmem driver attaching to pmem-namespaces emitted
 by libnd, convert it to use an ida instead of an always increasing
 atomic index.  This provides a bit of stability to pmem device names in
 the presence of driver re-bind events.
   :
 @@ -122,20 +123,26 @@ static struct pmem_device *pmem_alloc(struct device 
 *dev, struct resource *res)
  {
   struct pmem_device *pmem;
   struct gendisk *disk;
 - int idx, err;
 + int err;

   err = -ENOMEM;
   pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
   if (!pmem)
   goto out;

 + pmem-id = ida_simple_get(pmem_ida, 0, 0, GFP_KERNEL);

 nd_pmem_probe() is called asynchronously via async_schedule_domain
 ().  We have seen a case that the region#-pmem# binding becomes
 inconsistent across a reboot when there are 8 NVDIMM cards (reported by
 Robert Elliott).  This leads user to access a wrong device.

 I think pmem id needs to be assigned before async_schedule_domain(), and
 cascaded to nd_pmem_probe().


I'll take a look at making this better, but it will never be
bulletproof.  For the same reason that root=UUID=uuid is preferred
over root=/dev/sda userspace should never rely on consistent pmem
device names from boot to boot.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 10/20] pmem: use ida

2015-04-29 Thread Dan Williams
On Wed, Apr 29, 2015 at 1:49 PM, Linda Knippers linda.knipp...@hp.com wrote:
 On 4/29/2015 2:53 PM, Toshi Kani wrote:
 What's the right answer for this in the long run?

Short term, /dev/disk/by-uuid to take a stable identifier from the
contents of the device.

Longer term teach udev to populate /dev/disk/by-id with stable names
for libnd devices.  The trick is identifiers for interleaved PMEM
ranges comprised of multiple physical devices.  I'm thinking something
like /dev/disk/by-id/nd-set_cookie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support

2015-05-01 Thread Dan Williams
On Thu, Apr 30, 2015 at 6:21 PM, Rafael J. Wysocki r...@rjwysocki.net wrote:
 On Thursday, April 30, 2015 05:39:06 PM Dan Williams wrote:
 On Thu, Apr 30, 2015 at 4:23 PM, Rafael J. Wysocki r...@rjwysocki.net 
 wrote:
[..]
  +if ND_DEVICES
  +
  +config LIBND
  + tristate LIBND: libnd device driver support
  + help
  +   Platform agnostic device model for a libnd bus.  Publishes
  +   resources for a PMEM (persistent-memory) driver and/or BLK
  +   (sliding mmio window(s)) driver to attach.  Exposes a device
  +   topology under a ndX bus device, a /dev/ndctlX bus-ioctl
  +   message passing interface, and a /dev/nmemX dimm-ioctl
  +   message interface for each memory device registered on the
  +   bus.  instance.  A userspace library ndctl provides an API
  +   to enumerate/manage this subsystem.
  +
  +config ND_ACPI
  + tristate ACPI: NFIT to libnd bus support
  + select LIBND
  + depends on ACPI
  + help
  +   Infrastructure to probe ACPI 6 compliant platforms for
  +   NVDIMMs (NFIT) and register a libnd device tree.  In
  +   addition to storage devices this also enables libnd craft
  +   ACPI._DSM messages for platform/dimm configuration.
 
  I'm wondering if the two CONFIG options above really need to be 
  user-selectable?
 
  For example, what reason people (who've already selected ND_DEVICES) may 
  have
  for not selecting ND_ACPI if ACPI is set?


 Later on in the series we introduce ND_E820 which supports creating a
 libnd-bus from e820-type-12 memory ranges on pre-NFIT systems.  I'm
 also considering a configfs defined libnd-bus because e820 types are
 not nearly enough information to safely define nvdimm resources
 outside of NFIT.

 I hope these are not mutually exclusive with ND_ACPI?  Otherwise distros
 will have problems with supporting them in one kernel.

You can have ND_E820 support and ND_ACPI support in the same system.
Likely an NFIT enabled system will never have e820-type-12 ranges, but
if a user messes up and uses the new memmap=ss!nn command line to
overlap NFIT-defined memory then the request_mem_region() calls in the
driver will collide.  First to load wins in that scenario.

 If ND_E820 and ND_ACPI aren't mutually exclusive, I still don't see a good
 enough reason for asking users about ND_ACPI.  Why would I ever say No
 here if I said Yes or Module to ND_DEVICES?

I agree that if the user selects ND_DEVICES then ND_ACPI should
probably default on, but otherwise turning it off is a useful option.
If you know your system is pre-ACPI-6 then why bother including
support?

  +
  +endif
  diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
  new file mode 100644
  index ..944b5947c0cb
  --- /dev/null
  +++ b/drivers/block/nd/Makefile
  @@ -0,0 +1,6 @@
  +obj-$(CONFIG_LIBND) += libnd.o
  +obj-$(CONFIG_ND_ACPI) += nd_acpi.o
  +
  +nd_acpi-y := acpi.o
  +
  +libnd-y := core.o
 
  OK, so it looks like no modules, just built-in code, right?
 

 Um, no, both CONFIG_ND_ACPI and CONFIG_LIBND can be =m.

 OK

 [cut]

  +static int nd_acpi_remove(struct acpi_device *adev)
  +{
  + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(adev-dev);
  +
  + nd_bus_unregister(acpi_desc-nd_bus);
  + return 0;
  +}
  +
  +static void nd_acpi_notify(struct acpi_device *adev, u32 event)
  +{
  + /* TODO: handle ACPI_NOTIFY_BUS_CHECK notification */
  + dev_dbg(adev-dev, %s: event: %d\n, __func__, event);
  +}
  +
  +static const struct acpi_device_id nd_acpi_ids[] = {
  + { ACPI0012, 0 },
  + { , 0 },
  +};
  +MODULE_DEVICE_TABLE(acpi, nd_acpi_ids);
  +
  +static struct acpi_driver nd_acpi_driver = {
  + .name = KBUILD_MODNAME,
  + .ids = nd_acpi_ids,
  + .flags = ACPI_DRIVER_ALL_NOTIFY_EVENTS,
  + .ops = {
  + .add = nd_acpi_add,
  + .remove = nd_acpi_remove,
  + .notify = nd_acpi_notify
  + },
  +};
 
  Since this is going to be non-modular built-in code, please use an ACPI
  scan handler instead of using a driver here.  acpi_memhotplug.c does that,
  you can use it as an example, but I guess you don't need to enable hotplug
  for it to start with.


 No, you misunderstood, this will certainly be modular and loaded on-demand.

 OK

 So please drop the .notify thing at least for now.  It most likely doesn't do
 what you need anyway.

The .notify handler will eventually be filled in to handle hot-add of
NFIT structures, but yes I'll drop it for now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support

2015-04-30 Thread Dan Williams
On Thu, Apr 30, 2015 at 4:23 PM, Rafael J. Wysocki r...@rjwysocki.net wrote:
 On Tuesday, April 28, 2015 02:24:23 PM Dan Williams wrote:
 1/ Autodetect an NFIT table for the ACPI namespace device with _HID of
ACPI0012

 2/ libnd bus registration

 The NFIT provided by ACPI is one possible method by which platforms will
 discover NVDIMM resources.  However, the intent of the nd_bus_descriptor
 abstraction is to abstract provider specific details, leaving libnd
 to be independent of the specific NVDIMM resource discovery mechanism.
 This flexibility is later exploited later to implement custom-defined nd
 buses.

 Cc: linux-a...@vger.kernel.org
 Cc: Robert Moore robert.mo...@intel.com
 Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  drivers/block/Kconfig |2
  drivers/block/Makefile|1
  drivers/block/nd/Kconfig  |   40 +++
  drivers/block/nd/Makefile |6 +
  drivers/block/nd/acpi.c   |  475 
 +
  drivers/block/nd/acpi_nfit.h  |  254 ++
  drivers/block/nd/core.c   |   67 ++
  drivers/block/nd/libnd.h  |   33 +++
  drivers/block/nd/nd-private.h |   23 ++
  9 files changed, 901 insertions(+)
  create mode 100644 drivers/block/nd/Kconfig
  create mode 100644 drivers/block/nd/Makefile
  create mode 100644 drivers/block/nd/acpi.c
  create mode 100644 drivers/block/nd/acpi_nfit.h
  create mode 100644 drivers/block/nd/core.c
  create mode 100644 drivers/block/nd/libnd.h
  create mode 100644 drivers/block/nd/nd-private.h

 diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
 index eb1fed5bd516..dfe40e5ca9bd 100644
 --- a/drivers/block/Kconfig
 +++ b/drivers/block/Kconfig
 @@ -321,6 +321,8 @@ config BLK_DEV_NVME
 To compile this driver as a module, choose M here: the
 module will be called nvme.

 +source drivers/block/nd/Kconfig
 +
  config BLK_DEV_SKD
   tristate STEC S1120 Block Driver
   depends on PCI
 diff --git a/drivers/block/Makefile b/drivers/block/Makefile
 index 9cc6c18a1c7e..07a6acecf4d8 100644
 --- a/drivers/block/Makefile
 +++ b/drivers/block/Makefile
 @@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
  obj-$(CONFIG_MG_DISK)+= mg_disk.o
  obj-$(CONFIG_SUNVDC) += sunvdc.o
  obj-$(CONFIG_BLK_DEV_NVME)   += nvme.o
 +obj-$(CONFIG_ND_DEVICES) += nd/
  obj-$(CONFIG_BLK_DEV_SKD)+= skd.o
  obj-$(CONFIG_BLK_DEV_OSD)+= osdblk.o

 diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
 new file mode 100644
 index ..6d5d6b732f82
 --- /dev/null
 +++ b/drivers/block/nd/Kconfig
 @@ -0,0 +1,40 @@
 +menuconfig ND_DEVICES
 + bool NVDIMM Support
 + depends on PHYS_ADDR_T_64BIT
 + help
 +   Generic support for non-volatile memory devices including
 +   ACPI-6-NFIT defined resources.  On platforms that define an
 +   NFIT, or otherwise can discover NVDIMM resources, a libnd
 +   bus is registered to advertise PMEM (persistent memory)
 +   namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
 +   namespaces (/dev/ndX). A PMEM namespace refers to a memory
 +   resource that may span multiple DIMMs and support DAX (see
 +   CONFIG_DAX).  A BLK namespace refers to an NVDIMM control
 +   region which exposes an mmio register set for windowed
 +   access mode to non-volatile memory.
 +
 +if ND_DEVICES
 +
 +config LIBND
 + tristate LIBND: libnd device driver support
 + help
 +   Platform agnostic device model for a libnd bus.  Publishes
 +   resources for a PMEM (persistent-memory) driver and/or BLK
 +   (sliding mmio window(s)) driver to attach.  Exposes a device
 +   topology under a ndX bus device, a /dev/ndctlX bus-ioctl
 +   message passing interface, and a /dev/nmemX dimm-ioctl
 +   message interface for each memory device registered on the
 +   bus.  instance.  A userspace library ndctl provides an API
 +   to enumerate/manage this subsystem.
 +
 +config ND_ACPI
 + tristate ACPI: NFIT to libnd bus support
 + select LIBND
 + depends on ACPI
 + help
 +   Infrastructure to probe ACPI 6 compliant platforms for
 +   NVDIMMs (NFIT) and register a libnd device tree.  In
 +   addition to storage devices this also enables libnd craft
 +   ACPI._DSM messages for platform/dimm configuration.

 I'm wondering if the two CONFIG options above really need to be 
 user-selectable?

 For example, what reason people (who've already selected ND_DEVICES) may have
 for not selecting ND_ACPI if ACPI is set?


Later on in the series we introduce ND_E820 which supports creating a
libnd-bus from e820-type-12 memory ranges on pre-NFIT systems.  I'm
also considering a configfs defined libnd-bus because e820 types are
not nearly enough information to safely define nvdimm resources
outside of NFIT.


 +
 +endif
 diff --git a/drivers/block/nd/Makefile b

Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices

2015-05-01 Thread Dan Williams
On Fri, May 1, 2015 at 12:15 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-05-01 at 11:43 -0700, Dan Williams wrote:
 On Fri, May 1, 2015 at 11:19 AM, Toshi Kani toshi.k...@hp.com wrote:
  On Fri, 2015-05-01 at 11:22 -0700, Dan Williams wrote:
  On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote:
   On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote:
   Register the memory devices described in the nfit as libnd 'dimm'
   devices on an nd bus.  The kernel assigned device id for dimms is
   dynamic.  If userspace needs a more static identifier it should consult
   a provider-specific attribute.  In the case where NFIT is the provider,
   the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used
   for this purpose.
:
   +
   +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc)
   +{
   + struct nfit_mem *nfit_mem;
   +
   + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) {
   + struct nd_dimm *nd_dimm;
   + unsigned long flags = 0;
   + u32 nfit_handle;
   +
   + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle;
   + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle);
   + if (nd_dimm) {
   + /*
   +  * If for some reason we find multiple DCRs the
   +  * first one wins
   +  */
   + dev_err(acpi_desc-dev, duplicate DCR detected: 
   %s\n,
   + nd_dimm_name(nd_dimm));
   + continue;
   + }
   +
   + if (nfit_mem-bdw  nfit_mem-memdev_pmem)
   + flags |= NDD_ALIASING;
  
   Does this check work for a NVDIMM card which has multiple pmem regions
   with label info, but does not have any bdw region configured?
 
  If you have multiple pmem regions then you don't have aliasing and
  don't need a label.  You'll get an nd_namespace_io per region.
 
   The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk
   have label info.  There may be an NVDIMM card with a single blk region
   without label info.
 
  I'd really like to suggest that labels are only for resolving aliasing
  and that if you have a BLK-only NVDIMM you'll get an automatic
  namespace created the same as a PMEM-only.  Partitioning is always
  there to provide sub-divisions of a namespace.  The only reason to
  support multiple BLK-namespaces per-region is to give each a different
  sector size.  I may eventually need to relent on this position, but
  I'd really like to understand the use case for requiring labels when
  aliasing is not present as it seems like a waste to me.
 
  By looking at the callers of is_namespace_pmem() and is_namespace_blk(),
  such as nd_namespace_label_update(), I am concerned that the namespace
  types are also used for indicating the presence a label.  Is it OK for
  nd_namespace_label_update() to do nothing when there is no aliasing?

 Did you forget to answer this question?  I am not asking to have a
 label.  I am asking if the namespace types can handle it correctly.
 Restating the nd_namespace_label_update() example:
  - namespace_io case: Skip, but a label may still exist. Correct?
  - namespace_blk case: Proceed, but blk does not require a label.

Ah, ok.  This is handled by nd_namespace_attr_visible() only labelled
namespaces have writable sysfs attributes.  This would need to be
extended for a label-less BLK namespace type.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices

2015-05-01 Thread Dan Williams
On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote:
 Register the memory devices described in the nfit as libnd 'dimm'
 devices on an nd bus.  The kernel assigned device id for dimms is
 dynamic.  If userspace needs a more static identifier it should consult
 a provider-specific attribute.  In the case where NFIT is the provider,
 the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used
 for this purpose.
  :
 +
 +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc)
 +{
 + struct nfit_mem *nfit_mem;
 +
 + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) {
 + struct nd_dimm *nd_dimm;
 + unsigned long flags = 0;
 + u32 nfit_handle;
 +
 + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle;
 + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle);
 + if (nd_dimm) {
 + /*
 +  * If for some reason we find multiple DCRs the
 +  * first one wins
 +  */
 + dev_err(acpi_desc-dev, duplicate DCR detected: %s\n,
 + nd_dimm_name(nd_dimm));
 + continue;
 + }
 +
 + if (nfit_mem-bdw  nfit_mem-memdev_pmem)
 + flags |= NDD_ALIASING;

 Does this check work for a NVDIMM card which has multiple pmem regions
 with label info, but does not have any bdw region configured?

If you have multiple pmem regions then you don't have aliasing and
don't need a label.  You'll get an nd_namespace_io per region.

 The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk
 have label info.  There may be an NVDIMM card with a single blk region
 without label info.

I'd really like to suggest that labels are only for resolving aliasing
and that if you have a BLK-only NVDIMM you'll get an automatic
namespace created the same as a PMEM-only.  Partitioning is always
there to provide sub-divisions of a namespace.  The only reason to
support multiple BLK-namespaces per-region is to give each a different
sector size.  I may eventually need to relent on this position, but
I'd really like to understand the use case for requiring labels when
aliasing is not present as it seems like a waste to me.

 Instead of using the namespace types to assume the label info, how about
 adding a flag to indicate the presence of the label info?  This avoids
 the separation of namespace_io and namespace_pmem for the same pmem
 driver.

To what benefit?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices

2015-05-01 Thread Dan Williams
On Fri, May 1, 2015 at 11:19 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Fri, 2015-05-01 at 11:22 -0700, Dan Williams wrote:
 On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote:
  On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote:
  Register the memory devices described in the nfit as libnd 'dimm'
  devices on an nd bus.  The kernel assigned device id for dimms is
  dynamic.  If userspace needs a more static identifier it should consult
  a provider-specific attribute.  In the case where NFIT is the provider,
  the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used
  for this purpose.
   :
  +
  +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc)
  +{
  + struct nfit_mem *nfit_mem;
  +
  + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) {
  + struct nd_dimm *nd_dimm;
  + unsigned long flags = 0;
  + u32 nfit_handle;
  +
  + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle;
  + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle);
  + if (nd_dimm) {
  + /*
  +  * If for some reason we find multiple DCRs the
  +  * first one wins
  +  */
  + dev_err(acpi_desc-dev, duplicate DCR detected: 
  %s\n,
  + nd_dimm_name(nd_dimm));
  + continue;
  + }
  +
  + if (nfit_mem-bdw  nfit_mem-memdev_pmem)
  + flags |= NDD_ALIASING;
 
  Does this check work for a NVDIMM card which has multiple pmem regions
  with label info, but does not have any bdw region configured?

 If you have multiple pmem regions then you don't have aliasing and
 don't need a label.  You'll get an nd_namespace_io per region.

  The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk
  have label info.  There may be an NVDIMM card with a single blk region
  without label info.

 I'd really like to suggest that labels are only for resolving aliasing
 and that if you have a BLK-only NVDIMM you'll get an automatic
 namespace created the same as a PMEM-only.  Partitioning is always
 there to provide sub-divisions of a namespace.  The only reason to
 support multiple BLK-namespaces per-region is to give each a different
 sector size.  I may eventually need to relent on this position, but
 I'd really like to understand the use case for requiring labels when
 aliasing is not present as it seems like a waste to me.

 By looking at the callers of is_namespace_pmem() and is_namespace_blk(),
 such as nd_namespace_label_update(), I am concerned that the namespace
 types are also used for indicating the presence a label.  Is it OK for
 nd_namespace_label_update() to do nothing when there is no aliasing?

  Instead of using the namespace types to assume the label info, how about
  adding a flag to indicate the presence of the label info?  This avoids
  the separation of namespace_io and namespace_pmem for the same pmem
  driver.

 To what benefit?

 Why do they need to be separated? Having alias or not should not make
 the pmem namespace different.

The intent is to maximize the number of devices that can be
immediately attached to nd_pmem and nd_blk without user intervention.
nd_namespace_io is a pmem namespace where the boundaries are 100%
described by the NFIT / parent-region.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 06/20] libnd: ndctl.h, the nd ioctl abi

2015-04-28 Thread Dan Williams
Most configuration of the nd-subsystem is done via nd-sysfs attributes.
However, some nd buses, particularly the ACPI.NFIT bus, define a small
set of messages that can be passed to the platform.  For convenience we
derivce the initial nd-ioctl-command formats directly from the NFIT DSM
formats.

ND_CMD_SMART: media health and diagnostics
ND_CMD_GET_CONFIG_SIZE: size of the label space
ND_CMD_GET_CONFIG_DATA: read label space
ND_CMD_SET_CONFIG_DATA: write label space
ND_CMD_VENDOR: vendor-specific command passthrough
ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
ND_CMD_START_ARS: initiate scrubbing
ND_CMD_QUERY_ARS: report on scrubbing state
ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of an nd-bus, or an nd-dimm enumerate the supported
commands for that object.

Cc: linux-a...@vger.kernel.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Reported-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Kconfig  |   12 ++
 drivers/block/nd/acpi.c   |  237 ++
 drivers/block/nd/acpi_nfit.h  |3 
 drivers/block/nd/bus.c|  324 -
 drivers/block/nd/core.c   |   16 ++
 drivers/block/nd/dimm_devs.c  |   38 -
 drivers/block/nd/libnd.h  |   25 +++
 drivers/block/nd/nd-private.h |3 
 drivers/block/nd/test/nfit.c  |   78 ++
 include/uapi/linux/Kbuild |1 
 include/uapi/linux/ndctl.h|  178 +++
 11 files changed, 903 insertions(+), 12 deletions(-)
 create mode 100644 include/uapi/linux/ndctl.h

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 09f0135147ca..d2d84451e82c 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -37,6 +37,18 @@ config ND_ACPI
  addition to storage devices this also enables libnd craft
  ACPI._DSM messages for platform/dimm configuration.
 
+config ND_ACPI_DEBUG
+   bool ACPI: Extra nd_acpi debugging
+   depends on ND_ACPI
+   depends on DYNAMIC_DEBUG
+   default n
+   help
+ Enabling this option causes the nd_acpi driver to dump the
+ input and output buffers of _DSM operations on the ACPI0012
+ device and its children.  This can be very verbose, so leave
+ it disabled unless you are debugging a hardware / firmware
+ issue.
+
 config NFIT_TEST
tristate NFIT TEST: Manufactured NFIT for interface testing
depends on DMA_CMA
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index af6684341c9b..c46e166695f7 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -12,6 +12,7 @@
  */
 #include linux/list_sort.h
 #include linux/module.h
+#include linux/ndctl.h
 #include linux/list.h
 #include linux/acpi.h
 #include acpi_nfit.h
@@ -25,11 +26,160 @@ enum {
NFIT_ACPI_NOTIFY_TABLE = 0x80,
 };
 
+static u8 nd_acpi_uuids[2][16]; /* initialized at nd_acpi_init */
+
+static u8 *nd_acpi_bus_uuid(void)
+{
+   return nd_acpi_uuids[0];
+}
+
+static u8 *nd_acpi_dimm_uuid(void)
+{
+   return nd_acpi_uuids[1];
+}
+
+static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor 
*nd_desc)
+{
+   return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
+
+static struct acpi_device *to_acpi_dev(struct acpi_nfit_desc *acpi_desc)
+{
+   struct nd_bus_descriptor *nd_desc = acpi_desc-nd_desc;
+
+   /*
+* If provider == 'ACPI.NFIT' we can assume 'dev' is a struct
+* acpi_device.
+*/
+   if (!nd_desc-provider_name
+   || strcmp(nd_desc-provider_name, ACPI.NFIT) != 0)
+   return NULL;
+
+   return to_acpi_device(acpi_desc-dev);
+}
+
 static int nd_acpi_ctl(struct nd_bus_descriptor *nd_desc,
struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
unsigned int buf_len)
 {
-   return -ENOTTY;
+   struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc);
+   const struct nd_cmd_desc const *desc = NULL;
+   union acpi_object in_obj, in_buf, *out_obj;
+   struct device *dev = acpi_desc-dev;
+   const char *cmd_name, *dimm_name;
+   unsigned long dsm_mask;
+   acpi_handle handle;
+   u32 offset;
+   int rc, i;
+   u8 *uuid;
+
+   if (nd_dimm) {
+   struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+   struct acpi_device *adev = nfit_mem-adev;
+
+   if (!adev)
+   return -ENOTTY;
+   dimm_name = dev_name(adev-dev

[PATCH v2 09/20] libnd: support for legacy (non-aliasing) nvdimms

2015-04-28 Thread Dan Williams
The libnd region driver is an intermediary driver that translates
non-volatile regions into namespace sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may offer multiple
access modes to its media through either direct PMEM load/store access,
or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both.  If an nvdimm is
single interfaced, then there is no need for dimm metadata labels.  For
these devices we can take the region boundaries directly to create a
child namespace device (nd_namespace_io).

Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile |2 +
 drivers/block/nd/acpi.c   |1 
 drivers/block/nd/bus.c|   26 +
 drivers/block/nd/core.c   |   13 +++-
 drivers/block/nd/dimm.c   |2 -
 drivers/block/nd/libnd.h  |6 +-
 drivers/block/nd/namespace_devs.c |  111 +
 drivers/block/nd/nd-private.h |9 ++-
 drivers/block/nd/nd.h |8 +++
 drivers/block/nd/region.c |   88 +
 drivers/block/nd/region_devs.c|   61 
 include/linux/nd.h|   10 +++
 include/uapi/linux/ndctl.h|   10 +++
 13 files changed, 338 insertions(+), 9 deletions(-)
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/region.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 6010469c4d4c..0fb0891e1817 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -23,3 +23,5 @@ libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
 libnd-y += region_devs.o
+libnd-y += region.o
+libnd-y += namespace_devs.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index 41d0bb732b3e..c3dda74f73d7 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -774,6 +774,7 @@ static struct attribute_group 
nd_acpi_region_attribute_group = {
 static const struct attribute_group *nd_acpi_region_attribute_groups[] = {
nd_region_attribute_group,
nd_mapping_attribute_group,
+   nd_device_attribute_group,
nd_acpi_region_attribute_group,
NULL,
 };
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 7bd79f30e5e7..46568d182559 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 #include linux/vmalloc.h
 #include linux/uaccess.h
+#include linux/module.h
 #include linux/fcntl.h
 #include linux/async.h
 #include linux/ndctl.h
@@ -33,6 +34,12 @@ static int to_nd_device_type(struct device *dev)
 {
if (is_nd_dimm(dev))
return ND_DEVICE_DIMM;
+   else if (is_nd_pmem(dev))
+   return ND_DEVICE_REGION_PMEM;
+   else if (is_nd_blk(dev))
+   return ND_DEVICE_REGION_BLK;
+   else if (is_nd_pmem(dev-parent) || is_nd_blk(dev-parent))
+   return nd_region_to_namespace_type(to_nd_region(dev-parent));
 
return 0;
 }
@@ -50,27 +57,46 @@ static int nd_bus_match(struct device *dev, struct 
device_driver *drv)
return test_bit(to_nd_device_type(dev), nd_drv-type);
 }
 
+static struct module *to_bus_provider(struct device *dev)
+{
+   /* pin bus providers while regions are enabled */
+   if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+   return nd_bus-module;
+   }
+   return NULL;
+}
+
 static int nd_bus_probe(struct device *dev)
 {
struct nd_device_driver *nd_drv = to_nd_device_driver(dev-driver);
+   struct module *provider = to_bus_provider(dev);
struct nd_bus *nd_bus = walk_to_nd_bus(dev);
int rc;
 
+   if (!try_module_get(provider))
+   return -ENXIO;
+
rc = nd_drv-probe(dev);
dev_dbg(nd_bus-dev, %s.probe(%s) = %d\n, dev-driver-name,
dev_name(dev), rc);
+   if (rc != 0)
+   module_put(provider);
return rc;
 }
 
 static int nd_bus_remove(struct device *dev)
 {
struct nd_device_driver *nd_drv = to_nd_device_driver(dev-driver);
+   struct module *provider = to_bus_provider(dev);
struct nd_bus *nd_bus = walk_to_nd_bus(dev);
int rc;
 
rc = nd_drv-remove(dev);
dev_dbg(nd_bus-dev, %s.remove(%s) = %d\n, dev-driver-name,
dev_name(dev), rc);
+   module_put(provider);
return rc;
 }
 
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index d8d1c9cb3f16..646e424ae36c 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -133,8 +133,8 @@ struct attribute_group nd_bus_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_bus_attribute_group);
 
-struct nd_bus *nd_bus_register(struct device *parent

[PATCH v2 04/20] libnd: ndctl class device, and nd bus attributes

2015-04-28 Thread Dan Williams
This is the position (device topology) independent method to find all
the libnd buses in the system.  The expectation is that there will only
ever be one nd bus discovered via /sys/class/nd/ndctl0.  However, we
allow for the possibility of multiple buses and they will listed in
discovery order as ndctl0...ndctlN.  This character device hosts the
ioctl for passing control messages (inspired by the ACPI-NFIT DSM
interface commands).

Note, nd_ioctl() and the backing -ndctl() implementation are defined in
a subsequent patch.

Cc: Neil Brown ne...@suse.de
Cc: Greg KH gre...@linuxfoundation.org
Cc: linux-a...@vger.kernel.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/acpi.c   |   29 ++
 drivers/block/nd/acpi_nfit.h  |5 ++
 drivers/block/nd/bus.c|   83 +++
 drivers/block/nd/core.c   |   87 -
 drivers/block/nd/libnd.h  |5 ++
 drivers/block/nd/nd-private.h |6 +++
 drivers/block/nd/test/nfit.c  |3 +
 8 files changed, 217 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/bus.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index cf064db92589..7defe18ed009 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_NFIT_TEST) += test/
 nd_acpi-y := acpi.o
 
 libnd-y := core.o
+libnd-y += bus.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index 54344ef9c837..dd8505f766ed 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -341,6 +341,34 @@ static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc)
return 0;
 }
 
+static ssize_t revision_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_bus *nd_bus = to_nd_bus(dev);
+   struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+   struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+
+   return sprintf(buf, %d\n, acpi_desc-nfit-revision);
+}
+static DEVICE_ATTR_RO(revision);
+
+static struct attribute *nd_acpi_attributes[] = {
+   dev_attr_revision.attr,
+   NULL,
+};
+
+static struct attribute_group nd_acpi_attribute_group = {
+   .name = nfit,
+   .attrs = nd_acpi_attributes,
+};
+
+const struct attribute_group *nd_acpi_attribute_groups[] = {
+   nd_bus_attribute_group,
+   nd_acpi_attribute_group,
+   NULL,
+};
+EXPORT_SYMBOL_GPL(nd_acpi_attribute_groups);
+
 int nd_acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
struct device *dev = acpi_desc-dev;
@@ -408,6 +436,7 @@ static int nd_acpi_add(struct acpi_device *adev)
nd_desc = acpi_desc-nd_desc;
nd_desc-provider_name = ACPI.NFIT;
nd_desc-ndctl = nd_acpi_ctl;
+   nd_desc-attr_groups = nd_acpi_attribute_groups;
 
acpi_desc-nd_bus = nd_bus_register(dev, nd_desc);
if (!acpi_desc-nd_bus)
diff --git a/drivers/block/nd/acpi_nfit.h b/drivers/block/nd/acpi_nfit.h
index a26f69e32244..b65745ca3cbc 100644
--- a/drivers/block/nd/acpi_nfit.h
+++ b/drivers/block/nd/acpi_nfit.h
@@ -261,5 +261,10 @@ static inline struct acpi_nfit_memdev 
*__to_nfit_memdev(struct nfit_mem *nfit_me
return nfit_mem-memdev_pmem;
 }
 
+static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor 
*nd_desc)
+{
+   return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
+
 int nd_acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
 #endif /* __NFIT_H__ */
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
new file mode 100644
index ..635f2e926426
--- /dev/null
+++ b/drivers/block/nd/bus.c
@@ -0,0 +1,83 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
+#include linux/uaccess.h
+#include linux/fcntl.h
+#include linux/slab.h
+#include linux/fs.h
+#include linux/io.h
+#include nd-private.h
+
+static int nd_bus_major;
+static struct class *nd_class;
+
+int nd_bus_create_ndctl(struct nd_bus *nd_bus)
+{
+   dev_t devt = MKDEV(nd_bus_major, nd_bus-id);
+   struct device *dev;
+
+   dev = device_create(nd_class, nd_bus-dev, devt, nd_bus, ndctl%d,
+   nd_bus-id);
+
+   if (IS_ERR(dev)) {
+   dev_dbg(nd_bus-dev, failed to register ndctl%d: %ld\n

[PATCH v2 16/20] libnd: write pmem label set

2015-04-28 Thread Dan Williams
After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the next index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/dimm_devs.c  |   49 ++
 drivers/block/nd/label.c  |  327 +
 drivers/block/nd/label.h  |6 +
 drivers/block/nd/namespace_devs.c |   82 -
 drivers/block/nd/nd.h |3 
 5 files changed, 453 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 4aa5654354ac..358b2a06d680 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -132,6 +132,55 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
return rc;
 }
 
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+   void *buf, size_t len)
+{
+   int rc = validate_dimm(ndd);
+   size_t max_cmd_size, buf_offset;
+   struct nd_cmd_set_config_hdr *cmd;
+   struct nd_bus *nd_bus = walk_to_nd_bus(ndd-dev);
+   struct nd_bus_descriptor *nd_desc = nd_bus-nd_desc;
+
+   if (rc)
+   return rc;
+
+   if (!ndd-data)
+   return -ENXIO;
+
+   if (offset + len  ndd-nsarea.config_size)
+   return -ENXIO;
+
+   max_cmd_size = min_t(u32, PAGE_SIZE, len);
+   max_cmd_size = min_t(u32, max_cmd_size, ndd-nsarea.max_xfer);
+   cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
+   if (!cmd)
+   return -ENOMEM;
+
+   for (buf_offset = 0; len; len -= cmd-in_length,
+   buf_offset += cmd-in_length) {
+   size_t cmd_size;
+   u32 *status;
+
+   cmd-in_offset = offset + buf_offset;
+   cmd-in_length = min(max_cmd_size, len);
+   memcpy(cmd-in_buf, buf + buf_offset, cmd-in_length);
+
+   /* status is output in the last 4-bytes of the command buffer */
+   cmd_size = sizeof(*cmd) + cmd-in_length + sizeof(u32);
+   status = ((void *) cmd) + cmd_size - sizeof(u32);
+
+   rc = nd_desc-ndctl(nd_desc, to_nd_dimm(ndd-dev),
+   ND_CMD_SET_CONFIG_DATA, cmd, cmd_size);
+   if (rc || *status) {
+   rc = rc ? rc : -ENXIO;
+   break;
+   }
+   }
+   kfree(cmd);
+
+   return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
struct nd_dimm *nd_dimm = to_nd_dimm(dev);
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index b55fa2a6f872..78898b642191 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -12,6 +12,7 @@
  */
 #include linux/device.h
 #include linux/ndctl.h
+#include linux/slab.h
 #include linux/io.h
 #include linux/nd.h
 #include nd-private.h
@@ -57,6 +58,11 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
return ndd-nsindex_size;
 }
 
+static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+{
+   return ndd-nsarea.config_size / 129;
+}
+
 int nd_label_validate(struct nd_dimm_drvdata *ndd)
 {
/*
@@ -202,23 +208,30 @@ static struct nd_namespace_label __iomem 
*nd_label_base(struct nd_dimm_drvdata *
return base + 2 * sizeof_namespace_index(ndd);
 }
 
+static int to_slot(struct nd_dimm_drvdata *ndd,
+   struct nd_namespace_label __iomem *nd_label)
+{
+   return nd_label - nd_label_base(ndd);
+}
+
 #define for_each_clear_bit_le(bit, addr, size) \
for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
 (bit)  (size);\
 (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
 
 /**
- * preamble_current - common variable initialization for nd_label_* routines
+ * preamble_index - common variable initialization for nd_label_* routines
  * @nd_dimm: dimm container for the relevant label set
+ * @idx: namespace_index index
  * @nsindex: on return set to the currently active namespace index
  * @free: on return set to the free label bitmap in the index
  * @nslot: on return set to the number of slots in the label space
  */
-static bool preamble_current(struct nd_dimm_drvdata *ndd,
+static bool preamble_index(struct nd_dimm_drvdata *ndd, int idx,
struct nd_namespace_index **nsindex,
unsigned long **free, u32 *nslot)
 {
-   *nsindex = to_current_namespace_index(ndd);
+   *nsindex = to_namespace_index(ndd, idx);
if (*nsindex == NULL)
return

[PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)

2015-04-28 Thread Dan Williams
A region device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of region devices is regionN where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.

regions have 2 generic attributes size, and mappings where:
- size: the block-data-window accessible capacity or the span of the
  spa-range in the case of pm.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (nmemX,dpa,size).  For a
  PMEM-region there will be at least one mapping per dimm in the interleave
  set.  For a BLK-region there is only mapping0 listing the starting dimm
  offset of the block-data-window and the available capacity of that
  window (matches size above).

The max number of mappings per region is hard coded per the constraints of
sysfs attribute groups.  That said the number of mappings per region should
never exceed the maximum number of possible dimms in the system.  If the
current number turns out to not be enough then the mappings attribute
clarifies how many there are supposed to be. 32 should be enough for
anybody

Cc: Neil Brown ne...@suse.de
Cc: linux-a...@vger.kernel.org
Cc: Greg KH gre...@linuxfoundation.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile  |1 
 drivers/block/nd/acpi.c|  130 ++
 drivers/block/nd/libnd.h   |   25 +++
 drivers/block/nd/nd-private.h  |3 
 drivers/block/nd/nd.h  |   11 +
 drivers/block/nd/region_devs.c |  294 
 6 files changed, 463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/region_devs.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 842ba13253fd..6010469c4d4c 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -22,3 +22,4 @@ libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
+libnd-y += region_devs.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index bb0c2c764e78..41d0bb732b3e 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -751,12 +751,136 @@ static void nd_acpi_init_dsms(struct acpi_nfit_desc 
*acpi_desc)
set_bit(i, nd_desc-dsm_mask);
 }
 
+static ssize_t spa_index_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+struct nd_region *nd_region = to_nd_region(dev);
+struct nfit_spa *nfit_spa = nd_region_provider_data(nd_region);
+
+return sprintf(buf, %d\n, nfit_spa-spa-spa_index);
+}
+static DEVICE_ATTR_RO(spa_index);
+
+static struct attribute *nd_acpi_region_attributes[] = {
+   dev_attr_spa_index.attr,
+   NULL,
+};
+
+static struct attribute_group nd_acpi_region_attribute_group = {
+   .name = nfit,
+   .attrs = nd_acpi_region_attributes,
+};
+
+static const struct attribute_group *nd_acpi_region_attribute_groups[] = {
+   nd_region_attribute_group,
+   nd_mapping_attribute_group,
+   nd_acpi_region_attribute_group,
+   NULL,
+};
+
+static int nd_acpi_register_region(struct acpi_nfit_desc *acpi_desc,
+   struct nfit_spa *nfit_spa)
+{
+   static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
+   struct acpi_nfit_spa *spa = nfit_spa-spa;
+   struct nfit_memdev *nfit_memdev;
+   struct nd_region_desc ndr_desc;
+   int spa_type, count = 0;
+   struct resource res;
+   u16 spa_index;
+
+   spa_type = nfit_spa_type(spa);
+   spa_index = spa-spa_index;
+   if (spa_index == 0) {
+   dev_dbg(acpi_desc-dev, %s: detected invalid spa index\n,
+   __func__);
+   return 0;
+   }
+
+   memset(res, 0, sizeof(res));
+   memset(nd_mappings, 0, sizeof(nd_mappings));
+   memset(ndr_desc, 0, sizeof(ndr_desc));
+   res.start = spa-spa_base;
+   res.end = res.start + spa-spa_length - 1;
+   ndr_desc.res = res;
+   ndr_desc.provider_data = nfit_spa;
+   ndr_desc.attr_groups = nd_acpi_region_attribute_groups;
+   list_for_each_entry(nfit_memdev, acpi_desc-memdevs, list) {
+   struct acpi_nfit_memdev *memdev = nfit_memdev-memdev;
+   struct nd_mapping *nd_mapping;
+   struct nd_dimm *nd_dimm;
+
+   if (memdev-spa_index != spa_index

[PATCH v2 14/20] libnd: pmem label sets and namespace instantiation.

2015-04-28 Thread Dan Williams
A complete label set is a PMEM-label per dimm where all the UUIDs
match and the interleave set cookie matches an active interleave set.

Present a sysfs ABI for manipulation of a PMEM-namespace's 'alt_name',
'uuid', and 'size' attributes.  A later patch will make these settings
persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/bus.c|6 
 drivers/block/nd/core.c   |   64 ++
 drivers/block/nd/dimm.c   |2 
 drivers/block/nd/dimm_devs.c  |  127 +
 drivers/block/nd/label.c  |   54 ++
 drivers/block/nd/label.h  |3 
 drivers/block/nd/libnd.h  |2 
 drivers/block/nd/namespace_devs.c | 1024 +
 drivers/block/nd/nd-private.h |   14 +
 drivers/block/nd/nd.h |   33 +
 drivers/block/nd/pmem.c   |   22 +
 drivers/block/nd/region_devs.c|  145 +
 include/linux/nd.h|   24 +
 include/uapi/linux/ndctl.h|4 
 14 files changed, 1512 insertions(+), 12 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 8afb8d4a7e81..819259e92468 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -364,8 +364,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 }
 EXPORT_SYMBOL_GPL(nd_cmd_out_size);
 
-static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+void wait_nd_bus_probe_idle(struct device *dev)
 {
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
do {
if (nd_bus-probe_active == 0)
break;
@@ -384,7 +386,7 @@ static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, 
unsigned int cmd)
return 0;
 
nd_bus = walk_to_nd_bus(nd_dimm-dev);
-   wait_nd_bus_probe_idle(nd_bus);
+   wait_nd_bus_probe_idle(nd_bus-dev);
 
if (atomic_read(nd_dimm-busy))
return -EBUSY;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 603970d0ef3a..cf64b7a50d3a 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -13,6 +13,7 @@
 #include linux/export.h
 #include linux/module.h
 #include linux/device.h
+#include linux/ctype.h
 #include linux/ndctl.h
 #include linux/mutex.h
 #include linux/slab.h
@@ -106,6 +107,69 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
return NULL;
 }
 
+static bool is_uuid_sep(char sep)
+{
+   if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+   return true;
+   return false;
+}
+
+static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf,
+   size_t len)
+{
+   const char *str = buf;
+   u8 uuid[16];
+   int i;
+
+   for (i = 0; i  16; i++) {
+   if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+   dev_dbg(dev, %s: pos: %d buf[%zd]: %c buf[%zd]: %c\n,
+   __func__, i, str - buf, str[0],
+   str + 1 - buf, str[1]);
+   return -EINVAL;
+   }
+
+   uuid[i] = (hex_to_bin(str[0])  4) | hex_to_bin(str[1]);
+   str += 2;
+   if (is_uuid_sep(*str))
+   str++;
+   }
+
+   memcpy(uuid_out, uuid, sizeof(uuid));
+   return 0;
+}
+
+/**
+ * nd_uuid_store: common implementation for writing 'uuid' sysfs attributes
+ * @dev: container device for the uuid property
+ * @uuid_out: uuid buffer to replace
+ * @buf: raw sysfs buffer to parse
+ *
+ * Enforce that uuids can only be changed while the device is disabled
+ * (driver detached)
+ * LOCKING: expects device_lock() is held on entry
+ */
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+   size_t len)
+{
+   u8 uuid[16];
+   int rc;
+
+   if (dev-driver)
+   return -EBUSY;
+
+   rc = nd_uuid_parse(dev, uuid, buf, len);
+   if (rc)
+   return rc;
+
+   kfree(*uuid_out);
+   *uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL);
+   if (!(*uuid_out))
+   return -ENOMEM;
+
+   return 0;
+}
+
 static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index 5477176c5de0..e2f964308672 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -86,7 +86,7 @@ static int nd_dimm_remove(struct device *dev)
nd_bus_lock(dev);
dev_set_drvdata(dev, NULL);
for_each_dpa_resource_safe(ndd, res, _r)
-   __release_region(ndd-dpa, res-start, resource_size(res

[PATCH v2 10/20] pmem: use ida

2015-04-28 Thread Dan Williams
In preparation for the pmem driver attaching to pmem-namespaces emitted
by libnd, convert it to use an ida instead of an always increasing
atomic index.  This provides a bit of stability to pmem device names in
the presence of driver re-bind events.

Cc: Christoph Hellwig h...@lst.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/pmem.c |   22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index eabf4a8d0085..e3cf9142b172 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -34,10 +34,11 @@ struct pmem_device {
phys_addr_t phys_addr;
void*virt_addr;
size_t  size;
+   int id;
 };
 
 static int pmem_major;
-static atomic_t pmem_index;
+static DEFINE_IDA(pmem_ida);
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
unsigned int len, unsigned int off, int rw,
@@ -122,20 +123,26 @@ static struct pmem_device *pmem_alloc(struct device *dev, 
struct resource *res)
 {
struct pmem_device *pmem;
struct gendisk *disk;
-   int idx, err;
+   int err;
 
err = -ENOMEM;
pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
if (!pmem)
goto out;
 
+   pmem-id = ida_simple_get(pmem_ida, 0, 0, GFP_KERNEL);
+   if (pmem-id  0) {
+   err = pmem-id;
+   goto out_free_dev;
+   }
+
pmem-phys_addr = res-start;
pmem-size = resource_size(res);
 
err = -EINVAL;
if (!request_mem_region(pmem-phys_addr, pmem-size, pmem)) {
dev_warn(dev, could not reserve region [0x%pa:0x%zx]\n, 
pmem-phys_addr, pmem-size);
-   goto out_free_dev;
+   goto out_free_ida;
}
 
/*
@@ -159,15 +166,13 @@ static struct pmem_device *pmem_alloc(struct device *dev, 
struct resource *res)
if (!disk)
goto out_free_queue;
 
-   idx = atomic_inc_return(pmem_index) - 1;
-
disk-major = pmem_major;
-   disk-first_minor   = PMEM_MINORS * idx;
+   disk-first_minor   = PMEM_MINORS * pmem-id;
disk-fops  = pmem_fops;
disk-private_data  = pmem;
disk-queue = pmem-pmem_queue;
disk-flags = GENHD_FL_EXT_DEVT;
-   sprintf(disk-disk_name, pmem%d, idx);
+   sprintf(disk-disk_name, pmem%d, pmem-id);
disk-driverfs_dev = dev;
set_capacity(disk, pmem-size  9);
pmem-pmem_disk = disk;
@@ -182,6 +187,8 @@ out_unmap:
iounmap(pmem-virt_addr);
 out_release_region:
release_mem_region(pmem-phys_addr, pmem-size);
+out_free_ida:
+   ida_simple_remove(pmem_ida, pmem-id);
 out_free_dev:
kfree(pmem);
 out:
@@ -195,6 +202,7 @@ static void pmem_free(struct pmem_device *pmem)
blk_cleanup_queue(pmem-pmem_queue);
iounmap(pmem-virt_addr);
release_mem_region(pmem-phys_addr, pmem-size);
+   ida_simple_remove(pmem_ida, pmem-id);
kfree(pmem);
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Dan Williams
 to 128M as only the simulated DAX regions need CMA.  The rest
   can use vmalloc().

---

Available here:
  git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm nd-v2

---

Dan Williams (18):
  e820, efi: add ACPI 6.0 persistent memory types
  libnd, nd_acpi: initial libnd infrastructure and NFIT support
  nd_acpi, nfit-test: manufactured NFITs for interface development
  libnd: ndctl class device, and nd bus attributes
  libnd, nd_acpi: dimm/memory-devices
  libnd: ndctl.h, the nd ioctl abi
  libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
  libnd, nd_acpi: regions (block-data-window, persistent memory, volatile 
memory)
  libnd: support for legacy (non-aliasing) nvdimms
  pmem: use ida
  libnd, nd_pmem: add libnd support to the pmem driver
  libnd, nd_acpi: add interleave-set state-tracking infrastructure
  libnd: namespace indices: read and validate
  libnd: pmem label sets and namespace instantiation.
  libnd: blk labels and namespace instantiation
  libnd: write pmem label set
  libnd: write blk label set
  libnd: infrastructure for btt devices

Ross Zwisler (1):
  libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory

Vishal Verma (1):
  nd_btt: atomic sector updates


 Documentation/blockdev/btt.txt|  273 ++
 arch/arm64/kernel/efi.c   |1 
 arch/ia64/kernel/efi.c|4 
 arch/x86/boot/compressed/eboot.c  |4 
 arch/x86/include/uapi/asm/e820.h  |1 
 arch/x86/kernel/e820.c|   26 +
 arch/x86/kernel/pmem.c|2 
 arch/x86/platform/efi/efi.c   |3 
 drivers/block/Kconfig |   13 
 drivers/block/Makefile|2 
 drivers/block/nd/Kconfig  |  129 +++
 drivers/block/nd/Makefile |   41 +
 drivers/block/nd/acpi.c   | 1505 +
 drivers/block/nd/acpi_nfit.h  |  321 +++
 drivers/block/nd/blk.c|  264 ++
 drivers/block/nd/btt.c| 1423 +++
 drivers/block/nd/btt.h|  185 
 drivers/block/nd/btt_devs.c   |  443 ++
 drivers/block/nd/bus.c|  770 +
 drivers/block/nd/core.c   |  471 ++
 drivers/block/nd/dimm.c   |  115 +++
 drivers/block/nd/dimm_devs.c  |  507 +++
 drivers/block/nd/e820.c   |  100 ++
 drivers/block/nd/label.c  |  925 
 drivers/block/nd/label.h  |  143 +++
 drivers/block/nd/libnd.h  |  122 +++
 drivers/block/nd/namespace_devs.c | 1701 +
 drivers/block/nd/nd-private.h |  114 ++
 drivers/block/nd/nd.h |  261 ++
 drivers/block/nd/pmem.c   |  114 ++
 drivers/block/nd/region.c |  159 +++
 drivers/block/nd/region_devs.c|  637 ++
 drivers/block/nd/test/Makefile|5 
 drivers/block/nd/test/iomap.c |  151 +++
 drivers/block/nd/test/nfit.c  | 1131 +
 drivers/block/nd/test/nfit_test.h |   26 +
 include/linux/efi.h   |3 
 include/linux/nd.h|   98 ++
 include/uapi/linux/Kbuild |1 
 include/uapi/linux/ndctl.h|  199 
 40 files changed, 12345 insertions(+), 48 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/acpi.c
 create mode 100644 drivers/block/nd/acpi_nfit.h
 create mode 100644 drivers/block/nd/blk.c
 create mode 100644 drivers/block/nd/btt.c
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c
 create mode 100644 drivers/block/nd/bus.c
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/dimm_devs.c
 create mode 100644 drivers/block/nd/e820.c
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h
 create mode 100644 drivers/block/nd/libnd.h
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/nd-private.h
 create mode 100644 drivers/block/nd/nd.h
 rename drivers/block/{pmem.c = nd/pmem.c} (68%)
 create mode 100644 drivers/block/nd/region.c
 create mode 100644 drivers/block/nd/region_devs.c
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h
 create mode 100644 include/linux/nd.h
 create mode 100644 include/uapi/linux/ndctl.h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 03/20] nd_acpi, nfit-test: manufactured NFITs for interface development

2015-04-28 Thread Dan Williams
Manually create and register NFITs to describe 2 topologies.  Topology1
is an advanced plausible configuration for BLK/PMEM aliased NVDIMMs.
Topology2 is an example configuration for current platforms that only
ship with a persistent address range.

 Kernel provider nfit_test.0 produces an NFIT with the following attributes:

  (a)   (b)   DIMM   BLK-REGION
   +---++++
 +--+  |   pm0.0   | blk2.0 | pm1.0  | blk2.1 |0  region2
 | imc0 +--+- - - region0- - - ++++
 +--+---+  |   pm0.0   | blk3.0 | pm1.0  | blk3.1 |1  region3
|  +---+vv+
 +--+---+   | |
 | cpu0 | region1
 +--+---+   | |
|  +^^+
 +--+---+  |   blk4.0   | pm1.0  | blk4.0 |2  region4
 | imc1 +--+|++
 +--+  |   blk5.0   | pm1.0  | blk5.0 |3  region5
   ++++

 *) In this layout we have four dimms and two memory controllers in one
socket.  Each unique interface (block or pmem) to DPA space
is identified by a region device with a dynamically assigned id.

 *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
A single pmem namespace is created in the REGION0-spa-range
that spans dimm0 and dimm1 with a user-specified name of pm0.0.
Some of that interleaved spa range is reclaimed as bdw
accessed space starting at offset (a) into each dimm.  In that
reclaimed space we create two bdw namespaces from REGION2 and
REGION3 where blk2.0 and blk3.0 are just human readable names
that could be set to any user-desired name in the label.

 *) In the last portion of dimm0 and dimm1 we have an interleaved
spa range, REGION1, that spans those two dimms as well as dimm2
and dimm3.  Some of REGION1 allocated to a pmem namespace named
pm1.0 the rest is reclaimed in 4 bdw namespaces (for each
dimm in the interleave set), blk2.1, blk3.1, blk4.0, and
blk5.0.

 *) The portion of dimm2 and dimm3 that do not participate in the
REGION1 interleaved spa range (i.e. the DPA address below
offset (b) are also included in the blk4.0 and blk5.0
namespaces.  Note, that this example shows that bdw namespaces
don't need to be contiguous in DPA-space.

 Kernel provider nfit_test.1 produces an NFIT with the following attributes:

 region2
 +-+
 |-|
 ||   pm2.0   ||
 |-|
 +-+

 *) Describes a simple system-physical-address range with no backing
dimm or interleave description.

Cc: linux-a...@vger.kernel.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Kconfig  |   19 +
 drivers/block/nd/Makefile |   15 +
 drivers/block/nd/acpi.c   |3 
 drivers/block/nd/acpi_nfit.h  |   11 
 drivers/block/nd/test/Makefile|5 
 drivers/block/nd/test/iomap.c |  151 +
 drivers/block/nd/test/nfit.c  | 1025 +
 drivers/block/nd/test/nfit_test.h |   26 +
 8 files changed, 1254 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 6d5d6b732f82..09f0135147ca 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -37,4 +37,23 @@ config ND_ACPI
  addition to storage devices this also enables libnd craft
  ACPI._DSM messages for platform/dimm configuration.
 
+config NFIT_TEST
+   tristate NFIT TEST: Manufactured NFIT for interface testing
+   depends on DMA_CMA
+   depends on LIBND=m
+   depends on ND_ACPI
+   depends on m
+   help
+ For development purposes register a manufactured
+ NFIT table to verify the resulting device model topology.
+ Note, this module arranges for ioremap_cache() to be
+ overridden locally to allow simulation of system-memory as an
+ io-memory-resource.
+
+ Note, this test expects to be able to find at least 256MB of
+ CMA space (CONFIG_CMA_SIZE_MBYTES, cma=) or it will fail to
+ load.
+
+ Say N unless you are doing development of the 'nd' subsystem.
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 944b5947c0cb..cf064db92589 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd

[PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices

2015-04-28 Thread Dan Williams
Register the memory devices described in the nfit as libnd 'dimm'
devices on an nd bus.  The kernel assigned device id for dimms is
dynamic.  If userspace needs a more static identifier it should consult
a provider-specific attribute.  In the case where NFIT is the provider,
the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used
for this purpose.

Cc: Neil Brown ne...@suse.de
Cc: linux-a...@vger.kernel.org
Cc: Greg KH gre...@linuxfoundation.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/acpi.c   |  160 +
 drivers/block/nd/acpi_nfit.h  |1 
 drivers/block/nd/bus.c|   14 +++-
 drivers/block/nd/core.c   |   29 +++
 drivers/block/nd/dimm_devs.c  |   92 
 drivers/block/nd/libnd.h  |   11 +++
 drivers/block/nd/nd-private.h |   12 +++
 8 files changed, 318 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/dimm_devs.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 7defe18ed009..35e4c1a7a8ff 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -20,3 +20,4 @@ nd_acpi-y := acpi.o
 
 libnd-y := core.o
 libnd-y += bus.o
+libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index dd8505f766ed..af6684341c9b 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -369,6 +369,164 @@ const struct attribute_group *nd_acpi_attribute_groups[] 
= {
 };
 EXPORT_SYMBOL_GPL(nd_acpi_attribute_groups);
 
+static struct acpi_nfit_memdev *to_nfit_memdev(struct device *dev)
+{
+   struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+   struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+   return __to_nfit_memdev(nfit_mem);
+}
+
+static struct acpi_nfit_dcr *to_nfit_dcr(struct device *dev)
+{
+   struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+   struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+   return nfit_mem-dcr;
+}
+
+static ssize_t handle_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_memdev *memdev = to_nfit_memdev(dev);
+
+   return sprintf(buf, %#x\n, memdev-nfit_handle);
+}
+static DEVICE_ATTR_RO(handle);
+
+static ssize_t phys_id_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_memdev *memdev = to_nfit_memdev(dev);
+
+   return sprintf(buf, %#x\n, memdev-phys_id);
+}
+static DEVICE_ATTR_RO(phys_id);
+
+static ssize_t vendor_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev);
+
+   return sprintf(buf, %#x\n, dcr-vendor_id);
+}
+static DEVICE_ATTR_RO(vendor);
+
+static ssize_t rev_id_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev);
+
+   return sprintf(buf, %#x\n, dcr-revision_id);
+}
+static DEVICE_ATTR_RO(rev_id);
+
+static ssize_t device_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev);
+
+   return sprintf(buf, %#x\n, dcr-device_id);
+}
+static DEVICE_ATTR_RO(device);
+
+static ssize_t format_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev);
+
+   return sprintf(buf, %#x\n, dcr-fic);
+}
+static DEVICE_ATTR_RO(format);
+
+static ssize_t serial_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev);
+
+   return sprintf(buf, %#x\n, dcr-serial_number);
+}
+static DEVICE_ATTR_RO(serial);
+
+static struct attribute *nd_acpi_dimm_attributes[] = {
+   dev_attr_handle.attr,
+   dev_attr_phys_id.attr,
+   dev_attr_vendor.attr,
+   dev_attr_device.attr,
+   dev_attr_format.attr,
+   dev_attr_serial.attr,
+   dev_attr_rev_id.attr,
+   NULL,
+};
+
+static umode_t nd_acpi_dimm_attr_visible(struct kobject *kobj, struct 
attribute *a, int n)
+{
+   struct device *dev = container_of(kobj, struct device, kobj);
+
+   if (to_nfit_dcr(dev))
+   return a-mode;
+   else
+   return 0;
+}
+
+static struct attribute_group nd_acpi_dimm_attribute_group = {
+   .name = nfit,
+   .attrs = nd_acpi_dimm_attributes,
+   .is_visible = nd_acpi_dimm_attr_visible,
+};
+
+static const struct attribute_group *nd_acpi_dimm_attribute_groups[] = {
+   nd_acpi_dimm_attribute_group,
+   NULL,
+};
+
+static struct nd_dimm *nd_acpi_dimm_by_handle(struct acpi_nfit_desc *acpi_desc,
+   u32 nfit_handle)
+{
+   struct nfit_mem *nfit_mem;
+
+   list_for_each_entry

[PATCH v2 07/20] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure

2015-04-28 Thread Dan Williams
* Implement the device-model infrastructure for loading modules and
  attaching drivers to nd devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nd devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
  nd bus devices by default.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/acpi.c   |   13 ++-
 drivers/block/nd/bus.c|  168 +
 drivers/block/nd/core.c   |   43 ++
 drivers/block/nd/dimm.c   |   93 +++
 drivers/block/nd/dimm_devs.c  |  136 -
 drivers/block/nd/libnd.h  |2 
 drivers/block/nd/nd-private.h |8 +-
 drivers/block/nd/nd.h |   34 
 include/linux/nd.h|   39 ++
 include/uapi/linux/ndctl.h|6 +
 11 files changed, 528 insertions(+), 15 deletions(-)
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/nd.h
 create mode 100644 include/linux/nd.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 35e4c1a7a8ff..842ba13253fd 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -21,3 +21,4 @@ nd_acpi-y := acpi.o
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
+libnd-y += dimm.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index c46e166695f7..bb0c2c764e78 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -22,6 +22,10 @@ static bool warn_checksum;
 module_param(warn_checksum, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(warn_checksum, Turn checksum errors into warnings);
 
+static bool force_enable_dimms;
+module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(force_enable_dimms, Ignore _STA (ACPI DIMM device) status);
+
 enum {
NFIT_ACPI_NOTIFY_TABLE = 0x80,
 };
@@ -627,6 +631,7 @@ static struct attribute_group nd_acpi_dimm_attribute_group 
= {
 
 static const struct attribute_group *nd_acpi_dimm_attribute_groups[] = {
nd_dimm_attribute_group,
+   nd_device_attribute_group,
nd_acpi_dimm_attribute_group,
NULL,
 };
@@ -663,7 +668,7 @@ static int nd_acpi_add_dimm(struct acpi_nfit_desc 
*acpi_desc,
if (!adev_dimm) {
dev_err(dev, no ACPI.NFIT device with _ADR %#x, 
disabling...\n,
nfit_handle);
-   return -ENODEV;
+   return force_enable_dimms ? 0 : -ENODEV;
}
 
status = acpi_evaluate_integer(adev_dimm-handle, _STA, NULL, sta);
@@ -684,12 +689,13 @@ static int nd_acpi_add_dimm(struct acpi_nfit_desc 
*acpi_desc,
if (acpi_check_dsm(adev_dimm-handle, uuid, 1, 1ULL  i))
set_bit(i, nfit_mem-dsm_mask);
 
-   return rc;
+   return force_enable_dimms ? 0 : rc;
 }
 
 static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc)
 {
struct nfit_mem *nfit_mem;
+   int dimm_count = 0;
 
list_for_each_entry(nfit_mem, acpi_desc-dimms, list) {
struct nd_dimm *nd_dimm;
@@ -723,9 +729,10 @@ static int nd_acpi_register_dimms(struct acpi_nfit_desc 
*acpi_desc)
return -ENOMEM;
 
nfit_mem-nd_dimm = nd_dimm;
+   dimm_count++;
}
 
-   return 0;
+   return nd_bus_validate_dimm_count(acpi_desc-nd_bus, dimm_count);
 }
 
 static void nd_acpi_init_dsms(struct acpi_nfit_desc *acpi_desc)
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index a271e01af4a9..7bd79f30e5e7 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,19 +16,183 @@
 #include linux/fcntl.h
 #include linux/async.h
 #include linux/ndctl.h
+#include linux/sched.h
 #include linux/slab.h
 #include linux/fs.h
 #include linux/io.h
 #include linux/mm.h
+#include linux/nd.h
 #include nd-private.h
+#include nd.h
 
 int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
-struct bus_type nd_bus_type = {
+static int to_nd_device_type(struct device *dev)
+{
+   if (is_nd_dimm(dev))
+   return ND_DEVICE_DIMM;
+
+   return 0;
+}
+
+static int nd_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+   return add_uevent_var(env, MODALIAS= ND_DEVICE_MODALIAS_FMT

[PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support

2015-04-28 Thread Dan Williams
1/ Autodetect an NFIT table for the ACPI namespace device with _HID of
   ACPI0012

2/ libnd bus registration

The NFIT provided by ACPI is one possible method by which platforms will
discover NVDIMM resources.  However, the intent of the nd_bus_descriptor
abstraction is to abstract provider specific details, leaving libnd
to be independent of the specific NVDIMM resource discovery mechanism.
This flexibility is later exploited later to implement custom-defined nd
buses.

Cc: linux-a...@vger.kernel.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/Kconfig |2 
 drivers/block/Makefile|1 
 drivers/block/nd/Kconfig  |   40 +++
 drivers/block/nd/Makefile |6 +
 drivers/block/nd/acpi.c   |  475 +
 drivers/block/nd/acpi_nfit.h  |  254 ++
 drivers/block/nd/core.c   |   67 ++
 drivers/block/nd/libnd.h  |   33 +++
 drivers/block/nd/nd-private.h |   23 ++
 9 files changed, 901 insertions(+)
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/acpi.c
 create mode 100644 drivers/block/nd/acpi_nfit.h
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/libnd.h
 create mode 100644 drivers/block/nd/nd-private.h

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index eb1fed5bd516..dfe40e5ca9bd 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -321,6 +321,8 @@ config BLK_DEV_NVME
  To compile this driver as a module, choose M here: the
  module will be called nvme.
 
+source drivers/block/nd/Kconfig
+
 config BLK_DEV_SKD
tristate STEC S1120 Block Driver
depends on PCI
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 9cc6c18a1c7e..07a6acecf4d8 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD)   += pktcdvd.o
 obj-$(CONFIG_MG_DISK)  += mg_disk.o
 obj-$(CONFIG_SUNVDC)   += sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME) += nvme.o
+obj-$(CONFIG_ND_DEVICES)   += nd/
 obj-$(CONFIG_BLK_DEV_SKD)  += skd.o
 obj-$(CONFIG_BLK_DEV_OSD)  += osdblk.o
 
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
new file mode 100644
index ..6d5d6b732f82
--- /dev/null
+++ b/drivers/block/nd/Kconfig
@@ -0,0 +1,40 @@
+menuconfig ND_DEVICES
+   bool NVDIMM Support
+   depends on PHYS_ADDR_T_64BIT
+   help
+ Generic support for non-volatile memory devices including
+ ACPI-6-NFIT defined resources.  On platforms that define an
+ NFIT, or otherwise can discover NVDIMM resources, a libnd
+ bus is registered to advertise PMEM (persistent memory)
+ namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
+ namespaces (/dev/ndX). A PMEM namespace refers to a memory
+ resource that may span multiple DIMMs and support DAX (see
+ CONFIG_DAX).  A BLK namespace refers to an NVDIMM control
+ region which exposes an mmio register set for windowed
+ access mode to non-volatile memory.
+
+if ND_DEVICES
+
+config LIBND
+   tristate LIBND: libnd device driver support
+   help
+ Platform agnostic device model for a libnd bus.  Publishes
+ resources for a PMEM (persistent-memory) driver and/or BLK
+ (sliding mmio window(s)) driver to attach.  Exposes a device
+ topology under a ndX bus device, a /dev/ndctlX bus-ioctl
+ message passing interface, and a /dev/nmemX dimm-ioctl
+ message interface for each memory device registered on the
+ bus.  instance.  A userspace library ndctl provides an API
+ to enumerate/manage this subsystem.
+
+config ND_ACPI
+   tristate ACPI: NFIT to libnd bus support
+   select LIBND
+   depends on ACPI
+   help
+ Infrastructure to probe ACPI 6 compliant platforms for
+ NVDIMMs (NFIT) and register a libnd device tree.  In
+ addition to storage devices this also enables libnd craft
+ ACPI._DSM messages for platform/dimm configuration.
+
+endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
new file mode 100644
index ..944b5947c0cb
--- /dev/null
+++ b/drivers/block/nd/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_LIBND) += libnd.o
+obj-$(CONFIG_ND_ACPI) += nd_acpi.o
+
+nd_acpi-y := acpi.o
+
+libnd-y := core.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
new file mode 100644
index ..9f0b24390d1b
--- /dev/null
+++ b/drivers/block/nd/acpi.c
@@ -0,0 +1,475 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License

[PATCH v2 01/20] e820, efi: add ACPI 6.0 persistent memory types

2015-04-28 Thread Dan Williams
ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it reserved and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).  We may choose to exploit this wealth of definitions for
NVDIMMs to differentiate E820_PRAM (type-12) from E820_PMEM (type-7).
One potential differentiation is that PMEM is not backed by struct page
by default in contrast to PRAM.  For now, they are effectively treated
as aliases by the mm.

Note, /proc/iomem can be consulted for differentiating legacy
Persistent RAM E820_PRAM vs standard Persistent I/O Memory
E820_PMEM.

Cc: Boaz Harrosh b...@plexistor.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Christoph Hellwig h...@lst.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Borislav Petkov b...@alien8.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Jens Axboe ax...@fb.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Matthew Wilcox wi...@linux.intel.com
Cc: Thomas Gleixner t...@linutronix.de
Acked-by: Andy Lutomirski l...@amacapital.net
Reviewed-by: Ross Zwisler ross.zwis...@linux.intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/arm64/kernel/efi.c  |1 +
 arch/ia64/kernel/efi.c   |4 
 arch/x86/boot/compressed/eboot.c |4 
 arch/x86/include/uapi/asm/e820.h |1 +
 arch/x86/kernel/e820.c   |   26 +++---
 arch/x86/platform/efi/efi.c  |3 +++
 include/linux/efi.h  |3 ++-
 7 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index ab21e0d58278..9d4aa18f2a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
case EFI_BOOT_SERVICES_CODE:
case EFI_BOOT_SERVICES_DATA:
case EFI_CONVENTIONAL_MEMORY:
+   case EFI_PERSISTENT_MEMORY:
return 0;
default:
break;
diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
index c52d7540dc05..9028bc268cd7 100644
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -1223,6 +1223,10 @@ efi_initialize_iomem_resources(struct resource 
*code_resource,
flags |= IORESOURCE_DISABLED;
break;
 
+   case EFI_PERSISTENT_MEMORY:
+   name = persistent;
+   break;
+
case EFI_RESERVED_TYPE:
case EFI_RUNTIME_SERVICES_CODE:
case EFI_RUNTIME_SERVICES_DATA:
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index ef17683484e9..dde5bf7726f4 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params 
*params,
e820_type = E820_NVS;
break;
 
+   case EFI_PERSISTENT_MEMORY:
+   e820_type = E820_PMEM;
+   break;
+
default:
continue;
}
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index 960a8a9dc4ab..0f457e6eab18 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -32,6 +32,7 @@
 #define E820_ACPI  3
 #define E820_NVS   4
 #define E820_UNUSABLE  5
+#define E820_PMEM  7
 
 /*
  * This is a non-standardized way to represent ADR or NVDIMM regions that
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 11cc7d54ec3f..d38b53a7e9b2 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type)
case E820_UNUSABLE:
printk(KERN_CONT unusable);
break;
+   case E820_PMEM:
case E820_PRAM:
printk(KERN_CONT persistent (type %u), type);
break;
@@ -919,10 +920,31 @@ static inline const char *e820_type_to_string(int 
e820_type)
case E820_NVS:  return ACPI Non-volatile Storage;
case E820_UNUSABLE: return Unusable memory;
case E820_PRAM: return Persistent RAM;
+   case E820_PMEM: return Persistent I/O Memory;
default:return reserved;
}
 }
 
+static bool do_mark_busy(u32 type, struct resource *res)
+{
+   /* this is the legacy bios/dos rom-shadow + mmio region */
+   if (res-start  (1ULL20))
+   return true;
+
+   /*
+* Treat persistent memory like device memory, i.e. reserve it
+* for exclusive use of a driver
+*/
+   switch (type) {
+   case E820_RESERVED:
+   case

[PATCH v2 13/20] libnd: namespace indices: read and validate

2015-04-28 Thread Dan Williams
On media label format consists of two index blocks followed by an array
of labels.  None of these structures are ever updated in place.  A
sequence number tracks the current active index and the next one to
write, while labels are written to free slots.

++
||
|  nsindex0  |
||
++
||
|  nsindex1  |
||
++
|   label0   |
++
|   label1   |
++
||
 nslot...
||
++
|   labelN   |
++

After reading valid labels, store the dpa ranges they claim into
per-dimm resource trees.

Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Makefile|1 
 drivers/block/nd/dimm.c  |   26 +++-
 drivers/block/nd/dimm_devs.c |6 +
 drivers/block/nd/label.c |  291 ++
 drivers/block/nd/label.h |  129 +++
 drivers/block/nd/nd.h|   45 ++
 include/uapi/linux/ndctl.h   |1 
 7 files changed, 495 insertions(+), 4 deletions(-)
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index ebb212af9f15..d588f691163c 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -31,3 +31,4 @@ libnd-y += dimm.o
 libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
+libnd-y += label.o
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index 6b7d2842509c..5477176c5de0 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -18,6 +18,7 @@
 #include linux/slab.h
 #include linux/mm.h
 #include linux/nd.h
+#include label.h
 #include nd.h
 
 static void free_data(struct nd_dimm_drvdata *ndd)
@@ -42,7 +43,12 @@ static int nd_dimm_probe(struct device *dev)
return -ENOMEM;
 
dev_set_drvdata(dev, ndd);
-ndd-dev = dev;
+   ndd-dpa.name = dev_name(dev);
+   ndd-ns_current = -1;
+   ndd-ns_next = -1;
+   ndd-dpa.start = 0;
+   ndd-dpa.end = -1;
+   ndd-dev = dev;
 
rc = nd_dimm_init_nsarea(ndd);
if (rc)
@@ -54,18 +60,34 @@ static int nd_dimm_probe(struct device *dev)
 
dev_dbg(dev, config data size: %d\n, ndd-nsarea.config_size);
 
+   nd_bus_lock(dev);
+   ndd-ns_current = nd_label_validate(ndd);
+   ndd-ns_next = nd_label_next_nsindex(ndd-ns_current);
+   nd_label_copy(ndd, to_next_namespace_index(ndd),
+   to_current_namespace_index(ndd));
+   rc = nd_label_reserve_dpa(ndd);
+   nd_bus_unlock(dev);
+
+   if (rc)
+   goto err;
+
return 0;
 
  err:
free_data(ndd);
return rc;
-
 }
 
 static int nd_dimm_remove(struct device *dev)
 {
struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+   struct resource *res, *_r;
 
+   nd_bus_lock(dev);
+   dev_set_drvdata(dev, NULL);
+   for_each_dpa_resource_safe(ndd, res, _r)
+   __release_region(ndd-dpa, res-start, resource_size(res));
+   nd_bus_unlock(dev);
free_data(ndd);
 
return 0;
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 8981adc59ba4..3fbd0d0502eb 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -92,8 +92,12 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
if (ndd-data)
return 0;
 
-   if (ndd-nsarea.status || ndd-nsarea.max_xfer == 0)
+   if (ndd-nsarea.status || ndd-nsarea.max_xfer == 0
+   || ndd-nsarea.config_size  ND_LABEL_MIN_SIZE) {
+   dev_dbg(ndd-dev, failed to init config data area: (%d:%d)\n,
+   ndd-nsarea.max_xfer, ndd-nsarea.config_size);
return -ENXIO;
+   }
 
ndd-data = kmalloc(ndd-nsarea.config_size, GFP_KERNEL);
if (!ndd-data)
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
new file mode 100644
index ..e791ea8bbdde
--- /dev/null
+++ b/drivers/block/nd/label.c
@@ -0,0 +1,291 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include linux/device.h
+#include linux/ndctl.h
+#include linux/io.h
+#include linux/nd.h
+#include nd-private.h
+#include label.h
+#include nd.h
+
+#include asm-generic/io-64-nonatomic-lo-hi.h
+
+static u32 best_seq(u32

[PATCH v2 18/20] libnd: infrastructure for btt devices

2015-04-28 Thread Dan Williams
Block devices from an nd bus, in addition to accepting struct bio
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim/disable the block interface and use the byte-aligned nd_io
interface.

The BTT driver is the intended first consumer of this mechanism to allow
layering atomic sector update guarantees on top of nd_io capable
nd-bus-block-devices.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Kconfig  |3 
 drivers/block/nd/Makefile |1 
 drivers/block/nd/btt.h|   45 
 drivers/block/nd/btt_devs.c   |  442 +
 drivers/block/nd/bus.c|  128 
 drivers/block/nd/core.c   |   79 +++
 drivers/block/nd/nd-private.h |   28 +++
 drivers/block/nd/nd.h |   94 +
 drivers/block/nd/pmem.c   |   29 +++
 include/uapi/linux/ndctl.h|2 
 10 files changed, 847 insertions(+), 4 deletions(-)
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index c5eaf195734d..15896db4de37 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -95,4 +95,7 @@ config BLK_DEV_PMEM
 
  Say Y if you want to use a NVDIMM described by NFIT
 
+config ND_BTT_DEVS
+   def_bool y
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index d588f691163c..0c6d64b7a69d 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -32,3 +32,4 @@ libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
 libnd-y += label.o
+libnd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
new file mode 100644
index ..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/block/nd/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include linux/types.h
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG BTT_ARENA_INFO\0
+
+struct btt_sb {
+   u8 signature[BTT_SIG_LEN];
+   u8 uuid[16];
+   u8 parent_uuid[16];
+   __le32 flags;
+   __le16 version_major;
+   __le16 version_minor;
+   __le32 external_lbasize;
+   __le32 external_nlba;
+   __le32 internal_lbasize;
+   __le32 internal_nlba;
+   __le32 nfree;
+   __le32 infosize;
+   __le64 nextoff;
+   __le64 dataoff;
+   __le64 mapoff;
+   __le64 logoff;
+   __le64 info2off;
+   u8 padding[3968];
+   __le64 checksum;
+};
+
+#endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
new file mode 100644
index ..e6f0b8b999d8
--- /dev/null
+++ b/drivers/block/nd/btt_devs.c
@@ -0,0 +1,442 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include linux/device.h
+#include linux/genhd.h
+#include linux/sizes.h
+#include linux/slab.h
+#include linux/fs.h
+#include linux/mm.h
+#include nd-private.h
+#include btt.h
+#include nd.h
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+   struct nd_btt *nd_btt = to_nd_btt(dev);
+
+   dev_dbg(dev, %s\n, __func__);
+   WARN_ON(nd_btt-backing_dev);
+   ndio_del_claim(nd_btt-ndio_claim);
+   ida_simple_remove(btt_ida, nd_btt-id);
+   kfree(nd_btt-uuid);
+   kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+   .name = nd_btt,
+   .release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+   return dev-type == nd_btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+   struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+   WARN_ON(!is_nd_btt(dev));
+   return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static

[PATCH v2 17/20] libnd: write blk label set

2015-04-28 Thread Dan Williams
After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/bus.c|4 
 drivers/block/nd/dimm_devs.c  |   25 +++
 drivers/block/nd/label.c  |  297 +++--
 drivers/block/nd/label.h  |5 +
 drivers/block/nd/namespace_devs.c |   57 +++
 drivers/block/nd/nd-private.h |1 
 6 files changed, 367 insertions(+), 22 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 819259e92468..6c272f245f4e 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -136,6 +136,10 @@ static void nd_async_device_unregister(void *d, 
async_cookie_t cookie)
 {
struct device *dev = d;
 
+   /* flush bus operations before delete */
+   nd_bus_lock(dev);
+   nd_bus_unlock(dev);
+
device_unregister(dev);
put_device(dev);
 }
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 358b2a06d680..4b225c8b7d0a 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -19,6 +19,7 @@
 #include linux/fs.h
 #include linux/mm.h
 #include nd-private.h
+#include label.h
 #include nd.h
 
 static DEFINE_IDA(dimm_ida);
@@ -262,9 +263,33 @@ static ssize_t state_show(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RO(state);
 
+static ssize_t available_slots_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+   ssize_t rc;
+   u32 nfree;
+
+   if (!ndd)
+   return -ENXIO;
+
+   nd_bus_lock(dev);
+   nfree = nd_label_nfree(ndd);
+   if (nfree - 1  nfree) {
+   dev_WARN_ONCE(dev, 1, we ate our last label?\n);
+   nfree = 0;
+   } else
+   nfree--;
+   rc = sprintf(buf, %d\n, nfree);
+   nd_bus_unlock(dev);
+   return rc;
+}
+static DEVICE_ATTR_RO(available_slots);
+
 static struct attribute *nd_dimm_attributes[] = {
dev_attr_state.attr,
dev_attr_commands.attr,
+   dev_attr_available_slots.attr,
NULL,
 };
 
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index 78898b642191..069c26d50ed1 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -58,7 +58,7 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
return ndd-nsindex_size;
 }
 
-static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
 {
return ndd-nsarea.config_size / 129;
 }
@@ -416,7 +416,7 @@ u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
WARN_ON(!is_nd_bus_locked(ndd-dev));
 
if (!preamble_next(ndd, nsindex, free, nslot))
-   return 0;
+   return nd_dimm_num_label_slots(ndd);
 
return bitmap_weight(free, nslot);
 }
@@ -553,22 +553,270 @@ static int __pmem_label_update(struct nd_region 
*nd_region,
return 0;
 }
 
-static int init_labels(struct nd_mapping *nd_mapping)
+static void del_label(struct nd_mapping *nd_mapping, int l)
+{
+   struct nd_namespace_label __iomem *next_label, __iomem *nd_label;
+   struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+   unsigned int slot;
+   int j;
+
+   nd_label = nd_get_label(nd_mapping-labels, l);
+   slot = to_slot(ndd, nd_label);
+   dev_vdbg(ndd-dev, %s: clear: %d\n, __func__, slot);
+
+   for (j = l; (next_label = nd_get_label(nd_mapping-labels, j + 1)); j++)
+   nd_set_label(nd_mapping-labels, next_label, j);
+   nd_set_label(nd_mapping-labels, NULL, j);
+}
+
+static bool is_old_resource(struct resource *res, struct resource **list, int 
n)
 {
int i;
+
+   if (res-flags  DPA_RESOURCE_ADJUSTED)
+   return false;
+   for (i = 0; i  n; i++)
+   if (res == list[i])
+   return true;
+   return false;
+}
+
+static struct resource *to_resource(struct nd_dimm_drvdata *ndd,
+   struct nd_namespace_label __iomem *nd_label)
+{
+   struct resource *res;
+
+   for_each_dpa_resource(ndd, res) {
+   if (res-start != readq(nd_label-dpa))
+   continue;
+   if (resource_size(res) != readq(nd_label-rawsize))
+   continue;
+   return res;
+   }
+
+   return NULL;
+}
+
+/*
+ * 1/ Account all the labels that can be freed after this update
+ * 2/ Allocate and write the label

[PATCH v2 15/20] libnd: blk labels and namespace instantiation

2015-04-28 Thread Dan Williams
A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH gre...@linuxfoundation.org
Cc: Neil Brown ne...@suse.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/core.c   |   40 +++
 drivers/block/nd/dimm_devs.c  |   35 +++
 drivers/block/nd/libnd.h  |3 
 drivers/block/nd/namespace_devs.c |  502 ++---
 drivers/block/nd/nd-private.h |8 +
 drivers/block/nd/nd.h |5 
 drivers/block/nd/region_devs.c|   15 +
 include/linux/nd.h|   25 ++
 8 files changed, 589 insertions(+), 44 deletions(-)

diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index cf64b7a50d3a..3ec38289be58 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -170,6 +170,46 @@ int nd_uuid_store(struct device *dev, u8 **uuid_out, const 
char *buf,
return 0;
 }
 
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+   const unsigned long *supported, char *buf)
+{
+   ssize_t len = 0;
+   int i;
+
+   for (i = 0; supported[i]; i++)
+   if (current_lbasize == supported[i])
+   len += sprintf(buf + len, [%ld] , supported[i]);
+   else
+   len += sprintf(buf + len, %ld , supported[i]);
+   len += sprintf(buf + len, \n);
+   return len;
+}
+
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+   unsigned long *current_lbasize, const unsigned long *supported)
+{
+   unsigned long lbasize;
+   int rc, i;
+
+   if (dev-driver)
+   return -EBUSY;
+
+   rc = kstrtoul(buf, 0, lbasize);
+   if (rc)
+   return rc;
+
+   for (i = 0; supported[i]; i++)
+   if (lbasize == supported[i])
+   break;
+
+   if (supported[i]) {
+   *current_lbasize = lbasize;
+   return 0;
+   } else {
+   return -EINVAL;
+   }
+}
+
 static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index b242d3ae6d12..4aa5654354ac 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -256,6 +256,41 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void 
*provider_data,
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
 /**
+ * nd_blk_available_dpa - account the unused dpa of BLK region
+ * @nd_mapping: container of dpa-resource-root + labels
+ *
+ * Unlike PMEM, BLK namespaces can occupy discontiguous DPA ranges.
+ */
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping)
+{
+   struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+   resource_size_t map_end, busy = 0, available;
+   struct resource *res;
+
+   if (!ndd)
+   return 0;
+
+   map_end = nd_mapping-start + nd_mapping-size - 1;
+   for_each_dpa_resource(ndd, res)
+   if (res-start = nd_mapping-start  res-start  map_end) {
+   resource_size_t end = min(map_end, res-end);
+
+   busy += end - res-start + 1;
+   } else if (res-end = nd_mapping-start  res-end = 
map_end) {
+   busy += res-end - nd_mapping-start;
+   } else if (nd_mapping-start  res-start
+nd_mapping-start  res-end) {
+   /* total eclipse of the BLK region mapping */
+   busy += nd_mapping-size;
+   }
+
+   available = map_end - nd_mapping-start + 1;
+   if (busy  available)
+   return available - busy;
+   return 0;
+}
+
+/**
  * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
  * @nd_mapping: container of dpa-resource-root + labels
  * @nd_region: constrain available space check to this reference region
diff --git a/drivers/block/nd/libnd.h b/drivers/block/nd/libnd.h
index 832dcfebbb49..3f6b5e09cd67 100644
--- a/drivers/block/nd/libnd.h
+++ b/drivers/block/nd/libnd.h
@@ -26,6 +26,9 @@ enum {
ND_CMD_MAX_ENVELOPE = 16,
ND_CMD_ARS_QUERY_MAX = SZ_4K,
ND_MAX_MAPPINGS = 32,
+
+   /* mark newly adjusted resources as requiring a label update */
+   DPA_RESOURCE_ADJUSTED = 1  0,
 };
 
 extern

[PATCH v2 12/20] libnd, nd_acpi: add interleave-set state-tracking infrastructure

2015-04-28 Thread Dan Williams
On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a configuration data
region on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing no modifications of active labels via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set will only be possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an interleave-set cookie that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown ne...@suse.de
Cc: linux-a...@vger.kernel.org
Cc: Greg KH gre...@linuxfoundation.org
Cc: Robert Moore robert.mo...@intel.com
Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/acpi.c|   90 
 drivers/block/nd/bus.c |   41 ++
 drivers/block/nd/core.c|   47 +
 drivers/block/nd/dimm_devs.c   |   19 
 drivers/block/nd/libnd.h   |6 +++
 drivers/block/nd/nd-private.h  |   11 -
 drivers/block/nd/nd.h  |4 ++
 drivers/block/nd/region_devs.c |   85 ++
 8 files changed, 299 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index c3dda74f73d7..d34cefe38e2f 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -15,6 +15,7 @@
 #include linux/ndctl.h
 #include linux/list.h
 #include linux/acpi.h
+#include linux/sort.h
 #include acpi_nfit.h
 #include libnd.h
 
@@ -779,6 +780,90 @@ static const struct attribute_group 
*nd_acpi_region_attribute_groups[] = {
NULL,
 };
 
+/* enough info to uniquely specify an interleave set */
+struct nfit_set_info {
+   struct nfit_set_info_map {
+   u64 region_spa_offset;
+   u32 serial_number;
+   u32 pad;
+   } mapping[0];
+};
+
+static size_t sizeof_nfit_set_info(int num_mappings)
+{
+   return sizeof(struct nfit_set_info)
+   + num_mappings * sizeof(struct nfit_set_info_map);
+}
+
+static int cmp_map(const void *m0, const void *m1)
+{
+   const struct nfit_set_info_map *map0 = m0;
+   const struct nfit_set_info_map *map1 = m1;
+
+   return memcmp(map0-region_spa_offset, map1-region_spa_offset,
+   sizeof(u64));
+}
+
+/* Retrieve the nth entry referencing this spa */
+static struct acpi_nfit_memdev *memdev_from_spa(
+   struct acpi_nfit_desc *acpi_desc, u16 spa_index, int n)
+{
+struct nfit_memdev *nfit_memdev;
+
+list_for_each_entry(nfit_memdev, acpi_desc-memdevs, list)
+if (nfit_memdev-memdev-spa_index == spa_index)
+if (n-- == 0)
+return nfit_memdev-memdev;
+return NULL;
+}
+
+static int nd_acpi_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
+   struct nd_region_desc *ndr_desc, struct acpi_nfit_spa *spa)
+{
+   u16 num_mappings = ndr_desc-num_mappings;
+   int i, spa_type = nfit_spa_type(spa);
+   struct device *dev = acpi_desc-dev;
+   struct nd_interleave_set *nd_set;
+   struct nfit_set_info *info;
+
+   if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE)
+   /* pass */;
+   else
+   return 0;
+
+   nd_set = devm_kzalloc(dev, sizeof(*nd_set), GFP_KERNEL);
+   if (!nd_set)
+   return -ENOMEM;
+
+   info = devm_kzalloc(dev, sizeof_nfit_set_info(num_mappings), 
GFP_KERNEL);
+   if (!info)
+   return -ENOMEM;
+   for (i = 0; i  num_mappings; i++) {
+   struct nd_mapping *nd_mapping = ndr_desc-nd_mapping[i];
+   struct nfit_set_info_map *map = info-mapping[i];
+   struct nd_dimm *nd_dimm = nd_mapping-nd_dimm;
+   struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+   struct acpi_nfit_memdev *memdev = memdev_from_spa(acpi_desc,
+   spa-spa_index, i);
+
+   if (!memdev || !nfit_mem-dcr) {
+   dev_err(dev, %s: failed to find DCR\n, __func__);
+   return -ENODEV;
+   }
+
+   map-region_spa_offset = memdev-region_spa_offset;
+   map-serial_number = nfit_mem-dcr-serial_number;
+   }
+
+   sort(info-mapping[0

[PATCH v2 19/20] nd_btt: atomic sector updates

2015-04-28 Thread Dan Williams
From: Vishal Verma vishal.l.ve...@linux.intel.com

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the -rw_bytes() capability
of provided nd namespace devices.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata.  BLK namespaces may
mandate use of a BTT and expect the bus to initialize a BTT if not
already present.  Otherwise if a BTT is desired for other namespaces (or
partitions of a namespace) a BTT may be manually configured.

Cc: Andy Lutomirski l...@amacapital.net
Cc: Boaz Harrosh b...@plexistor.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Jens Axboe ax...@fb.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Christoph Hellwig h...@lst.de
Cc: Neil Brown ne...@suse.de
Cc: Jeff Moyer jmo...@redhat.com
Cc: Dave Chinner da...@fromorbit.com
Cc: Greg KH gre...@linuxfoundation.org
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma vishal.l.ve...@linux.intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 Documentation/blockdev/btt.txt |  273 
 drivers/block/nd/Kconfig   |   20 +
 drivers/block/nd/Makefile  |3 
 drivers/block/nd/acpi.c|1 
 drivers/block/nd/btt.c | 1423 
 drivers/block/nd/btt.h |  140 
 drivers/block/nd/btt_devs.c|3 
 drivers/block/nd/libnd.h   |1 
 drivers/block/nd/nd-private.h  |1 
 drivers/block/nd/nd.h  |   10 
 drivers/block/nd/region.c  |   67 ++
 drivers/block/nd/region_devs.c |   10 
 12 files changed, 1948 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 drivers/block/nd/btt.c

diff --git a/Documentation/blockdev/btt.txt b/Documentation/blockdev/btt.txt
new file mode 100644
index ..95134d5ec4a0
--- /dev/null
+++ b/Documentation/blockdev/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=
+
+
+1. Introduction
+---
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or 
perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is 
in
+progress, and we experience a power failure, the block will contain a mix of 
old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+
+
+The underlying storage on which a BTT can be laid out is not limited in any 
way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called Arenas.
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the On-disk metadata layout:
+
+
+  Backing Store +---  Arena
++---+   |   +--+
+|   |   |   | Arena info block |
+|Arena 0+---+   |   4K |
+| 512G  |   +--+
+|   |   |  |
++---+   |  |
+|   |   |  |
+|Arena 1|   |   Data Blocks|
+| 512G  |   |  |
+|   |   |  |
++---+   |  |
+|   .   |   |  |
+|   .   |   |  |
+|   .   |   |  |
+|   |   |  |
+|   |   |  |
++---+   +--+
+|  |
+| BTT Map

[PATCH v2 11/20] libnd, nd_pmem: add libnd support to the pmem driver

2015-04-28 Thread Dan Williams
nd_pmem attaches to persistent memory regions and namespaces emitted by
the nd subsystem, and, same as the original pmem driver, presents the
system-physical-address range as a block device.

The existing e820-type-12 to pmem setup is converted to a full libnd bus
that emits an nd_namespace_io device.

Cc: Andy Lutomirski l...@amacapital.net
Cc: Boaz Harrosh b...@plexistor.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Jens Axboe ax...@fb.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Christoph Hellwig h...@lst.de
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/x86/kernel/pmem.c|2 -
 drivers/block/Kconfig |   11 -
 drivers/block/Makefile|1 
 drivers/block/nd/Kconfig  |   27 
 drivers/block/nd/Makefile |6 +++
 drivers/block/nd/e820.c   |  100 +
 drivers/block/nd/pmem.c   |   47 ++---
 7 files changed, 157 insertions(+), 37 deletions(-)
 create mode 100644 drivers/block/nd/e820.c
 rename drivers/block/{pmem.c = nd/pmem.c} (88%)

diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 3420c874ddc5..279328c42f87 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -13,7 +13,7 @@ static __init void register_pmem_device(struct resource *res)
struct platform_device *pdev;
int error;
 
-   pdev = platform_device_alloc(pmem, PLATFORM_DEVID_AUTO);
+   pdev = platform_device_alloc(e820_pmem, PLATFORM_DEVID_AUTO);
if (!pdev)
return;
 
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index dfe40e5ca9bd..1cef4ffb16c5 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -406,17 +406,6 @@ config BLK_DEV_RAM_DAX
  and will prevent RAM block device backing store memory from being
  allocated from highmem (only a problem for highmem systems).
 
-config BLK_DEV_PMEM
-   tristate Persistent memory block device support
-   help
- Saying Y here will allow you to use a contiguous range of reserved
- memory as one or more persistent block devices.
-
- To compile this driver as a module, choose M here: the module will be
- called 'pmem'.
-
- If unsure, say N.
-
 config CDROM_PKTCDVD
tristate Packet writing on CD/DVD media
depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 07a6acecf4d8..964d8eb2c16f 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM)+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)  += brd.o
-obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_CPQ_DA)   += cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index d2d84451e82c..c5eaf195734d 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -68,4 +68,31 @@ config NFIT_TEST
 
  Say N unless you are doing development of the 'nd' subsystem.
 
+config ND_E820
+   tristate E820: Support the E820-type-12 PMEM convention
+   depends on X86_PMEM_LEGACY
+   default m if X86_PMEM_LEGACY
+   select LIBND
+   help
+ Prior to ACPI 6 some platforms advertised peristent memory
+ via type-12 e820 memory ranges.  Create a libnd bus and
+ attach an instance of the pmem driver to these ranges.
+
+config BLK_DEV_PMEM
+   tristate PMEM: Persistent memory block device support
+   depends on LIBND
+   default LIBND
+   help
+ Memory ranges for PMEM are described by either an NFIT
+ (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
+ non-standard OEM-specific E820 memory type (type-12, see
+ CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
+ 'memmap=nn[KMG]!ss[KMG]' kernel command line (see
+ Documentation/kernel-parameters.txt).  This driver converts
+ these persistent memory ranges into block devices that are
+ capable of DAX (direct-access) file system mappings.  See
+ Documentation/blockdev/nd.txt for more details.
+
+ Say Y if you want to use a NVDIMM described by NFIT
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 0fb0891e1817..ebb212af9f15 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -14,10 +14,16 @@ endif
 
 obj-$(CONFIG_LIBND) += libnd.o
 obj-$(CONFIG_ND_ACPI) += nd_acpi.o
+obj-$(CONFIG_ND_E820) += nd_e820.o
 obj-$(CONFIG_NFIT_TEST) += test/
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 
 nd_acpi-y := acpi.o
 
+nd_e820-y := e820.o
+
+nd_pmem-y := pmem.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/e820.c b/drivers/block/nd/e820.c
new file mode 100644
index ..f4db8c54248e
--- /dev/null
+++ b/drivers/block

[PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory

2015-04-28 Thread Dan Williams
From: Ross Zwisler ross.zwis...@linux.intel.com

The libnd implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnd generic nd_blk driver
calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski l...@amacapital.net
Cc: Boaz Harrosh b...@plexistor.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Jens Axboe ax...@fb.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Christoph Hellwig h...@lst.de
Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 drivers/block/nd/Kconfig  |   12 +
 drivers/block/nd/Makefile |3 
 drivers/block/nd/acpi.c   |  422 +++--
 drivers/block/nd/acpi_nfit.h  |   47 
 drivers/block/nd/blk.c|  264 +++
 drivers/block/nd/libnd.h  |   11 +
 drivers/block/nd/namespace_devs.c |   47 
 drivers/block/nd/nd-private.h |3 
 drivers/block/nd/nd.h |   16 +
 drivers/block/nd/region.c |8 +
 drivers/block/nd/region_devs.c|   65 +-
 drivers/block/nd/test/nfit.c  |   29 +++
 drivers/block/nd/test/nfit_test.h |2 
 13 files changed, 891 insertions(+), 38 deletions(-)
 create mode 100644 drivers/block/nd/blk.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 612bf2b14283..bac4290129fc 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -95,6 +95,18 @@ config BLK_DEV_PMEM
 
  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
 
+config ND_BLK
+   tristate BLK: Block data window (aperture) device support
+   depends on LIBND
+   default ND_ACPI
+   help
+ This driver performs I/O using a set of mmio windows on a
+ dimm.  The set of apertures will all access the one DIMM.
+ Multiple windows allow multiple threads to have a different
+ portions of the dimm open at one time.
+
+ Say Y if you want to use a NVDIMM with BLK-mode capability
+
 config ND_BTT_DEVS
bool
 
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 7d778b4523d4..ef36927618e5 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_ND_E820) += nd_e820.o
 obj-$(CONFIG_NFIT_TEST) += test/
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_acpi-y := acpi.o
 
@@ -27,6 +28,8 @@ nd_pmem-y := pmem.o
 
 nd_btt-y := btt.o
 
+nd_blk-y := blk.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index 5b9997fbc344..e4ff3a9b4fc1 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -12,12 +12,14 @@
  */
 #include linux/list_sort.h
 #include linux/module.h
+#include linux/mutex.h
 #include linux/ndctl.h
 #include linux/list.h
 #include linux/acpi.h
 #include linux/sort.h
 #include acpi_nfit.h
 #include libnd.h
+#include nd.h
 
 static bool warn_checksum;
 module_param(warn_checksum, bool, S_IRUGO|S_IWUSR);
@@ -84,7 +86,7 @@ static int nd_acpi_ctl(struct nd_bus_descriptor *nd_desc,
 
if (!adev)
return -ENOTTY;
-   dimm_name = dev_name(adev-dev);
+   dimm_name = nd_dimm_name(nd_dimm);
cmd_name = nd_dimm_cmd_name(cmd);
dsm_mask = nfit_mem-dsm_mask;
desc = nd_cmd_dimm_desc(cmd);
@@ -301,10 +303,21 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, 
void *table, const void
bdw-dcr_index, bdw-num_bdw);
break;
}
-   /* TODO */
-   case NFIT_TABLE_IDT:
-   dev_dbg(dev, %s: idt\n, __func__);
+   case NFIT_TABLE_IDT: {
+   struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+   GFP_KERNEL);
+   struct acpi_nfit_idt *idt = table;
+
+   if (!nfit_idt)
+   return err;
+   INIT_LIST_HEAD(nfit_idt-list);
+   nfit_idt-idt = idt;
+   list_add_tail(nfit_idt-list, acpi_desc-idts);
+   dev_dbg(dev, %s: idt index: %d num_lines: %d\n, __func__

Re: [Linux-nvdimm] [PATCH 12/21] nd_pmem: add NFIT support to the pmem driver

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 5:56 AM, Christoph Hellwig h...@infradead.org wrote:
 On Sat, Apr 18, 2015 at 12:37:09PM -0700, Dan Williams wrote:
 At this point in the patch series I agree, but in later patches we
 take advantage of nd bus services.  [PATCH 15/21] nd: pmem label sets
 and namespace instantiation adds support for labeled pmem namespaces,
 and in [PATCH 19/21] nd: infrastructure for btt devices we make pmem
 capable of hosting btt instances.

 Thats fine, but still doesn't require moving it around.

I ended up not moving it in v2.  Let me know if the updated rationale
makes sense.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 5:46 AM, Christoph Hellwig h...@infradead.org wrote:
 On Fri, Apr 17, 2015 at 09:35:19PM -0400, Dan Williams wrote:
 diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
 index c52d7540dc05..cd8b7485e396 100644
 --- a/arch/ia64/kernel/efi.c
 +++ b/arch/ia64/kernel/efi.c
 @@ -1227,6 +1227,7 @@ efi_initialize_iomem_resources(struct resource 
 *code_resource,
   case EFI_RUNTIME_SERVICES_CODE:
   case EFI_RUNTIME_SERVICES_DATA:
   case EFI_ACPI_RECLAIM_MEMORY:
 + case EFI_PERSISTENT_MEMORY:
   default:
   name = reserved;

 You probably want pmem as name here..

 diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
 index 11cc7d54ec3f..410af501a941 100644
 --- a/arch/x86/kernel/e820.c
 +++ b/arch/x86/kernel/e820.c
 @@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type)
   case E820_RESERVED_KERN:
   printk(KERN_CONT usable);
   break;
 + case E820_PMEM:
 + case E820_PRAM:
   case E820_RESERVED:
   printk(KERN_CONT reserved);
   break;
 @@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type)
   case E820_UNUSABLE:
   printk(KERN_CONT unusable);
   break;
 - case E820_PRAM:
 - printk(KERN_CONT persistent (type %u), type);
 - break;

 Please keep this printk, and add the new E820_PMEM case to it as well.

 +static bool do_mark_busy(u32 type, struct resource *res)
 +{
 + if (res-start  (1ULL20))
 + return true;
 +
 + switch (type) {
 + case E820_RESERVED:
 + case E820_PRAM:
 + case E820_PMEM:
 + return false;
 + default:
 + return true;
 + }
 +}

 Please add a comment explaining the choices once you start refactoring
 this.  Especially the address check is black magic..

Ok, I was able to incorporate all these into v2.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 03/21] nd_acpi: initial core implementation and nfit skeleton

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 5:53 AM, Christoph Hellwig h...@infradead.org wrote:
 On Fri, Apr 17, 2015 at 09:35:30PM -0400, Dan Williams wrote:
 new file mode 100644
 index ..5fa74f124b3e
 --- /dev/null
 +++ b/drivers/block/nd/Kconfig
 @@ -0,0 +1,44 @@
 +config ND_ARCH_HAS_IOREMAP_CACHE
 + depends on (X86 || IA64 || ARM || ARM64 || SH || XTENSA)
 + def_bool y

 As mentioned before please either define this symbol in each
 arch Kconfig, or just ensure every architecture proides a stub.

 But more importantly it doesn't seem like you're actually using
 ioremap_cache anywhere.  Allowing a cached ioremap would be a very
 worthwile addition to the pmem drivers once we have the proper
 memcpy functions making it safe, and is one of the high priority
 todo items for the pmem driver.

 +
 +menuconfig NFIT_DEVICES
 + bool NVDIMM (NFIT) Support

 Please just call all the symbolc and file names nvdimm instead of nfit
 or nd to make eryones life simpler for the generic code.  Just use the
 EFI/ACPI terminology in those parts that actually parse those tables.

Done in v2.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development

2015-04-28 Thread Dan Williams
On Tue, Apr 28, 2015 at 5:54 AM, Christoph Hellwig h...@infradead.org wrote:
 Eww, the --wrap stuff is too ugly too live.

Since when are unit tests pretty?

 Just implement the
 implemenetation of persistent nvdimms into qemu where it belongs.

Ugh, no, I'm not keen to introduce yet another roadblock to running
the tests and another degree of freedom for things to bit rot.  It
will never be pretty, but the implementation at least gets slightly
cleaner in v2 with the removal of the wrapping for nd_blk_do_io().
It's also worth noting that 0day is currently running our unit tests.

 Note that having a not actually persistent implementation that register
 with the subsystems which doesn't need these hacks still sounds ok to
 me, altough I suspect most users would much prefer the virtualization
 based variant.

KVM NFIT enabling is happening, but I don't think it is useful as a
unit test vehicle.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote:

 * Dan Williams dan.j.willi...@intel.com wrote:

  Anyway, I did want to say that while I may not be convinced about
  the approach, I think the patches themselves don't look horrible.
  I actually like your __pfn_t. So while I (very obviously) have
  some doubts about this approach, it may be that the most
  convincing argument is just in the code.

 Ok, I'll keep thinking about this and come back when we have a
 better story about passing mmap'd persistent memory around in
 userspace.

 So is there anything fundamentally wrong about creating struct page
 backing at mmap() time (and making sure aliased mmaps share struct
 page arrays)?

Something like get_user_pages() triggers memory hotplug for
persistent memory, so they are actual real struct pages?  Can we do
memory hotplug at that granularity?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar mi...@kernel.org wrote:

 * Dan Williams dan.j.willi...@intel.com wrote:

 On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote:
  On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
  What is the primary thing that is driving this need? Do we have a very
  concrete example?
 
  FYI, I plan to to implement RAID acceleration using nvdimms, and I
  plan to ue pages for that.  The code just merge for 4.1 can easily
  support page backing, and I plan to use that for now.  This still
  leaves support for the gigantic intel nvdimms discovered over EFI
  out, but given that I don't have access to them, and I dont know
  of any publically available there's little I can do for now.  But
  adding on demand allocate struct pages for the seems like the
  easiest way forward.  Boaz already has code to allocate pages for
  them, although not on demand but at boot / plug in time.

 Hmmm, the capacities of persistent memory that would be assigned for
 a raid accelerator would be limited by diminishing returns.  I.e.
 there seems to be no point to assign more than 8GB or so to the
 cache? [...]

 Why would that be the case?

 If it's not a temporary cache but a persistent cache that hosts all
 the data even after writeback completes then going to huge sizes will
 bring similar benefits to using a large, fast SSD disk on your
 desktop... The larger, the better. And it also persists across
 reboots.

True, that's more dm-cache than RAID accelerator, but point taken.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote:

 Ok, I'll keep thinking about this and come back when we have a better
 story about passing mmap'd persistent memory around in userspace.

 Ok. And if we do decide to go with your kind of __pfn type, I'd
 probably prefer that we encode the type in the low bits of the word
 rather than compare against PAGE_OFFSET. On some architectures
 PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
 but even on x86 it's a *lot* cheaper to test the low bit than it is to
 compare against a big constant.

 We know struct page * is supposed to be at least aligned to at least
 unsigned long, so you'd have two bits of type information (and we
 could easily make it three). With 0 being a real pointer, so that
 you can use the pointer itself without masking.

 And the hide type in low bits of pointer is something we've done
 quite a lot, so it's more kernel coding style anyway.

Ok.  Although __pfn_t also stores pfn values directly which will
consume those 2 bits so we'll need to shift pfns up when storing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 7:55 AM, Stephen Rothwell s...@canb.auug.org.au wrote:
 Hi Dan,

 On Wed, 06 May 2015 16:04:59 -0400 Dan Williams dan.j.willi...@intel.com 
 wrote:

 diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
 new file mode 100644
 index ..91171e0285d9
 --- /dev/null
 +++ b/include/asm-generic/pfn.h
 @@ -0,0 +1,51 @@
 +#ifndef __ASM_PFN_H
 +#define __ASM_PFN_H
 +
 +#ifndef __pfn_to_phys
 +#define __pfn_to_phys(pfn)  ((dma_addr_t)(pfn)  PAGE_SHIFT)

 Why dma_addr_t and not phys_addr_t?  i.e. it could use a comment if it
 is correct.

Hmm, this was derived from:

#define page_to_phys(page)((dma_addr_t)page_to_pfn(page)  PAGE_SHIFT)

in arch/x86/include/asm/io.h

The primary users of __pfn_to_phys() is dma_map_page().  I'll add a
comment to that effect.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar mi...@kernel.org wrote:

 * Ingo Molnar mi...@kernel.org wrote:

 [...]

 For anything more complex, that maps any of this storage to
 user-space, or exposes it to higher level struct page based APIs,
 etc., where references matter and it's more of a cache with
 potentially multiple users, not an IO space, the natural API is
 struct page.

 Let me walk back on this:

 I'd say that this particular series mostly addresses the 'pfn as
 sector_t' side of the equation, where persistent memory is IO space,
 not memory space, and as such it is the more natural and thus also
 the cheaper/faster approach.

 ... but that does not appear to be the case: this series replaces a
 'struct page' interface with a pure pfn interface for the express
 purpose of being able to DMA to/from 'memory areas' that are not
 struct page backed.

 Linus probably disagrees? :-)

 [ and he'd disagree rightfully ;-) ]

 So what this patch set tries to achieve is (sector_t - sector_t) IO
 between storage devices (i.e. a rare and somewhat weird usecase), and
 does it by squeezing one device's storage address into our formerly
 struct page backed descriptor, via a pfn.

 That looks like a layering violation and a mistake to me. If we want
 to do direct (sector_t - sector_t) IO, with no serialization worries,
 it should have its own (simple) API - which things like hierarchical
 RAID or RDMA APIs could use.

I'm wrapped around the idea that __pfn_t *is* that simple api for the
tiered storage driver use case.  For RDMA I think we need struct page
because I assume that would be coordinated through a filesystem an
truncate() is back in play.

What does an alternative API look like?

 If what we want to do is to support say an mmap() of a file on
 persistent storage, and then read() into that file from another device
 via DMA, then I think we should have allocated struct page backing at
 mmap() time already, and all regular syscall APIs would 'just work'
 from that point on - far above what page-less, pfn-based APIs can do.

 The temporary struct page backing can then be freed at munmap() time.

Yes, passing around mmap()'d (DAX) persistent memory will need more
than a __pfn_t.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote:

 My pet concrete example is covered by __pfn_t.  Referencing persistent
 memory in an md/dm hierarchical storage configuration.  Setting aside
 the thrash to get existing block users to do bvec_set_page(page)
 instead of bvec-page = page the onus is on that md/dm
 implementation and backing storage device driver to operate on
 __pfn_t.  That use case is simple because there is no use of page
 locking or refcounting in that path, just dma_map_page() and
 kmap_atomic().

 So clarify for me: are you trying to make the IO stack in general be
 able to use the persistent memory as a source (or destination) for IO
 to _other_ devices, or are you talking about just internally shuffling
 things around for something like RAID on top of persistent memory?

 Because I think those are two very different things.

Yes, they are, and I am referring to the former, persistent memory as
a source/destination to other devices.

 For example, one of the things I worry about is for people doing IO
 from persistent memory directly to some slow stable storage (aka
 disk). That was what I thought you were aiming for: infrastructure so
 that you can make a bio for a *disk* device contain a page list that
 is the persistent memory.

 And I think that is a very dangerous operation to do, because the
 persistent memory itself is going to have some filesystem on it, so
 anything that looks up the persistent memory pages is *not* going to
 have a stable pfn: the pfn will point to a fixed part of the
 persistent memory, but the file that was there may be deleted and the
 memory reassigned to something else.

Indeed, truncate() in the absence of struct page has been a major
hurdle for persistent memory enabling.  But it does not impact this
specific md/dm use case.  md/dm will have taken an exclusive claim on
an entire pmem block device (or partition), so there will be no
competing with a filesystem.

 That's the kind of thing that struct page helps with for normal IO
 devices. It's both a source of serialization and indirection, so that
 when somebody does a truncate() on a file, we don't end up doing IO
 to random stale locations on the disk that got reassigned to another
 file.

 So struct page is very fundamental. It's *not* just a this is the
 physical source/drain of the data you are doing IO on.

 So if you are looking at some kind of zero-copy IO, where you can do
 IO from a filesystem on persistent storage to *another* filesystem on
 (say, a big rotational disk used for long-term storage) by just doing
 a bo that targets the disk, but has the persistent memory as the
 source memory, I really want to understand how you are going to
 serialize this.

 So *that* is what I meant by What is the primary thing that is
 driving this need? Do we have a very concrete example?

 I abvsolutely do *not* want to teach the bio subsystem to just
 randomly be able to take the source/destination of the IO as being
 some random pfn without knowing what the actual uses are and how these
 IO's are generated in the first place.

blkdev_get(FMODE_EXCL) is the protection in this case.

 I was assuming that you wanted to do something where you mmap() the
 persistent memory, and then write it out to another device (possibly
 using aio_write()). But that really does require some kind of
 serialization at a higher level, because you can't just look up the
 pfn's in the page table and assume they are stable: they are *not*
 stable.

We want to get there eventually, but this patchset does not address that case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote:

 blkdev_get(FMODE_EXCL) is the protection in this case.

 Ugh. That looks like a horrible nasty big hammer that will bite us
 badly some day. Since you'd have to hold it for the whole IO. But I
 guess it at least works.

Oh no, that wouldn't be per-I/O that would be permanent at
configuration set up time just like a raid member device.

Something like:
mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda

 Anyway, I did want to say that while I may not be convinced about the
 approach, I think the patches themselves don't look horrible. I
 actually like your __pfn_t. So while I (very obviously) have some
 doubts about this approach, it may be that the most convincing
 argument is just in the code.

Ok, I'll keep thinking about this and come back when we have a better
story about passing mmap'd persistent memory around in userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote:

 The motivation for this change is persistent memory and the desire to
 use it not only via the pmem driver, but also as a memory target for I/O
 (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.

 I detest this approach.


Hmm, yes, I can't argue against put the onus on odd behavior where it
belongs

 I'd much rather go exactly the other way around, and do the dynamic
 struct page instead.

 Add a flag to struct page

Ok, given I had already precluded 32-bit systems in this __pfn_t
approach we should have flag space for this on 64-bit.

 to mark it as a fake entry and teach
 page_to_pfn() to look up the actual pfn some way (that union tha
 contains index looks like a good target to also contain 'pfn', for
 example).

 Especially if this is mainly for persistent storage, we'll never have
 issues with worrying about writing it back under memory pressure, so
 allocating a struct page for these things shouldn't be a problem.
 There's likely only a few paths that actually generate IO for those
 things.

 In other words, I'd really like our basic infrastructure to be for the
 *normal* case, and the struct page is about so much more than just
 what's the target for IO. For normal IO, struct page is also what
 serializes the IO so that you have a consistent view of the end
 result, and there's obviously the reference count there too. So I
 really *really* think that struct page is the better entity for
 describing the actual IO, because it's the common and the generic
 thing, while a pfn is not actually *enough* for IO in general, and
 you now end up having to look up the struct page for the locking and
 refcounting etc.

 If you go the other way, and instead generate a struct page from the
 pfn for the few cases that need it, you put the onus on odd behavior
 where it belongs.

 Yes, it might not be any simpler in the end, but I think it would be
 conceptually much better.

Conceptually better, but certainly more difficult to audit if the fake
struct page is initialized in a subtle way that breaks when/if it
leaks to some unwitting context.  The one benefit I may need to
concede is a mechanism to opt-in to handle these fake pages to the few
paths that know what they are doing.  That was easy with __pfn_t, but
a struct page can go silently almost anywhere.  Certainly nothing is
prepared a for a given struct page pointer to change the pfn it points
to on the fly, which I think is what we would end up doing for
something like a raid cache.  Keep a pool of struct pages around and
point them at persistent memory pfns while I/O is in flight.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 06/10] scatterlist: support page-less (__pfn_t only) entries

2015-05-06 Thread Dan Williams
From: Matthew Wilcox wi...@linux.intel.com

Given that an offset will never be more than PAGE_SIZE, steal the unused
bits of the offset to implement a flags field.  Move the existing this
is a sg_chain() entry flag to the new flags field, and add a new flag
(SG_FLAGS_PAGE) to indicate that there is a struct page backing for the
entry.

Signed-off-by: Dan Williams dan.j.willi...@intel.com
Signed-off-by: Matthew Wilcox wi...@linux.intel.com
---
 block/blk-merge.c |2 -
 drivers/dma/ste_dma40.c   |5 --
 drivers/mmc/card/queue.c  |4 +-
 include/asm-generic/scatterlist.h |9 
 include/crypto/scatterwalk.h  |   10 
 include/linux/scatterlist.h   |   91 +
 6 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 218ad1e57a49..82a688551b72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,7 +267,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
if (rq-cmd_flags  REQ_WRITE)
memset(q-dma_drain_buffer, 0, q-dma_drain_size);
 
-   sg-page_link = ~0x02;
+   sg_unmark_end(sg);
sg = sg_next(sg);
sg_set_page(sg, virt_to_page(q-dma_drain_buffer),
q-dma_drain_size,
diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c
index 3c10f034d4b9..e8c00642cacb 100644
--- a/drivers/dma/ste_dma40.c
+++ b/drivers/dma/ste_dma40.c
@@ -2562,10 +2562,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t 
dma_addr,
dma_addr += period_len;
}
 
-   sg[periods].offset = 0;
-   sg_dma_len(sg[periods]) = 0;
-   sg[periods].page_link =
-   ((unsigned long)sg | 0x01)  ~0x02;
+   sg_chain(sg, periods + 1, sg);
 
txd = d40_prep_sg(chan, sg, sg, periods, direction,
  DMA_PREP_INTERRUPT);
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 236d194c2883..127f76294e71 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct 
mmc_queue *mq,
sg_set_buf(__sg, buf + offset, len);
offset += len;
remain -= len;
-   (__sg++)-page_link = ~0x02;
+   sg_unmark_end(__sg++);
sg_len++;
} while (remain);
}
@@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct 
mmc_queue *mq,
list_for_each_entry(req, packed-list, queuelist) {
sg_len += blk_rq_map_sg(mq-queue, req, __sg);
__sg = sg + (sg_len - 1);
-   (__sg++)-page_link = ~0x02;
+   sg_unmark_end(__sg++);
}
sg_mark_end(sg + (sg_len - 1));
return sg_len;
diff --git a/include/asm-generic/scatterlist.h 
b/include/asm-generic/scatterlist.h
index 5de07355fad4..959f51572a8e 100644
--- a/include/asm-generic/scatterlist.h
+++ b/include/asm-generic/scatterlist.h
@@ -7,8 +7,17 @@ struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
unsigned long   sg_magic;
 #endif
+#ifdef CONFIG_HAVE_DMA_PFN
+   union {
+   __pfn_t pfn;
+   struct scatterlist *next;
+   };
+   unsigned short  offset;
+   unsigned short  sg_flags;
+#else
unsigned long   page_link;
unsigned intoffset;
+#endif
unsigned intlength;
dma_addr_t  dma_address;
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 20e4226a2e14..7296d89a50b2 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -25,6 +25,15 @@
 #include linux/scatterlist.h
 #include linux/sched.h
 
+#ifdef CONFIG_HAVE_DMA_PFN
+/*
+ * If we're using PFNs, the architecture must also have been converted to
+ * support SG_CHAIN.  So we can use the generic code instead of custom
+ * code.
+ */
+#define scatterwalk_sg_chain(prv, num, sgl)sg_chain(prv, num, sgl)
+#define scatterwalk_sg_next(sgl)   sg_next(sgl)
+#else
 static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
struct scatterlist *sg2)
 {
@@ -32,6 +41,7 @@ static inline void scatterwalk_sg_chain(struct scatterlist 
*sg1, int num,
sg1[num - 1].page_link = ~0x02;
sg1[num - 1].page_link |= 0x01;
 }
+#endif
 
 static inline void scatterwalk_crypto_chain(struct scatterlist *head,
struct scatterlist *sg,
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..9d423e559bdb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include linux/bug.h
 #include linux/mm.h
 
+#include asm/page.h
 #include asm/types.h
 #include asm

[PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory

2015-05-06 Thread Dan Williams
It would be unfortunate if the kmap infrastructure escaped its current
32-bit/HIGHMEM bonds and leaked into 64-bit code.  Instead, if the user
has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t()
implementation to scan a list of pre-mapped persistent memory address
ranges inserted by the pmem driver.

The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
but there are two mitigating factors:

1/ The number of persistent memory ranges is bounded by the number of
   DIMMs which is on the order of 10s of DIMMs, not hundreds.

2/ The lookup yields the entire range, if it becomes inefficient to do a
   kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
   advantage of the fact that the lookup can be amortized for all kmap
   operations it needs to perform in a given range.

Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/Kconfig |3 +
 arch/x86/Kconfig |2 +
 arch/x86/kernel/Makefile |1 
 arch/x86/kernel/kmap.c   |   95 ++
 drivers/block/pmem.c |6 +++
 include/linux/highmem.h  |   23 +++
 6 files changed, 130 insertions(+)
 create mode 100644 arch/x86/kernel/kmap.c

diff --git a/arch/Kconfig b/arch/Kconfig
index f7f800860c00..69d3a3fa21af 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS
 config HAVE_DMA_PFN
bool
 
+config HAVE_KMAP_PFN
+   bool
+
 config GENERIC_SMP_IDLE_THREAD
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1fae5e842423..eddaea839500 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY
  Say Y if unsure.
 
 config X86_PMEM_DMA
+   depends on !HIGHMEM
def_bool PMEM_IO
+   select HAVE_KMAP_PFN
select HAVE_DMA_PFN
 
 config HIGHPTE
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9bcd0b56ca17..44c323342996 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT)+= paravirt.o 
paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)   += pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY)  += pmem.o
+obj-$(CONFIG_X86_PMEM_DMA) += kmap.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)  += pcspeaker.o
 
diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c
new file mode 100644
index ..d597c475377b
--- /dev/null
+++ b/arch/x86/kernel/kmap.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include linux/rcupdate.h
+#include linux/rculist.h
+#include linux/highmem.h
+#include linux/device.h
+#include linux/slab.h
+#include linux/mm.h
+
+static LIST_HEAD(ranges);
+
+struct kmap {
+   struct list_head list;
+   struct resource *res;
+   struct device *dev;
+   void *base;
+};
+
+static void teardown_kmap(void *data)
+{
+   struct kmap *kmap = data;
+
+   dev_dbg(kmap-dev, kmap unregister %pr\n, kmap-res);
+   list_del_rcu(kmap-list);
+   synchronize_rcu();
+   kfree(kmap);
+}
+
+int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
+   void *base)
+{
+   struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
+   int rc;
+
+   if (!kmap)
+   return -ENOMEM;
+
+   INIT_LIST_HEAD(kmap-list);
+   kmap-res = res;
+   kmap-base = base;
+   kmap-dev = dev;
+   rc = devm_add_action(dev, teardown_kmap, kmap);
+   if (rc) {
+   kfree(kmap);
+   return rc;
+   }
+   dev_dbg(kmap-dev, kmap register %pr\n, kmap-res);
+   list_add_rcu(kmap-list, ranges);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
+
+void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+   struct page *page = __pfn_t_to_page(pfn);
+   resource_size_t addr;
+   struct kmap *kmap;
+
+   if (page)
+   return kmap_atomic(page);
+   addr = __pfn_t_to_phys(pfn);
+   rcu_read_lock();
+   list_for_each_entry_rcu(kmap, ranges, list)
+   if (addr = kmap-res-start  addr = kmap-res-end)
+   return kmap-base + addr - kmap-res-start;
+
+   /* only unlock in the error case */
+   rcu_read_unlock();
+   return NULL;
+}
+EXPORT_SYMBOL(kmap_atomic_pfn_t);
+
+void kunmap_atomic_pfn_t(void *addr)
+{
+   rcu_read_unlock();
+
+   /*
+* If the original __pfn_t had an entry in the memmap

[PATCH v2 09/10] dax: convert to __pfn_t

2015-05-06 Thread Dan Williams
The primary source for non-page-backed page-frames to enter the system
is via the pmem driver's -direct_access() method.  The pfns returned by
the top-level bdev_direct_access() may be passed to any other subsystem
in the kernel and those sub-systems either need to assume that the pfn
is page backed (CONFIG_PMEM_IO=n) or be prepared to handle non-page
backed case (CONFIG_PMEM_IO=y).  Currently the pfns returned by
-direct_access() are only ever used by vm_insert_mixed() which does not
care if the pfn is mapped.  As we go to add more usages of these pfns
add the type-safety of __pfn_t.

Cc: Matthew Wilcox wi...@linux.intel.com
Cc: Ross Zwisler ross.zwis...@linux.intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Boaz Harrosh b...@plexistor.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/powerpc/sysdev/axonram.c |4 ++--
 drivers/block/brd.c   |4 ++--
 drivers/block/pmem.c  |8 +---
 drivers/s390/block/dcssblk.c  |6 +++---
 fs/block_dev.c|2 +-
 fs/dax.c  |9 +
 include/asm-generic/pfn.h |7 +++
 include/linux/blkdev.h|4 ++--
 8 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 9bb5da7f2c0c..069cb5285f18 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,13 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, unsigned long *pfn, long size)
+  void **kaddr, __pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device-bd_disk-private_data;
loff_t offset = (loff_t)sector  AXON_RAM_SECTOR_SHIFT;
 
*kaddr = (void *)(bank-ph_addr + offset);
-   *pfn = virt_to_phys(*kaddr)  PAGE_SHIFT;
+   *pfn = phys_to_pfn_t(virt_to_phys(*kaddr));
 
return bank-size - offset;
 }
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 115c6cf9cb43..57f4cd787ea2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t 
sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, unsigned long *pfn, long size)
+   void **kaddr, __pfn_t *pfn, long size)
 {
struct brd_device *brd = bdev-bd_disk-private_data;
struct page *page;
@@ -382,7 +382,7 @@ static long brd_direct_access(struct block_device *bdev, 
sector_t sector,
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
-   *pfn = page_to_pfn(page);
+   *pfn = page_to_pfn_t(page);
 
/*
 * TODO: If size  PAGE_SIZE, we could look to see if the next page in
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 2a847651f8de..18edb48e405e 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -98,8 +98,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t 
sector,
return 0;
 }
 
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+static long __maybe_unused pmem_direct_access(struct block_device *bdev,
+   sector_t sector, void **kaddr, __pfn_t *pfn, long size)
 {
struct pmem_device *pmem = bdev-bd_disk-private_data;
size_t offset = sector  9;
@@ -108,7 +108,7 @@ static long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
return -ENODEV;
 
*kaddr = pmem-virt_addr + offset;
-   *pfn = (pmem-phys_addr + offset)  PAGE_SHIFT;
+   *pfn = phys_to_pfn_t(pmem-phys_addr + offset);
 
return pmem-size - offset;
 }
@@ -116,7 +116,9 @@ static long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
+#if IS_ENABLED(CONFIG_PMEM_IO)
.direct_access =pmem_direct_access,
+#endif
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5da8515b8fb9..8616c1d33786 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev

[PATCH v2 10/10] block: base support for pfn i/o

2015-05-06 Thread Dan Williams
Allow block device drivers to opt-in to receiving bio(s) where the
bio_vec(s) point to memory that is not backed by struct page entries.
When a driver opts in it asserts that it will use the __pfn_t versions of the
dma_map/kmap/scatterlist apis in its bio submission path.

Cc: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 block/bio.c   |   48 ++---
 block/blk-core.c  |9 
 include/linux/blk_types.h |1 +
 include/linux/blkdev.h|2 ++
 4 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..9c506dd6a093 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio-bi_rw = bio_src-bi_rw;
bio-bi_iter = bio_src-bi_iter;
bio-bi_io_vec = bio_src-bi_io_vec;
+   bio-bi_flags |= bio_src-bi_flags  (1  BIO_PFN);
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -658,6 +659,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
goto integrity_clone;
}
 
+   bio-bi_flags |= bio_src-bi_flags  (1  BIO_PFN);
+
bio_for_each_segment(bv, bio_src, iter)
bio-bi_io_vec[bio-bi_vcnt++] = bv;
 
@@ -699,9 +702,9 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+static int __bio_add_pfn(struct request_queue *q, struct bio *bio,
+   __pfn_t pfn, unsigned int len, unsigned int offset,
+   unsigned int max_sectors)
 {
int retried_segments = 0;
struct bio_vec *bvec;
@@ -723,7 +726,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
if (bio-bi_vcnt  0) {
struct bio_vec *prev = bio-bi_io_vec[bio-bi_vcnt - 1];
 
-   if (page == bvec_page(prev) 
+   if (pfn.pfn == prev-bv_pfn.pfn 
offset == prev-bv_offset + prev-bv_len) {
unsigned int prev_bv_len = prev-bv_len;
prev-bv_len += len;
@@ -768,7 +771,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
 * cannot add the page
 */
bvec = bio-bi_io_vec[bio-bi_vcnt];
-   bvec_set_page(bvec, page);
+   bvec-bv_pfn = pfn;
bvec-bv_len = len;
bvec-bv_offset = offset;
bio-bi_vcnt++;
@@ -818,7 +821,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
return len;
 
  failed:
-   bvec_set_page(bvec, NULL);
+   bvec-bv_pfn.pfn = 0;
bvec-bv_len = 0;
bvec-bv_offset = 0;
bio-bi_vcnt--;
@@ -845,7 +848,7 @@ static int __bio_add_page(struct request_queue *q, struct 
bio *bio, struct page
 int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page 
*page,
unsigned int len, unsigned int offset)
 {
-   return __bio_add_page(q, bio, page, len, offset,
+   return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
  queue_max_hw_sectors(q));
 }
 EXPORT_SYMBOL(bio_add_pc_page);
@@ -872,10 +875,39 @@ int bio_add_page(struct bio *bio, struct page *page, 
unsigned int len,
if ((max_sectors  (len  9))  !bio-bi_iter.bi_size)
max_sectors = len  9;
 
-   return __bio_add_page(q, bio, page, len, offset, max_sectors);
+   return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
+   max_sectors);
 }
 EXPORT_SYMBOL(bio_add_page);
 
+/**
+ * bio_add_pfn -   attempt to add pfn to bio
+ * @bio: destination bio
+ * @pfn: pfn to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Identical to bio_add_page() except this variant flags the bio as
+ * not have struct page backing.  A given request_queue must assert
+ * that it is prepared to handle this constraint before bio(s)
+ * flagged in the manner can be passed.
+ */
+int bio_add_pfn(struct bio *bio, __pfn_t pfn, unsigned int len,
+   unsigned int offset)
+{
+   struct request_queue *q = bdev_get_queue(bio-bi_bdev);
+   unsigned int max_sectors;
+
+   if (!blk_queue_pfn(q))
+   return 0;
+   set_bit(BIO_PFN, bio-bi_flags);
+   max_sectors = blk_max_size_offset(q, bio-bi_iter.bi_sector);
+   if ((max_sectors  (len  9))  !bio-bi_iter.bi_size)
+   max_sectors = len  9;
+
+   return __bio_add_pfn(q, bio, pfn, len, offset, max_sectors);
+}
+
 struct submit_bio_ret {
struct completion event;
int error;
diff --git a/block/blk-core.c b/block/blk-core.c
index 94d2c6ccf801..4eefff363986 100644

[PATCH v2 05/10] scatterlist: use sg_phys()

2015-05-06 Thread Dan Williams
Coccinelle cleanup to replace open coded sg to physical address
translations.  This is in preparation for introducing scatterlists that
reference pfn(s) without a backing struct page.

// sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
// usage: make coccicheck COCCI=sg_phys.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg)) + sg-offset
+ sg_phys(sg)

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg))
+ sg_phys(sg) - sg-offset

Cc: Julia Lawall julia.law...@lip6.fr
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/arm/mm/dma-mapping.c|2 +-
 arch/microblaze/kernel/dma.c |2 +-
 drivers/iommu/intel-iommu.c  |4 ++--
 drivers/iommu/iommu.c|2 +-
 drivers/staging/android/ion/ion_chunk_heap.c |4 ++--
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 09c5fe3d30c2..43cc6a8fdacc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
return -ENOMEM;
 
for (count = 0, s = sg; count  (size  PAGE_SHIFT); s = sg_next(s)) {
-   phys_addr_t phys = page_to_phys(sg_page(s));
+   phys_addr_t phys = sg_phys(s) - s-offset;
unsigned int len = PAGE_ALIGN(s-offset + s-length);
 
if (!is_coherent 
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ed7ba8a11822..dcb3c594d626 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg-dma_address = sg_phys(sg);
-   __dma_sync(page_to_phys(sg_page(sg)) + sg-offset,
+   __dma_sync(sg_phys(sg),
sg-length, direction);
}
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 68d43beccb7e..9b9ada71e0d3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
sg_res = aligned_nrpages(sg-offset, sg-length);
sg-dma_address = ((dma_addr_t)iov_pfn  
VTD_PAGE_SHIFT) + sg-offset;
sg-dma_length = sg-length;
-   pteval = page_to_phys(sg_page(sg)) | prot;
+   pteval = (sg_phys(sg) - sg-offset) | prot;
phys_pfn = pteval  VTD_PAGE_SHIFT;
}
 
@@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
 
for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
-   sg-dma_address = page_to_phys(sg_page(sg)) + sg-offset;
+   sg-dma_address = sg_phys(sg);
sg-dma_length = sg-length;
}
return nelems;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d4f527e56679..59808fc9110d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, 
unsigned long iova,
min_pagesz = 1  __ffs(domain-ops-pgsize_bitmap);
 
for_each_sg(sg, s, nents, i) {
-   phys_addr_t phys = page_to_phys(sg_page(s)) + s-offset;
+   phys_addr_t phys = sg_phys(s);
 
/*
 * We are mapping on IOMMU page boundaries, so offset within
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c 
b/drivers/staging/android/ion/ion_chunk_heap.c
index 3e6ec2ee6802..b7da5d142aa9 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
 err:
sg = table-sgl;
for (i -= 1; i = 0; i--) {
-   gen_pool_free(chunk_heap-pool, page_to_phys(sg_page(sg)),
+   gen_pool_free(chunk_heap-pool, sg_phys(sg) - sg-offset,
  sg-length);
sg = sg_next(sg);
}
@@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
DMA_BIDIRECTIONAL);
 
for_each_sg(table-sgl, sg, table-nents, i) {
-   gen_pool_free(chunk_heap-pool, page_to_phys(sg_page(sg)),
+   gen_pool_free(chunk_heap-pool, sg_phys(sg) - sg-offset,
  sg-length);
}
chunk_heap-allocated -= allocated_size;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord

[PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o

2015-05-06 Thread Dan Williams
Introduce a type that encapsulates a page-frame-number that is
optionally backed by memmap (struct page).  This type will be used in
place of 'struct page *' instances in contexts where persistent memory
is being referenced (scatterlists for drivers, biovecs for the block
layer, etc).  The operations in those i/o paths that formerly required a
'struct page *' are to be converted to use __pfn_t aware equivalent
helpers.  Otherwise, in the absence of persistent memory, there is no
functional change and __pfn_t is an alias for a normal memory page.

It turns out that while 'struct page' references are used broadly in the
kernel I/O stacks the usage of 'struct page' based capabilities is very
shallow.  It is only used for populating bio_vecs and scatterlists for
the retrieval of dma addresses, and for temporary kernel mappings
(kmap).  Aside from kmap, these usages can be trivially converted to
operate on a pfn.

Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
via struct page, to setup and track temporary kernel mappings.  It would
be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
bonds and leaked into 64-bit code.  Thankfully, it seems all that is
needed here is to convert kmap_atomic() callers, that want to opt-in to
supporting persistent memory, to use a new kmap_atomic_pfn_t().  Where
kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
established by the driver for persistent memory.

Note, that as far as conceptually understanding __pfn_t is concerned,
'persistent memory' is really any address range in host memory not
covered by memmap.  Contrast this with pure iomem that is on an mmio
mapped bus like PCI and cannot be converted to a dma_addr_t by pfn 
PAGE_SHIFT.

Cc: H. Peter Anvin h...@zytor.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Tejun Heo t...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Linus Torvalds torva...@linux-foundation.org
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 include/asm-generic/memory_model.h |1 -
 include/asm-generic/pfn.h  |   51 
 include/linux/mm.h |1 +
 init/Kconfig   |   13 +
 4 files changed, 65 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/pfn.h

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index 14909b0b9cae..1b0ae21fd8ff 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -70,7 +70,6 @@
 #endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
 
 #define page_to_pfn __page_to_pfn
-#define pfn_to_page __pfn_to_page
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
new file mode 100644
index ..91171e0285d9
--- /dev/null
+++ b/include/asm-generic/pfn.h
@@ -0,0 +1,51 @@
+#ifndef __ASM_PFN_H
+#define __ASM_PFN_H
+
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn)  ((dma_addr_t)(pfn)  PAGE_SHIFT)
+#endif
+
+static inline struct page *pfn_to_page(unsigned long pfn)
+{
+   return __pfn_to_page(pfn);
+}
+
+/*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  This type will be used in place of a
+ * 'struct page *' instance in contexts where unmapped memory (usually
+ * persistent memory) is being referenced (scatterlists for drivers,
+ * biovecs for the block layer, etc).
+ */
+typedef struct {
+   union {
+   unsigned long pfn;
+   struct page *page;
+   };
+} __pfn_t;
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+   if (pfn.pfn  PAGE_OFFSET)
+   return NULL;
+#endif
+   return pfn.page;
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+   if (pfn.pfn  PAGE_OFFSET)
+   return __pfn_to_phys(pfn.pfn);
+#endif
+   return __pfn_to_phys(page_to_pfn(pfn.page));
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+   __pfn_t pfn = { .page = page };
+
+   return pfn;
+}
+#endif /* __ASM_PFN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..9d35cff41c12 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -52,6 +52,7 @@ extern int sysctl_legacy_va_layout;
 #include asm/page.h
 #include asm/pgtable.h
 #include asm/processor.h
+#include asm-generic/pfn.h
 
 #ifndef __pa_symbol
 #define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..7d2ad350fd29 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1764,6 +1764,19 @@ config PROFILING
  Say Y here to enable the extended profiling support mechanisms used
  by profilers such as OProfile.
 
+config PMEM_IO
+   default n
+   bool Support for I/O, DAX, DMA, RDMA to unmapped (persistent) memory 
if EXPERT
+   help
+ Say Y here

[PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec

2015-05-06 Thread Dan Williams
Carry an __pfn_t in a bio_vec rather than a 'struct page *' in support
of allowing a bio to reference unmapped (not struct page backed)
persistent memory.

This also fixes up the macros and static initializers that we were not
automatically converted by the Coccinelle script that introduced the
bvec_page() and bvec_set_page() helpers.

If CONFIG_PMEM_IO=n this is functionally equivalent to the status quo as
the __pfn_t helpers can assume that a __pfn_t always has a corresponding
struct page.

Cc: Jens Axboe ax...@kernel.dk
Cc: Matthew Wilcox wi...@linux.intel.com
Cc: Dave Hansen dave.han...@linux.intel.com
Cc: Julia Lawall julia.law...@lip6.fr
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 block/blk-integrity.c |4 ++--
 block/blk-merge.c |6 +++---
 block/bounce.c|2 +-
 drivers/md/bcache/btree.c |2 +-
 include/linux/bio.h   |   24 +---
 include/linux/blk_types.h |   13 ++---
 lib/iov_iter.c|   22 +++---
 mm/page_io.c  |4 ++--
 8 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 0458f31f075a..351198fbda3c 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = unsupported;
  */
 int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
 {
-   struct bio_vec iv, ivprv = { NULL };
+   struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
unsigned int segments = 0;
unsigned int seg_size = 0;
struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
 int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist)
 {
-   struct bio_vec iv, ivprv = { NULL };
+   struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
struct scatterlist *sg = NULL;
unsigned int segments = 0;
struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 47ceefacd320..218ad1e57a49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 struct bio *bio,
 bool no_sg_merge)
 {
-   struct bio_vec bv, bvprv = { NULL };
+   struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
int cluster, high, highprv = 1;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
 static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
   struct bio *nxt)
 {
-   struct bio_vec end_bv = { NULL }, nxt_bv;
+   struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
struct bvec_iter iter;
 
if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
 struct scatterlist *sglist,
 struct scatterlist **sg)
 {
-   struct bio_vec bvec, bvprv = { NULL };
+   struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
struct bvec_iter iter;
int nsegs, cluster;
 
diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char 
*vfrom)
 #else /* CONFIG_HIGHMEM */
 
 #define bounce_copy_vec(to, vfrom) \
-   memcpy(page_address((to)-bv_page) + (to)-bv_offset, vfrom, 
(to)-bv_len)
+   memcpy(page_address(bvec_page(to)) + (to)-bv_offset, vfrom, 
(to)-bv_len)
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
void *base = (void *) ((unsigned long) i  ~(PAGE_SIZE - 1));
 
bio_for_each_segment_all(bv, b-bio, j)
-   memcpy(page_address(bv-bv_page),
+   memcpy(page_address(bvec_page(bv)),
   base + j * PAGE_SIZE, PAGE_SIZE);
 
bch_submit_bbio(b-bio, b-c, k.key, 0);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127c9958..a59d97cbfe13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,8 +63,8 @@
  */
 #define __bvec_iter_bvec(bvec, iter)   ((bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
-   (__bvec_iter_bvec((bvec), (iter))-bv_page)
+#define bvec_iter_pfn(bvec, iter)  \
+   (__bvec_iter_bvec((bvec), (iter))-bv_pfn)
 
 #define bvec_iter_len(bvec, iter)  \
min((iter

[PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
Changes since v1 [1]:

1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers.

2/ added kmap_atomic_pfn_t()

3/ rebased on v4.1-rc2

[1]: http://marc.info/?l=linux-kernelm=142653770511970w=2

---

A lead in note, this looks scarier than it is.  Most of the code thrash
is automated via Coccinelle.  Also the subtle differences behind an
'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a
Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls
whether a pfn and a __pfn_t are equivalent.

The motivation for this change is persistent memory and the desire to
use it not only via the pmem driver, but also as a memory target for I/O
(DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.  Aside
from the pmem driver and DAX, persistent memory is not able to be used
in these I/O scenarios due to the lack of a backing struct page, i.e.
persistent memory is not part of the memmap.  This patchset takes the
position that the solution is to teach I/O paths that want to operate on
persistent memory to do so by referencing a __pfn_t.  The alternatives
are discussed in the changelog for [PATCH v2 01/10] arch: introduce
__pfn_t for persistent memory i/o, copied here:

Alternatives:

1/ Provide struct page coverage for persistent memory in
   DRAM.  The expectation is that persistent memory capacities make
   this untenable in the long term.

2/ Provide struct page coverage for persistent memory with
   persistent memory.  While persistent memory may have near DRAM
   performance characteristics it may not have the same
   write-endurance of DRAM.  Given the update frequency of struct
   page objects it may not be suitable for persistent memory.

3/ Dynamically allocate struct page.  This appears to be on
   the order of the complexity of converting code paths to use
   __pfn_t references instead of struct page, and the amount of
   setup required to establish a valid struct page reference is
   mostly wasted when the only usage in the block stack is to
   perform a page_to_pfn() conversion for dma-mapping.  Instances
   of kmap() / kmap_atomic() usage appear to be the only occasions
   in the block stack where struct page is non-trivially used.  A
   new kmap_atomic_pfn_t() is proposed to handle those cases.

---

Dan Williams (9):
  arch: introduce __pfn_t for persistent memory i/o
  block: add helpers for accessing a bio_vec page
  block: convert .bv_page to .bv_pfn bio_vec
  dma-mapping: allow archs to optionally specify a -map_pfn() operation
  scatterlist: use sg_phys()
  x86: support dma_map_pfn()
  x86: support kmap_atomic_pfn_t() for persistent memory
  dax: convert to __pfn_t
  block: base support for pfn i/o

Matthew Wilcox (1):
  scatterlist: support page-less (__pfn_t only) entries


 arch/Kconfig |6 ++
 arch/arm/mm/dma-mapping.c|2 -
 arch/microblaze/kernel/dma.c |2 -
 arch/powerpc/sysdev/axonram.c|6 +-
 arch/x86/Kconfig |7 ++
 arch/x86/kernel/Makefile |1 
 arch/x86/kernel/amd_gart_64.c|   22 +-
 arch/x86/kernel/kmap.c   |   95 ++
 arch/x86/kernel/pci-nommu.c  |   22 +-
 arch/x86/kernel/pci-swiotlb.c|4 +
 arch/x86/pci/sta2x11-fixup.c |4 +
 arch/x86/xen/pci-swiotlb-xen.c   |4 +
 block/bio-integrity.c|8 +-
 block/bio.c  |   82 --
 block/blk-core.c |   13 +++-
 block/blk-integrity.c|7 +-
 block/blk-lib.c  |2 -
 block/blk-merge.c|   15 ++--
 block/bounce.c   |   26 ---
 drivers/block/aoe/aoecmd.c   |8 +-
 drivers/block/brd.c  |6 +-
 drivers/block/drbd/drbd_bitmap.c |5 +
 drivers/block/drbd/drbd_main.c   |6 +-
 drivers/block/drbd/drbd_receiver.c   |4 +
 drivers/block/drbd/drbd_worker.c |3 +
 drivers/block/floppy.c   |6 +-
 drivers/block/loop.c |   13 ++--
 drivers/block/nbd.c  |8 +-
 drivers/block/nvme-core.c|2 -
 drivers/block/pktcdvd.c  |   11 ++-
 drivers/block/pmem.c |   16 +++-
 drivers/block/ps3disk.c  |2 -
 drivers/block/ps3vram.c  |2 -
 drivers/block/rbd.c  |2 -
 drivers/block/rsxx/dma.c |2 -
 drivers/block/umem.c |2 -
 drivers/block/zram

Re: [Linux-nvdimm] [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 1:05 PM, Dan Williams dan.j.willi...@intel.com wrote:
 It would be unfortunate if the kmap infrastructure escaped its current
 32-bit/HIGHMEM bonds and leaked into 64-bit code.  Instead, if the user
 has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t()
 implementation to scan a list of pre-mapped persistent memory address
 ranges inserted by the pmem driver.

 The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
 but there are two mitigating factors:

 1/ The number of persistent memory ranges is bounded by the number of
DIMMs which is on the order of 10s of DIMMs, not hundreds.

 2/ The lookup yields the entire range, if it becomes inefficient to do a
kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
advantage of the fact that the lookup can be amortized for all kmap
operations it needs to perform in a given range.

 Signed-off-by: Dan Williams dan.j.willi...@intel.com
 ---
  arch/Kconfig |3 +
  arch/x86/Kconfig |2 +
  arch/x86/kernel/Makefile |1
  arch/x86/kernel/kmap.c   |   95 
 ++
  drivers/block/pmem.c |6 +++
  include/linux/highmem.h  |   23 +++
  6 files changed, 130 insertions(+)
  create mode 100644 arch/x86/kernel/kmap.c

 diff --git a/arch/Kconfig b/arch/Kconfig
 index f7f800860c00..69d3a3fa21af 100644
 --- a/arch/Kconfig
 +++ b/arch/Kconfig
 @@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS
  config HAVE_DMA_PFN
 bool

 +config HAVE_KMAP_PFN
 +   bool
 +
  config GENERIC_SMP_IDLE_THREAD
 bool

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 1fae5e842423..eddaea839500 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY
   Say Y if unsure.

  config X86_PMEM_DMA
 +   depends on !HIGHMEM
 def_bool PMEM_IO
 +   select HAVE_KMAP_PFN
 select HAVE_DMA_PFN

  config HIGHPTE
 diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
 index 9bcd0b56ca17..44c323342996 100644
 --- a/arch/x86/kernel/Makefile
 +++ b/arch/x86/kernel/Makefile
 @@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT)+= paravirt.o 
 paravirt_patch_$(BITS).o
  obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
  obj-$(CONFIG_PARAVIRT_CLOCK)   += pvclock.o
  obj-$(CONFIG_X86_PMEM_LEGACY)  += pmem.o
 +obj-$(CONFIG_X86_PMEM_DMA) += kmap.o

  obj-$(CONFIG_PCSPKR_PLATFORM)  += pcspeaker.o

 diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c
 new file mode 100644
 index ..d597c475377b
 --- /dev/null
 +++ b/arch/x86/kernel/kmap.c
 @@ -0,0 +1,95 @@
 +/*
 + * Copyright(c) 2015 Intel Corporation. All rights reserved.
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of version 2 of the GNU General Public License as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful, but
 + * WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + */
 +#include linux/rcupdate.h
 +#include linux/rculist.h
 +#include linux/highmem.h
 +#include linux/device.h
 +#include linux/slab.h
 +#include linux/mm.h
 +
 +static LIST_HEAD(ranges);
 +
 +struct kmap {
 +   struct list_head list;
 +   struct resource *res;
 +   struct device *dev;
 +   void *base;
 +};
 +
 +static void teardown_kmap(void *data)
 +{
 +   struct kmap *kmap = data;
 +
 +   dev_dbg(kmap-dev, kmap unregister %pr\n, kmap-res);
 +   list_del_rcu(kmap-list);
 +   synchronize_rcu();
 +   kfree(kmap);
 +}
 +
 +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
 +   void *base)
 +{
 +   struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
 +   int rc;
 +
 +   if (!kmap)
 +   return -ENOMEM;
 +
 +   INIT_LIST_HEAD(kmap-list);
 +   kmap-res = res;
 +   kmap-base = base;
 +   kmap-dev = dev;
 +   rc = devm_add_action(dev, teardown_kmap, kmap);
 +   if (rc) {
 +   kfree(kmap);
 +   return rc;
 +   }
 +   dev_dbg(kmap-dev, kmap register %pr\n, kmap-res);
 +   list_add_rcu(kmap-list, ranges);
 +   return 0;
 +}
 +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
 +
 +void *kmap_atomic_pfn_t(__pfn_t pfn)
 +{
 +   struct page *page = __pfn_t_to_page(pfn);
 +   resource_size_t addr;
 +   struct kmap *kmap;
 +
 +   if (page)
 +   return kmap_atomic(page);
 +   addr = __pfn_t_to_phys(pfn);
 +   rcu_read_lock();
 +   list_for_each_entry_rcu(kmap, ranges, list)
 +   if (addr = kmap-res-start  addr = kmap-res-end)
 +   return kmap-base + addr - kmap-res-start;
 +
 +   /* only unlock in the error case

[PATCH v2 07/10] x86: support dma_map_pfn()

2015-05-06 Thread Dan Williams
Fix up x86 dma_map_ops to allow pfn-only mappings.

As long as a dma_map_sg() implementation uses the generic sg_phys()
helpers it can support scatterlists that use __pfn_t instead of struct
page.

Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/x86/Kconfig |5 +
 arch/x86/kernel/amd_gart_64.c|   22 +-
 arch/x86/kernel/pci-nommu.c  |   22 +-
 arch/x86/kernel/pci-swiotlb.c|4 
 arch/x86/pci/sta2x11-fixup.c |4 
 arch/x86/xen/pci-swiotlb-xen.c   |4 
 drivers/iommu/amd_iommu.c|   21 -
 drivers/iommu/intel-iommu.c  |   22 +-
 drivers/xen/swiotlb-xen.c|   29 +++--
 include/asm-generic/dma-mapping-common.h |4 ++--
 include/asm-generic/scatterlist.h|1 +
 include/linux/swiotlb.h  |4 
 lib/swiotlb.c|   20 +++-
 13 files changed, 125 insertions(+), 37 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..1fae5e842423 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -796,6 +796,7 @@ config CALGARY_IOMMU
bool IBM Calgary IOMMU support
select SWIOTLB
depends on X86_64  PCI
+   depends on !HAVE_DMA_PFN
---help---
  Support for hardware IOMMUs in IBM's xSeries x366 and x460
  systems. Needed to run systems with more than 3GB of memory
@@ -1432,6 +1433,10 @@ config X86_PMEM_LEGACY
 
  Say Y if unsure.
 
+config X86_PMEM_DMA
+   def_bool PMEM_IO
+   select HAVE_DMA_PFN
+
 config HIGHPTE
bool Allocate 3rd-level pagetables from highmem
depends on HIGHMEM
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 8e3842fc8bea..8fad83c8dfd2 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, 
dma_addr_t phys_mem,
 }
 
 /* Map a single area into the IOMMU */
-static dma_addr_t gart_map_page(struct device *dev, struct page *page,
-   unsigned long offset, size_t size,
-   enum dma_data_direction dir,
-   struct dma_attrs *attrs)
+static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn,
+  unsigned long offset, size_t size,
+  enum dma_data_direction dir,
+  struct dma_attrs *attrs)
 {
unsigned long bus;
-   phys_addr_t paddr = page_to_phys(page) + offset;
+   phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;
 
if (!dev)
dev = x86_dma_fallback_dev;
@@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct 
page *page,
return bus;
 }
 
+static __maybe_unused dma_addr_t gart_map_page(struct device *dev,
+   struct page *page, unsigned long offset, size_t size,
+   enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+   return gart_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+   attrs);
+}
+
 /*
  * Free a DMA mapping.
  */
@@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info)
 static struct dma_map_ops gart_dma_ops = {
.map_sg = gart_map_sg,
.unmap_sg   = gart_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+   .map_pfn= gart_map_pfn,
+#else
.map_page   = gart_map_page,
+#endif
.unmap_page = gart_unmap_page,
.alloc  = gart_alloc_coherent,
.free   = gart_free_coherent,
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index da15918d1c81..876dacfbabf6 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t 
bus, size_t size)
return 1;
 }
 
-static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
-unsigned long offset, size_t size,
-enum dma_data_direction dir,
-struct dma_attrs *attrs)
+static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn,
+   unsigned long offset, size_t size,
+   enum dma_data_direction dir,
+   struct dma_attrs *attrs)
 {
-   dma_addr_t bus = page_to_phys(page) + offset;
+   dma_addr_t bus = __pfn_t_to_phys(pfn) + offset;
WARN_ON(size == 0);
if (!check_addr(map_single, dev, bus, size))
return DMA_ERROR_CODE;
@@ -38,6 +38,14 @@ static

[PATCH v2 02/10] block: add helpers for accessing a bio_vec page

2015-05-06 Thread Dan Williams
In preparation for converting struct bio_vec to carry a __pfn_t instead
of struct page.

This change is prompted by the desire to add in-kernel DMA support
(O_DIRECT, hierarchical storage, RDMA, etc) for persistent memory which
lacks struct page coverage.

Alternatives:

1/ Provide struct page coverage for persistent memory in DRAM.  The
   expectation is that persistent memory capacities make this untenable
   in the long term.

2/ Provide struct page coverage for persistent memory with persistent
   memory.  While persistent memory may have near DRAM performance
   characteristics it may not have the same write-endurance of DRAM.
   Given the update frequency of struct page objects it may not be
   suitable for persistent memory.

3/ Dynamically allocate struct page.  This appears to be on the order
   of the complexity of converting code paths to use __pfn_t references
   instead of struct page, and the amount of setup required to establish
   a valid struct page reference is mostly wasted when the only usage in
   the block stack is to perform a page_to_pfn() conversion for
   dma-mapping.  Instances of kmap() / kmap_atomic() usage appear to be
   the only occasions in the block stack where struct page is
   non-trivially used.  A new kmap_atomic_pfn_t() is proposed to handle
   those cases.

Generated with the following semantic patch:

// bv_page.cocci: convert usage of -bv_page to use set/get helpers
// usage: make coccicheck COCCI=bv_page.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct bio_vec bvec;
expression E;
type T;
@@

- bvec.bv_page = (T)E
+ bvec_set_page(bvec, E)

@@
struct bio_vec *bvec;
expression E;
type T;
@@

- bvec-bv_page = (T)E
+ bvec_set_page(bvec, E)

@@
struct bio_vec bvec;
type T;
@@

- (T)bvec.bv_page
+ bvec_page(bvec)

@@
struct bio_vec *bvec;
type T;
@@

- (T)bvec-bv_page
+ bvec_page(bvec)

@@
struct bio *bio;
expression E;
expression F;
type T;
@@

- bio-bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(bio-bi_io_vec[F], E)

@@
struct bio *bio;
expression E;
type T;
@@

- bio-bi_io_vec-bv_page = (T)E
+ bvec_set_page(bio-bi_io_vec, E)

@@
struct cached_dev *dc;
expression E;
type T;
@@

- dc-sb_bio.bi_io_vec-bv_page = (T)E
+ bvec_set_page(dc-sb_bio.bi_io_vec, E)

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca-sb_bio.bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(ca-sb_bio.bi_io_vec[F], E)

@@
struct cache *ca;
expression F;
@@

- ca-sb_bio.bi_io_vec[F].bv_page
+ bvec_page(ca-sb_bio.bi_io_vec[F])

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca-sb_bio.bi_inline_vecs[F].bv_page = (T)E
+ bvec_set_page(ca-sb_bio.bi_inline_vecs[F], E)

@@
struct cache *ca;
expression F;
@@

- ca-sb_bio.bi_inline_vecs[F].bv_page
+ bvec_page(ca-sb_bio.bi_inline_vecs[F])


@@
struct cache *ca;
expression E;
type T;
@@

- ca-sb_bio.bi_io_vec-bv_page = (T)E
+ bvec_set_page(ca-sb_bio.bi_io_vec, E)

@@
struct bio *bio;
expression F;
@@

- bio-bi_io_vec[F].bv_page
+ bvec_page(bio-bi_io_vec[F])

@@
struct bio bio;
expression F;
@@

- bio.bi_io_vec[F].bv_page
+ bvec_page(bio.bi_io_vec[F])

@@
struct bio *bio;
@@

- bio-bi_io_vec-bv_page
+ bvec_page(bio-bi_io_vec)

@@
struct cached_dev *dc;
@@

- dc-sb_bio.bi_io_vec-bv_page
+ bvec_page(dc-sb_bio-bi_io_vec)


@@
struct bio bio;
@@

- bio.bi_io_vec-bv_page
+ bvec_page(bio.bi_io_vec)

@@
struct bio_integrity_payload *bip;
expression E;
type T;
@@

- bip-bip_vec-bv_page = (T)E
+ bvec_set_page(bip-bip_vec, E)

@@
struct bio_integrity_payload *bip;
@@

- bip-bip_vec-bv_page
+ bvec_page(bip-bip_vec)

@@
struct bio_integrity_payload bip;
@@

- bip.bip_vec-bv_page
+ bvec_page(bip.bip_vec)

Cc: Jens Axboe ax...@kernel.dk
Cc: Matthew Wilcox wi...@linux.intel.com
Cc: Ross Zwisler ross.zwis...@linux.intel.com
Cc: Neil Brown ne...@suse.de
Cc: Alasdair Kergon a...@redhat.com
Cc: Mike Snitzer snit...@redhat.com
Cc: Chris Mason c...@fb.com
Cc: Boaz Harrosh b...@plexistor.com
Cc: Theodore Ts'o ty...@mit.edu
Cc: Jan Kara j...@suse.cz
Cc: Julia Lawall julia.law...@lip6.fr
Cc: Martin K. Petersen martin.peter...@oracle.com
Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/powerpc/sysdev/axonram.c   |2 +
 block/bio-integrity.c   |8 ++--
 block/bio.c |   40 +++---
 block/blk-core.c|4 +-
 block/blk-integrity.c   |3 +-
 block/blk-lib.c |2 +
 block/blk-merge.c   |7 ++--
 block/bounce.c  |   24 ++---
 drivers/block/aoe/aoecmd.c  |8 ++--
 drivers/block/brd.c |2 +
 drivers/block/drbd/drbd_bitmap.c|5 ++-
 drivers/block/drbd/drbd_main.c  |6 ++-
 drivers/block/drbd/drbd_receiver.c  |4 +-
 drivers/block/drbd/drbd_worker.c|3 +-
 drivers/block/floppy.c  |6

[PATCH v2 04/10] dma-mapping: allow archs to optionally specify a -map_pfn() operation

2015-05-06 Thread Dan Williams
This is in support of enabling block device drivers to perform DMA
to/from persistent memory which may not have a backing struct page
entry.

Signed-off-by: Dan Williams dan.j.willi...@intel.com
---
 arch/Kconfig |3 +++
 include/asm-generic/dma-mapping-common.h |   30 ++
 include/asm-generic/pfn.h|9 +
 include/linux/dma-debug.h|   23 +++
 include/linux/dma-mapping.h  |8 +++-
 lib/dma-debug.c  |   10 ++
 6 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..f7f800860c00 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
 config HAVE_DMA_CONTIGUOUS
bool
 
+config HAVE_DMA_PFN
+   bool
+
 config GENERIC_SMP_IDLE_THREAD
bool
 
diff --git a/include/asm-generic/dma-mapping-common.h 
b/include/asm-generic/dma-mapping-common.h
index 940d5ec122c9..7305efb1bac6 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device 
*dev, void *ptr,
 
kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
+#ifdef CONFIG_HAVE_DMA_PFN
+   addr = ops-map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+(unsigned long)ptr  ~PAGE_MASK, size,
+dir, attrs);
+#else
addr = ops-map_page(dev, virt_to_page(ptr),
 (unsigned long)ptr  ~PAGE_MASK, size,
 dir, attrs);
+#endif
debug_dma_map_page(dev, virt_to_page(ptr),
   (unsigned long)ptr  ~PAGE_MASK, size,
   dir, addr, true);
@@ -73,6 +79,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
ops-unmap_sg(dev, sg, nents, dir, attrs);
 }
 
+#ifdef CONFIG_HAVE_DMA_PFN
+static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+   struct dma_map_ops *ops = get_dma_ops(dev);
+   dma_addr_t addr;
+
+   BUG_ON(!valid_dma_direction(dir));
+   addr = ops-map_pfn(dev, pfn, offset, size, dir, NULL);
+   debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
+
+   return addr;
+}
+
+static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+   kmemcheck_mark_initialized(page_address(page) + offset, size);
+   return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+}
+#else
 static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
  size_t offset, size_t size,
  enum dma_data_direction dir)
@@ -87,6 +116,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, 
struct page *page,
 
return addr;
 }
+#endif /* CONFIG_HAVE_DMA_PFN */
 
 static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
  size_t size, enum dma_data_direction dir)
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
index 91171e0285d9..c1fdf41fb726 100644
--- a/include/asm-generic/pfn.h
+++ b/include/asm-generic/pfn.h
@@ -48,4 +48,13 @@ static inline __pfn_t page_to_pfn_t(struct page *page)
 
return pfn;
 }
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+   if (pfn.pfn  PAGE_OFFSET)
+   return pfn.pfn;
+#endif
+   return page_to_pfn(__pfn_t_to_page(pfn));
+}
 #endif /* __ASM_PFN_H */
diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fe8cb610deac..a3b4c8c0cd68 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);
 
 extern int dma_debug_resize_entries(u32 num_entries);
 
-extern void debug_dma_map_page(struct device *dev, struct page *page,
-  size_t offset, size_t size,
-  int direction, dma_addr_t dma_addr,
-  bool map_single);
+extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
+ size_t size, int direction, dma_addr_t dma_addr,
+ bool map_single);
+
+static inline void debug_dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single

Re: [dm-devel] [PATCH stable] block: discard bdi_unregister() in favour of bdi_destroy()

2015-05-06 Thread Dan Williams
On Wed, Apr 29, 2015 at 5:32 PM, NeilBrown ne...@suse.de wrote:

 bdi_unregister() now contains very little functionality.

 It contains a WARN_ON if bdi-dev is NULL.  This warning is of no
 real consequence as bdi-dev isn't needed by anything else in the function,
 and it triggers if
blk_cleanup_queue() - bdi_destroy()
 is called before bdi_unregister, which a subsequent patch will make happen.
 So this isn't wanted.

 It also calls bdi_set_min_ratio().  This needs to be called after
 writes through the bdi have all been flushed, and before the bdi is destroyed.
 Calling it early is better than calling it late as it frees up a global
 resource.

 Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
 perfectly fits these requirements.

 So bdi_unregister can be discarded with the important content moved to
 bdi_destroy, as can the
   writeback_bdi_unregister
 event which is already not used.

 This is tagged for 'stable' as it is a pre-requisite for a subsequent
 patch which moves calls to blk_cleanup_queue() before calls to
 del_gendisk().  The commit identified as 'Fixes' removed a lot of
 other functionality from bdi_unregister(), and made a change which
 necessitated moving the blk_cleanup_queue() calls.

 Reported-by: Mike Snitzer snit...@redhat.com
 Cc: Christoph Hellwig h...@lst.de
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: sta...@vger.kernel.org (v4.0)
 Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
 Signed-off-by: NeilBrown ne...@suse.de

 ---

 Hi again Jens,
  would you be able to queue this patch *before* the other one:
block: destroy bdi before blockdev is unregistered.

  If it has to come after I'll need to re-write the text a bit.
  If you could give me the commit hash to reference I'll do that.

Seems it is after:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=6cd18e71

Also, we gave both patches a try internally after seeing the duplicate
sysfs warning.  You can add:

Acked-by: Dan Williams dan.j.willi...@intel.com
Tested-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com

...on the re-send.

Thanks Neil!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote:

 Conceptually better, but certainly more difficult to audit if the fake
 struct page is initialized in a subtle way that breaks when/if it
 leaks to some unwitting context.

 Maybe. It could go either way, though. In particular, with the
 dynamically allocated struct page approach, if somebody uses it past
 the supposed lifetime of the use, things like poisoning the temporary
 struct page could be fairly effective. You can't really poison the
 pfn - it's just a number, and if somebody uses it later than you think
 (and you have re-used that physical memory for something else), you'll
 never ever know.

True, but there's little need to poison a _pfn_t because it's
permanent once discovered via -direct_access() on the hosting struct
block_device.  Sure, kmap_atomic_pfn_t() may fail when the pmem driver
unbinds from a device, but the __pfn_t is still valid.  Obviously, we
can only support atomic kmap(s) with this property, and it would be
nice to fault if someone continued to use the __pfn_t after the
hosting device was disabled.  To be clear, DAX has this same problem
today.  Nothing stops whomever called -direct_access() to continue
using the pfn after the backing device has been disabled.

 I'd *assume* that most users of the dynamic struct page allocation
 have very clear lifetime rules. Those things would presumably normally
 get looked-up by some extended version of get_user_pages(), and
 there's a clear use of the result, with no longer lifetime. Also, you
 do need to have some higher-level locking when you  do this, to make
 sure that the persistent pages don't magically get re-assigned. We're
 presumably talking about having a filesystem in that persistent
 memory, so we cannot be doing IO to the pages (from some other source
 - whether RDMA or some special zero-copy model) while the underlying
 filesystem is reassigning the storage because somebody deleted the
 file.

 IOW, there had better be other external rules about when - and how
 long - you can use a particular persistent page. No? So the whole
 when/how to allocate the temporary 'struct page' is just another
 detail in that whole thing.

 And yes, some uses may not ever actually see that. If the whole of
 persistent memory is just assigned to a database or something, and the
 DB just wants to do a flush this range of persistent memory to
 long-term disk storage, then there may not be much of a lifetime
 issue for the persistent memory. But even then you're going to have IO
 completion callbacks etc to let the DB know that it has hit the disk,
 so..

 What is the primary thing that is driving this need? Do we have a very
 concrete example?

My pet concrete example is covered by __pfn_t.  Referencing persistent
memory in an md/dm hierarchical storage configuration.  Setting aside
the thrash to get existing block users to do bvec_set_page(page)
instead of bvec-page = page the onus is on that md/dm
implementation and backing storage device driver to operate on
__pfn_t.  That use case is simple because there is no use of page
locking or refcounting in that path, just dma_map_page() and
kmap_atomic().  The more difficult use case is precisely what Al
picked up on, O_DIRECT and RDMA.  This patchset does nothing to
address those use cases outside of not needing a struct page when they
eventually craft a bio.

I know Matthew Wilcox has explored the idea of get_user_sg() and let
the scatterlist hold the reference count and locks, but I'll let him
speak to that.

I still see __pfn_t as generally useful for the simple in-kernel
stacked-block-i/o use case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    2   3   4   5   6   7   8   9   10   11   >