Re: [Linux-nvdimm] another pmem variant
On Wed, Mar 25, 2015 at 10:04 AM, Christoph Hellwig h...@lst.de wrote: On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote: The kernel command line would simply be the standard/existing memmap= to reserve a memory range. Then, when the platform device loads, it does a request_firmware() to inject a binary table that further carves memory into ranges to which the pmem driver attaches. No need for the legacy system BIOS to be upgraded to the new way. Ewww... It does do the right thing in kernel space. The userspace utility creates the binary table (once) that can be compiled into the platform device driver or auto-loaded by an initrd. The problem with a new memmap= is that it is too coarse. For example you can't do things like specify a pmem range per-NUMA node. Sure you can as long as you know the layout. memmap= can be specified multiple times. Again, I see absolutely zero benefit of doing crap like request_firmware() to convert interface, and I'm also tired of having this talk about code that will eventually be released and should be superior (and from all that I can guess so far will actually be far worse). You and me both... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type
On Wed, Mar 25, 2015 at 1:23 PM, Ross Zwisler ross.zwis...@linux.intel.com wrote: On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote: Various recent bioses support NVDIMMs or ADR using a non-standard e820 memory type, and Intel supplied reference Linux code using this type to various vendors. Wire this e820 table type up to export platform devices for the pmem driver so that we can use it in Linux, and also provide a memmap= argument to manually tag memory as protected, which can be used if the bios doesn't use the standard nonstandard interface, or we just want to test the pmem driver with regular memory. Based on an earlier patch from Dave Jiang dave.ji...@intel.com Signed-off-by: Christoph Hellwig h...@lst.de snip diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca..93a27e4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE source mm/Kconfig +config X86_PMEM_LEGACY + bool Support non-stanard NVDIMMs and ADR protected memory + help + Treat memory marked using the non-stard e820 type of 12 as used + by the Intel Sandy Bridge-EP reference BIOS as protected memory. + The kernel will the offer these regions to the pmem driver so + they can be used for persistent storage. + + If you say N the kernel will treat the ADR region like an e820 + reserved region. + + Say Y if unsure Would it make sense to have this default to y, or is that too strong? We never default new enabling to y. Maybe some exceptions, but this isn't one of them in my mind. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type
On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig h...@lst.de wrote: Various recent bioses support NVDIMMs or ADR using a non-standard e820 memory type, and Intel supplied reference Linux code using this type to various vendors. Wire this e820 table type up to export platform devices for the pmem driver so that we can use it in Linux, and also provide a memmap= argument to manually tag memory as protected, which can be used if the bios doesn't use the standard nonstandard interface, or we just want to test the pmem driver with regular memory. Based on an earlier patch from Dave Jiang dave.ji...@intel.com Signed-off-by: Christoph Hellwig h...@lst.de --- [..] +static __init int register_pmem_devices(void) +{ + int i; + + for (i = 0; i e820.nr_map; i++) { + struct e820entry *ei = e820.map[i]; + + if (ei-type == E820_PROTECTED_KERN) { + struct resource res = { + .flags = IORESOURCE_MEM, + .start = ei-addr, + .end= ei-addr + ei-size - 1, + }; + register_pmem_device(res); + } + } + + return 0; +} Aside from the s/E820_PROTECTED_KERN/E820_PMEM/ suggestion this looks ok to me. The vaporware new way can be a superset of this mechanism. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
On Tue, Mar 31, 2015 at 8:24 AM, Boaz Harrosh b...@plexistor.com wrote: On 03/31/2015 06:17 PM, Dan Williams wrote: On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh b...@plexistor.com wrote: Some error checks had unlikely some did not. Put unlikely on all error handling paths. (I like unlikely for error paths specially for readability) unlikely() is not a readability hint, it's specifically for branches that profiling shows adding it makes a difference. Just delete them all until profiling show they make a difference. They certainly don't make a difference in the slow paths. Why? Because the compiler and cpu already does a decent job, and if you get the frequency wrong it can hurt performance [1]. It's pre-mature optimization to sprinkle them around, especially in slow paths. [1]: https://lwn.net/Articles/420019/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh b...@plexistor.com wrote: Some error checks had unlikely some did not. Put unlikely on all error handling paths. (I like unlikely for error paths specially for readability) unlikely() is not a readability hint, it's specifically for branches that profiling shows adding it makes a difference. Just delete them all until profiling show they make a difference. They certainly don't make a difference in the slow paths. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] another pmem variant V2
On Tue, Mar 31, 2015 at 10:24 AM, Christoph Hellwig h...@lst.de wrote: On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote: I'd be fine with that too - mind sending an updated series? I will send an updated one tonight or early tomorrow. Btw, do you want to keep the E820_PRAM name instead of E820_PMEM? Seems like most people either don't care or prefer E820_PMEM. I'm fine either way. FWIW, I like the idea of having a separate E820_PRAM name for type-12 memory vs future can't yet disclose UEFI memory type. The E820_PRAM type potentially has the property of being relegated to legacy NVDIMMs. We can later add E820_PMEM as a memory type that, for example, is not automatically backed by struct page. That said, I'm fine either way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
On Mon, 2015-02-23 at 14:33 +0200, Boaz Harrosh wrote: There is something not very nice (Gentlemen nice) In current e820.c code. At Multiple places for example like (@ memblock_x86_fill()) it will add the different memory resources *except the E820_RESERVED type* Then at e820_reserve_resources() it will mark all !E820_RESERVED as busy. This is all fine when we have only the known types one of: E820_RESERVED_KERN: E820_RAM: E820_ACPI: E820_NVS: E820_UNUSABLE: E820_RESERVED: But if the system encounters a brand new memory type it will not add it to any memory list, But will proceed to mark it BUSY. So now any other Driver in the system that does know how to deal with this new type, is not able to call request_mem_region_exclusive() on this new type because it is hard coded BUSY even though nothing really uses it. So make any unknown type behave like E820_RESERVED memory, it will show up as available to first caller of request_mem_region_exclusive(). I Also change the string representation of an unknown type from reserved (So to not confuse with memmap reserved region). And call it reserved-unknown I wish I could return reserved-type-X But this is not possible because one must return a constant, code-segment, string. (NOTE: These unknown-types where called reserved in /proc/iomem and in dmesg but behaved differently. What this patch does is name them differently but let them behave the same) By Popular demand An Extra WARNING message is printed if an UNKNOWN is found. It will look like this: e820: WARNING [mem 0x1-0x1] is unknown type 12 I don't think we need to warn that an unknown range was published, just warn if it is consumed. Something like these incremental changes. I don't see the need for patch 2 or either version of patch 3. diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 1afa5518baa6..2e755a92d84f 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -134,11 +134,6 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size, return; } - if (unlikely(_is_unknown_type(type))) - pr_warn(e820: WARNING [mem %#010llx-%#010llx] is unknown type %d\n, - (unsigned long long) start, - (unsigned long long) (start + size - 1), type); - e820x-map[x].addr = start; e820x-map[x].size = size; e820x-map[x].type = type; @@ -938,7 +933,7 @@ static inline const char *e820_type_to_string(int e820_type) case E820_NVS: return ACPI Non-volatile Storage; case E820_UNUSABLE: return Unusable memory; case E820_RESERVED: return reserved; - default:return reserved-unkown; + default:return iomem_unknown_resource_name; } } diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 2c525078..d857e79b4bf2 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -194,6 +194,9 @@ extern struct resource * __request_region(struct resource *, resource_size_t n, const char *name, int flags); +/* For uniquely tagging unknown memory so we can warn when it is consumed */ +extern const char iomem_unknown_resource_name[]; + /* Compatibility cruft */ #define release_region(start,n)__release_region(ioport_resource, (start), (n)) #define check_mem_region(start,n) __check_region(iomem_resource, (start), (n)) diff --git a/kernel/resource.c b/kernel/resource.c index 0bcebffc4e77..38b36c212a48 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1040,6 +1040,8 @@ resource_size_t resource_alignment(struct resource *res) static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait); +const char iomem_unknown_resource_name[] = { reserved-unknown }; + /** * __request_region - create a new busy resource region * @parent: parent resource descriptor @@ -1092,6 +1094,15 @@ struct resource * __request_region(struct resource *parent, break; } write_unlock(resource_lock); + + if (res res-parent +res-parent-name == iomem_unknown_resource_name) { + add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK); + pr_warn(request unknown region [mem %#010llx-%#010llx] %s\n, + res-start, res-end, + res-name); + } + return res; } EXPORT_SYMBOL(__request_region); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v4] resource: Add new flag IORESOURCE_MEM_WARN
On Tue, Feb 24, 2015 at 7:00 AM, Boaz Harrosh b...@plexistor.com wrote: Resource providers set this flag if they want that request_region will print a warning in dmesg if this particular memory resource is locked by a driver. Thous acting as a Protocol Police about experimental devices that did not pass a committee approval. The Only user of this flag is x86/kernel/e820.c that wants to WARN about UNKNOWN memory types. NOTE: It would be preferred if I defined a general flag say IORESOURCE_WARN, where any kind of resource provider can WARN on use, but we have run out of flags in the 32bit long systems. So I defined a free bit from the resource specific flags for mem resources. This is why I need to check if this is a memory resource first so not to conflict with other resource specific flags. (Though actually no one is using this specific bit) CC: Thomas Gleixner t...@linutronix.de CC: Ingo Molnar mi...@redhat.com CC: H. Peter Anvin h...@zytor.com CC: x...@kernel.org CC: Dan Williams dan.j.willi...@intel.com CC: Andrew Morton a...@linux-foundation.org CC: Bjorn Helgaas bhelg...@google.com CC: Vivek Goyal vgo...@redhat.com Signed-off-by: Boaz Harrosh b...@plexistor.com --- arch/x86/kernel/e820.c | 3 +++ include/linux/ioport.h | 1 + kernel/resource.c | 9 - 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 1a8a1c3..105bb37 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -961,6 +961,9 @@ void __init e820_reserve_resources(void) res-flags = IORESOURCE_MEM; + if (_is_unknown_type(e820.map[i].type)) + res-flags |= IORESOURCE_MEM_WARN; + /* * don't register the region that could be conflicted with * pci device BAR resource and insert them later in diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 2c525022..f78972b 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -90,6 +90,7 @@ struct resource { #define IORESOURCE_MEM_32BIT (33) #define IORESOURCE_MEM_SHADOWABLE (15) /* dup: IORESOURCE_SHADOWABLE */ #define IORESOURCE_MEM_EXPANSIONROM(16) +#define IORESOURCE_MEM_WARN(17) /* WARN if requested by driver */ /* PnP I/O specific bits (IORESOURCE_BITS) */ #define IORESOURCE_IO_16BIT_ADDR (10) diff --git a/kernel/resource.c b/kernel/resource.c index 19f2357..4bab16f 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1075,8 +1075,15 @@ struct resource * __request_region(struct resource *parent, break; if (conflict != parent) { parent = conflict; - if (!(conflict-flags IORESOURCE_BUSY)) + if (!(conflict-flags IORESOURCE_BUSY)) { + if (unlikely( No need for unlikely(), this isn't a hot path. + (resource_type(conflict) == IORESOURCE_MEM) +(conflict-flags IORESOURCE_MEM_WARN))) + pr_warn(request region with unknown memory type [mem %#010llx-%#010llx] %s\n, + conflict-start, conflict-end, + conflict-name); I think this should also dump the res-name to identify who is requesting it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
On Mon, Feb 23, 2015 at 11:59 PM, Boaz Harrosh b...@plexistor.com wrote: No, this is a complete HACK, since when do we hard code specific (GLOBAL) ARCHs strings in common code. Please look at linux/ioport.h see the richness of options for all kind of buses and systems. The flag system works perfectly and I just continue this here. And really DAN, you prefer a global string that's dead garbage in 99% of arches to a simple bit flag definition that costs nothing? I don't think so. Glad we're moving ahead with the IORESOURCE_MEM_WARN solution rather than this or the 64-bit-limited IORESOURCE_WARN approach. + add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK); NACK!! I disagree. Ultimately what goes into kernel/resource.c is not up to me, but firmware/driver combinations that subvert standards should be flagged by the kernel. Stepping back from the original motivation, in the general case, an unknown memory type is indiscernible from a BIOS bug. TAINT_FIRMWARE_WORKAROUND is simply a notification that firmware needs to be updated, and I believe a driver attaching to unknown memory is such an event. It does not block a user from using that memory however he or she sees fit. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
On Thu, Mar 26, 2015 at 11:46 AM, Elliott, Robert (Server Storage) elli...@hp.com wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Christoph Hellwig Sent: Thursday, March 26, 2015 11:43 AM To: Boaz Harrosh Cc: Christoph Hellwig; Ingo Molnar; linux-nvd...@ml01.01.org; linux- fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; x...@kernel.org; ross.zwis...@linux.intel.com; ax...@kernel.dk Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote: + memmap=nn[KMG]!ss[KMG] + [KNL,X86] Mark specific memory as protected. + Region of memory to be used, from ss to ss+nn. + The memory region may be marked as e820 type 12 (0xc) + and is NVDIMM or ADR memory. + Do we need to escape \! this character on grub command line ? It might help to note that. I did like the original | BTW No need to escape it on the kvm command line, which is where I tested this flag only so far. If there is a strong argument for | I'm happy to change it. I agree with Boaz that ! is a nuisance if loading pmem as a module with modprobe from bash. This is a core kernel command line, not a module parameter. I'm not saying that it should stay !, but modprobe will not need to deal with it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
On Thu, Mar 26, 2015 at 1:01 AM, Christoph Hellwig h...@lst.de wrote: On Wed, Mar 25, 2015 at 07:15:42PM -0700, Dan Williams wrote: Random thought, type-12 memory happens to correspond to legacy NVDIMM systems with smaller capacities. Perhaps new NVDIMM should not be is_e820_ram() by default? Let's look into that once we can see the spec.. Based on an earlier patch from Dave Jiang dave.ji...@intel.com. ...which was based on an earlier patch by me, its been nearly 4 years to come full circle. That's the attribution in the patch I have access to. I can add you to the credits if you want. Yes, please attribute Dave and myself. ...and for the series: Acked-by: Dan Williams dan.j.willi...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH] SQUASHME: Streamline pmem.c
On Thu, Mar 26, 2015 at 10:02 AM, Boaz Harrosh b...@plexistor.com wrote: Christoph why did you choose the fat and ugly version of pmem.c beats me. Boaz, I am so very tired of your snide commentary. It severely detracts from the technical merit of your patches. Please stop. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig h...@lst.de wrote: This will allow to deal with persistent memory which needs to be treated like ram in many, but not all cases. Random thought, type-12 memory happens to correspond to legacy NVDIMM systems with smaller capacities. Perhaps new NVDIMM should not be is_e820_ram() by default? Based on an earlier patch from Dave Jiang dave.ji...@intel.com. ...which was based on an earlier patch by me, its been nearly 4 years to come full circle. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig h...@lst.de wrote: From: Ross Zwisler ross.zwis...@linux.intel.com PMEM is a new driver that presents a reserved range of memory as a block device. This is useful for developing with NV-DIMMs, and can be used with volatile memory as a development platform. Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com [hch: convert to use a platform_device for discovery, fix partition support] Signed-off-by: Christoph Hellwig h...@lst.de Tested-by: Ross Zwisler ross.zwis...@linux.intel.com --- MAINTAINERS| 6 + drivers/block/Kconfig | 13 ++ drivers/block/Makefile | 1 + drivers/block/pmem.c | 373 + 4 files changed, 393 insertions(+) create mode 100644 drivers/block/pmem.c diff --git a/MAINTAINERS b/MAINTAINERS index 358eb01..efacf2b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8063,6 +8063,12 @@ S: Maintained F: Documentation/blockdev/ramdisk.txt F: drivers/block/brd.c +PERSISTENT MEMORY DRIVER +M: Ross Zwisler ross.zwis...@linux.intel.com +L: linux-nvd...@lists.01.org +S: Supported +F: drivers/block/pmem.c + RANDOM NUMBER DRIVER M: Theodore Ts'o ty...@mit.edu S: Maintained diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 1b8094d..9284aaf 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -404,6 +404,19 @@ config BLK_DEV_RAM_DAX and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). +config BLK_DEV_PMEM + tristate Persistent memory block device support + help + Saying Y here will allow you to use a contiguous range of reserved + memory as one or more block devices. Memory for PMEM should be + reserved using the memmap kernel parameter. + + To compile this driver as a module, choose M here: the module will be + called pmem. + + Most normal users won't need this functionality, and can thus say N + here. + config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 02b688d..9cc6c18 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)+= ps3vram.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o +obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c new file mode 100644 index 000..545b13b --- /dev/null +++ b/drivers/block/pmem.c @@ -0,0 +1,373 @@ +/* + * Persistent Memory Driver + * Copyright (c) 2014, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * This driver is heavily based on drivers/block/brd.c. + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + */ + +#include asm/cacheflush.h +#include linux/blkdev.h +#include linux/hdreg.h +#include linux/init.h +#include linux/platform_device.h +#include linux/module.h +#include linux/moduleparam.h +#include linux/slab.h + +#define SECTOR_SHIFT 9 +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define PAGE_SECTORS (1 PAGE_SECTORS_SHIFT) + +#define PMEM_MINORS16 + +struct pmem_device { + struct request_queue*pmem_queue; + struct gendisk *pmem_disk; + + /* One contiguous memory region per device */ + phys_addr_t phys_addr; + void*virt_addr; + size_t size; +}; + +static int pmem_major; +static atomic_t pmem_index; + +static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo) +{ + /* some standard values */ + geo-heads = 1 6; + geo-sectors = 1 5; + geo-cylinders = get_capacity(bd-bd_disk) 11; + return 0; +} + +/* + * direct translation from (pmem,sector) = void* + * We do not require that sector be page aligned. + * The return value will point to the beginning of the page containing the + * given sector, not to the sector itself. + */ +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector) +{ +
Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh b...@plexistor.com wrote: On 03/26/2015 11:34 AM, Christoph Hellwig wrote: +/* + * This is a non-standardized way to represent ADR or NVDIMM regions that + * persist over a reboot. The kernel will ignore their special capabilities + * unless the CONFIG_X86_PMEM_LEGACY option is set. + * + * Note that older platforms also used 6 for the same type of memory, + * but newer versions switched to 12 as 6 was assigned differently. Some + * time they will learn.. + */ +#define E820_PRAM12 Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig to enable is _PMEM_, the driver stack that gets loaded is pmem, so PRAM is unexpected. Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM, but there are other not RAM technologies that can be supported exactly the same way. MEM is a more general name meaning on the memory bus. I think. I would love the consistency. One of nice side of effects of having a PRAM name is that we can later add a UEFI PMEM type where the distinction is thsy PRAM is included in the system memory map by default and PMEM is analogous to IOMEM. Just a thought... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
On Thu, Mar 26, 2015 at 7:52 AM, Boaz Harrosh b...@plexistor.com wrote: On 03/26/2015 04:12 PM, Dan Williams wrote: On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig h...@lst.de wrote: From: Ross Zwisler ross.zwis...@linux.intel.com Dan something is Broken with you mailer program it keeps dropping the CC when sending replies. For example Both me and Ross who were on CC got dropped, Jens Axboe though got add back. Its not only this email, it is all the emails in this series, please check what is going on. They show up in the archives: https://lists.01.org/pipermail/linux-nvdimm/2015-March/thread.html Sometimes vger.kernel.org drops intel.com mails, it's outside my control. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/8] pmem: Add support for page structs
On Thu, Mar 5, 2015 at 3:59 AM, Boaz Harrosh b...@plexistor.com wrote: One of the current shortcomings of the NVDIMM/PMEM support is that this memory does not have a page-struct(s) associated with its memory and therefor cannot be passed to a block-device or network or DMAed in any way through another device in the system. The use of add_persistent_memory() fixes all this. After this patch an FS can do: bdev_direct_access(,pfn,); Hmm, can we do this mapping on demand per direct access mapping rather than unconditionally for each range that pmem is handling? Going forward I don't think we want to be tied to guaranteeing that plain bdev_direct_access() always yields pfn_to_page()-capable pfns. Perhaps a DAX_MAP_PFN flag or something along those lines? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem
On Mon, Apr 20, 2015 at 12:06 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: Maintainer information and documenation for drivers/block/nd/ Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Cc: Neil Brown ne...@suse.de Cc: Greg KH gre...@linuxfoundation.org Signed-off-by: Dan Williams dan.j.willi...@intel.com --- Documentation/blockdev/nd.txt | 867 + MAINTAINERS | 34 +- 2 files changed, 895 insertions(+), 6 deletions(-) create mode 100644 Documentation/blockdev/nd.txt diff --git a/Documentation/blockdev/nd.txt b/Documentation/blockdev/nd.txt new file mode 100644 index ..bcfdf21063ab --- /dev/null +++ b/Documentation/blockdev/nd.txt @@ -0,0 +1,867 @@ + The NFIT-Defined/NVDIMM Sub-system (ND) + + nd - kernel abi / device-model ndctl - userspace helper library + linux-nvd...@lists.01.org +v9: April 17th, 2015 + + + Glossary + + Overview +Supporting Documents +Git Trees + + NFIT Terminology and NVDIMM Types [...] +The “NVDIMM Firmware Interface Table” (NFIT) [...] Ok, I'll bite. So why on earth is this whole concept and the naming itself ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving around a specific 'firmware' mindset and revolving around specific, weirdly named, overly complicated looking firmware interfaces that come with their own new weird glossary?? There's only three core properties of NVDIMMs that this implementation cares about. 1/ directly mapped interleaved persistent memory (PMEM) 2/ indirect mmio aperture accessed (windowed) persistent memory (BLK) 3/ the possibility that those 2 access modes may alias the same on-media addresses Most of complexity of the implementation is dealing with aspect 3, but that complexity can and is bypassed in places. Firmware might be a discovery method - or not. A non-volatile device might be e820 enumerated, or PCI discovered - potentially with all discovery handled by the driver. PCI attached non-volatile memory is NVMe. ND is handling address ranges that support direct cpu load store. Why do you restrict this driver to a naming and design that is so firmware centric? PMEM, BLK, and the fact that they may alias are the generic properties that are independent of the specification. Granted some of the NFIT terminology has leaked past the point of initial table parsing, but its too early to start claiming restrictive design. We already support three ways of attaching PMEM with varying degrees of backing complexity, and we're more than willing to beat NFIT back where it makes sense to accommodate more non-NFIT NVDIMM implementations. Discovery matters, but what matters _most_ to devices is actually its runtime properties and runtime implementation - and I sure hope firmware has no active role in that! It doesn't. Once PMEM and BLK aliasing are resolved the firmware is out of the picture. In some cases this aliasing is resolved from the outset (simple memory range, type-12 etc...), the bulk of the implementation is bypassed in that case. I really think this is backwards from the get go, it gives me a feeling of someone having spent way too much time in committee and too little time spent thinking about simple, proper kernel design and reusing existing terminology ... The simple paths are there, in addition to support for the rest of the spec. Do we have an existing term for a dimm-relative-address in the kernel? Some of this is simply novel to the kernel. Also: + nd - kernel abi / device-model ndctl - userspace helper library WTF is a 'kernel ABI'?? ABI like Documentation/ABI/, the sysfs layout and ioctls for passing a handful of management commands to firmware. Wherever possible all the slow path configuration is done with sysfs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem
On Mon, Apr 20, 2015 at 8:57 AM, Dan Williams dan.j.willi...@intel.com wrote: On Mon, Apr 20, 2015 at 5:53 AM, Christoph Hellwig h...@lst.de wrote: Once I'll go through this in more detail I'll comment more. Sounds good. Given that the ACPICA folks are going to define their own nfit.h with possibly different structure names, that damage should be limited to just acpi.c. Currently, changing nfit.h structure field names would impact multiple files. It's a straightforward rework to disentangle, I'll post patches soon. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi
On Tue, Apr 21, 2015 at 2:20 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: Most configuration of the nd-subsystem is done via nd-sysfs. However, the NFIT specification defines a small set of messages that can be passed to the subsystem via platform-firmware-defined methods. The command set (as of the current version of the NFIT-DSM spec) is: NFIT_CMD_SMART: media health and diagnostics NFIT_CMD_GET_CONFIG_SIZE: size of the label space NFIT_CMD_GET_CONFIG_DATA: read label NFIT_CMD_SET_CONFIG_DATA: write label NFIT_CMD_VENDOR: vendor-specific command passthrough NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities NFIT_CMD_START_ARS: initiate scrubbing NFIT_CMD_QUERY_ARS: report on scrubbing state NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events Most of the commands target a specific dimm. However, the address-range-scrubbing commands target the entire NFIT-bus / platform. The 'commands' attribute of an nd-bus, or an nd-dimm enumerate the supported commands for that object. Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Reported-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 11 + drivers/block/nd/acpi.c | 333 + drivers/block/nd/bus.c| 230 drivers/block/nd/core.c | 17 ++ drivers/block/nd/dimm_devs.c | 69 drivers/block/nd/nd-private.h | 11 + drivers/block/nd/nd.h | 21 +++ drivers/block/nd/test/nfit.c | 89 +++ include/uapi/linux/Kbuild |1 include/uapi/linux/ndctl.h| 178 ++ 10 files changed, 950 insertions(+), 10 deletions(-) create mode 100644 drivers/block/nd/nd.h create mode 100644 include/uapi/linux/ndctl.h diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 0106b3807202..6c15d10bf4e0 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -42,6 +42,17 @@ config NFIT_ACPI enables the core to craft ACPI._DSM messages for platform/dimm configuration. +config NFIT_ACPI_DEBUG + bool NFIT ACPI: Turn on extra debugging + depends on NFIT_ACPI + depends on DYNAMIC_DEBUG + default n + help + Enabling this option causes the nd_acpi driver to dump the + input and output buffers of _DSM operations on the ACPI0012 + device, which can be very verbose. Leave it disabled unless + you are debugging a hardware / firmware issue. + config NFIT_TEST tristate NFIT TEST: Manufactured NFIT for interface testing depends on DMA_CMA diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index 48db723d7a90..073ff28fdbfe 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -13,8 +13,10 @@ #include linux/list.h #include linux/acpi.h #include linux/mutex.h +#include linux/ndctl.h #include linux/module.h #include nfit.h +#include nd.h enum { NFIT_ACPI_NOTIFY_TABLE = 0x80, @@ -26,20 +28,330 @@ struct acpi_nfit { struct nd_bus *nd_bus; }; +static struct acpi_nfit *to_acpi_nfit(struct nfit_bus_descriptor *nfit_desc) +{ + return container_of(nfit_desc, struct acpi_nfit, nfit_desc); +} + +#define NFIT_ACPI_MAX_ELEM 4 +struct nfit_cmd_desc { + int in_num; + int out_num; + u32 in_sizes[NFIT_ACPI_MAX_ELEM]; + int out_sizes[NFIT_ACPI_MAX_ELEM]; +}; + +static const struct nfit_cmd_desc nfit_dimm_descs[] = { + [NFIT_CMD_IMPLEMENTED] = { }, + [NFIT_CMD_SMART] = { + .out_num = 2, + .out_sizes = { 4, 8, }, + }, + [NFIT_CMD_SMART_THRESHOLD] = { + .out_num = 2, + .out_sizes = { 4, 8, }, + }, + [NFIT_CMD_DIMM_FLAGS] = { + .out_num = 2, + .out_sizes = { 4, 4 }, + }, + [NFIT_CMD_GET_CONFIG_SIZE] = { + .out_num = 3, + .out_sizes = { 4, 4, 4, }, + }, + [NFIT_CMD_GET_CONFIG_DATA] = { + .in_num = 2, + .in_sizes = { 4, 4, }, + .out_num = 2, + .out_sizes = { 4, UINT_MAX, }, + }, + [NFIT_CMD_SET_CONFIG_DATA] = { + .in_num = 3, + .in_sizes = { 4, 4, UINT_MAX, }, + .out_num = 1, + .out_sizes = { 4, }, + }, + [NFIT_CMD_VENDOR] = { + .in_num = 3, + .in_sizes = { 4, 4, UINT_MAX, }, + .out_num = 3, + .out_sizes = { 4, 4, UINT_MAX, }, + }, +}; + +static const struct nfit_cmd_desc nfit_acpi_descs[] = { + [NFIT_CMD_IMPLEMENTED] = { }, + [NFIT_CMD_ARS_CAP] = { + .in_num = 2, + .in_sizes
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote: On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: : + +static int nd_mem_init(struct nd_bus *nd_bus) +{ + struct nd_spa *nd_spa; + + /* + * For each SPA-DCR address range find its corresponding + * MEMDEV(s). From each MEMDEV find the corresponding DCR. + * Then, try to find a SPA-BDW and a corresponding BDW that + * references the DCR. Throw it all into an nd_mem object. + * Note, that BDWs are optional. + */ + list_for_each_entry(nd_spa, nd_bus-spas, list) { + u16 spa_index = readw(nd_spa-nfit_spa-spa_index); + int type = nfit_spa_type(nd_spa-nfit_spa); + struct nd_mem *nd_mem, *found; + struct nd_memdev *nd_memdev; + u16 dcr_index; + + if (type != NFIT_SPA_DCR) + continue; This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM Control Region GUID, for initializing an nd_mem object. However, battery-backed DIMMs do not have such control region SPA. IIUC, the NFIT spec does not require NFIT_SPA_DCR. Can you change this function to work with NFIT_SPA_PM as well? NFIT_SPA_PM ranges are handled separately from nd_mem_init(). See nd_region_create() in patch 10. If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in core.c fails in nd_bus_init_interleave_sets() and skips all subsequent nd_bus_xxx() calls. So, nd_region_create() won't be called. nd_bus_init_interleave_sets() fails because init_interleave_set() returns -ENODEV if (!nd_mem). Ah, ok your test case is specifying PMEM backed by memory device info. We have a test case for simple ranges (nfit_test1_setup()), but it doesn't hit this bug because it does not specify any memory-device tables. Thanks, will fix this in v2 of the patch set. BTW, there are two nd_bus_probe() in bus.c and core.c, which is confusing. Ok, will fix this as well in the v2 posting. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Wed, Apr 22, 2015 at 9:39 AM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-21 at 13:35 -0700, Dan Williams wrote: On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote: On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: : + +static int nd_mem_init(struct nd_bus *nd_bus) +{ + struct nd_spa *nd_spa; + + /* + * For each SPA-DCR address range find its corresponding + * MEMDEV(s). From each MEMDEV find the corresponding DCR. + * Then, try to find a SPA-BDW and a corresponding BDW that + * references the DCR. Throw it all into an nd_mem object. + * Note, that BDWs are optional. + */ + list_for_each_entry(nd_spa, nd_bus-spas, list) { + u16 spa_index = readw(nd_spa-nfit_spa-spa_index); + int type = nfit_spa_type(nd_spa-nfit_spa); + struct nd_mem *nd_mem, *found; + struct nd_memdev *nd_memdev; + u16 dcr_index; + + if (type != NFIT_SPA_DCR) + continue; This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM Control Region GUID, for initializing an nd_mem object. However, battery-backed DIMMs do not have such control region SPA. IIUC, the NFIT spec does not require NFIT_SPA_DCR. Can you change this function to work with NFIT_SPA_PM as well? NFIT_SPA_PM ranges are handled separately from nd_mem_init(). See nd_region_create() in patch 10. If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in core.c fails in nd_bus_init_interleave_sets() and skips all subsequent nd_bus_xxx() calls. So, nd_region_create() won't be called. nd_bus_init_interleave_sets() fails because init_interleave_set() returns -ENODEV if (!nd_mem). Ah, ok your test case is specifying PMEM backed by memory device info. We have a test case for simple ranges (nfit_test1_setup()), but it doesn't hit this bug because it does not specify any memory-device tables. Yes, we have NFIT table with SPA range (PM), memory device to SPA, and NVDIMM control region structures. With the memory device to SPA structure, this code requires full sets of information, including the namespace label data in _DSM [1], which is outside of ACPI 6.0 and is optional. Battery-backed DIMMs do not have such label data. This is what nd_namespace_io devices are for, they do not require labels. Question, if you don't have labels and you don't have DSMs then why publish a MEMDEV table at all? Why not simply publish an anonymous range? See nfit_test1_setup(). It needs to work with NFIT table with these structures without this _DSM or with a different type of _DSM which this code may or may not need to support. It should also check Region Format Interface Code (RFIC) in the NVDIMM control region structure before assuming this _DSM is present to implement RFIC 0x0201. Ok I can look into adding this check, but I don't think it is necessary if you simply refrain from publishing a MEMDEV entry. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com wrote: On 4/22/2015 1:03 PM, Dan Williams wrote: On Wed, Apr 22, 2015 at 9:39 AM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-21 at 13:35 -0700, Dan Williams wrote: On Tue, Apr 21, 2015 at 12:55 PM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-21 at 12:58 -0700, Dan Williams wrote: On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: : + +static int nd_mem_init(struct nd_bus *nd_bus) +{ + struct nd_spa *nd_spa; + + /* + * For each SPA-DCR address range find its corresponding + * MEMDEV(s). From each MEMDEV find the corresponding DCR. + * Then, try to find a SPA-BDW and a corresponding BDW that + * references the DCR. Throw it all into an nd_mem object. + * Note, that BDWs are optional. + */ + list_for_each_entry(nd_spa, nd_bus-spas, list) { + u16 spa_index = readw(nd_spa-nfit_spa-spa_index); + int type = nfit_spa_type(nd_spa-nfit_spa); + struct nd_mem *nd_mem, *found; + struct nd_memdev *nd_memdev; + u16 dcr_index; + + if (type != NFIT_SPA_DCR) + continue; This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM Control Region GUID, for initializing an nd_mem object. However, battery-backed DIMMs do not have such control region SPA. IIUC, the NFIT spec does not require NFIT_SPA_DCR. Can you change this function to work with NFIT_SPA_PM as well? NFIT_SPA_PM ranges are handled separately from nd_mem_init(). See nd_region_create() in patch 10. If nd_mem_init() does not initialize nd_mem objects, nd_bus_probe() in core.c fails in nd_bus_init_interleave_sets() and skips all subsequent nd_bus_xxx() calls. So, nd_region_create() won't be called. nd_bus_init_interleave_sets() fails because init_interleave_set() returns -ENODEV if (!nd_mem). Ah, ok your test case is specifying PMEM backed by memory device info. We have a test case for simple ranges (nfit_test1_setup()), but it doesn't hit this bug because it does not specify any memory-device tables. Yes, we have NFIT table with SPA range (PM), memory device to SPA, and NVDIMM control region structures. With the memory device to SPA structure, this code requires full sets of information, including the namespace label data in _DSM [1], which is outside of ACPI 6.0 and is optional. Battery-backed DIMMs do not have such label data. This is what nd_namespace_io devices are for, they do not require labels. Question, if you don't have labels and you don't have DSMs then why publish a MEMDEV table at all? Why not simply publish an anonymous range? See nfit_test1_setup(). The MEMDEV table provides useful information, and there may be _DSMs, perhaps just not the same _DSM as some other devices. It needs to work with NFIT table with these structures without this _DSM or with a different type of _DSM which this code may or may not need to support. It should also check Region Format Interface Code (RFIC) in the NVDIMM control region structure before assuming this _DSM is present to implement RFIC 0x0201. Ok I can look into adding this check, but I don't think it is necessary if you simply refrain from publishing a MEMDEV entry. But we need the MEMDEV. And as Toshi mentions, we could have other RFICs with other _DSMs than your example. That's why there is an RFIC. Wait, point of clarification, DCRs (dimm-control-regions) have RFICs, not MEMDEVs (memory-device-to-spa-mapping). Toshi's original report was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM device. That specific problem can be fixed by either deleting the MEMDEV, or adding a DCR. Of course, if you add a DCR with a different intended DSM layout than the DSM-example-interface the driver will need to add support for handling that case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Wed, Apr 22, 2015 at 11:23 AM, Toshi Kani toshi.k...@hp.com wrote: On Wed, 2015-04-22 at 11:20 -0700, Dan Williams wrote: On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com wrote: Wait, point of clarification, DCRs (dimm-control-regions) have RFICs, not MEMDEVs (memory-device-to-spa-mapping). Toshi's original report was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM device. That specific problem can be fixed by either deleting the MEMDEV, or adding a DCR. By a DCR, do you mean a DCR structure or SPA with Control Region GUID? Hmm, I meant a DCR as defined below. I agree you would not need a SPA-DCR. Adding a DCR structure does not solve this issue since it requires SPA with Control Region GUID, which battery-backed DIMMs do not have. I would not go that far, half of a DCR entry is relevant for any NVDIMM, and half is only relevant if a DIMM offers BLK access: struct acpi_nfit_dcr { u16 type; u16 length; u16 dcr_index; u16 vendor_id; u16 device_id; u16 revision_id; u16 sub_vendor_id; u16 sub_device_id; u16 sub_revision_id; u8 reserved[6]; u32 serial_number; u16 fic; BLK relevant fields start here u16 num_bcw; u64 bcw_size; u64 cmd_offset; u64 cmd_size; u64 status_offset; u64 status_size; u16 flags; u8 reserved2[6]; }; Of course, if you add a DCR with a different intended DSM layout than the DSM-example-interface the driver will need to add support for handling that case. Yes, we consider to add different _DSMs for management. We do not need the nd_acpi driver to support it now, but we need this framework to work without the DSM-example-interface present. One possible workaround is that I could ignore MEMDEV entries that do not have a corresponding DCR. This would enable nd_namespace_io devices to be surfaced for your use case. Would that work for you? I.e. do you need the nfit_handle exposed? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices
On Wed, Apr 22, 2015 at 12:12 PM, Elliott, Robert (Server Storage) elli...@hp.com wrote: -Original Message- From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of Dan Williams Sent: Friday, April 17, 2015 8:37 PM To: linux-nvd...@lists.01.org Subject: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices ... +/* + * btt_sb_checksum: compute checksum for btt info block + * + * Returns a fletcher64 checksum of everything in the given info block + * except the last field (since that's where the checksum lives). + */ +u64 btt_sb_checksum(struct btt_sb *btt_sb) +{ + u64 sum, sum_save; + + sum_save = btt_sb-checksum; + btt_sb-checksum = 0; + sum = nd_fletcher64(btt_sb, sizeof(*btt_sb)); + btt_sb-checksum = sum_save; + return sum; +} +EXPORT_SYMBOL(btt_sb_checksum); ... Of all the functions with prototypes in nd.h, this is the only function that doesn't have a name starting with nd_. Following such a convention helps ease setting up ftrace filters. Sure, I'll fix that up. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem
On Wed, Apr 22, 2015 at 12:06 PM, Elliott, Robert (Server Storage) elli...@hp.com wrote: -Original Message- From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of Dan Williams Sent: Friday, April 17, 2015 8:35 PM To: linux-nvd...@lists.01.org Subject: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem ... create mode 100644 drivers/block/nd/acpi.c create mode 100644 drivers/block/nd/blk.c create mode 100644 drivers/block/nd/bus.c create mode 100644 drivers/block/nd/core.c ... The kernel already has lots of files with these names: 5 acpi.c 10 bus.c 66 core.c I often use ctags like this: vim -t core.c but that doesn’t immediately work with common filenames - it presents a list of all 66 files to choose from. Also, blk.c is a name one might expect to see in the block/ directory (e.g., next to blk.h). An nd_ prefix on all the filenames would help. I picked up the don't duplicate the directory name in the source file name approach from a review comment from Linus on a SCSI driver a long time back (iirc). I'm not motivated to stop that practice now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Wed, Apr 22, 2015 at 12:38 PM, Toshi Kani toshi.k...@hp.com wrote: On Wed, 2015-04-22 at 12:28 -0700, Dan Williams wrote: On Wed, Apr 22, 2015 at 11:23 AM, Toshi Kani toshi.k...@hp.com wrote: On Wed, 2015-04-22 at 11:20 -0700, Dan Williams wrote: On Wed, Apr 22, 2015 at 11:00 AM, Linda Knippers linda.knipp...@hp.com wrote: Wait, point of clarification, DCRs (dimm-control-regions) have RFICs, not MEMDEVs (memory-device-to-spa-mapping). Toshi's original report was that an NFIT with a SPA+MEMDEV was failing to enable a PMEM device. That specific problem can be fixed by either deleting the MEMDEV, or adding a DCR. By a DCR, do you mean a DCR structure or SPA with Control Region GUID? Hmm, I meant a DCR as defined below. I agree you would not need a SPA-DCR. Adding a DCR structure does not solve this issue since it requires SPA with Control Region GUID, which battery-backed DIMMs do not have. I would not go that far, half of a DCR entry is relevant for any NVDIMM, and half is only relevant if a DIMM offers BLK access: struct acpi_nfit_dcr { u16 type; u16 length; u16 dcr_index; u16 vendor_id; u16 device_id; u16 revision_id; u16 sub_vendor_id; u16 sub_device_id; u16 sub_revision_id; u8 reserved[6]; u32 serial_number; u16 fic; BLK relevant fields start here u16 num_bcw; u64 bcw_size; u64 cmd_offset; u64 cmd_size; u64 status_offset; u64 status_size; u16 flags; u8 reserved2[6]; }; Yes, we do have a DCR entry. But we do not have a SPA-DCR. Got it. will fix. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development
On Fri, Apr 24, 2015 at 2:59 PM, Linda Knippers linda.knipp...@hp.com wrote: On 4/24/2015 5:50 PM, Dan Williams wrote: On Fri, Apr 24, 2015 at 2:47 PM, Linda Knippers linda.knipp...@hp.com wrote: On 4/17/2015 9:35 PM, Dan Williams wrote: : diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 5fa74f124b3e..0106b3807202 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -41,4 +41,24 @@ config NFIT_ACPI register the platform-global NFIT blob with the core. Also enables the core to craft ACPI._DSM messages for platform/dimm configuration. + +config NFIT_TEST + tristate NFIT TEST: Manufactured NFIT for interface testing + depends on DMA_CMA + depends on ND_CORE=m + depends on m + help + For development purposes register a manufactured + NFIT table to verify the resulting device model topology. + Note, this module arranges for ioremap_cache() to be + overridden locally to allow simulation of system-memory as an + io-memory-resource. + + Note, this test expects to be able to find at least + 256MB of CMA space (CONFIG_CMA_SIZE_MBYTES) or it will fail to It seems to actually be wanting = 584MB. Ah, true, this Kconfig text is stale. Will fix. Thanks. One more question... +#ifdef CONFIG_CMA_SIZE_MBYTES +#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES +#else +#define CMA_SIZE_MBYTES 0 +#endif + +static __init int nfit_test_init(void) +{ + int rc, i; + + if (CMA_SIZE_MBYTES 584) { + pr_err(need CONFIG_CMA_SIZE_MBYTES = 584 to load\n); + return -EINVAL; + } + Since the kernel takes a cma= boot parameter, it would be nice if this check is against what the kernel is using rather than the config option. Is that possible? Yeah, that would be more friendly. I also think we can reduce the BLK aperture sizes. Since those don't need to be DAX capable they can come from vmalloc memory rather than CMA. I'll take a look. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi
On Fri, Apr 24, 2015 at 8:56 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: Most configuration of the nd-subsystem is done via nd-sysfs. However, the NFIT specification defines a small set of messages that can be passed to the subsystem via platform-firmware-defined methods. The command set (as of the current version of the NFIT-DSM spec) is: NFIT_CMD_SMART: media health and diagnostics NFIT_CMD_GET_CONFIG_SIZE: size of the label space NFIT_CMD_GET_CONFIG_DATA: read label NFIT_CMD_SET_CONFIG_DATA: write label NFIT_CMD_VENDOR: vendor-specific command passthrough NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities NFIT_CMD_START_ARS: initiate scrubbing NFIT_CMD_QUERY_ARS: report on scrubbing state NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events nd/bus.c provides two features, 1) the top level ND bus driver which is the central part of the ND, and 2) the ioctl interface specific to the example-DSM-interface. I think the example-DSM-specific part should be put into an example-DSM-support module, so that the ND can support other _DSMs as necessary. Also, _DSM needs to be handled as optional. I don't think it needs to be separated, they'll both end up using the same infrastructure just with different UUIDs on the ACPI device interface or different format-interface-codes. A firmware implementation is also free to disable individual DSMs (see nd_acpi_add_dimm). That said, you're right, we do need a fix to allow PMEM from DIMMs without DSMs to activate. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi
On Fri, Apr 24, 2015 at 9:09 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-24 at 09:56 -0600, Toshi Kani wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: Most configuration of the nd-subsystem is done via nd-sysfs. However, the NFIT specification defines a small set of messages that can be passed to the subsystem via platform-firmware-defined methods. The command set (as of the current version of the NFIT-DSM spec) is: NFIT_CMD_SMART: media health and diagnostics NFIT_CMD_GET_CONFIG_SIZE: size of the label space NFIT_CMD_GET_CONFIG_DATA: read label NFIT_CMD_SET_CONFIG_DATA: write label NFIT_CMD_VENDOR: vendor-specific command passthrough NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities NFIT_CMD_START_ARS: initiate scrubbing NFIT_CMD_QUERY_ARS: report on scrubbing state NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events nd/bus.c provides two features, 1) the top level ND bus driver which is the central part of the ND, and 2) the ioctl interface specific to the example-DSM-interface. I think the example-DSM-specific part should be put into an example-DSM-support module, so that the ND can support other _DSMs as necessary. Also, _DSM needs to be handled as optional. And the same for nd/acpi.c, which is 1) the ACPI0012 handler, and 2) the example-DSM-support module. I think they need to be separated. Ok, send me a patch as I'm not sure what type of separation you are proposing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 08/21] nd: ndctl.h, the nd ioctl abi
On Fri, Apr 24, 2015 at 10:18 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-24 at 09:25 -0700, Dan Williams wrote: On Fri, Apr 24, 2015 at 8:56 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: Most configuration of the nd-subsystem is done via nd-sysfs. However, the NFIT specification defines a small set of messages that can be passed to the subsystem via platform-firmware-defined methods. The command set (as of the current version of the NFIT-DSM spec) is: NFIT_CMD_SMART: media health and diagnostics NFIT_CMD_GET_CONFIG_SIZE: size of the label space NFIT_CMD_GET_CONFIG_DATA: read label NFIT_CMD_SET_CONFIG_DATA: write label NFIT_CMD_VENDOR: vendor-specific command passthrough NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities NFIT_CMD_START_ARS: initiate scrubbing NFIT_CMD_QUERY_ARS: report on scrubbing state NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events nd/bus.c provides two features, 1) the top level ND bus driver which is the central part of the ND, and 2) the ioctl interface specific to the example-DSM-interface. I think the example-DSM-specific part should be put into an example-DSM-support module, so that the ND can support other _DSMs as necessary. Also, _DSM needs to be handled as optional. I don't think it needs to be separated, they'll both end up using the same infrastructure just with different UUIDs on the ACPI device interface or different format-interface-codes. A firmware implementation is also free to disable individual DSMs (see nd_acpi_add_dimm). Well, ioctl cmd# is essentially func# of the _DSM, and each cmd structure needs to match with its _DSM output data structure. So, I do not think these cmds will work for other _DSMs. That said, the ND is complex enough already, and we should not make it more complicated for the initial version... So, how about changing the name of /dev/ndctl0 to indicate RFIC 0x0201, ex. /dev/nd0201ctl0? That should allow separate ioctl()s for other RFICs. The code can be updated when other _DSM actually needs to be supported by the ND. No, all you need is unique command names (see libndctl ndctl_{bus|dimm}_is_cmd_supported()) and then translate the ND cmd number to the firmware function number in the provider. It just so happens that for these first set of commands the ND cmd number matches the ACPI device function number in the DSM-interface-example, but there is no reason that need always be the case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types
On Sun, Apr 19, 2015 at 12:46 AM, Boaz Harrosh b...@plexistor.com wrote: On 04/18/2015 04:35 AM, Dan Williams wrote: [..] diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 11cc7d54ec3f..410af501a941 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type) case E820_RESERVED_KERN: printk(KERN_CONT usable); break; + case E820_PMEM: + case E820_PRAM: NACK! This is the most important print in the system and it is a pure user Interface. It has no effect what so ever on functionality It is to Inform the user through dmesg what is the content of the table. It still describes how the memory is used which is reserved for a driver, I don't see how increasing the verbosity here improves debug given the alternatives, see below... case E820_RESERVED: printk(KERN_CONT reserved); break; @@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type) case E820_UNUSABLE: printk(KERN_CONT unusable); break; + case E820_PMEM: - case E820_PRAM: - printk(KERN_CONT persistent (type %u), type); - break; Just add the new (7) entry here please. Here Christoph has bike shed it for you. /proc/iomem has these details to differentiate PRAM and PMEM as well as show which driver(s)/device(s) have claimed the range(s). default: printk(KERN_CONT type %u, type); break; Here is where you can see undefined/unknown types. @@ -919,10 +918,26 @@ static inline const char *e820_type_to_string(int e820_type) case E820_NVS: return ACPI Non-volatile Storage; case E820_UNUSABLE: return Unusable memory; case E820_PRAM: return Persistent RAM; + case E820_PMEM: return Persistent I/O Memory; default:return reserved; } } +static bool do_mark_busy(u32 type, struct resource *res) +{ + if (res-start (1ULL20)) + return true; + + switch (type) { + case E820_RESERVED: + case E820_PRAM: + case E820_PMEM: + return false; + default: + return true; + } Sigh. Again an unknown type comes out busy. Busy means resource used. It does *not* mean unknown type. It just forces researchers to ignore the return value of request_region. And not be protected by double lock. It does not really prevent anything You're free to submit a standalone patch to change this policy... see the new OEM-reserved memory types in ACPI 6. That said, I think we're better off with the current policy. If unknown memory types were treated as permanently-busy back when we initially started experimenting with NVDIMM support (2010) then I doubt the e820-type-12 prototype would ever have escaped the lab. We could have avoided a good amount of confusion. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem
On Mon, Apr 20, 2015 at 5:53 AM, Christoph Hellwig h...@lst.de wrote: [I haven't much time to look through the patches, so only high level hand wavey comments for now, sorry..] On Mon, Apr 20, 2015 at 01:14:42AM -0700, Dan Williams wrote: So why on earth is this whole concept and the naming itself ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving around a specific 'firmware' mindset and revolving around specific, weirdly named, overly complicated looking firmware interfaces that come with their own new weird glossary?? There's only three core properties of NVDIMMs that this implementation cares about. 1/ directly mapped interleaved persistent memory (PMEM) 2/ indirect mmio aperture accessed (windowed) persistent memory (BLK) 3/ the possibility that those 2 access modes may alias the same on-media addresses Most of complexity of the implementation is dealing with aspect 3, but that complexity can and is bypassed in places. Firmware might be a discovery method - or not. A non-volatile device might be e820 enumerated, or PCI discovered - potentially with all discovery handled by the driver. PCI attached non-volatile memory is NVMe. ND is handling address ranges that support direct cpu load store. But those can't be attached in all kinds of different ways. It's not like this is a new thing - they've been used in Storage OEM systems for a long time, both on Intel platforms and other CPUs. And the current pmem.c can also handle cases like a PCI card exposing a large mmio region that can be used as persistent memory. So a big vote from me into naming this the pmem subsystem and trying to have names not too tied to one specific firmware interface. While I understand a kernel developer's natural aversion to anything committee defined, NFIT does seem be a superset of all the base mechanisms needed to describe NVDIMM resources. Also, it's worth noting that meaning of 'N' in ND is purposefully vague. The whole point of listing it as Nfit-Defined / NvDimm Subsystem was to indicate that ND is generic and could also refer generally to Non-volatile-Devices. What's missing, in my opinion, is an existing NVDIMM platform that would like to leverage some of base enabling that this sub-system provides and will never have an NFIT capability. In the absence of alternative concerns/implementations we reached for NFIT terminology out of convenience, but I'm all up for deprecating NFIT-Defined as one of the meanings of 'ND'. Once I'll go through this in more detail I'll comment more. Sounds good. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'
On Tue, Apr 21, 2015 at 12:35 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-04-17 at 21:35 -0400, Dan Williams wrote: : + +static int nd_mem_init(struct nd_bus *nd_bus) +{ + struct nd_spa *nd_spa; + + /* + * For each SPA-DCR address range find its corresponding + * MEMDEV(s). From each MEMDEV find the corresponding DCR. + * Then, try to find a SPA-BDW and a corresponding BDW that + * references the DCR. Throw it all into an nd_mem object. + * Note, that BDWs are optional. + */ + list_for_each_entry(nd_spa, nd_bus-spas, list) { + u16 spa_index = readw(nd_spa-nfit_spa-spa_index); + int type = nfit_spa_type(nd_spa-nfit_spa); + struct nd_mem *nd_mem, *found; + struct nd_memdev *nd_memdev; + u16 dcr_index; + + if (type != NFIT_SPA_DCR) + continue; This function requires NFIT_SPA_DCR, SPA Range Structure with NVDIMM Control Region GUID, for initializing an nd_mem object. However, battery-backed DIMMs do not have such control region SPA. IIUC, the NFIT spec does not require NFIT_SPA_DCR. Can you change this function to work with NFIT_SPA_PM as well? NFIT_SPA_PM ranges are handled separately from nd_mem_init(). See nd_region_create() in patch 10. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development
On Fri, Apr 24, 2015 at 2:47 PM, Linda Knippers linda.knipp...@hp.com wrote: On 4/17/2015 9:35 PM, Dan Williams wrote: : diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 5fa74f124b3e..0106b3807202 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -41,4 +41,24 @@ config NFIT_ACPI register the platform-global NFIT blob with the core. Also enables the core to craft ACPI._DSM messages for platform/dimm configuration. + +config NFIT_TEST + tristate NFIT TEST: Manufactured NFIT for interface testing + depends on DMA_CMA + depends on ND_CORE=m + depends on m + help + For development purposes register a manufactured + NFIT table to verify the resulting device model topology. + Note, this module arranges for ioremap_cache() to be + overridden locally to allow simulation of system-memory as an + io-memory-resource. + + Note, this test expects to be able to find at least + 256MB of CMA space (CONFIG_CMA_SIZE_MBYTES) or it will fail to It seems to actually be wanting = 584MB. Ah, true, this Kconfig text is stale. Will fix. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/20] libnd: non-volatile memory device support
On Tue, Apr 28, 2015 at 5:25 PM, Rafael J. Wysocki r...@rjwysocki.net wrote: On Tuesday, April 28, 2015 02:24:12 PM Dan Williams wrote: Changes since v1 [1]: Incorporates feedback received prior to April 24. 1/ Ingo said [2]: So why on earth is this whole concept and the naming itself ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving around a specific 'firmware' mindset and revolving around specific, weirdly named, overly complicated looking firmware interfaces that come with their own new weird glossary?? Indeed, we of course consulted the NFIT specification to determine the shape of the sub-system, but then let its terms and data structures permeate too deep into the implementation. That is fixed now with all NFIT specifics factored out into acpi.c. The NFIT is no longer required reading to review libnd. Only three concepts are needed: i/ PMEM - contiguous memory range where cpu stores are persistent once they are flushed through the memory controller. ii/ BLK - mmio apertures (sliding windows) that can be programmed to access an aperture's-worth of persistent media at a time. iii/ DPA - dimm-physical-address, address space local to a dimm. A dimm may provide both PMEM-mode and BLK-mode access to a range of DPA. libnd manages allocation of DPA to either PMEM or BLK-namespaces to resolve this aliasing. The v1..v2 diffstat below shows the migration of nfit-specifics to acpi.c and the new state of libnd being nfit-free. nd now only refers to non-volatile devices. Note, reworked documentation will return once the review has settled. Documentation/blockdev/nd.txt | 867 - MAINTAINERS | 34 +- arch/ia64/kernel/efi.c|5 +- arch/x86/kernel/e820.c| 11 +- arch/x86/kernel/pmem.c|2 +- drivers/block/Makefile|2 +- drivers/block/nd/Kconfig | 135 ++-- drivers/block/nd/Makefile | 32 +- drivers/block/nd/acpi.c | 1506 +++-- drivers/block/nd/acpi_nfit.h | 321 drivers/block/nd/blk.c| 27 +- drivers/block/nd/btt.c|6 +- drivers/block/nd/btt_devs.c |8 +- drivers/block/nd/bus.c| 337 + drivers/block/nd/core.c | 574 +- drivers/block/nd/dimm.c | 11 - drivers/block/nd/dimm_devs.c | 292 ++- drivers/block/nd/e820.c | 100 +++ drivers/block/nd/libnd.h | 122 +++ drivers/block/nd/namespace_devs.c | 10 +- drivers/block/nd/nd-private.h | 107 +-- drivers/block/nd/nd.h | 91 +-- drivers/block/nd/nfit.h | 238 -- drivers/block/nd/pmem.c | 56 +- drivers/block/nd/region.c | 78 +- drivers/block/nd/region_devs.c| 783 +++ drivers/block/nd/test/iomap.c | 86 +-- drivers/block/nd/test/nfit.c | 1115 +++ drivers/block/nd/test/nfit_test.h | 15 +- include/uapi/linux/ndctl.h| 130 ++-- 30 files changed, 3166 insertions(+), 3935 deletions(-) delete mode 100644 Documentation/blockdev/nd.txt create mode 100644 drivers/block/nd/acpi_nfit.h create mode 100644 drivers/block/nd/e820.c create mode 100644 drivers/block/nd/libnd.h delete mode 100644 drivers/block/nd/nfit.h [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000484.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000520.html 2/ Christoph asked the pmem ida conversion to be moved to its own patch (done), and to consider leaving the current pmem.c in drivers/block/. Instead, I converted the e820-type-12 enabling to be the first non-ACPI-NFIT based consumer of libnd. The new nd_e820 driver simply registers e820-type-12 ranges as libnd PMEM regions. Among other things this conversion enables BTT for these ranges. The alternative is to move drivers/block/nd/nd.h internals out to include/linux/ which I think is worse. 3/ Toshi reported that the NFIT parsing fails to handle the case of a PMEM range with a single-dimm (non-aliasing) interleave description. Support for this case was added and is tested by default by the nfit_test.1 configuration. 4/ Toshi reported that we should not be treating a missing _STA property as a dimm disabled by firmware case. (fixed). 5/ Christoph noted that ND_ARCH_HAS_IOREMAP_CACHE needs to be moved to arch code. It is gone for now and we'll revisit when adding cached mappings back to the PMEM driver. 6/ Toshi mentioned that the presence of two different nd_bus_probe() functions was confusing. (cleaned up
Re: [PATCH v2 00/20] libnd: non-volatile memory device support
On Tue, Apr 28, 2015 at 1:52 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com wrote: Changes since v1 [1]: Incorporates feedback received prior to April 24. 1/ Ingo said [2]: So why on earth is this whole concept and the naming itself ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving around a specific 'firmware' mindset and revolving around specific, weirdly named, overly complicated looking firmware interfaces that come with their own new weird glossary?? Indeed, we of course consulted the NFIT specification to determine the shape of the sub-system, but then let its terms and data structures permeate too deep into the implementation. That is fixed now with all NFIT specifics factored out into acpi.c. The NFIT is no longer required reading to review libnd. Only three concepts are needed: i/ PMEM - contiguous memory range where cpu stores are persistent once they are flushed through the memory controller. ii/ BLK - mmio apertures (sliding windows) that can be programmed to access an aperture's-worth of persistent media at a time. iii/ DPA - dimm-physical-address, address space local to a dimm. A dimm may provide both PMEM-mode and BLK-mode access to a range of DPA. libnd manages allocation of DPA to either PMEM or BLK-namespaces to resolve this aliasing. Mostly for my understanding: is there a name for address relative to the address lines on the DIMM? That is, a DIMM that exposes 8 GB of apparent physical memory, possibly interleaved, broken up, or weirdly remapped by the memory controller, would still have addresses between 0 and 8 GB. Some of those might be PMEM windows, some might be MMIO, some might be BLK apertures, etc. IIUC DPA refers to actual addressable storage, not this type of address? No, DPA is exactly as you describe above. You can't directly access it except through a PMEM mapping (possibly interleaved with DPA from other DIMMs) or a BLK aperture (mmio window into DPA). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/20] libnd: non-volatile memory device support
On Tue, Apr 28, 2015 at 2:06 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 1:59 PM, Dan Williams dan.j.willi...@intel.com wrote: On Tue, Apr 28, 2015 at 1:52 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com wrote: Changes since v1 [1]: Incorporates feedback received prior to April 24. 1/ Ingo said [2]: So why on earth is this whole concept and the naming itself ('drivers/block/nd/' stands for 'NFIT Defined', apparently) revolving around a specific 'firmware' mindset and revolving around specific, weirdly named, overly complicated looking firmware interfaces that come with their own new weird glossary?? Indeed, we of course consulted the NFIT specification to determine the shape of the sub-system, but then let its terms and data structures permeate too deep into the implementation. That is fixed now with all NFIT specifics factored out into acpi.c. The NFIT is no longer required reading to review libnd. Only three concepts are needed: i/ PMEM - contiguous memory range where cpu stores are persistent once they are flushed through the memory controller. ii/ BLK - mmio apertures (sliding windows) that can be programmed to access an aperture's-worth of persistent media at a time. iii/ DPA - dimm-physical-address, address space local to a dimm. A dimm may provide both PMEM-mode and BLK-mode access to a range of DPA. libnd manages allocation of DPA to either PMEM or BLK-namespaces to resolve this aliasing. Mostly for my understanding: is there a name for address relative to the address lines on the DIMM? That is, a DIMM that exposes 8 GB of apparent physical memory, possibly interleaved, broken up, or weirdly remapped by the memory controller, would still have addresses between 0 and 8 GB. Some of those might be PMEM windows, some might be MMIO, some might be BLK apertures, etc. IIUC DPA refers to actual addressable storage, not this type of address? No, DPA is exactly as you describe above. You can't directly access it except through a PMEM mapping (possibly interleaved with DPA from other DIMMs) or a BLK aperture (mmio window into DPA). So the thing I'm describing has no name, then? Oh, well. What? The thing you are describing *is* DPA. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory
On Tue, Apr 28, 2015 at 2:10 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:26 AM, Dan Williams dan.j.willi...@intel.com wrote: From: Ross Zwisler ross.zwis...@linux.intel.com The libnd implementation handles allocating dimm address space (DPA) between PMEM and BLK mode interfaces. After DPA has been allocated from a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O as a struct bio based block device. Unlike PMEM, BLK is required to handle platform specific details like mmio register formats and memory controller interleave. For this reason the libnd generic nd_blk driver calls back into the bus provider to carry out the I/O. This initial implementation handles the BLK interface defined by the ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from DCR (dimm control region), BDW (block data window), IDT (interleave descriptor) NFIT structures and the hardware register format. [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 12 + drivers/block/nd/Makefile |3 drivers/block/nd/acpi.c | 422 +++-- drivers/block/nd/acpi_nfit.h | 47 drivers/block/nd/blk.c| 264 +++ drivers/block/nd/libnd.h | 11 + drivers/block/nd/namespace_devs.c | 47 drivers/block/nd/nd-private.h |3 drivers/block/nd/nd.h | 16 + drivers/block/nd/region.c |8 + drivers/block/nd/region_devs.c| 65 +- drivers/block/nd/test/nfit.c | 29 +++ drivers/block/nd/test/nfit_test.h |2 13 files changed, 891 insertions(+), 38 deletions(-) create mode 100644 drivers/block/nd/blk.c diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 612bf2b14283..bac4290129fc 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -95,6 +95,18 @@ config BLK_DEV_PMEM Say Y if you want to use a NVDIMM described by ACPI, E820, etc... +config ND_BLK + tristate BLK: Block data window (aperture) device support + depends on LIBND + default ND_ACPI + help + This driver performs I/O using a set of mmio windows on a + dimm. The set of apertures will all access the one DIMM. + Multiple windows allow multiple threads to have a different + portions of the dimm open at one time. + + Say Y if you want to use a NVDIMM with BLK-mode capability + This describes how it works, not what it is. How about: This driver exposes NVDIMM BLK regions as block devices. BLK regions are regions of NVDIMM storage that are sector-addressable, not byte-addressible, and do not support DAX. They *are* byte-addressable albeit through an indirection window. The indirection windows are too small for DAX to be a viable access mode. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 01/20] e820, efi: add ACPI 6.0 persistent memory types
On Tue, Apr 28, 2015 at 1:49 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:24 AM, Dan Williams dan.j.willi...@intel.com wrote: diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 11cc7d54ec3f..d38b53a7e9b2 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type) case E820_UNUSABLE: printk(KERN_CONT unusable); break; + case E820_PMEM: case E820_PRAM: printk(KERN_CONT persistent (type %u), type); break; I'd kind of like to make it more clear what's going on here. It doesn't help that the spec chose poor names. How about NVDIMM physical aperture for E820_PMEM and legacy persistent RAM for E820_PRAM? The term aperture to me implies this BLK (mmio-windowed) mode of accessing persistent media that the NFIT specification introduces. In fact, those ranges are mapped E820_RESERVED. E820_PMEM really is a memory range that happens to be persistent. Otherwise this looks generaly sensible, although I don't really understand why e820_type_to_string and e820_print_type are different. e820_type_to_string() appears in /proc/iomem and seems to afford being more descriptive than e820_print_type() that just scrolls by in dmesg, but I'm just guessing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device support
On Tue, Apr 28, 2015 at 2:24 PM, Elliott, Robert (Server Storage) elli...@hp.com wrote: -Original Message- From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of Dan Williams Sent: Tuesday, April 28, 2015 1:24 PM To: linux-nvd...@lists.01.org Cc: Neil Brown; Dave Chinner; H. Peter Anvin; Christoph Hellwig; Rafael J. Wysocki; Robert Moore; Ingo Molnar; linux-a...@vger.kernel.org; Jens Axboe; Borislav Petkov; Thomas Gleixner; Greg KH; linux-kernel@vger.kernel.org; Andy Lutomirski; Andrew Morton; Linus Torvalds Subject: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device support Changes since v1 [1]: Incorporates feedback received prior to April 24. Here are some comments on the sysfs properties reported for a pmem device. They are based on v1, but I don't think v2 changes anything. 1. This confuses lsblk (part of util-linux): /sys/block/pmem0/device/type:4 lsblk shows: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT pmem0 251:00 8G 0 worm pmem1 251:16 0 8G 0 worm pmem2 251:32 0 8G 0 worm pmem3 251:48 0 8G 0 worm pmem4 251:64 0 8G 0 worm pmem5 251:80 0 8G 0 worm pmem6 251:96 0 8G 0 worm pmem7 251:112 0 8G 0 worm lsblk's blkdev_scsi_type_to_name() considers 4 to mean SCSI_TYPE_WORM (write once read many ... used for certain optical and tape drives). Why is lsblk assuming these are scsi devices? I'll need to go check that out. I'm not sure what nd and pmem are doing to result in that value. That is their libnd specific device type number from include/uapi/ndctl.h. 4 == ND_DEVICE_NAMESPACE_IO. lsblk has no business interpreting this as something SCSI specific. 2. To avoid confusing software trying to detect fast storage vs. slow storage devices via sysfs, this value should be 0: /sys/block/pmem0/queue/rotational:1 That can be done by adding this shortly after the blk_alloc_queue call: queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem-pmem_queue); Yeah, good catch. 3. Is there any reason to have a 512 KiB limit on the transfer length? /sys/block/pmem0/queue/max_hw_sectors_kb:512 That is from: blk_queue_max_hw_sectors(pmem-pmem_queue, 1024); I'd only change this from the default if performance testing showed it made a non-trivial difference. 4. These are read-writeable, but IOs never reach a queue, so the queue size is irrelevant and merging never happens: /sys/block/pmem0/queue/nomerges:0 /sys/block/pmem0/queue/nr_requests:128 Consider making them both read-only with: * nomerges set to 2 (no merging happening) * nr_requests as small as the block layer allows to avoid wasting memory. 5. No scatter-gather lists are created by the driver, so these read-only fields are meaningless: /sys/block/pmem0/queue/max_segments:128 /sys/block/pmem0/queue/max_segment_size:65536 Is there a better way to report them as irrelevant? Again it comes back to the question of whether these default settings are actively harmful. 6. There is no completion processing, so the read-writeable cpu affinity is not used: /sys/block/pmem0/queue/rq_affinity:0 Consider making it read-only and set to 2, meaning the completions always run on the requesting CPU. There are no completions with pmem, the entire I/O path is synchronous. Ideally, this attribute would disappear for a pmem queue, not be set to 2. 7. With mmap() allowing less than logical block sized accesses to the device, this could be considered misleading: /sys/block/pmem0/queue/physical_block_size:512 I don't see how it is misleading. If you access it as a block device the block size is 512. If the application is mmap() + DAX aware it knows that the physical_block_size is being bypassed. Perhaps that needs to be 1 byte or a cacheline size (64 bytes on x86) to indicate that direct partial logical block accesses are possible. No, because that breaks the definition of a block device. Through the bdev interface it's always accessed a block at a time. The btt driver could report 512 as one indication it is different. I wouldn't be surprised if smaller values than the logical block size confused some software, though. Precisely why we shouldn't go there with pmem. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)
On Wed, Apr 29, 2015 at 8:53 AM, Elliott, Robert (Server Storage) elli...@hp.com wrote: -Original Message- From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of Dan Williams Sent: Tuesday, April 28, 2015 1:25 PM Subject: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block- data-window, persistent memory, volatile memory) A region device represents the maximum capacity of a BLK range (mmio block-data-window(s)), or a PMEM range (DAX-capable persistent memory or volatile memory), without regard for aliasing. Aliasing, in the dimm-local address space (DPA), is resolved by metadata on a dimm to designate which exclusive interface will access the aliased DPA ranges. Support for the per-dimm metadata/label arrvies is in a subsequent patch. The name format of region devices is regionN where, like dimms, N is a global ida index assigned at discovery time. This id is not reliable across reboots nor in the presence of hotplug. Look to attributes of the region or static id-data of the sub-namespace to generate a persistent name. ... +++ b/drivers/block/nd/region_devs.c ... +static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus, + struct nd_region_desc *ndr_desc, struct device_type *dev_type) +{ + struct nd_region *nd_region; + struct device *dev; + u16 i; + + for (i = 0; i ndr_desc-num_mappings; i++) { + struct nd_mapping *nd_mapping = ndr_desc-nd_mapping[i]; + struct nd_dimm *nd_dimm = nd_mapping-nd_dimm; + + if ((nd_mapping-start | nd_mapping-size) % SZ_4K) { + dev_err(nd_bus-dev, %pf: %s mapping%d is not 4K aligned\n, + __builtin_return_address(0), Please use KiB rather than the unclear K. Ok. Same comment for a dev_dbg print in patch 14. It's a debug statement, but ok. [..] Could this include nd in the name, like ndregion%d? The other dev_set_name calls in this patch set use: btt%d ndbus%d nmem%d namespace%d.%d which are a bit more distinctive. They sit on an nd bus and don't have global device nodes, I don't see a need to make them anymore distinctive. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT] Networking
On Wed, 2015-04-29 at 17:17 +0200, D.S. Ljungmark wrote: On 29/04/15 16:51, Denys Vlasenko wrote: On Wed, Apr 1, 2015 at 9:48 PM, David Miller da...@davemloft.net wrote: D.S. Ljungmark (1): ipv6: Don't reduce hop limit for an interface https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6fd99094de2b83d1d4c8457f2c83483b2828e75a I was testing this change and apparently it doesn't close the hole. The python script I use to send RAs: #!/usr/bin/env python import sys import time import scapy.all from scapy.layers.inet6 import * ip = IPv6() # ip.dst = 'ff02::1' ip.dst = sys.argv[1] icmp = ICMPv6ND_RA() icmp.chlim = 1 for x in range(10): send(ip/icmp) time.sleep(1) # ./ipv6-hop-limit.py fe80::21e:37ff:fed0:5006 . Sent 1 packets. ...10 times... Sent 1 packets. After I do this, on the targeted machine I check hop_limits: # for f in /proc/sys/net/ipv6/conf/*/hop_limit; do echo -n $f:; cat $f; done /proc/sys/net/ipv6/conf/all/hop_limit:64 /proc/sys/net/ipv6/conf/default/hop_limit:64 /proc/sys/net/ipv6/conf/enp0s25/hop_limit:1 === THIS /proc/sys/net/ipv6/conf/lo/hop_limit:64 /proc/sys/net/ipv6/conf/wlp3s0/hop_limit:64 As you see, the interface which received RAs still lowered its hop_limit to 1. I take it means that the bug is still present (right? I'm not a network guy...). It might not be present in the _kernel_. Do you run NetworkManager on your system? If so, see below. I triple-checked that I do run the kernel with the fix. Further investigation shows that the code touched by the fix is not even reached, hop_limit is changed elsewhere. I'm willing to test additional patches. NetworkManager had it's own re-implementation of the bug. It got fixed with NetworkManager commit: commit bdaaf9849b0cacf131b71fa2ae168f5db796874f Author: Thomas Haller thal...@redhat.com Date: Wed Apr 8 15:54:30 2015 +0200 platform: don't accept lowering IPv6 hop-limit from RA (CVE-2015-2924) Beforte that commit, NetworkManager would take the RA packet, extract the hop limit, and write it to the sysctl itself. Yup, we basically followed the original kernel logic here, so we needed to patch it in NM as well. It's been backported to NM 0.9.10, 1.0, and obviously is in git master. Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 11/20] libnd, nd_pmem: add libnd support to the pmem driver
On Tue, Apr 28, 2015 at 3:58 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 3:21 PM, Phil Pokorny ppoko...@penguincomputing.com wrote: On Tue, Apr 28, 2015 at 2:04 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:25 AM, Dan Williams dan.j.willi...@intel.com wrote: [..] This is such a mess that I think this driver should maybe flat-out refuse to load in this type of configuration without some scary module option. I have some NVDIMMs that report as type 12 but need two extra out-of-tree drivers to work safely. First, they need i2c_imc or the equivalent (I'll try to resubmit that soon). Second, they need secret magic NDAed register poking. The latter is very problematic. At the very least, I think we should discourage people who don't really know what they're doing from using this driver without care. The benefit of the type-12 experiment having not made it very far out of the lab is that it may be feasible to whitelist known platforms where we believe ADR is available. Otherwise, the presence of the NFIT asserts platform persistent memory support. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory
On Tue, Apr 28, 2015 at 4:06 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 3:30 PM, Dan Williams dan.j.willi...@intel.com wrote: On Tue, Apr 28, 2015 at 2:10 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 28, 2015 at 11:26 AM, Dan Williams dan.j.willi...@intel.com wrote: From: Ross Zwisler ross.zwis...@linux.intel.com The libnd implementation handles allocating dimm address space (DPA) between PMEM and BLK mode interfaces. After DPA has been allocated from a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O as a struct bio based block device. Unlike PMEM, BLK is required to handle platform specific details like mmio register formats and memory controller interleave. For this reason the libnd generic nd_blk driver calls back into the bus provider to carry out the I/O. This initial implementation handles the BLK interface defined by the ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from DCR (dimm control region), BDW (block data window), IDT (interleave descriptor) NFIT structures and the hardware register format. [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 12 + drivers/block/nd/Makefile |3 drivers/block/nd/acpi.c | 422 +++-- drivers/block/nd/acpi_nfit.h | 47 drivers/block/nd/blk.c| 264 +++ drivers/block/nd/libnd.h | 11 + drivers/block/nd/namespace_devs.c | 47 drivers/block/nd/nd-private.h |3 drivers/block/nd/nd.h | 16 + drivers/block/nd/region.c |8 + drivers/block/nd/region_devs.c| 65 +- drivers/block/nd/test/nfit.c | 29 +++ drivers/block/nd/test/nfit_test.h |2 13 files changed, 891 insertions(+), 38 deletions(-) create mode 100644 drivers/block/nd/blk.c diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 612bf2b14283..bac4290129fc 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -95,6 +95,18 @@ config BLK_DEV_PMEM Say Y if you want to use a NVDIMM described by ACPI, E820, etc... +config ND_BLK + tristate BLK: Block data window (aperture) device support + depends on LIBND + default ND_ACPI + help + This driver performs I/O using a set of mmio windows on a + dimm. The set of apertures will all access the one DIMM. + Multiple windows allow multiple threads to have a different + portions of the dimm open at one time. + + Say Y if you want to use a NVDIMM with BLK-mode capability + This describes how it works, not what it is. How about: This driver exposes NVDIMM BLK regions as block devices. BLK regions are regions of NVDIMM storage that are sector-addressable, not byte-addressible, and do not support DAX. They *are* byte-addressable albeit through an indirection window. The indirection windows are too small for DAX to be a viable access mode. Right, I was assuming incorrectly that the sector-atomic thing was a necessary part of BLK, or at least of this implementation. Anyway, I think my point stands: let's describe what these drivers do from a user's perspective, not how they work. Agreed. How about: Support NVDIMMs, or other devices, that implement a BLK-mode access capability. BLK-mode access uses memory-mapped-i/o apertures to access persistent media. Say Y if your platform firmware emits an ACPI.NFIT table (CONFIG_ND_ACPI), or otherwise exposes BLK-mode capabilities. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT] Networking
On Wed, 2015-04-29 at 18:55 +0200, D.S. Ljungmark wrote: On 29/04/15 18:50, Dan Williams wrote: On Wed, 2015-04-29 at 17:17 +0200, D.S. Ljungmark wrote: On 29/04/15 16:51, Denys Vlasenko wrote: On Wed, Apr 1, 2015 at 9:48 PM, David Miller da...@davemloft.net wrote: D.S. Ljungmark (1): ipv6: Don't reduce hop limit for an interface https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6fd99094de2b83d1d4c8457f2c83483b2828e75a I was testing this change and apparently it doesn't close the hole. The python script I use to send RAs: - #!/usr/bin/env python import sys import time import scapy.all from scapy.layers.inet6 import * ip = IPv6() # ip.dst = 'ff02::1' ip.dst = sys.argv[1] icmp = ICMPv6ND_RA() icmp.chlim = 1 for x in range(10): send(ip/icmp) time.sleep(1) # ./ipv6-hop-limit.py fe80::21e:37ff:fed0:5006 . Sent 1 packets. ...10 times... Sent 1 packets. After I do this, on the targeted machine I check hop_limits: # for f in /proc/sys/net/ipv6/conf/*/hop_limit; do echo -n $f:; cat $f; done /proc/sys/net/ipv6/conf/all/hop_limit:64 /proc/sys/net/ipv6/conf/default/hop_limit:64 /proc/sys/net/ipv6/conf/enp0s25/hop_limit:1 === THIS /proc/sys/net/ipv6/conf/lo/hop_limit:64 /proc/sys/net/ipv6/conf/wlp3s0/hop_limit:64 As you see, the interface which received RAs still lowered its hop_limit to 1. I take it means that the bug is still present (right? I'm not a network guy...). It might not be present in the _kernel_. Do you run NetworkManager on your system? If so, see below. I triple-checked that I do run the kernel with the fix. Further investigation shows that the code touched by the fix is not even reached, hop_limit is changed elsewhere. I'm willing to test additional patches. NetworkManager had it's own re-implementation of the bug. It got fixed with NetworkManager commit: commit bdaaf9849b0cacf131b71fa2ae168f5db796874f Author: Thomas Haller thal...@redhat.com Date: Wed Apr 8 15:54:30 2015 +0200 platform: don't accept lowering IPv6 hop-limit from RA (CVE-2015-2924) Beforte that commit, NetworkManager would take the RA packet, extract the hop limit, and write it to the sysctl itself. Yup, we basically followed the original kernel logic here, so we needed to patch it in NM as well. It's been backported to NM 0.9.10, 1.0, and obviously is in git master. Are there any release announcements for NetworkManager? Or a place to link for official releases/homepage? The mailing list: https://mail.gnome.org/mailman/listinfo/networkmanager-list The project site: https://wiki.gnome.org/Projects/NetworkManager Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 10/20] pmem: use ida
On Wed, Apr 29, 2015 at 11:25 AM, Toshi Kani toshi.k...@hp.com wrote: Hi Dan, Thanks for the update. This version of the patchset enumerates our NFIT table properly. :-) On Tue, 2015-04-28 at 14:25 -0400, Dan Williams wrote: In preparation for the pmem driver attaching to pmem-namespaces emitted by libnd, convert it to use an ida instead of an always increasing atomic index. This provides a bit of stability to pmem device names in the presence of driver re-bind events. : @@ -122,20 +123,26 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) { struct pmem_device *pmem; struct gendisk *disk; - int idx, err; + int err; err = -ENOMEM; pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); if (!pmem) goto out; + pmem-id = ida_simple_get(pmem_ida, 0, 0, GFP_KERNEL); nd_pmem_probe() is called asynchronously via async_schedule_domain (). We have seen a case that the region#-pmem# binding becomes inconsistent across a reboot when there are 8 NVDIMM cards (reported by Robert Elliott). This leads user to access a wrong device. I think pmem id needs to be assigned before async_schedule_domain(), and cascaded to nd_pmem_probe(). I'll take a look at making this better, but it will never be bulletproof. For the same reason that root=UUID=uuid is preferred over root=/dev/sda userspace should never rely on consistent pmem device names from boot to boot. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 10/20] pmem: use ida
On Wed, Apr 29, 2015 at 1:49 PM, Linda Knippers linda.knipp...@hp.com wrote: On 4/29/2015 2:53 PM, Toshi Kani wrote: What's the right answer for this in the long run? Short term, /dev/disk/by-uuid to take a stable identifier from the contents of the device. Longer term teach udev to populate /dev/disk/by-id with stable names for libnd devices. The trick is identifiers for interleaved PMEM ranges comprised of multiple physical devices. I'm thinking something like /dev/disk/by-id/nd-set_cookie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support
On Thu, Apr 30, 2015 at 6:21 PM, Rafael J. Wysocki r...@rjwysocki.net wrote: On Thursday, April 30, 2015 05:39:06 PM Dan Williams wrote: On Thu, Apr 30, 2015 at 4:23 PM, Rafael J. Wysocki r...@rjwysocki.net wrote: [..] +if ND_DEVICES + +config LIBND + tristate LIBND: libnd device driver support + help + Platform agnostic device model for a libnd bus. Publishes + resources for a PMEM (persistent-memory) driver and/or BLK + (sliding mmio window(s)) driver to attach. Exposes a device + topology under a ndX bus device, a /dev/ndctlX bus-ioctl + message passing interface, and a /dev/nmemX dimm-ioctl + message interface for each memory device registered on the + bus. instance. A userspace library ndctl provides an API + to enumerate/manage this subsystem. + +config ND_ACPI + tristate ACPI: NFIT to libnd bus support + select LIBND + depends on ACPI + help + Infrastructure to probe ACPI 6 compliant platforms for + NVDIMMs (NFIT) and register a libnd device tree. In + addition to storage devices this also enables libnd craft + ACPI._DSM messages for platform/dimm configuration. I'm wondering if the two CONFIG options above really need to be user-selectable? For example, what reason people (who've already selected ND_DEVICES) may have for not selecting ND_ACPI if ACPI is set? Later on in the series we introduce ND_E820 which supports creating a libnd-bus from e820-type-12 memory ranges on pre-NFIT systems. I'm also considering a configfs defined libnd-bus because e820 types are not nearly enough information to safely define nvdimm resources outside of NFIT. I hope these are not mutually exclusive with ND_ACPI? Otherwise distros will have problems with supporting them in one kernel. You can have ND_E820 support and ND_ACPI support in the same system. Likely an NFIT enabled system will never have e820-type-12 ranges, but if a user messes up and uses the new memmap=ss!nn command line to overlap NFIT-defined memory then the request_mem_region() calls in the driver will collide. First to load wins in that scenario. If ND_E820 and ND_ACPI aren't mutually exclusive, I still don't see a good enough reason for asking users about ND_ACPI. Why would I ever say No here if I said Yes or Module to ND_DEVICES? I agree that if the user selects ND_DEVICES then ND_ACPI should probably default on, but otherwise turning it off is a useful option. If you know your system is pre-ACPI-6 then why bother including support? + +endif diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile new file mode 100644 index ..944b5947c0cb --- /dev/null +++ b/drivers/block/nd/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_LIBND) += libnd.o +obj-$(CONFIG_ND_ACPI) += nd_acpi.o + +nd_acpi-y := acpi.o + +libnd-y := core.o OK, so it looks like no modules, just built-in code, right? Um, no, both CONFIG_ND_ACPI and CONFIG_LIBND can be =m. OK [cut] +static int nd_acpi_remove(struct acpi_device *adev) +{ + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(adev-dev); + + nd_bus_unregister(acpi_desc-nd_bus); + return 0; +} + +static void nd_acpi_notify(struct acpi_device *adev, u32 event) +{ + /* TODO: handle ACPI_NOTIFY_BUS_CHECK notification */ + dev_dbg(adev-dev, %s: event: %d\n, __func__, event); +} + +static const struct acpi_device_id nd_acpi_ids[] = { + { ACPI0012, 0 }, + { , 0 }, +}; +MODULE_DEVICE_TABLE(acpi, nd_acpi_ids); + +static struct acpi_driver nd_acpi_driver = { + .name = KBUILD_MODNAME, + .ids = nd_acpi_ids, + .flags = ACPI_DRIVER_ALL_NOTIFY_EVENTS, + .ops = { + .add = nd_acpi_add, + .remove = nd_acpi_remove, + .notify = nd_acpi_notify + }, +}; Since this is going to be non-modular built-in code, please use an ACPI scan handler instead of using a driver here. acpi_memhotplug.c does that, you can use it as an example, but I guess you don't need to enable hotplug for it to start with. No, you misunderstood, this will certainly be modular and loaded on-demand. OK So please drop the .notify thing at least for now. It most likely doesn't do what you need anyway. The .notify handler will eventually be filled in to handle hot-add of NFIT structures, but yes I'll drop it for now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support
On Thu, Apr 30, 2015 at 4:23 PM, Rafael J. Wysocki r...@rjwysocki.net wrote: On Tuesday, April 28, 2015 02:24:23 PM Dan Williams wrote: 1/ Autodetect an NFIT table for the ACPI namespace device with _HID of ACPI0012 2/ libnd bus registration The NFIT provided by ACPI is one possible method by which platforms will discover NVDIMM resources. However, the intent of the nd_bus_descriptor abstraction is to abstract provider specific details, leaving libnd to be independent of the specific NVDIMM resource discovery mechanism. This flexibility is later exploited later to implement custom-defined nd buses. Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/Kconfig |2 drivers/block/Makefile|1 drivers/block/nd/Kconfig | 40 +++ drivers/block/nd/Makefile |6 + drivers/block/nd/acpi.c | 475 + drivers/block/nd/acpi_nfit.h | 254 ++ drivers/block/nd/core.c | 67 ++ drivers/block/nd/libnd.h | 33 +++ drivers/block/nd/nd-private.h | 23 ++ 9 files changed, 901 insertions(+) create mode 100644 drivers/block/nd/Kconfig create mode 100644 drivers/block/nd/Makefile create mode 100644 drivers/block/nd/acpi.c create mode 100644 drivers/block/nd/acpi_nfit.h create mode 100644 drivers/block/nd/core.c create mode 100644 drivers/block/nd/libnd.h create mode 100644 drivers/block/nd/nd-private.h diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index eb1fed5bd516..dfe40e5ca9bd 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -321,6 +321,8 @@ config BLK_DEV_NVME To compile this driver as a module, choose M here: the module will be called nvme. +source drivers/block/nd/Kconfig + config BLK_DEV_SKD tristate STEC S1120 Block Driver depends on PCI diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 9cc6c18a1c7e..07a6acecf4d8 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o obj-$(CONFIG_MG_DISK)+= mg_disk.o obj-$(CONFIG_SUNVDC) += sunvdc.o obj-$(CONFIG_BLK_DEV_NVME) += nvme.o +obj-$(CONFIG_ND_DEVICES) += nd/ obj-$(CONFIG_BLK_DEV_SKD)+= skd.o obj-$(CONFIG_BLK_DEV_OSD)+= osdblk.o diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig new file mode 100644 index ..6d5d6b732f82 --- /dev/null +++ b/drivers/block/nd/Kconfig @@ -0,0 +1,40 @@ +menuconfig ND_DEVICES + bool NVDIMM Support + depends on PHYS_ADDR_T_64BIT + help + Generic support for non-volatile memory devices including + ACPI-6-NFIT defined resources. On platforms that define an + NFIT, or otherwise can discover NVDIMM resources, a libnd + bus is registered to advertise PMEM (persistent memory) + namespaces (/dev/pmemX) and BLK (sliding mmio window(s)) + namespaces (/dev/ndX). A PMEM namespace refers to a memory + resource that may span multiple DIMMs and support DAX (see + CONFIG_DAX). A BLK namespace refers to an NVDIMM control + region which exposes an mmio register set for windowed + access mode to non-volatile memory. + +if ND_DEVICES + +config LIBND + tristate LIBND: libnd device driver support + help + Platform agnostic device model for a libnd bus. Publishes + resources for a PMEM (persistent-memory) driver and/or BLK + (sliding mmio window(s)) driver to attach. Exposes a device + topology under a ndX bus device, a /dev/ndctlX bus-ioctl + message passing interface, and a /dev/nmemX dimm-ioctl + message interface for each memory device registered on the + bus. instance. A userspace library ndctl provides an API + to enumerate/manage this subsystem. + +config ND_ACPI + tristate ACPI: NFIT to libnd bus support + select LIBND + depends on ACPI + help + Infrastructure to probe ACPI 6 compliant platforms for + NVDIMMs (NFIT) and register a libnd device tree. In + addition to storage devices this also enables libnd craft + ACPI._DSM messages for platform/dimm configuration. I'm wondering if the two CONFIG options above really need to be user-selectable? For example, what reason people (who've already selected ND_DEVICES) may have for not selecting ND_ACPI if ACPI is set? Later on in the series we introduce ND_E820 which supports creating a libnd-bus from e820-type-12 memory ranges on pre-NFIT systems. I'm also considering a configfs defined libnd-bus because e820 types are not nearly enough information to safely define nvdimm resources outside of NFIT. + +endif diff --git a/drivers/block/nd/Makefile b
Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices
On Fri, May 1, 2015 at 12:15 PM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-05-01 at 11:43 -0700, Dan Williams wrote: On Fri, May 1, 2015 at 11:19 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-05-01 at 11:22 -0700, Dan Williams wrote: On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote: Register the memory devices described in the nfit as libnd 'dimm' devices on an nd bus. The kernel assigned device id for dimms is dynamic. If userspace needs a more static identifier it should consult a provider-specific attribute. In the case where NFIT is the provider, the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used for this purpose. : + +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc) +{ + struct nfit_mem *nfit_mem; + + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) { + struct nd_dimm *nd_dimm; + unsigned long flags = 0; + u32 nfit_handle; + + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle; + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle); + if (nd_dimm) { + /* + * If for some reason we find multiple DCRs the + * first one wins + */ + dev_err(acpi_desc-dev, duplicate DCR detected: %s\n, + nd_dimm_name(nd_dimm)); + continue; + } + + if (nfit_mem-bdw nfit_mem-memdev_pmem) + flags |= NDD_ALIASING; Does this check work for a NVDIMM card which has multiple pmem regions with label info, but does not have any bdw region configured? If you have multiple pmem regions then you don't have aliasing and don't need a label. You'll get an nd_namespace_io per region. The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk have label info. There may be an NVDIMM card with a single blk region without label info. I'd really like to suggest that labels are only for resolving aliasing and that if you have a BLK-only NVDIMM you'll get an automatic namespace created the same as a PMEM-only. Partitioning is always there to provide sub-divisions of a namespace. The only reason to support multiple BLK-namespaces per-region is to give each a different sector size. I may eventually need to relent on this position, but I'd really like to understand the use case for requiring labels when aliasing is not present as it seems like a waste to me. By looking at the callers of is_namespace_pmem() and is_namespace_blk(), such as nd_namespace_label_update(), I am concerned that the namespace types are also used for indicating the presence a label. Is it OK for nd_namespace_label_update() to do nothing when there is no aliasing? Did you forget to answer this question? I am not asking to have a label. I am asking if the namespace types can handle it correctly. Restating the nd_namespace_label_update() example: - namespace_io case: Skip, but a label may still exist. Correct? - namespace_blk case: Proceed, but blk does not require a label. Ah, ok. This is handled by nd_namespace_attr_visible() only labelled namespaces have writable sysfs attributes. This would need to be extended for a label-less BLK namespace type. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices
On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote: Register the memory devices described in the nfit as libnd 'dimm' devices on an nd bus. The kernel assigned device id for dimms is dynamic. If userspace needs a more static identifier it should consult a provider-specific attribute. In the case where NFIT is the provider, the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used for this purpose. : + +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc) +{ + struct nfit_mem *nfit_mem; + + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) { + struct nd_dimm *nd_dimm; + unsigned long flags = 0; + u32 nfit_handle; + + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle; + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle); + if (nd_dimm) { + /* + * If for some reason we find multiple DCRs the + * first one wins + */ + dev_err(acpi_desc-dev, duplicate DCR detected: %s\n, + nd_dimm_name(nd_dimm)); + continue; + } + + if (nfit_mem-bdw nfit_mem-memdev_pmem) + flags |= NDD_ALIASING; Does this check work for a NVDIMM card which has multiple pmem regions with label info, but does not have any bdw region configured? If you have multiple pmem regions then you don't have aliasing and don't need a label. You'll get an nd_namespace_io per region. The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk have label info. There may be an NVDIMM card with a single blk region without label info. I'd really like to suggest that labels are only for resolving aliasing and that if you have a BLK-only NVDIMM you'll get an automatic namespace created the same as a PMEM-only. Partitioning is always there to provide sub-divisions of a namespace. The only reason to support multiple BLK-namespaces per-region is to give each a different sector size. I may eventually need to relent on this position, but I'd really like to understand the use case for requiring labels when aliasing is not present as it seems like a waste to me. Instead of using the namespace types to assume the label info, how about adding a flag to indicate the presence of the label info? This avoids the separation of namespace_io and namespace_pmem for the same pmem driver. To what benefit? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices
On Fri, May 1, 2015 at 11:19 AM, Toshi Kani toshi.k...@hp.com wrote: On Fri, 2015-05-01 at 11:22 -0700, Dan Williams wrote: On Fri, May 1, 2015 at 10:48 AM, Toshi Kani toshi.k...@hp.com wrote: On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote: Register the memory devices described in the nfit as libnd 'dimm' devices on an nd bus. The kernel assigned device id for dimms is dynamic. If userspace needs a more static identifier it should consult a provider-specific attribute. In the case where NFIT is the provider, the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used for this purpose. : + +static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc) +{ + struct nfit_mem *nfit_mem; + + list_for_each_entry(nfit_mem, acpi_desc-dimms, list) { + struct nd_dimm *nd_dimm; + unsigned long flags = 0; + u32 nfit_handle; + + nfit_handle = __to_nfit_memdev(nfit_mem)-nfit_handle; + nd_dimm = nd_acpi_dimm_by_handle(acpi_desc, nfit_handle); + if (nd_dimm) { + /* + * If for some reason we find multiple DCRs the + * first one wins + */ + dev_err(acpi_desc-dev, duplicate DCR detected: %s\n, + nd_dimm_name(nd_dimm)); + continue; + } + + if (nfit_mem-bdw nfit_mem-memdev_pmem) + flags |= NDD_ALIASING; Does this check work for a NVDIMM card which has multiple pmem regions with label info, but does not have any bdw region configured? If you have multiple pmem regions then you don't have aliasing and don't need a label. You'll get an nd_namespace_io per region. The code assumes that namespace_pmem (NDD_ALIASING) and namespace_blk have label info. There may be an NVDIMM card with a single blk region without label info. I'd really like to suggest that labels are only for resolving aliasing and that if you have a BLK-only NVDIMM you'll get an automatic namespace created the same as a PMEM-only. Partitioning is always there to provide sub-divisions of a namespace. The only reason to support multiple BLK-namespaces per-region is to give each a different sector size. I may eventually need to relent on this position, but I'd really like to understand the use case for requiring labels when aliasing is not present as it seems like a waste to me. By looking at the callers of is_namespace_pmem() and is_namespace_blk(), such as nd_namespace_label_update(), I am concerned that the namespace types are also used for indicating the presence a label. Is it OK for nd_namespace_label_update() to do nothing when there is no aliasing? Instead of using the namespace types to assume the label info, how about adding a flag to indicate the presence of the label info? This avoids the separation of namespace_io and namespace_pmem for the same pmem driver. To what benefit? Why do they need to be separated? Having alias or not should not make the pmem namespace different. The intent is to maximize the number of devices that can be immediately attached to nd_pmem and nd_blk without user intervention. nd_namespace_io is a pmem namespace where the boundaries are 100% described by the NFIT / parent-region. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 06/20] libnd: ndctl.h, the nd ioctl abi
Most configuration of the nd-subsystem is done via nd-sysfs attributes. However, some nd buses, particularly the ACPI.NFIT bus, define a small set of messages that can be passed to the platform. For convenience we derivce the initial nd-ioctl-command formats directly from the NFIT DSM formats. ND_CMD_SMART: media health and diagnostics ND_CMD_GET_CONFIG_SIZE: size of the label space ND_CMD_GET_CONFIG_DATA: read label space ND_CMD_SET_CONFIG_DATA: write label space ND_CMD_VENDOR: vendor-specific command passthrough ND_CMD_ARS_CAP: report address-range-scrubbing capabilities ND_CMD_START_ARS: initiate scrubbing ND_CMD_QUERY_ARS: report on scrubbing state ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events If a platform later defines different commands than this set it is straightforward to extend support to those formats. Most of the commands target a specific dimm. However, the address-range-scrubbing commands target the bus. The 'commands' attribute in sysfs of an nd-bus, or an nd-dimm enumerate the supported commands for that object. Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Reported-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 12 ++ drivers/block/nd/acpi.c | 237 ++ drivers/block/nd/acpi_nfit.h |3 drivers/block/nd/bus.c| 324 - drivers/block/nd/core.c | 16 ++ drivers/block/nd/dimm_devs.c | 38 - drivers/block/nd/libnd.h | 25 +++ drivers/block/nd/nd-private.h |3 drivers/block/nd/test/nfit.c | 78 ++ include/uapi/linux/Kbuild |1 include/uapi/linux/ndctl.h| 178 +++ 11 files changed, 903 insertions(+), 12 deletions(-) create mode 100644 include/uapi/linux/ndctl.h diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 09f0135147ca..d2d84451e82c 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -37,6 +37,18 @@ config ND_ACPI addition to storage devices this also enables libnd craft ACPI._DSM messages for platform/dimm configuration. +config ND_ACPI_DEBUG + bool ACPI: Extra nd_acpi debugging + depends on ND_ACPI + depends on DYNAMIC_DEBUG + default n + help + Enabling this option causes the nd_acpi driver to dump the + input and output buffers of _DSM operations on the ACPI0012 + device and its children. This can be very verbose, so leave + it disabled unless you are debugging a hardware / firmware + issue. + config NFIT_TEST tristate NFIT TEST: Manufactured NFIT for interface testing depends on DMA_CMA diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index af6684341c9b..c46e166695f7 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -12,6 +12,7 @@ */ #include linux/list_sort.h #include linux/module.h +#include linux/ndctl.h #include linux/list.h #include linux/acpi.h #include acpi_nfit.h @@ -25,11 +26,160 @@ enum { NFIT_ACPI_NOTIFY_TABLE = 0x80, }; +static u8 nd_acpi_uuids[2][16]; /* initialized at nd_acpi_init */ + +static u8 *nd_acpi_bus_uuid(void) +{ + return nd_acpi_uuids[0]; +} + +static u8 *nd_acpi_dimm_uuid(void) +{ + return nd_acpi_uuids[1]; +} + +static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor *nd_desc) +{ + return container_of(nd_desc, struct acpi_nfit_desc, nd_desc); +} + +static struct acpi_device *to_acpi_dev(struct acpi_nfit_desc *acpi_desc) +{ + struct nd_bus_descriptor *nd_desc = acpi_desc-nd_desc; + + /* +* If provider == 'ACPI.NFIT' we can assume 'dev' is a struct +* acpi_device. +*/ + if (!nd_desc-provider_name + || strcmp(nd_desc-provider_name, ACPI.NFIT) != 0) + return NULL; + + return to_acpi_device(acpi_desc-dev); +} + static int nd_acpi_ctl(struct nd_bus_descriptor *nd_desc, struct nd_dimm *nd_dimm, unsigned int cmd, void *buf, unsigned int buf_len) { - return -ENOTTY; + struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc); + const struct nd_cmd_desc const *desc = NULL; + union acpi_object in_obj, in_buf, *out_obj; + struct device *dev = acpi_desc-dev; + const char *cmd_name, *dimm_name; + unsigned long dsm_mask; + acpi_handle handle; + u32 offset; + int rc, i; + u8 *uuid; + + if (nd_dimm) { + struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm); + struct acpi_device *adev = nfit_mem-adev; + + if (!adev) + return -ENOTTY; + dimm_name = dev_name(adev-dev
[PATCH v2 09/20] libnd: support for legacy (non-aliasing) nvdimms
The libnd region driver is an intermediary driver that translates non-volatile regions into namespace sub-devices that are surfaced by persistent memory block-device drivers (PMEM and BLK). ACPI 6 introduces the concept that a given nvdimm may offer multiple access modes to its media through either direct PMEM load/store access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM interface, some offer a BLK-like mode, but never both. If an nvdimm is single interfaced, then there is no need for dimm metadata labels. For these devices we can take the region boundaries directly to create a child namespace device (nd_namespace_io). Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile |2 + drivers/block/nd/acpi.c |1 drivers/block/nd/bus.c| 26 + drivers/block/nd/core.c | 13 +++- drivers/block/nd/dimm.c |2 - drivers/block/nd/libnd.h |6 +- drivers/block/nd/namespace_devs.c | 111 + drivers/block/nd/nd-private.h |9 ++- drivers/block/nd/nd.h |8 +++ drivers/block/nd/region.c | 88 + drivers/block/nd/region_devs.c| 61 include/linux/nd.h| 10 +++ include/uapi/linux/ndctl.h| 10 +++ 13 files changed, 338 insertions(+), 9 deletions(-) create mode 100644 drivers/block/nd/namespace_devs.c create mode 100644 drivers/block/nd/region.c diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 6010469c4d4c..0fb0891e1817 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -23,3 +23,5 @@ libnd-y += bus.o libnd-y += dimm_devs.o libnd-y += dimm.o libnd-y += region_devs.o +libnd-y += region.o +libnd-y += namespace_devs.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index 41d0bb732b3e..c3dda74f73d7 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -774,6 +774,7 @@ static struct attribute_group nd_acpi_region_attribute_group = { static const struct attribute_group *nd_acpi_region_attribute_groups[] = { nd_region_attribute_group, nd_mapping_attribute_group, + nd_device_attribute_group, nd_acpi_region_attribute_group, NULL, }; diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c index 7bd79f30e5e7..46568d182559 100644 --- a/drivers/block/nd/bus.c +++ b/drivers/block/nd/bus.c @@ -13,6 +13,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME : fmt #include linux/vmalloc.h #include linux/uaccess.h +#include linux/module.h #include linux/fcntl.h #include linux/async.h #include linux/ndctl.h @@ -33,6 +34,12 @@ static int to_nd_device_type(struct device *dev) { if (is_nd_dimm(dev)) return ND_DEVICE_DIMM; + else if (is_nd_pmem(dev)) + return ND_DEVICE_REGION_PMEM; + else if (is_nd_blk(dev)) + return ND_DEVICE_REGION_BLK; + else if (is_nd_pmem(dev-parent) || is_nd_blk(dev-parent)) + return nd_region_to_namespace_type(to_nd_region(dev-parent)); return 0; } @@ -50,27 +57,46 @@ static int nd_bus_match(struct device *dev, struct device_driver *drv) return test_bit(to_nd_device_type(dev), nd_drv-type); } +static struct module *to_bus_provider(struct device *dev) +{ + /* pin bus providers while regions are enabled */ + if (is_nd_pmem(dev) || is_nd_blk(dev)) { + struct nd_bus *nd_bus = walk_to_nd_bus(dev); + + return nd_bus-module; + } + return NULL; +} + static int nd_bus_probe(struct device *dev) { struct nd_device_driver *nd_drv = to_nd_device_driver(dev-driver); + struct module *provider = to_bus_provider(dev); struct nd_bus *nd_bus = walk_to_nd_bus(dev); int rc; + if (!try_module_get(provider)) + return -ENXIO; + rc = nd_drv-probe(dev); dev_dbg(nd_bus-dev, %s.probe(%s) = %d\n, dev-driver-name, dev_name(dev), rc); + if (rc != 0) + module_put(provider); return rc; } static int nd_bus_remove(struct device *dev) { struct nd_device_driver *nd_drv = to_nd_device_driver(dev-driver); + struct module *provider = to_bus_provider(dev); struct nd_bus *nd_bus = walk_to_nd_bus(dev); int rc; rc = nd_drv-remove(dev); dev_dbg(nd_bus-dev, %s.remove(%s) = %d\n, dev-driver-name, dev_name(dev), rc); + module_put(provider); return rc; } diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c index d8d1c9cb3f16..646e424ae36c 100644 --- a/drivers/block/nd/core.c +++ b/drivers/block/nd/core.c @@ -133,8 +133,8 @@ struct attribute_group nd_bus_attribute_group = { }; EXPORT_SYMBOL_GPL(nd_bus_attribute_group); -struct nd_bus *nd_bus_register(struct device *parent
[PATCH v2 04/20] libnd: ndctl class device, and nd bus attributes
This is the position (device topology) independent method to find all the libnd buses in the system. The expectation is that there will only ever be one nd bus discovered via /sys/class/nd/ndctl0. However, we allow for the possibility of multiple buses and they will listed in discovery order as ndctl0...ndctlN. This character device hosts the ioctl for passing control messages (inspired by the ACPI-NFIT DSM interface commands). Note, nd_ioctl() and the backing -ndctl() implementation are defined in a subsequent patch. Cc: Neil Brown ne...@suse.de Cc: Greg KH gre...@linuxfoundation.org Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile |1 drivers/block/nd/acpi.c | 29 ++ drivers/block/nd/acpi_nfit.h |5 ++ drivers/block/nd/bus.c| 83 +++ drivers/block/nd/core.c | 87 - drivers/block/nd/libnd.h |5 ++ drivers/block/nd/nd-private.h |6 +++ drivers/block/nd/test/nfit.c |3 + 8 files changed, 217 insertions(+), 2 deletions(-) create mode 100644 drivers/block/nd/bus.c diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index cf064db92589..7defe18ed009 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_NFIT_TEST) += test/ nd_acpi-y := acpi.o libnd-y := core.o +libnd-y += bus.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index 54344ef9c837..dd8505f766ed 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -341,6 +341,34 @@ static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc) return 0; } +static ssize_t revision_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct nd_bus *nd_bus = to_nd_bus(dev); + struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus); + struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc); + + return sprintf(buf, %d\n, acpi_desc-nfit-revision); +} +static DEVICE_ATTR_RO(revision); + +static struct attribute *nd_acpi_attributes[] = { + dev_attr_revision.attr, + NULL, +}; + +static struct attribute_group nd_acpi_attribute_group = { + .name = nfit, + .attrs = nd_acpi_attributes, +}; + +const struct attribute_group *nd_acpi_attribute_groups[] = { + nd_bus_attribute_group, + nd_acpi_attribute_group, + NULL, +}; +EXPORT_SYMBOL_GPL(nd_acpi_attribute_groups); + int nd_acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz) { struct device *dev = acpi_desc-dev; @@ -408,6 +436,7 @@ static int nd_acpi_add(struct acpi_device *adev) nd_desc = acpi_desc-nd_desc; nd_desc-provider_name = ACPI.NFIT; nd_desc-ndctl = nd_acpi_ctl; + nd_desc-attr_groups = nd_acpi_attribute_groups; acpi_desc-nd_bus = nd_bus_register(dev, nd_desc); if (!acpi_desc-nd_bus) diff --git a/drivers/block/nd/acpi_nfit.h b/drivers/block/nd/acpi_nfit.h index a26f69e32244..b65745ca3cbc 100644 --- a/drivers/block/nd/acpi_nfit.h +++ b/drivers/block/nd/acpi_nfit.h @@ -261,5 +261,10 @@ static inline struct acpi_nfit_memdev *__to_nfit_memdev(struct nfit_mem *nfit_me return nfit_mem-memdev_pmem; } +static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_desc) +{ + return container_of(nd_desc, struct acpi_nfit_desc, nd_desc); +} + int nd_acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz); #endif /* __NFIT_H__ */ diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c new file mode 100644 index ..635f2e926426 --- /dev/null +++ b/drivers/block/nd/bus.c @@ -0,0 +1,83 @@ +/* + * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt +#include linux/uaccess.h +#include linux/fcntl.h +#include linux/slab.h +#include linux/fs.h +#include linux/io.h +#include nd-private.h + +static int nd_bus_major; +static struct class *nd_class; + +int nd_bus_create_ndctl(struct nd_bus *nd_bus) +{ + dev_t devt = MKDEV(nd_bus_major, nd_bus-id); + struct device *dev; + + dev = device_create(nd_class, nd_bus-dev, devt, nd_bus, ndctl%d, + nd_bus-id); + + if (IS_ERR(dev)) { + dev_dbg(nd_bus-dev, failed to register ndctl%d: %ld\n
[PATCH v2 16/20] libnd: write pmem label set
After 'uuid', 'size', and optionally 'alt_name' have been set to valid values the labels on the dimms can be updated. Write procedure is: 1/ Allocate and write new labels in the next index 2/ Free the old labels in the working copy 3/ Write the bitmap and the label space on the dimm 4/ Write the index to make the update valid Label ranges directly mirror the dpa resource values for the given label_id of the namespace. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/dimm_devs.c | 49 ++ drivers/block/nd/label.c | 327 + drivers/block/nd/label.h |6 + drivers/block/nd/namespace_devs.c | 82 - drivers/block/nd/nd.h |3 5 files changed, 453 insertions(+), 14 deletions(-) diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c index 4aa5654354ac..358b2a06d680 100644 --- a/drivers/block/nd/dimm_devs.c +++ b/drivers/block/nd/dimm_devs.c @@ -132,6 +132,55 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd) return rc; } +int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset, + void *buf, size_t len) +{ + int rc = validate_dimm(ndd); + size_t max_cmd_size, buf_offset; + struct nd_cmd_set_config_hdr *cmd; + struct nd_bus *nd_bus = walk_to_nd_bus(ndd-dev); + struct nd_bus_descriptor *nd_desc = nd_bus-nd_desc; + + if (rc) + return rc; + + if (!ndd-data) + return -ENXIO; + + if (offset + len ndd-nsarea.config_size) + return -ENXIO; + + max_cmd_size = min_t(u32, PAGE_SIZE, len); + max_cmd_size = min_t(u32, max_cmd_size, ndd-nsarea.max_xfer); + cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL); + if (!cmd) + return -ENOMEM; + + for (buf_offset = 0; len; len -= cmd-in_length, + buf_offset += cmd-in_length) { + size_t cmd_size; + u32 *status; + + cmd-in_offset = offset + buf_offset; + cmd-in_length = min(max_cmd_size, len); + memcpy(cmd-in_buf, buf + buf_offset, cmd-in_length); + + /* status is output in the last 4-bytes of the command buffer */ + cmd_size = sizeof(*cmd) + cmd-in_length + sizeof(u32); + status = ((void *) cmd) + cmd_size - sizeof(u32); + + rc = nd_desc-ndctl(nd_desc, to_nd_dimm(ndd-dev), + ND_CMD_SET_CONFIG_DATA, cmd, cmd_size); + if (rc || *status) { + rc = rc ? rc : -ENXIO; + break; + } + } + kfree(cmd); + + return rc; +} + static void nd_dimm_release(struct device *dev) { struct nd_dimm *nd_dimm = to_nd_dimm(dev); diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c index b55fa2a6f872..78898b642191 100644 --- a/drivers/block/nd/label.c +++ b/drivers/block/nd/label.c @@ -12,6 +12,7 @@ */ #include linux/device.h #include linux/ndctl.h +#include linux/slab.h #include linux/io.h #include linux/nd.h #include nd-private.h @@ -57,6 +58,11 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd) return ndd-nsindex_size; } +static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd) +{ + return ndd-nsarea.config_size / 129; +} + int nd_label_validate(struct nd_dimm_drvdata *ndd) { /* @@ -202,23 +208,30 @@ static struct nd_namespace_label __iomem *nd_label_base(struct nd_dimm_drvdata * return base + 2 * sizeof_namespace_index(ndd); } +static int to_slot(struct nd_dimm_drvdata *ndd, + struct nd_namespace_label __iomem *nd_label) +{ + return nd_label - nd_label_base(ndd); +} + #define for_each_clear_bit_le(bit, addr, size) \ for ((bit) = find_next_zero_bit_le((addr), (size), 0); \ (bit) (size);\ (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1)) /** - * preamble_current - common variable initialization for nd_label_* routines + * preamble_index - common variable initialization for nd_label_* routines * @nd_dimm: dimm container for the relevant label set + * @idx: namespace_index index * @nsindex: on return set to the currently active namespace index * @free: on return set to the free label bitmap in the index * @nslot: on return set to the number of slots in the label space */ -static bool preamble_current(struct nd_dimm_drvdata *ndd, +static bool preamble_index(struct nd_dimm_drvdata *ndd, int idx, struct nd_namespace_index **nsindex, unsigned long **free, u32 *nslot) { - *nsindex = to_current_namespace_index(ndd); + *nsindex = to_namespace_index(ndd, idx); if (*nsindex == NULL) return
[PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)
A region device represents the maximum capacity of a BLK range (mmio block-data-window(s)), or a PMEM range (DAX-capable persistent memory or volatile memory), without regard for aliasing. Aliasing, in the dimm-local address space (DPA), is resolved by metadata on a dimm to designate which exclusive interface will access the aliased DPA ranges. Support for the per-dimm metadata/label arrvies is in a subsequent patch. The name format of region devices is regionN where, like dimms, N is a global ida index assigned at discovery time. This id is not reliable across reboots nor in the presence of hotplug. Look to attributes of the region or static id-data of the sub-namespace to generate a persistent name. regions have 2 generic attributes size, and mappings where: - size: the block-data-window accessible capacity or the span of the spa-range in the case of pm. - mappingN: a tuple describing a dimm's contribution to the region's capacity in the format (nmemX,dpa,size). For a PMEM-region there will be at least one mapping per dimm in the interleave set. For a BLK-region there is only mapping0 listing the starting dimm offset of the block-data-window and the available capacity of that window (matches size above). The max number of mappings per region is hard coded per the constraints of sysfs attribute groups. That said the number of mappings per region should never exceed the maximum number of possible dimms in the system. If the current number turns out to not be enough then the mappings attribute clarifies how many there are supposed to be. 32 should be enough for anybody Cc: Neil Brown ne...@suse.de Cc: linux-a...@vger.kernel.org Cc: Greg KH gre...@linuxfoundation.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile |1 drivers/block/nd/acpi.c| 130 ++ drivers/block/nd/libnd.h | 25 +++ drivers/block/nd/nd-private.h |3 drivers/block/nd/nd.h | 11 + drivers/block/nd/region_devs.c | 294 6 files changed, 463 insertions(+), 1 deletion(-) create mode 100644 drivers/block/nd/region_devs.c diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 842ba13253fd..6010469c4d4c 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -22,3 +22,4 @@ libnd-y := core.o libnd-y += bus.o libnd-y += dimm_devs.o libnd-y += dimm.o +libnd-y += region_devs.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index bb0c2c764e78..41d0bb732b3e 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -751,12 +751,136 @@ static void nd_acpi_init_dsms(struct acpi_nfit_desc *acpi_desc) set_bit(i, nd_desc-dsm_mask); } +static ssize_t spa_index_show(struct device *dev, +struct device_attribute *attr, char *buf) +{ +struct nd_region *nd_region = to_nd_region(dev); +struct nfit_spa *nfit_spa = nd_region_provider_data(nd_region); + +return sprintf(buf, %d\n, nfit_spa-spa-spa_index); +} +static DEVICE_ATTR_RO(spa_index); + +static struct attribute *nd_acpi_region_attributes[] = { + dev_attr_spa_index.attr, + NULL, +}; + +static struct attribute_group nd_acpi_region_attribute_group = { + .name = nfit, + .attrs = nd_acpi_region_attributes, +}; + +static const struct attribute_group *nd_acpi_region_attribute_groups[] = { + nd_region_attribute_group, + nd_mapping_attribute_group, + nd_acpi_region_attribute_group, + NULL, +}; + +static int nd_acpi_register_region(struct acpi_nfit_desc *acpi_desc, + struct nfit_spa *nfit_spa) +{ + static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS]; + struct acpi_nfit_spa *spa = nfit_spa-spa; + struct nfit_memdev *nfit_memdev; + struct nd_region_desc ndr_desc; + int spa_type, count = 0; + struct resource res; + u16 spa_index; + + spa_type = nfit_spa_type(spa); + spa_index = spa-spa_index; + if (spa_index == 0) { + dev_dbg(acpi_desc-dev, %s: detected invalid spa index\n, + __func__); + return 0; + } + + memset(res, 0, sizeof(res)); + memset(nd_mappings, 0, sizeof(nd_mappings)); + memset(ndr_desc, 0, sizeof(ndr_desc)); + res.start = spa-spa_base; + res.end = res.start + spa-spa_length - 1; + ndr_desc.res = res; + ndr_desc.provider_data = nfit_spa; + ndr_desc.attr_groups = nd_acpi_region_attribute_groups; + list_for_each_entry(nfit_memdev, acpi_desc-memdevs, list) { + struct acpi_nfit_memdev *memdev = nfit_memdev-memdev; + struct nd_mapping *nd_mapping; + struct nd_dimm *nd_dimm; + + if (memdev-spa_index != spa_index
[PATCH v2 14/20] libnd: pmem label sets and namespace instantiation.
A complete label set is a PMEM-label per dimm where all the UUIDs match and the interleave set cookie matches an active interleave set. Present a sysfs ABI for manipulation of a PMEM-namespace's 'alt_name', 'uuid', and 'size' attributes. A later patch will make these settings persistent by writing back the label. Note that PMEM allocations grow forwards from the start of an interleave set (lowest dimm-physical-address (DPA)). BLK-namespaces that alias with a PMEM interleave set will grow allocations backward from the highest DPA. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/bus.c|6 drivers/block/nd/core.c | 64 ++ drivers/block/nd/dimm.c |2 drivers/block/nd/dimm_devs.c | 127 + drivers/block/nd/label.c | 54 ++ drivers/block/nd/label.h |3 drivers/block/nd/libnd.h |2 drivers/block/nd/namespace_devs.c | 1024 + drivers/block/nd/nd-private.h | 14 + drivers/block/nd/nd.h | 33 + drivers/block/nd/pmem.c | 22 + drivers/block/nd/region_devs.c| 145 + include/linux/nd.h| 24 + include/uapi/linux/ndctl.h|4 14 files changed, 1512 insertions(+), 12 deletions(-) diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c index 8afb8d4a7e81..819259e92468 100644 --- a/drivers/block/nd/bus.c +++ b/drivers/block/nd/bus.c @@ -364,8 +364,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd, } EXPORT_SYMBOL_GPL(nd_cmd_out_size); -static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus) +void wait_nd_bus_probe_idle(struct device *dev) { + struct nd_bus *nd_bus = walk_to_nd_bus(dev); + do { if (nd_bus-probe_active == 0) break; @@ -384,7 +386,7 @@ static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd) return 0; nd_bus = walk_to_nd_bus(nd_dimm-dev); - wait_nd_bus_probe_idle(nd_bus); + wait_nd_bus_probe_idle(nd_bus-dev); if (atomic_read(nd_dimm-busy)) return -EBUSY; diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c index 603970d0ef3a..cf64b7a50d3a 100644 --- a/drivers/block/nd/core.c +++ b/drivers/block/nd/core.c @@ -13,6 +13,7 @@ #include linux/export.h #include linux/module.h #include linux/device.h +#include linux/ctype.h #include linux/ndctl.h #include linux/mutex.h #include linux/slab.h @@ -106,6 +107,69 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev) return NULL; } +static bool is_uuid_sep(char sep) +{ + if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0') + return true; + return false; +} + +static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf, + size_t len) +{ + const char *str = buf; + u8 uuid[16]; + int i; + + for (i = 0; i 16; i++) { + if (!isxdigit(str[0]) || !isxdigit(str[1])) { + dev_dbg(dev, %s: pos: %d buf[%zd]: %c buf[%zd]: %c\n, + __func__, i, str - buf, str[0], + str + 1 - buf, str[1]); + return -EINVAL; + } + + uuid[i] = (hex_to_bin(str[0]) 4) | hex_to_bin(str[1]); + str += 2; + if (is_uuid_sep(*str)) + str++; + } + + memcpy(uuid_out, uuid, sizeof(uuid)); + return 0; +} + +/** + * nd_uuid_store: common implementation for writing 'uuid' sysfs attributes + * @dev: container device for the uuid property + * @uuid_out: uuid buffer to replace + * @buf: raw sysfs buffer to parse + * + * Enforce that uuids can only be changed while the device is disabled + * (driver detached) + * LOCKING: expects device_lock() is held on entry + */ +int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf, + size_t len) +{ + u8 uuid[16]; + int rc; + + if (dev-driver) + return -EBUSY; + + rc = nd_uuid_parse(dev, uuid, buf, len); + if (rc) + return rc; + + kfree(*uuid_out); + *uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL); + if (!(*uuid_out)) + return -ENOMEM; + + return 0; +} + static ssize_t commands_show(struct device *dev, struct device_attribute *attr, char *buf) { diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c index 5477176c5de0..e2f964308672 100644 --- a/drivers/block/nd/dimm.c +++ b/drivers/block/nd/dimm.c @@ -86,7 +86,7 @@ static int nd_dimm_remove(struct device *dev) nd_bus_lock(dev); dev_set_drvdata(dev, NULL); for_each_dpa_resource_safe(ndd, res, _r) - __release_region(ndd-dpa, res-start, resource_size(res
[PATCH v2 10/20] pmem: use ida
In preparation for the pmem driver attaching to pmem-namespaces emitted by libnd, convert it to use an ida instead of an always increasing atomic index. This provides a bit of stability to pmem device names in the presence of driver re-bind events. Cc: Christoph Hellwig h...@lst.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/pmem.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index eabf4a8d0085..e3cf9142b172 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -34,10 +34,11 @@ struct pmem_device { phys_addr_t phys_addr; void*virt_addr; size_t size; + int id; }; static int pmem_major; -static atomic_t pmem_index; +static DEFINE_IDA(pmem_ida); static void pmem_do_bvec(struct pmem_device *pmem, struct page *page, unsigned int len, unsigned int off, int rw, @@ -122,20 +123,26 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) { struct pmem_device *pmem; struct gendisk *disk; - int idx, err; + int err; err = -ENOMEM; pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); if (!pmem) goto out; + pmem-id = ida_simple_get(pmem_ida, 0, 0, GFP_KERNEL); + if (pmem-id 0) { + err = pmem-id; + goto out_free_dev; + } + pmem-phys_addr = res-start; pmem-size = resource_size(res); err = -EINVAL; if (!request_mem_region(pmem-phys_addr, pmem-size, pmem)) { dev_warn(dev, could not reserve region [0x%pa:0x%zx]\n, pmem-phys_addr, pmem-size); - goto out_free_dev; + goto out_free_ida; } /* @@ -159,15 +166,13 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) if (!disk) goto out_free_queue; - idx = atomic_inc_return(pmem_index) - 1; - disk-major = pmem_major; - disk-first_minor = PMEM_MINORS * idx; + disk-first_minor = PMEM_MINORS * pmem-id; disk-fops = pmem_fops; disk-private_data = pmem; disk-queue = pmem-pmem_queue; disk-flags = GENHD_FL_EXT_DEVT; - sprintf(disk-disk_name, pmem%d, idx); + sprintf(disk-disk_name, pmem%d, pmem-id); disk-driverfs_dev = dev; set_capacity(disk, pmem-size 9); pmem-pmem_disk = disk; @@ -182,6 +187,8 @@ out_unmap: iounmap(pmem-virt_addr); out_release_region: release_mem_region(pmem-phys_addr, pmem-size); +out_free_ida: + ida_simple_remove(pmem_ida, pmem-id); out_free_dev: kfree(pmem); out: @@ -195,6 +202,7 @@ static void pmem_free(struct pmem_device *pmem) blk_cleanup_queue(pmem-pmem_queue); iounmap(pmem-virt_addr); release_mem_region(pmem-phys_addr, pmem-size); + ida_simple_remove(pmem_ida, pmem-id); kfree(pmem); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 00/20] libnd: non-volatile memory device support
to 128M as only the simulated DAX regions need CMA. The rest can use vmalloc(). --- Available here: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm nd-v2 --- Dan Williams (18): e820, efi: add ACPI 6.0 persistent memory types libnd, nd_acpi: initial libnd infrastructure and NFIT support nd_acpi, nfit-test: manufactured NFITs for interface development libnd: ndctl class device, and nd bus attributes libnd, nd_acpi: dimm/memory-devices libnd: ndctl.h, the nd ioctl abi libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory) libnd: support for legacy (non-aliasing) nvdimms pmem: use ida libnd, nd_pmem: add libnd support to the pmem driver libnd, nd_acpi: add interleave-set state-tracking infrastructure libnd: namespace indices: read and validate libnd: pmem label sets and namespace instantiation. libnd: blk labels and namespace instantiation libnd: write pmem label set libnd: write blk label set libnd: infrastructure for btt devices Ross Zwisler (1): libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory Vishal Verma (1): nd_btt: atomic sector updates Documentation/blockdev/btt.txt| 273 ++ arch/arm64/kernel/efi.c |1 arch/ia64/kernel/efi.c|4 arch/x86/boot/compressed/eboot.c |4 arch/x86/include/uapi/asm/e820.h |1 arch/x86/kernel/e820.c| 26 + arch/x86/kernel/pmem.c|2 arch/x86/platform/efi/efi.c |3 drivers/block/Kconfig | 13 drivers/block/Makefile|2 drivers/block/nd/Kconfig | 129 +++ drivers/block/nd/Makefile | 41 + drivers/block/nd/acpi.c | 1505 + drivers/block/nd/acpi_nfit.h | 321 +++ drivers/block/nd/blk.c| 264 ++ drivers/block/nd/btt.c| 1423 +++ drivers/block/nd/btt.h| 185 drivers/block/nd/btt_devs.c | 443 ++ drivers/block/nd/bus.c| 770 + drivers/block/nd/core.c | 471 ++ drivers/block/nd/dimm.c | 115 +++ drivers/block/nd/dimm_devs.c | 507 +++ drivers/block/nd/e820.c | 100 ++ drivers/block/nd/label.c | 925 drivers/block/nd/label.h | 143 +++ drivers/block/nd/libnd.h | 122 +++ drivers/block/nd/namespace_devs.c | 1701 + drivers/block/nd/nd-private.h | 114 ++ drivers/block/nd/nd.h | 261 ++ drivers/block/nd/pmem.c | 114 ++ drivers/block/nd/region.c | 159 +++ drivers/block/nd/region_devs.c| 637 ++ drivers/block/nd/test/Makefile|5 drivers/block/nd/test/iomap.c | 151 +++ drivers/block/nd/test/nfit.c | 1131 + drivers/block/nd/test/nfit_test.h | 26 + include/linux/efi.h |3 include/linux/nd.h| 98 ++ include/uapi/linux/Kbuild |1 include/uapi/linux/ndctl.h| 199 40 files changed, 12345 insertions(+), 48 deletions(-) create mode 100644 Documentation/blockdev/btt.txt create mode 100644 drivers/block/nd/Kconfig create mode 100644 drivers/block/nd/Makefile create mode 100644 drivers/block/nd/acpi.c create mode 100644 drivers/block/nd/acpi_nfit.h create mode 100644 drivers/block/nd/blk.c create mode 100644 drivers/block/nd/btt.c create mode 100644 drivers/block/nd/btt.h create mode 100644 drivers/block/nd/btt_devs.c create mode 100644 drivers/block/nd/bus.c create mode 100644 drivers/block/nd/core.c create mode 100644 drivers/block/nd/dimm.c create mode 100644 drivers/block/nd/dimm_devs.c create mode 100644 drivers/block/nd/e820.c create mode 100644 drivers/block/nd/label.c create mode 100644 drivers/block/nd/label.h create mode 100644 drivers/block/nd/libnd.h create mode 100644 drivers/block/nd/namespace_devs.c create mode 100644 drivers/block/nd/nd-private.h create mode 100644 drivers/block/nd/nd.h rename drivers/block/{pmem.c = nd/pmem.c} (68%) create mode 100644 drivers/block/nd/region.c create mode 100644 drivers/block/nd/region_devs.c create mode 100644 drivers/block/nd/test/Makefile create mode 100644 drivers/block/nd/test/iomap.c create mode 100644 drivers/block/nd/test/nfit.c create mode 100644 drivers/block/nd/test/nfit_test.h create mode 100644 include/linux/nd.h create mode 100644 include/uapi/linux/ndctl.h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 03/20] nd_acpi, nfit-test: manufactured NFITs for interface development
Manually create and register NFITs to describe 2 topologies. Topology1 is an advanced plausible configuration for BLK/PMEM aliased NVDIMMs. Topology2 is an example configuration for current platforms that only ship with a persistent address range. Kernel provider nfit_test.0 produces an NFIT with the following attributes: (a) (b) DIMM BLK-REGION +---++++ +--+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 |0 region2 | imc0 +--+- - - region0- - - ++++ +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 |1 region3 | +---+vv+ +--+---+ | | | cpu0 | region1 +--+---+ | | | +^^+ +--+---+ | blk4.0 | pm1.0 | blk4.0 |2 region4 | imc1 +--+|++ +--+ | blk5.0 | pm1.0 | blk5.0 |3 region5 ++++ *) In this layout we have four dimms and two memory controllers in one socket. Each unique interface (block or pmem) to DPA space is identified by a region device with a dynamically assigned id. *) The first portion of dimm0 and dimm1 are interleaved as REGION0. A single pmem namespace is created in the REGION0-spa-range that spans dimm0 and dimm1 with a user-specified name of pm0.0. Some of that interleaved spa range is reclaimed as bdw accessed space starting at offset (a) into each dimm. In that reclaimed space we create two bdw namespaces from REGION2 and REGION3 where blk2.0 and blk3.0 are just human readable names that could be set to any user-desired name in the label. *) In the last portion of dimm0 and dimm1 we have an interleaved spa range, REGION1, that spans those two dimms as well as dimm2 and dimm3. Some of REGION1 allocated to a pmem namespace named pm1.0 the rest is reclaimed in 4 bdw namespaces (for each dimm in the interleave set), blk2.1, blk3.1, blk4.0, and blk5.0. *) The portion of dimm2 and dimm3 that do not participate in the REGION1 interleaved spa range (i.e. the DPA address below offset (b) are also included in the blk4.0 and blk5.0 namespaces. Note, that this example shows that bdw namespaces don't need to be contiguous in DPA-space. Kernel provider nfit_test.1 produces an NFIT with the following attributes: region2 +-+ |-| || pm2.0 || |-| +-+ *) Describes a simple system-physical-address range with no backing dimm or interleave description. Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 19 + drivers/block/nd/Makefile | 15 + drivers/block/nd/acpi.c |3 drivers/block/nd/acpi_nfit.h | 11 drivers/block/nd/test/Makefile|5 drivers/block/nd/test/iomap.c | 151 + drivers/block/nd/test/nfit.c | 1025 + drivers/block/nd/test/nfit_test.h | 26 + 8 files changed, 1254 insertions(+), 1 deletion(-) create mode 100644 drivers/block/nd/test/Makefile create mode 100644 drivers/block/nd/test/iomap.c create mode 100644 drivers/block/nd/test/nfit.c create mode 100644 drivers/block/nd/test/nfit_test.h diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 6d5d6b732f82..09f0135147ca 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -37,4 +37,23 @@ config ND_ACPI addition to storage devices this also enables libnd craft ACPI._DSM messages for platform/dimm configuration. +config NFIT_TEST + tristate NFIT TEST: Manufactured NFIT for interface testing + depends on DMA_CMA + depends on LIBND=m + depends on ND_ACPI + depends on m + help + For development purposes register a manufactured + NFIT table to verify the resulting device model topology. + Note, this module arranges for ioremap_cache() to be + overridden locally to allow simulation of system-memory as an + io-memory-resource. + + Note, this test expects to be able to find at least 256MB of + CMA space (CONFIG_CMA_SIZE_MBYTES, cma=) or it will fail to + load. + + Say N unless you are doing development of the 'nd' subsystem. + endif diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 944b5947c0cb..cf064db92589 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd
[PATCH v2 05/20] libnd, nd_acpi: dimm/memory-devices
Register the memory devices described in the nfit as libnd 'dimm' devices on an nd bus. The kernel assigned device id for dimms is dynamic. If userspace needs a more static identifier it should consult a provider-specific attribute. In the case where NFIT is the provider, the 'nmemX/nfit/handle' or 'nmemX/nfit/serial' attributes may be used for this purpose. Cc: Neil Brown ne...@suse.de Cc: linux-a...@vger.kernel.org Cc: Greg KH gre...@linuxfoundation.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile |1 drivers/block/nd/acpi.c | 160 + drivers/block/nd/acpi_nfit.h |1 drivers/block/nd/bus.c| 14 +++- drivers/block/nd/core.c | 29 +++ drivers/block/nd/dimm_devs.c | 92 drivers/block/nd/libnd.h | 11 +++ drivers/block/nd/nd-private.h | 12 +++ 8 files changed, 318 insertions(+), 2 deletions(-) create mode 100644 drivers/block/nd/dimm_devs.c diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 7defe18ed009..35e4c1a7a8ff 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -20,3 +20,4 @@ nd_acpi-y := acpi.o libnd-y := core.o libnd-y += bus.o +libnd-y += dimm_devs.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index dd8505f766ed..af6684341c9b 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -369,6 +369,164 @@ const struct attribute_group *nd_acpi_attribute_groups[] = { }; EXPORT_SYMBOL_GPL(nd_acpi_attribute_groups); +static struct acpi_nfit_memdev *to_nfit_memdev(struct device *dev) +{ + struct nd_dimm *nd_dimm = to_nd_dimm(dev); + struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm); + + return __to_nfit_memdev(nfit_mem); +} + +static struct acpi_nfit_dcr *to_nfit_dcr(struct device *dev) +{ + struct nd_dimm *nd_dimm = to_nd_dimm(dev); + struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm); + + return nfit_mem-dcr; +} + +static ssize_t handle_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_memdev *memdev = to_nfit_memdev(dev); + + return sprintf(buf, %#x\n, memdev-nfit_handle); +} +static DEVICE_ATTR_RO(handle); + +static ssize_t phys_id_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_memdev *memdev = to_nfit_memdev(dev); + + return sprintf(buf, %#x\n, memdev-phys_id); +} +static DEVICE_ATTR_RO(phys_id); + +static ssize_t vendor_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev); + + return sprintf(buf, %#x\n, dcr-vendor_id); +} +static DEVICE_ATTR_RO(vendor); + +static ssize_t rev_id_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev); + + return sprintf(buf, %#x\n, dcr-revision_id); +} +static DEVICE_ATTR_RO(rev_id); + +static ssize_t device_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev); + + return sprintf(buf, %#x\n, dcr-device_id); +} +static DEVICE_ATTR_RO(device); + +static ssize_t format_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev); + + return sprintf(buf, %#x\n, dcr-fic); +} +static DEVICE_ATTR_RO(format); + +static ssize_t serial_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct acpi_nfit_dcr *dcr = to_nfit_dcr(dev); + + return sprintf(buf, %#x\n, dcr-serial_number); +} +static DEVICE_ATTR_RO(serial); + +static struct attribute *nd_acpi_dimm_attributes[] = { + dev_attr_handle.attr, + dev_attr_phys_id.attr, + dev_attr_vendor.attr, + dev_attr_device.attr, + dev_attr_format.attr, + dev_attr_serial.attr, + dev_attr_rev_id.attr, + NULL, +}; + +static umode_t nd_acpi_dimm_attr_visible(struct kobject *kobj, struct attribute *a, int n) +{ + struct device *dev = container_of(kobj, struct device, kobj); + + if (to_nfit_dcr(dev)) + return a-mode; + else + return 0; +} + +static struct attribute_group nd_acpi_dimm_attribute_group = { + .name = nfit, + .attrs = nd_acpi_dimm_attributes, + .is_visible = nd_acpi_dimm_attr_visible, +}; + +static const struct attribute_group *nd_acpi_dimm_attribute_groups[] = { + nd_acpi_dimm_attribute_group, + NULL, +}; + +static struct nd_dimm *nd_acpi_dimm_by_handle(struct acpi_nfit_desc *acpi_desc, + u32 nfit_handle) +{ + struct nfit_mem *nfit_mem; + + list_for_each_entry
[PATCH v2 07/20] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
* Implement the device-model infrastructure for loading modules and attaching drivers to nd devices. This is a simple association of a nd-device-type number with a driver that has a bitmask of supported device types. To facilitate userspace bind/unbind operations 'modalias' and 'devtype', that also appear in the uevent, are added as generic sysfs attributes for all nd devices. The reason for the device-type number is to support sub-types within a given parent devtype, be it a vendor-specific sub-type or otherwise. * The first consumer of this infrastructure is the driver for dimm devices. It simply uses control messages to retrieve and store the configuration-data image (label set) from each dimm. Note: nd_device_register() arranges for asynchronous registration of nd bus devices by default. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile |1 drivers/block/nd/acpi.c | 13 ++- drivers/block/nd/bus.c| 168 + drivers/block/nd/core.c | 43 ++ drivers/block/nd/dimm.c | 93 +++ drivers/block/nd/dimm_devs.c | 136 - drivers/block/nd/libnd.h |2 drivers/block/nd/nd-private.h |8 +- drivers/block/nd/nd.h | 34 include/linux/nd.h| 39 ++ include/uapi/linux/ndctl.h|6 + 11 files changed, 528 insertions(+), 15 deletions(-) create mode 100644 drivers/block/nd/dimm.c create mode 100644 drivers/block/nd/nd.h create mode 100644 include/linux/nd.h diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 35e4c1a7a8ff..842ba13253fd 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -21,3 +21,4 @@ nd_acpi-y := acpi.o libnd-y := core.o libnd-y += bus.o libnd-y += dimm_devs.o +libnd-y += dimm.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index c46e166695f7..bb0c2c764e78 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -22,6 +22,10 @@ static bool warn_checksum; module_param(warn_checksum, bool, S_IRUGO|S_IWUSR); MODULE_PARM_DESC(warn_checksum, Turn checksum errors into warnings); +static bool force_enable_dimms; +module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(force_enable_dimms, Ignore _STA (ACPI DIMM device) status); + enum { NFIT_ACPI_NOTIFY_TABLE = 0x80, }; @@ -627,6 +631,7 @@ static struct attribute_group nd_acpi_dimm_attribute_group = { static const struct attribute_group *nd_acpi_dimm_attribute_groups[] = { nd_dimm_attribute_group, + nd_device_attribute_group, nd_acpi_dimm_attribute_group, NULL, }; @@ -663,7 +668,7 @@ static int nd_acpi_add_dimm(struct acpi_nfit_desc *acpi_desc, if (!adev_dimm) { dev_err(dev, no ACPI.NFIT device with _ADR %#x, disabling...\n, nfit_handle); - return -ENODEV; + return force_enable_dimms ? 0 : -ENODEV; } status = acpi_evaluate_integer(adev_dimm-handle, _STA, NULL, sta); @@ -684,12 +689,13 @@ static int nd_acpi_add_dimm(struct acpi_nfit_desc *acpi_desc, if (acpi_check_dsm(adev_dimm-handle, uuid, 1, 1ULL i)) set_bit(i, nfit_mem-dsm_mask); - return rc; + return force_enable_dimms ? 0 : rc; } static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc) { struct nfit_mem *nfit_mem; + int dimm_count = 0; list_for_each_entry(nfit_mem, acpi_desc-dimms, list) { struct nd_dimm *nd_dimm; @@ -723,9 +729,10 @@ static int nd_acpi_register_dimms(struct acpi_nfit_desc *acpi_desc) return -ENOMEM; nfit_mem-nd_dimm = nd_dimm; + dimm_count++; } - return 0; + return nd_bus_validate_dimm_count(acpi_desc-nd_bus, dimm_count); } static void nd_acpi_init_dsms(struct acpi_nfit_desc *acpi_desc) diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c index a271e01af4a9..7bd79f30e5e7 100644 --- a/drivers/block/nd/bus.c +++ b/drivers/block/nd/bus.c @@ -16,19 +16,183 @@ #include linux/fcntl.h #include linux/async.h #include linux/ndctl.h +#include linux/sched.h #include linux/slab.h #include linux/fs.h #include linux/io.h #include linux/mm.h +#include linux/nd.h #include nd-private.h +#include nd.h int nd_dimm_major; static int nd_bus_major; static struct class *nd_class; -struct bus_type nd_bus_type = { +static int to_nd_device_type(struct device *dev) +{ + if (is_nd_dimm(dev)) + return ND_DEVICE_DIMM; + + return 0; +} + +static int nd_bus_uevent(struct device *dev, struct kobj_uevent_env *env) +{ + return add_uevent_var(env, MODALIAS= ND_DEVICE_MODALIAS_FMT
[PATCH v2 02/20] libnd, nd_acpi: initial libnd infrastructure and NFIT support
1/ Autodetect an NFIT table for the ACPI namespace device with _HID of ACPI0012 2/ libnd bus registration The NFIT provided by ACPI is one possible method by which platforms will discover NVDIMM resources. However, the intent of the nd_bus_descriptor abstraction is to abstract provider specific details, leaving libnd to be independent of the specific NVDIMM resource discovery mechanism. This flexibility is later exploited later to implement custom-defined nd buses. Cc: linux-a...@vger.kernel.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/Kconfig |2 drivers/block/Makefile|1 drivers/block/nd/Kconfig | 40 +++ drivers/block/nd/Makefile |6 + drivers/block/nd/acpi.c | 475 + drivers/block/nd/acpi_nfit.h | 254 ++ drivers/block/nd/core.c | 67 ++ drivers/block/nd/libnd.h | 33 +++ drivers/block/nd/nd-private.h | 23 ++ 9 files changed, 901 insertions(+) create mode 100644 drivers/block/nd/Kconfig create mode 100644 drivers/block/nd/Makefile create mode 100644 drivers/block/nd/acpi.c create mode 100644 drivers/block/nd/acpi_nfit.h create mode 100644 drivers/block/nd/core.c create mode 100644 drivers/block/nd/libnd.h create mode 100644 drivers/block/nd/nd-private.h diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index eb1fed5bd516..dfe40e5ca9bd 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -321,6 +321,8 @@ config BLK_DEV_NVME To compile this driver as a module, choose M here: the module will be called nvme. +source drivers/block/nd/Kconfig + config BLK_DEV_SKD tristate STEC S1120 Block Driver depends on PCI diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 9cc6c18a1c7e..07a6acecf4d8 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o obj-$(CONFIG_MG_DISK) += mg_disk.o obj-$(CONFIG_SUNVDC) += sunvdc.o obj-$(CONFIG_BLK_DEV_NVME) += nvme.o +obj-$(CONFIG_ND_DEVICES) += nd/ obj-$(CONFIG_BLK_DEV_SKD) += skd.o obj-$(CONFIG_BLK_DEV_OSD) += osdblk.o diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig new file mode 100644 index ..6d5d6b732f82 --- /dev/null +++ b/drivers/block/nd/Kconfig @@ -0,0 +1,40 @@ +menuconfig ND_DEVICES + bool NVDIMM Support + depends on PHYS_ADDR_T_64BIT + help + Generic support for non-volatile memory devices including + ACPI-6-NFIT defined resources. On platforms that define an + NFIT, or otherwise can discover NVDIMM resources, a libnd + bus is registered to advertise PMEM (persistent memory) + namespaces (/dev/pmemX) and BLK (sliding mmio window(s)) + namespaces (/dev/ndX). A PMEM namespace refers to a memory + resource that may span multiple DIMMs and support DAX (see + CONFIG_DAX). A BLK namespace refers to an NVDIMM control + region which exposes an mmio register set for windowed + access mode to non-volatile memory. + +if ND_DEVICES + +config LIBND + tristate LIBND: libnd device driver support + help + Platform agnostic device model for a libnd bus. Publishes + resources for a PMEM (persistent-memory) driver and/or BLK + (sliding mmio window(s)) driver to attach. Exposes a device + topology under a ndX bus device, a /dev/ndctlX bus-ioctl + message passing interface, and a /dev/nmemX dimm-ioctl + message interface for each memory device registered on the + bus. instance. A userspace library ndctl provides an API + to enumerate/manage this subsystem. + +config ND_ACPI + tristate ACPI: NFIT to libnd bus support + select LIBND + depends on ACPI + help + Infrastructure to probe ACPI 6 compliant platforms for + NVDIMMs (NFIT) and register a libnd device tree. In + addition to storage devices this also enables libnd craft + ACPI._DSM messages for platform/dimm configuration. + +endif diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile new file mode 100644 index ..944b5947c0cb --- /dev/null +++ b/drivers/block/nd/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_LIBND) += libnd.o +obj-$(CONFIG_ND_ACPI) += nd_acpi.o + +nd_acpi-y := acpi.o + +libnd-y := core.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c new file mode 100644 index ..9f0b24390d1b --- /dev/null +++ b/drivers/block/nd/acpi.c @@ -0,0 +1,475 @@ +/* + * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License
[PATCH v2 01/20] e820, efi: add ACPI 6.0 persistent memory types
ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory. Mark it reserved and allow it to be claimed by a persistent memory device driver. This definition is in addition to the Linux kernel's existing type-12 definition that was recently added in support of shipping platforms with NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as OEM reserved). We may choose to exploit this wealth of definitions for NVDIMMs to differentiate E820_PRAM (type-12) from E820_PMEM (type-7). One potential differentiation is that PMEM is not backed by struct page by default in contrast to PRAM. For now, they are effectively treated as aliases by the mm. Note, /proc/iomem can be consulted for differentiating legacy Persistent RAM E820_PRAM vs standard Persistent I/O Memory E820_PMEM. Cc: Boaz Harrosh b...@plexistor.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Cc: Andrew Morton a...@linux-foundation.org Cc: Borislav Petkov b...@alien8.de Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Matthew Wilcox wi...@linux.intel.com Cc: Thomas Gleixner t...@linutronix.de Acked-by: Andy Lutomirski l...@amacapital.net Reviewed-by: Ross Zwisler ross.zwis...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/arm64/kernel/efi.c |1 + arch/ia64/kernel/efi.c |4 arch/x86/boot/compressed/eboot.c |4 arch/x86/include/uapi/asm/e820.h |1 + arch/x86/kernel/e820.c | 26 +++--- arch/x86/platform/efi/efi.c |3 +++ include/linux/efi.h |3 ++- 7 files changed, 38 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c index ab21e0d58278..9d4aa18f2a82 100644 --- a/arch/arm64/kernel/efi.c +++ b/arch/arm64/kernel/efi.c @@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md) case EFI_BOOT_SERVICES_CODE: case EFI_BOOT_SERVICES_DATA: case EFI_CONVENTIONAL_MEMORY: + case EFI_PERSISTENT_MEMORY: return 0; default: break; diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c index c52d7540dc05..9028bc268cd7 100644 --- a/arch/ia64/kernel/efi.c +++ b/arch/ia64/kernel/efi.c @@ -1223,6 +1223,10 @@ efi_initialize_iomem_resources(struct resource *code_resource, flags |= IORESOURCE_DISABLED; break; + case EFI_PERSISTENT_MEMORY: + name = persistent; + break; + case EFI_RESERVED_TYPE: case EFI_RUNTIME_SERVICES_CODE: case EFI_RUNTIME_SERVICES_DATA: diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index ef17683484e9..dde5bf7726f4 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params *params, e820_type = E820_NVS; break; + case EFI_PERSISTENT_MEMORY: + e820_type = E820_PMEM; + break; + default: continue; } diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h index 960a8a9dc4ab..0f457e6eab18 100644 --- a/arch/x86/include/uapi/asm/e820.h +++ b/arch/x86/include/uapi/asm/e820.h @@ -32,6 +32,7 @@ #define E820_ACPI 3 #define E820_NVS 4 #define E820_UNUSABLE 5 +#define E820_PMEM 7 /* * This is a non-standardized way to represent ADR or NVDIMM regions that diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 11cc7d54ec3f..d38b53a7e9b2 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type) case E820_UNUSABLE: printk(KERN_CONT unusable); break; + case E820_PMEM: case E820_PRAM: printk(KERN_CONT persistent (type %u), type); break; @@ -919,10 +920,31 @@ static inline const char *e820_type_to_string(int e820_type) case E820_NVS: return ACPI Non-volatile Storage; case E820_UNUSABLE: return Unusable memory; case E820_PRAM: return Persistent RAM; + case E820_PMEM: return Persistent I/O Memory; default:return reserved; } } +static bool do_mark_busy(u32 type, struct resource *res) +{ + /* this is the legacy bios/dos rom-shadow + mmio region */ + if (res-start (1ULL20)) + return true; + + /* +* Treat persistent memory like device memory, i.e. reserve it +* for exclusive use of a driver +*/ + switch (type) { + case E820_RESERVED: + case
[PATCH v2 13/20] libnd: namespace indices: read and validate
On media label format consists of two index blocks followed by an array of labels. None of these structures are ever updated in place. A sequence number tracks the current active index and the next one to write, while labels are written to free slots. ++ || | nsindex0 | || ++ || | nsindex1 | || ++ | label0 | ++ | label1 | ++ || nslot... || ++ | labelN | ++ After reading valid labels, store the dpa ranges they claim into per-dimm resource trees. Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Makefile|1 drivers/block/nd/dimm.c | 26 +++- drivers/block/nd/dimm_devs.c |6 + drivers/block/nd/label.c | 291 ++ drivers/block/nd/label.h | 129 +++ drivers/block/nd/nd.h| 45 ++ include/uapi/linux/ndctl.h |1 7 files changed, 495 insertions(+), 4 deletions(-) create mode 100644 drivers/block/nd/label.c create mode 100644 drivers/block/nd/label.h diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index ebb212af9f15..d588f691163c 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -31,3 +31,4 @@ libnd-y += dimm.o libnd-y += region_devs.o libnd-y += region.o libnd-y += namespace_devs.o +libnd-y += label.o diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c index 6b7d2842509c..5477176c5de0 100644 --- a/drivers/block/nd/dimm.c +++ b/drivers/block/nd/dimm.c @@ -18,6 +18,7 @@ #include linux/slab.h #include linux/mm.h #include linux/nd.h +#include label.h #include nd.h static void free_data(struct nd_dimm_drvdata *ndd) @@ -42,7 +43,12 @@ static int nd_dimm_probe(struct device *dev) return -ENOMEM; dev_set_drvdata(dev, ndd); -ndd-dev = dev; + ndd-dpa.name = dev_name(dev); + ndd-ns_current = -1; + ndd-ns_next = -1; + ndd-dpa.start = 0; + ndd-dpa.end = -1; + ndd-dev = dev; rc = nd_dimm_init_nsarea(ndd); if (rc) @@ -54,18 +60,34 @@ static int nd_dimm_probe(struct device *dev) dev_dbg(dev, config data size: %d\n, ndd-nsarea.config_size); + nd_bus_lock(dev); + ndd-ns_current = nd_label_validate(ndd); + ndd-ns_next = nd_label_next_nsindex(ndd-ns_current); + nd_label_copy(ndd, to_next_namespace_index(ndd), + to_current_namespace_index(ndd)); + rc = nd_label_reserve_dpa(ndd); + nd_bus_unlock(dev); + + if (rc) + goto err; + return 0; err: free_data(ndd); return rc; - } static int nd_dimm_remove(struct device *dev) { struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev); + struct resource *res, *_r; + nd_bus_lock(dev); + dev_set_drvdata(dev, NULL); + for_each_dpa_resource_safe(ndd, res, _r) + __release_region(ndd-dpa, res-start, resource_size(res)); + nd_bus_unlock(dev); free_data(ndd); return 0; diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c index 8981adc59ba4..3fbd0d0502eb 100644 --- a/drivers/block/nd/dimm_devs.c +++ b/drivers/block/nd/dimm_devs.c @@ -92,8 +92,12 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd) if (ndd-data) return 0; - if (ndd-nsarea.status || ndd-nsarea.max_xfer == 0) + if (ndd-nsarea.status || ndd-nsarea.max_xfer == 0 + || ndd-nsarea.config_size ND_LABEL_MIN_SIZE) { + dev_dbg(ndd-dev, failed to init config data area: (%d:%d)\n, + ndd-nsarea.max_xfer, ndd-nsarea.config_size); return -ENXIO; + } ndd-data = kmalloc(ndd-nsarea.config_size, GFP_KERNEL); if (!ndd-data) diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c new file mode 100644 index ..e791ea8bbdde --- /dev/null +++ b/drivers/block/nd/label.c @@ -0,0 +1,291 @@ +/* + * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include linux/device.h +#include linux/ndctl.h +#include linux/io.h +#include linux/nd.h +#include nd-private.h +#include label.h +#include nd.h + +#include asm-generic/io-64-nonatomic-lo-hi.h + +static u32 best_seq(u32
[PATCH v2 18/20] libnd: infrastructure for btt devices
Block devices from an nd bus, in addition to accepting struct bio based requests, also have the capability to perform byte-aligned accesses. By default only the bio/block interface is used. However, if another driver can make effective use of the byte-aligned capability it can claim/disable the block interface and use the byte-aligned nd_io interface. The BTT driver is the intended first consumer of this mechanism to allow layering atomic sector update guarantees on top of nd_io capable nd-bus-block-devices. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig |3 drivers/block/nd/Makefile |1 drivers/block/nd/btt.h| 45 drivers/block/nd/btt_devs.c | 442 + drivers/block/nd/bus.c| 128 drivers/block/nd/core.c | 79 +++ drivers/block/nd/nd-private.h | 28 +++ drivers/block/nd/nd.h | 94 + drivers/block/nd/pmem.c | 29 +++ include/uapi/linux/ndctl.h|2 10 files changed, 847 insertions(+), 4 deletions(-) create mode 100644 drivers/block/nd/btt.h create mode 100644 drivers/block/nd/btt_devs.c diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index c5eaf195734d..15896db4de37 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -95,4 +95,7 @@ config BLK_DEV_PMEM Say Y if you want to use a NVDIMM described by NFIT +config ND_BTT_DEVS + def_bool y + endif diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index d588f691163c..0c6d64b7a69d 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -32,3 +32,4 @@ libnd-y += region_devs.o libnd-y += region.o libnd-y += namespace_devs.o libnd-y += label.o +libnd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h new file mode 100644 index ..e8f6d8e0ddd3 --- /dev/null +++ b/drivers/block/nd/btt.h @@ -0,0 +1,45 @@ +/* + * Block Translation Table library + * Copyright (c) 2014-2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#ifndef _LINUX_BTT_H +#define _LINUX_BTT_H + +#include linux/types.h + +#define BTT_SIG_LEN 16 +#define BTT_SIG BTT_ARENA_INFO\0 + +struct btt_sb { + u8 signature[BTT_SIG_LEN]; + u8 uuid[16]; + u8 parent_uuid[16]; + __le32 flags; + __le16 version_major; + __le16 version_minor; + __le32 external_lbasize; + __le32 external_nlba; + __le32 internal_lbasize; + __le32 internal_nlba; + __le32 nfree; + __le32 infosize; + __le64 nextoff; + __le64 dataoff; + __le64 mapoff; + __le64 logoff; + __le64 info2off; + u8 padding[3968]; + __le64 checksum; +}; + +#endif diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c new file mode 100644 index ..e6f0b8b999d8 --- /dev/null +++ b/drivers/block/nd/btt_devs.c @@ -0,0 +1,442 @@ +/* + * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include linux/device.h +#include linux/genhd.h +#include linux/sizes.h +#include linux/slab.h +#include linux/fs.h +#include linux/mm.h +#include nd-private.h +#include btt.h +#include nd.h + +static DEFINE_IDA(btt_ida); + +static void nd_btt_release(struct device *dev) +{ + struct nd_btt *nd_btt = to_nd_btt(dev); + + dev_dbg(dev, %s\n, __func__); + WARN_ON(nd_btt-backing_dev); + ndio_del_claim(nd_btt-ndio_claim); + ida_simple_remove(btt_ida, nd_btt-id); + kfree(nd_btt-uuid); + kfree(nd_btt); +} + +static struct device_type nd_btt_device_type = { + .name = nd_btt, + .release = nd_btt_release, +}; + +bool is_nd_btt(struct device *dev) +{ + return dev-type == nd_btt_device_type; +} + +struct nd_btt *to_nd_btt(struct device *dev) +{ + struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev); + + WARN_ON(!is_nd_btt(dev)); + return nd_btt; +} +EXPORT_SYMBOL(to_nd_btt); + +static
[PATCH v2 17/20] libnd: write blk label set
After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been set to valid values the labels on the dimm can be updated. The difference with the pmem case is that blk namespaces are limited to one dimm and can cover discontiguous ranges in dpa space. Also, after allocating label slots, it is useful for userspace to know how many slots are left. Export this information in sysfs. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/bus.c|4 drivers/block/nd/dimm_devs.c | 25 +++ drivers/block/nd/label.c | 297 +++-- drivers/block/nd/label.h |5 + drivers/block/nd/namespace_devs.c | 57 +++ drivers/block/nd/nd-private.h |1 6 files changed, 367 insertions(+), 22 deletions(-) diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c index 819259e92468..6c272f245f4e 100644 --- a/drivers/block/nd/bus.c +++ b/drivers/block/nd/bus.c @@ -136,6 +136,10 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie) { struct device *dev = d; + /* flush bus operations before delete */ + nd_bus_lock(dev); + nd_bus_unlock(dev); + device_unregister(dev); put_device(dev); } diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c index 358b2a06d680..4b225c8b7d0a 100644 --- a/drivers/block/nd/dimm_devs.c +++ b/drivers/block/nd/dimm_devs.c @@ -19,6 +19,7 @@ #include linux/fs.h #include linux/mm.h #include nd-private.h +#include label.h #include nd.h static DEFINE_IDA(dimm_ida); @@ -262,9 +263,33 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr, } static DEVICE_ATTR_RO(state); +static ssize_t available_slots_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev); + ssize_t rc; + u32 nfree; + + if (!ndd) + return -ENXIO; + + nd_bus_lock(dev); + nfree = nd_label_nfree(ndd); + if (nfree - 1 nfree) { + dev_WARN_ONCE(dev, 1, we ate our last label?\n); + nfree = 0; + } else + nfree--; + rc = sprintf(buf, %d\n, nfree); + nd_bus_unlock(dev); + return rc; +} +static DEVICE_ATTR_RO(available_slots); + static struct attribute *nd_dimm_attributes[] = { dev_attr_state.attr, dev_attr_commands.attr, + dev_attr_available_slots.attr, NULL, }; diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c index 78898b642191..069c26d50ed1 100644 --- a/drivers/block/nd/label.c +++ b/drivers/block/nd/label.c @@ -58,7 +58,7 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd) return ndd-nsindex_size; } -static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd) +int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd) { return ndd-nsarea.config_size / 129; } @@ -416,7 +416,7 @@ u32 nd_label_nfree(struct nd_dimm_drvdata *ndd) WARN_ON(!is_nd_bus_locked(ndd-dev)); if (!preamble_next(ndd, nsindex, free, nslot)) - return 0; + return nd_dimm_num_label_slots(ndd); return bitmap_weight(free, nslot); } @@ -553,22 +553,270 @@ static int __pmem_label_update(struct nd_region *nd_region, return 0; } -static int init_labels(struct nd_mapping *nd_mapping) +static void del_label(struct nd_mapping *nd_mapping, int l) +{ + struct nd_namespace_label __iomem *next_label, __iomem *nd_label; + struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping); + unsigned int slot; + int j; + + nd_label = nd_get_label(nd_mapping-labels, l); + slot = to_slot(ndd, nd_label); + dev_vdbg(ndd-dev, %s: clear: %d\n, __func__, slot); + + for (j = l; (next_label = nd_get_label(nd_mapping-labels, j + 1)); j++) + nd_set_label(nd_mapping-labels, next_label, j); + nd_set_label(nd_mapping-labels, NULL, j); +} + +static bool is_old_resource(struct resource *res, struct resource **list, int n) { int i; + + if (res-flags DPA_RESOURCE_ADJUSTED) + return false; + for (i = 0; i n; i++) + if (res == list[i]) + return true; + return false; +} + +static struct resource *to_resource(struct nd_dimm_drvdata *ndd, + struct nd_namespace_label __iomem *nd_label) +{ + struct resource *res; + + for_each_dpa_resource(ndd, res) { + if (res-start != readq(nd_label-dpa)) + continue; + if (resource_size(res) != readq(nd_label-rawsize)) + continue; + return res; + } + + return NULL; +} + +/* + * 1/ Account all the labels that can be freed after this update + * 2/ Allocate and write the label
[PATCH v2 15/20] libnd: blk labels and namespace instantiation
A blk label set describes a namespace comprised of one or more discontiguous dpa ranges on a single dimm. They may alias with one or more pmem interleave sets that include the given dimm. This is the runtime/volatile configuration infrastructure for sysfs manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'. A later patch will make these settings persistent by writing back the label(s). Unlike pmem namespaces, multiple blk namespaces can be created per region. Once a blk namespace has been created a new seed device (unconfigured child of a parent blk region) is instantiated. As long as a region has 'available_size' != 0 new child namespaces may be created. Cc: Greg KH gre...@linuxfoundation.org Cc: Neil Brown ne...@suse.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/core.c | 40 +++ drivers/block/nd/dimm_devs.c | 35 +++ drivers/block/nd/libnd.h |3 drivers/block/nd/namespace_devs.c | 502 ++--- drivers/block/nd/nd-private.h |8 + drivers/block/nd/nd.h |5 drivers/block/nd/region_devs.c| 15 + include/linux/nd.h| 25 ++ 8 files changed, 589 insertions(+), 44 deletions(-) diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c index cf64b7a50d3a..3ec38289be58 100644 --- a/drivers/block/nd/core.c +++ b/drivers/block/nd/core.c @@ -170,6 +170,46 @@ int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf, return 0; } +ssize_t nd_sector_size_show(unsigned long current_lbasize, + const unsigned long *supported, char *buf) +{ + ssize_t len = 0; + int i; + + for (i = 0; supported[i]; i++) + if (current_lbasize == supported[i]) + len += sprintf(buf + len, [%ld] , supported[i]); + else + len += sprintf(buf + len, %ld , supported[i]); + len += sprintf(buf + len, \n); + return len; +} + +ssize_t nd_sector_size_store(struct device *dev, const char *buf, + unsigned long *current_lbasize, const unsigned long *supported) +{ + unsigned long lbasize; + int rc, i; + + if (dev-driver) + return -EBUSY; + + rc = kstrtoul(buf, 0, lbasize); + if (rc) + return rc; + + for (i = 0; supported[i]; i++) + if (lbasize == supported[i]) + break; + + if (supported[i]) { + *current_lbasize = lbasize; + return 0; + } else { + return -EINVAL; + } +} + static ssize_t commands_show(struct device *dev, struct device_attribute *attr, char *buf) { diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c index b242d3ae6d12..4aa5654354ac 100644 --- a/drivers/block/nd/dimm_devs.c +++ b/drivers/block/nd/dimm_devs.c @@ -256,6 +256,41 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data, EXPORT_SYMBOL_GPL(nd_dimm_create); /** + * nd_blk_available_dpa - account the unused dpa of BLK region + * @nd_mapping: container of dpa-resource-root + labels + * + * Unlike PMEM, BLK namespaces can occupy discontiguous DPA ranges. + */ +resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping) +{ + struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping); + resource_size_t map_end, busy = 0, available; + struct resource *res; + + if (!ndd) + return 0; + + map_end = nd_mapping-start + nd_mapping-size - 1; + for_each_dpa_resource(ndd, res) + if (res-start = nd_mapping-start res-start map_end) { + resource_size_t end = min(map_end, res-end); + + busy += end - res-start + 1; + } else if (res-end = nd_mapping-start res-end = map_end) { + busy += res-end - nd_mapping-start; + } else if (nd_mapping-start res-start +nd_mapping-start res-end) { + /* total eclipse of the BLK region mapping */ + busy += nd_mapping-size; + } + + available = map_end - nd_mapping-start + 1; + if (busy available) + return available - busy; + return 0; +} + +/** * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa * @nd_mapping: container of dpa-resource-root + labels * @nd_region: constrain available space check to this reference region diff --git a/drivers/block/nd/libnd.h b/drivers/block/nd/libnd.h index 832dcfebbb49..3f6b5e09cd67 100644 --- a/drivers/block/nd/libnd.h +++ b/drivers/block/nd/libnd.h @@ -26,6 +26,9 @@ enum { ND_CMD_MAX_ENVELOPE = 16, ND_CMD_ARS_QUERY_MAX = SZ_4K, ND_MAX_MAPPINGS = 32, + + /* mark newly adjusted resources as requiring a label update */ + DPA_RESOURCE_ADJUSTED = 1 0, }; extern
[PATCH v2 12/20] libnd, nd_acpi: add interleave-set state-tracking infrastructure
On platforms that have firmware support for reading/writing per-dimm label space, a portion of the dimm may be accessible via an interleave set PMEM mapping in addition to the dimm's BLK (block-data-window aperture(s)) interface. A label, stored in a configuration data region on the dimm, disambiguates which dimm addresses are accessed through which exclusive interface. Add infrastructure that allows the kernel to block modifications to a label in the set while any member dimm is active. Note that this is meant only for enforcing no modifications of active labels via the coarse ioctl command. Adding/deleting namespaces from an active interleave set will only be possible via sysfs. Another aspect of tracking interleave sets is tracking their integrity when DIMMs in a set are physically re-ordered. For this purpose we generate an interleave-set cookie that can be recorded in a label and validated against the current configuration. It is the bus provider implementation's responsibility to calculate the interleave set cookie and attach it to a given region. Cc: Neil Brown ne...@suse.de Cc: linux-a...@vger.kernel.org Cc: Greg KH gre...@linuxfoundation.org Cc: Robert Moore robert.mo...@intel.com Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/acpi.c| 90 drivers/block/nd/bus.c | 41 ++ drivers/block/nd/core.c| 47 + drivers/block/nd/dimm_devs.c | 19 drivers/block/nd/libnd.h |6 +++ drivers/block/nd/nd-private.h | 11 - drivers/block/nd/nd.h |4 ++ drivers/block/nd/region_devs.c | 85 ++ 8 files changed, 299 insertions(+), 4 deletions(-) diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index c3dda74f73d7..d34cefe38e2f 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -15,6 +15,7 @@ #include linux/ndctl.h #include linux/list.h #include linux/acpi.h +#include linux/sort.h #include acpi_nfit.h #include libnd.h @@ -779,6 +780,90 @@ static const struct attribute_group *nd_acpi_region_attribute_groups[] = { NULL, }; +/* enough info to uniquely specify an interleave set */ +struct nfit_set_info { + struct nfit_set_info_map { + u64 region_spa_offset; + u32 serial_number; + u32 pad; + } mapping[0]; +}; + +static size_t sizeof_nfit_set_info(int num_mappings) +{ + return sizeof(struct nfit_set_info) + + num_mappings * sizeof(struct nfit_set_info_map); +} + +static int cmp_map(const void *m0, const void *m1) +{ + const struct nfit_set_info_map *map0 = m0; + const struct nfit_set_info_map *map1 = m1; + + return memcmp(map0-region_spa_offset, map1-region_spa_offset, + sizeof(u64)); +} + +/* Retrieve the nth entry referencing this spa */ +static struct acpi_nfit_memdev *memdev_from_spa( + struct acpi_nfit_desc *acpi_desc, u16 spa_index, int n) +{ +struct nfit_memdev *nfit_memdev; + +list_for_each_entry(nfit_memdev, acpi_desc-memdevs, list) +if (nfit_memdev-memdev-spa_index == spa_index) +if (n-- == 0) +return nfit_memdev-memdev; +return NULL; +} + +static int nd_acpi_init_interleave_set(struct acpi_nfit_desc *acpi_desc, + struct nd_region_desc *ndr_desc, struct acpi_nfit_spa *spa) +{ + u16 num_mappings = ndr_desc-num_mappings; + int i, spa_type = nfit_spa_type(spa); + struct device *dev = acpi_desc-dev; + struct nd_interleave_set *nd_set; + struct nfit_set_info *info; + + if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE) + /* pass */; + else + return 0; + + nd_set = devm_kzalloc(dev, sizeof(*nd_set), GFP_KERNEL); + if (!nd_set) + return -ENOMEM; + + info = devm_kzalloc(dev, sizeof_nfit_set_info(num_mappings), GFP_KERNEL); + if (!info) + return -ENOMEM; + for (i = 0; i num_mappings; i++) { + struct nd_mapping *nd_mapping = ndr_desc-nd_mapping[i]; + struct nfit_set_info_map *map = info-mapping[i]; + struct nd_dimm *nd_dimm = nd_mapping-nd_dimm; + struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm); + struct acpi_nfit_memdev *memdev = memdev_from_spa(acpi_desc, + spa-spa_index, i); + + if (!memdev || !nfit_mem-dcr) { + dev_err(dev, %s: failed to find DCR\n, __func__); + return -ENODEV; + } + + map-region_spa_offset = memdev-region_spa_offset; + map-serial_number = nfit_mem-dcr-serial_number; + } + + sort(info-mapping[0
[PATCH v2 19/20] nd_btt: atomic sector updates
From: Vishal Verma vishal.l.ve...@linux.intel.com BTT stands for Block Translation Table, and is a way to provide power fail sector atomicity semantics for block devices that have the ability to perform byte granularity IO. It relies on the -rw_bytes() capability of provided nd namespace devices. The BTT works as a stacked blocked device, and reserves a chunk of space from the backing device for its accounting metadata. BLK namespaces may mandate use of a BTT and expect the bus to initialize a BTT if not already present. Otherwise if a BTT is desired for other namespaces (or partitions of a namespace) a BTT may be manually configured. Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Cc: Neil Brown ne...@suse.de Cc: Jeff Moyer jmo...@redhat.com Cc: Dave Chinner da...@fromorbit.com Cc: Greg KH gre...@linuxfoundation.org [jmoyer: fix nmi watchdog timeout in btt_map_init] [jmoyer: move btt initialization to module load path] [jmoyer: fix memory leak in the btt initialization path] [jmoyer: Don't overwrite corrupted arenas] Signed-off-by: Vishal Verma vishal.l.ve...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- Documentation/blockdev/btt.txt | 273 drivers/block/nd/Kconfig | 20 + drivers/block/nd/Makefile |3 drivers/block/nd/acpi.c|1 drivers/block/nd/btt.c | 1423 drivers/block/nd/btt.h | 140 drivers/block/nd/btt_devs.c|3 drivers/block/nd/libnd.h |1 drivers/block/nd/nd-private.h |1 drivers/block/nd/nd.h | 10 drivers/block/nd/region.c | 67 ++ drivers/block/nd/region_devs.c | 10 12 files changed, 1948 insertions(+), 4 deletions(-) create mode 100644 Documentation/blockdev/btt.txt create mode 100644 drivers/block/nd/btt.c diff --git a/Documentation/blockdev/btt.txt b/Documentation/blockdev/btt.txt new file mode 100644 index ..95134d5ec4a0 --- /dev/null +++ b/Documentation/blockdev/btt.txt @@ -0,0 +1,273 @@ +BTT - Block Translation Table += + + +1. Introduction +--- + +Persistent memory based storage is able to perform IO at byte (or more +accurately, cache line) granularity. However, we often want to expose such +storage as traditional block devices. The block drivers for persistent memory +will do exactly this. However, they do not provide any atomicity guarantees. +Traditional SSDs typically provide protection against torn sectors in hardware, +using stored energy in capacitors to complete in-flight block writes, or perhaps +in firmware. We don't have this luxury with persistent memory - if a write is in +progress, and we experience a power failure, the block will contain a mix of old +and new data. Applications may not be prepared to handle such a scenario. + +The Block Translation Table (BTT) provides atomic sector update semantics for +persistent memory devices, so that applications that rely on sector writes not +being torn can continue to do so. The BTT manifests itself as a stacked block +device, and reserves a portion of the underlying storage for its metadata. At +the heart of it, is an indirection table that re-maps all the blocks on the +volume. It can be thought of as an extremely simple file system that only +provides atomic sector updates. + + +2. Static Layout + + +The underlying storage on which a BTT can be laid out is not limited in any way. +The BTT, however, splits the available space into chunks of up to 512 GiB, +called Arenas. + +Each arena follows the same layout for its metadata, and all references in an +arena are internal to it (with the exception of one field that points to the +next arena). The following depicts the On-disk metadata layout: + + + Backing Store +--- Arena ++---+ | +--+ +| | | | Arena info block | +|Arena 0+---+ | 4K | +| 512G | +--+ +| | | | ++---+ | | +| | | | +|Arena 1| | Data Blocks| +| 512G | | | +| | | | ++---+ | | +| . | | | +| . | | | +| . | | | +| | | | +| | | | ++---+ +--+ +| | +| BTT Map
[PATCH v2 11/20] libnd, nd_pmem: add libnd support to the pmem driver
nd_pmem attaches to persistent memory regions and namespaces emitted by the nd subsystem, and, same as the original pmem driver, presents the system-physical-address range as a block device. The existing e820-type-12 to pmem setup is converted to a full libnd bus that emits an nd_namespace_io device. Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/x86/kernel/pmem.c|2 - drivers/block/Kconfig | 11 - drivers/block/Makefile|1 drivers/block/nd/Kconfig | 27 drivers/block/nd/Makefile |6 +++ drivers/block/nd/e820.c | 100 + drivers/block/nd/pmem.c | 47 ++--- 7 files changed, 157 insertions(+), 37 deletions(-) create mode 100644 drivers/block/nd/e820.c rename drivers/block/{pmem.c = nd/pmem.c} (88%) diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c index 3420c874ddc5..279328c42f87 100644 --- a/arch/x86/kernel/pmem.c +++ b/arch/x86/kernel/pmem.c @@ -13,7 +13,7 @@ static __init void register_pmem_device(struct resource *res) struct platform_device *pdev; int error; - pdev = platform_device_alloc(pmem, PLATFORM_DEVID_AUTO); + pdev = platform_device_alloc(e820_pmem, PLATFORM_DEVID_AUTO); if (!pdev) return; diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index dfe40e5ca9bd..1cef4ffb16c5 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -406,17 +406,6 @@ config BLK_DEV_RAM_DAX and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). -config BLK_DEV_PMEM - tristate Persistent memory block device support - help - Saying Y here will allow you to use a contiguous range of reserved - memory as one or more persistent block devices. - - To compile this driver as a module, choose M here: the module will be - called 'pmem'. - - If unsure, say N. - config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 07a6acecf4d8..964d8eb2c16f 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM)+= ps3vram.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o -obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index d2d84451e82c..c5eaf195734d 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -68,4 +68,31 @@ config NFIT_TEST Say N unless you are doing development of the 'nd' subsystem. +config ND_E820 + tristate E820: Support the E820-type-12 PMEM convention + depends on X86_PMEM_LEGACY + default m if X86_PMEM_LEGACY + select LIBND + help + Prior to ACPI 6 some platforms advertised peristent memory + via type-12 e820 memory ranges. Create a libnd bus and + attach an instance of the pmem driver to these ranges. + +config BLK_DEV_PMEM + tristate PMEM: Persistent memory block device support + depends on LIBND + default LIBND + help + Memory ranges for PMEM are described by either an NFIT + (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a + non-standard OEM-specific E820 memory type (type-12, see + CONFIG_X86_PMEM_LEGACY), or it is manually specified by the + 'memmap=nn[KMG]!ss[KMG]' kernel command line (see + Documentation/kernel-parameters.txt). This driver converts + these persistent memory ranges into block devices that are + capable of DAX (direct-access) file system mappings. See + Documentation/blockdev/nd.txt for more details. + + Say Y if you want to use a NVDIMM described by NFIT + endif diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 0fb0891e1817..ebb212af9f15 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -14,10 +14,16 @@ endif obj-$(CONFIG_LIBND) += libnd.o obj-$(CONFIG_ND_ACPI) += nd_acpi.o +obj-$(CONFIG_ND_E820) += nd_e820.o obj-$(CONFIG_NFIT_TEST) += test/ +obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o nd_acpi-y := acpi.o +nd_e820-y := e820.o + +nd_pmem-y := pmem.o + libnd-y := core.o libnd-y += bus.o libnd-y += dimm_devs.o diff --git a/drivers/block/nd/e820.c b/drivers/block/nd/e820.c new file mode 100644 index ..f4db8c54248e --- /dev/null +++ b/drivers/block
[PATCH v2 20/20] libnd, nd_acpi, nd_blk: driver for BLK-mode access persistent memory
From: Ross Zwisler ross.zwis...@linux.intel.com The libnd implementation handles allocating dimm address space (DPA) between PMEM and BLK mode interfaces. After DPA has been allocated from a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O as a struct bio based block device. Unlike PMEM, BLK is required to handle platform specific details like mmio register formats and memory controller interleave. For this reason the libnd generic nd_blk driver calls back into the bus provider to carry out the I/O. This initial implementation handles the BLK interface defined by the ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from DCR (dimm control region), BDW (block data window), IDT (interleave descriptor) NFIT structures and the hardware register format. [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Cc: Andy Lutomirski l...@amacapital.net Cc: Boaz Harrosh b...@plexistor.com Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@fb.com Cc: Ingo Molnar mi...@kernel.org Cc: Christoph Hellwig h...@lst.de Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- drivers/block/nd/Kconfig | 12 + drivers/block/nd/Makefile |3 drivers/block/nd/acpi.c | 422 +++-- drivers/block/nd/acpi_nfit.h | 47 drivers/block/nd/blk.c| 264 +++ drivers/block/nd/libnd.h | 11 + drivers/block/nd/namespace_devs.c | 47 drivers/block/nd/nd-private.h |3 drivers/block/nd/nd.h | 16 + drivers/block/nd/region.c |8 + drivers/block/nd/region_devs.c| 65 +- drivers/block/nd/test/nfit.c | 29 +++ drivers/block/nd/test/nfit_test.h |2 13 files changed, 891 insertions(+), 38 deletions(-) create mode 100644 drivers/block/nd/blk.c diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig index 612bf2b14283..bac4290129fc 100644 --- a/drivers/block/nd/Kconfig +++ b/drivers/block/nd/Kconfig @@ -95,6 +95,18 @@ config BLK_DEV_PMEM Say Y if you want to use a NVDIMM described by ACPI, E820, etc... +config ND_BLK + tristate BLK: Block data window (aperture) device support + depends on LIBND + default ND_ACPI + help + This driver performs I/O using a set of mmio windows on a + dimm. The set of apertures will all access the one DIMM. + Multiple windows allow multiple threads to have a different + portions of the dimm open at one time. + + Say Y if you want to use a NVDIMM with BLK-mode capability + config ND_BTT_DEVS bool diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile index 7d778b4523d4..ef36927618e5 100644 --- a/drivers/block/nd/Makefile +++ b/drivers/block/nd/Makefile @@ -18,6 +18,7 @@ obj-$(CONFIG_ND_E820) += nd_e820.o obj-$(CONFIG_NFIT_TEST) += test/ obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o obj-$(CONFIG_ND_BTT) += nd_btt.o +obj-$(CONFIG_ND_BLK) += nd_blk.o nd_acpi-y := acpi.o @@ -27,6 +28,8 @@ nd_pmem-y := pmem.o nd_btt-y := btt.o +nd_blk-y := blk.o + libnd-y := core.o libnd-y += bus.o libnd-y += dimm_devs.o diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c index 5b9997fbc344..e4ff3a9b4fc1 100644 --- a/drivers/block/nd/acpi.c +++ b/drivers/block/nd/acpi.c @@ -12,12 +12,14 @@ */ #include linux/list_sort.h #include linux/module.h +#include linux/mutex.h #include linux/ndctl.h #include linux/list.h #include linux/acpi.h #include linux/sort.h #include acpi_nfit.h #include libnd.h +#include nd.h static bool warn_checksum; module_param(warn_checksum, bool, S_IRUGO|S_IWUSR); @@ -84,7 +86,7 @@ static int nd_acpi_ctl(struct nd_bus_descriptor *nd_desc, if (!adev) return -ENOTTY; - dimm_name = dev_name(adev-dev); + dimm_name = nd_dimm_name(nd_dimm); cmd_name = nd_dimm_cmd_name(cmd); dsm_mask = nfit_mem-dsm_mask; desc = nd_cmd_dimm_desc(cmd); @@ -301,10 +303,21 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void bdw-dcr_index, bdw-num_bdw); break; } - /* TODO */ - case NFIT_TABLE_IDT: - dev_dbg(dev, %s: idt\n, __func__); + case NFIT_TABLE_IDT: { + struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt), + GFP_KERNEL); + struct acpi_nfit_idt *idt = table; + + if (!nfit_idt) + return err; + INIT_LIST_HEAD(nfit_idt-list); + nfit_idt-idt = idt; + list_add_tail(nfit_idt-list, acpi_desc-idts); + dev_dbg(dev, %s: idt index: %d num_lines: %d\n, __func__
Re: [Linux-nvdimm] [PATCH 12/21] nd_pmem: add NFIT support to the pmem driver
On Tue, Apr 28, 2015 at 5:56 AM, Christoph Hellwig h...@infradead.org wrote: On Sat, Apr 18, 2015 at 12:37:09PM -0700, Dan Williams wrote: At this point in the patch series I agree, but in later patches we take advantage of nd bus services. [PATCH 15/21] nd: pmem label sets and namespace instantiation adds support for labeled pmem namespaces, and in [PATCH 19/21] nd: infrastructure for btt devices we make pmem capable of hosting btt instances. Thats fine, but still doesn't require moving it around. I ended up not moving it in v2. Let me know if the updated rationale makes sense. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types
On Tue, Apr 28, 2015 at 5:46 AM, Christoph Hellwig h...@infradead.org wrote: On Fri, Apr 17, 2015 at 09:35:19PM -0400, Dan Williams wrote: diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c index c52d7540dc05..cd8b7485e396 100644 --- a/arch/ia64/kernel/efi.c +++ b/arch/ia64/kernel/efi.c @@ -1227,6 +1227,7 @@ efi_initialize_iomem_resources(struct resource *code_resource, case EFI_RUNTIME_SERVICES_CODE: case EFI_RUNTIME_SERVICES_DATA: case EFI_ACPI_RECLAIM_MEMORY: + case EFI_PERSISTENT_MEMORY: default: name = reserved; You probably want pmem as name here.. diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 11cc7d54ec3f..410af501a941 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type) case E820_RESERVED_KERN: printk(KERN_CONT usable); break; + case E820_PMEM: + case E820_PRAM: case E820_RESERVED: printk(KERN_CONT reserved); break; @@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type) case E820_UNUSABLE: printk(KERN_CONT unusable); break; - case E820_PRAM: - printk(KERN_CONT persistent (type %u), type); - break; Please keep this printk, and add the new E820_PMEM case to it as well. +static bool do_mark_busy(u32 type, struct resource *res) +{ + if (res-start (1ULL20)) + return true; + + switch (type) { + case E820_RESERVED: + case E820_PRAM: + case E820_PMEM: + return false; + default: + return true; + } +} Please add a comment explaining the choices once you start refactoring this. Especially the address check is black magic.. Ok, I was able to incorporate all these into v2. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 03/21] nd_acpi: initial core implementation and nfit skeleton
On Tue, Apr 28, 2015 at 5:53 AM, Christoph Hellwig h...@infradead.org wrote: On Fri, Apr 17, 2015 at 09:35:30PM -0400, Dan Williams wrote: new file mode 100644 index ..5fa74f124b3e --- /dev/null +++ b/drivers/block/nd/Kconfig @@ -0,0 +1,44 @@ +config ND_ARCH_HAS_IOREMAP_CACHE + depends on (X86 || IA64 || ARM || ARM64 || SH || XTENSA) + def_bool y As mentioned before please either define this symbol in each arch Kconfig, or just ensure every architecture proides a stub. But more importantly it doesn't seem like you're actually using ioremap_cache anywhere. Allowing a cached ioremap would be a very worthwile addition to the pmem drivers once we have the proper memcpy functions making it safe, and is one of the high priority todo items for the pmem driver. + +menuconfig NFIT_DEVICES + bool NVDIMM (NFIT) Support Please just call all the symbolc and file names nvdimm instead of nfit or nd to make eryones life simpler for the generic code. Just use the EFI/ACPI terminology in those parts that actually parse those tables. Done in v2. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-nvdimm] [PATCH 05/21] nfit-test: manufactured NFITs for interface development
On Tue, Apr 28, 2015 at 5:54 AM, Christoph Hellwig h...@infradead.org wrote: Eww, the --wrap stuff is too ugly too live. Since when are unit tests pretty? Just implement the implemenetation of persistent nvdimms into qemu where it belongs. Ugh, no, I'm not keen to introduce yet another roadblock to running the tests and another degree of freedom for things to bit rot. It will never be pretty, but the implementation at least gets slightly cleaner in v2 with the removal of the wrapping for nd_blk_do_io(). It's also worth noting that 0day is currently running our unit tests. Note that having a not actually persistent implementation that register with the subsystems which doesn't need these hacks still sounds ok to me, altough I suspect most users would much prefer the virtualization based variant. KVM NFIT enabling is happening, but I don't think it is useful as a unit test vehicle. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Something like get_user_pages() triggers memory hotplug for persistent memory, so they are actual real struct pages? Can we do memory hotplug at that granularity? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote: On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. Hmmm, the capacities of persistent memory that would be assigned for a raid accelerator would be limited by diminishing returns. I.e. there seems to be no point to assign more than 8GB or so to the cache? [...] Why would that be the case? If it's not a temporary cache but a persistent cache that hosts all the data even after writeback completes then going to huge sizes will bring similar benefits to using a large, fast SSD disk on your desktop... The larger, the better. And it also persists across reboots. True, that's more dm-cache than RAID accelerator, but point taken. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote: Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. Ok. And if we do decide to go with your kind of __pfn type, I'd probably prefer that we encode the type in the low bits of the word rather than compare against PAGE_OFFSET. On some architectures PAGE_OFFSET is zero (admittedly probably not ones you'd care about), but even on x86 it's a *lot* cheaper to test the low bit than it is to compare against a big constant. We know struct page * is supposed to be at least aligned to at least unsigned long, so you'd have two bits of type information (and we could easily make it three). With 0 being a real pointer, so that you can use the pointer itself without masking. And the hide type in low bits of pointer is something we've done quite a lot, so it's more kernel coding style anyway. Ok. Although __pfn_t also stores pfn values directly which will consume those 2 bits so we'll need to shift pfns up when storing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o
On Thu, May 7, 2015 at 7:55 AM, Stephen Rothwell s...@canb.auug.org.au wrote: Hi Dan, On Wed, 06 May 2015 16:04:59 -0400 Dan Williams dan.j.willi...@intel.com wrote: diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h new file mode 100644 index ..91171e0285d9 --- /dev/null +++ b/include/asm-generic/pfn.h @@ -0,0 +1,51 @@ +#ifndef __ASM_PFN_H +#define __ASM_PFN_H + +#ifndef __pfn_to_phys +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) PAGE_SHIFT) Why dma_addr_t and not phys_addr_t? i.e. it could use a comment if it is correct. Hmm, this was derived from: #define page_to_phys(page)((dma_addr_t)page_to_pfn(page) PAGE_SHIFT) in arch/x86/include/asm/io.h The primary users of __pfn_to_phys() is dma_map_page(). I'll add a comment to that effect. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar mi...@kernel.org wrote: * Ingo Molnar mi...@kernel.org wrote: [...] For anything more complex, that maps any of this storage to user-space, or exposes it to higher level struct page based APIs, etc., where references matter and it's more of a cache with potentially multiple users, not an IO space, the natural API is struct page. Let me walk back on this: I'd say that this particular series mostly addresses the 'pfn as sector_t' side of the equation, where persistent memory is IO space, not memory space, and as such it is the more natural and thus also the cheaper/faster approach. ... but that does not appear to be the case: this series replaces a 'struct page' interface with a pure pfn interface for the express purpose of being able to DMA to/from 'memory areas' that are not struct page backed. Linus probably disagrees? :-) [ and he'd disagree rightfully ;-) ] So what this patch set tries to achieve is (sector_t - sector_t) IO between storage devices (i.e. a rare and somewhat weird usecase), and does it by squeezing one device's storage address into our formerly struct page backed descriptor, via a pfn. That looks like a layering violation and a mistake to me. If we want to do direct (sector_t - sector_t) IO, with no serialization worries, it should have its own (simple) API - which things like hierarchical RAID or RDMA APIs could use. I'm wrapped around the idea that __pfn_t *is* that simple api for the tiered storage driver use case. For RDMA I think we need struct page because I assume that would be coordinated through a filesystem an truncate() is back in play. What does an alternative API look like? If what we want to do is to support say an mmap() of a file on persistent storage, and then read() into that file from another device via DMA, then I think we should have allocated struct page backing at mmap() time already, and all regular syscall APIs would 'just work' from that point on - far above what page-less, pfn-based APIs can do. The temporary struct page backing can then be freed at munmap() time. Yes, passing around mmap()'d (DAX) persistent memory will need more than a __pfn_t. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote: My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). So clarify for me: are you trying to make the IO stack in general be able to use the persistent memory as a source (or destination) for IO to _other_ devices, or are you talking about just internally shuffling things around for something like RAID on top of persistent memory? Because I think those are two very different things. Yes, they are, and I am referring to the former, persistent memory as a source/destination to other devices. For example, one of the things I worry about is for people doing IO from persistent memory directly to some slow stable storage (aka disk). That was what I thought you were aiming for: infrastructure so that you can make a bio for a *disk* device contain a page list that is the persistent memory. And I think that is a very dangerous operation to do, because the persistent memory itself is going to have some filesystem on it, so anything that looks up the persistent memory pages is *not* going to have a stable pfn: the pfn will point to a fixed part of the persistent memory, but the file that was there may be deleted and the memory reassigned to something else. Indeed, truncate() in the absence of struct page has been a major hurdle for persistent memory enabling. But it does not impact this specific md/dm use case. md/dm will have taken an exclusive claim on an entire pmem block device (or partition), so there will be no competing with a filesystem. That's the kind of thing that struct page helps with for normal IO devices. It's both a source of serialization and indirection, so that when somebody does a truncate() on a file, we don't end up doing IO to random stale locations on the disk that got reassigned to another file. So struct page is very fundamental. It's *not* just a this is the physical source/drain of the data you are doing IO on. So if you are looking at some kind of zero-copy IO, where you can do IO from a filesystem on persistent storage to *another* filesystem on (say, a big rotational disk used for long-term storage) by just doing a bo that targets the disk, but has the persistent memory as the source memory, I really want to understand how you are going to serialize this. So *that* is what I meant by What is the primary thing that is driving this need? Do we have a very concrete example? I abvsolutely do *not* want to teach the bio subsystem to just randomly be able to take the source/destination of the IO as being some random pfn without knowing what the actual uses are and how these IO's are generated in the first place. blkdev_get(FMODE_EXCL) is the protection in this case. I was assuming that you wanted to do something where you mmap() the persistent memory, and then write it out to another device (possibly using aio_write()). But that really does require some kind of serialization at a higher level, because you can't just look up the pfn's in the page table and assume they are stable: they are *not* stable. We want to get there eventually, but this patchset does not address that case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote: blkdev_get(FMODE_EXCL) is the protection in this case. Ugh. That looks like a horrible nasty big hammer that will bite us badly some day. Since you'd have to hold it for the whole IO. But I guess it at least works. Oh no, that wouldn't be per-I/O that would be permanent at configuration set up time just like a raid member device. Something like: mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote: The motivation for this change is persistent memory and the desire to use it not only via the pmem driver, but also as a memory target for I/O (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. I detest this approach. Hmm, yes, I can't argue against put the onus on odd behavior where it belongs I'd much rather go exactly the other way around, and do the dynamic struct page instead. Add a flag to struct page Ok, given I had already precluded 32-bit systems in this __pfn_t approach we should have flag space for this on 64-bit. to mark it as a fake entry and teach page_to_pfn() to look up the actual pfn some way (that union tha contains index looks like a good target to also contain 'pfn', for example). Especially if this is mainly for persistent storage, we'll never have issues with worrying about writing it back under memory pressure, so allocating a struct page for these things shouldn't be a problem. There's likely only a few paths that actually generate IO for those things. In other words, I'd really like our basic infrastructure to be for the *normal* case, and the struct page is about so much more than just what's the target for IO. For normal IO, struct page is also what serializes the IO so that you have a consistent view of the end result, and there's obviously the reference count there too. So I really *really* think that struct page is the better entity for describing the actual IO, because it's the common and the generic thing, while a pfn is not actually *enough* for IO in general, and you now end up having to look up the struct page for the locking and refcounting etc. If you go the other way, and instead generate a struct page from the pfn for the few cases that need it, you put the onus on odd behavior where it belongs. Yes, it might not be any simpler in the end, but I think it would be conceptually much better. Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. The one benefit I may need to concede is a mechanism to opt-in to handle these fake pages to the few paths that know what they are doing. That was easy with __pfn_t, but a struct page can go silently almost anywhere. Certainly nothing is prepared a for a given struct page pointer to change the pfn it points to on the fly, which I think is what we would end up doing for something like a raid cache. Keep a pool of struct pages around and point them at persistent memory pfns while I/O is in flight. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 06/10] scatterlist: support page-less (__pfn_t only) entries
From: Matthew Wilcox wi...@linux.intel.com Given that an offset will never be more than PAGE_SIZE, steal the unused bits of the offset to implement a flags field. Move the existing this is a sg_chain() entry flag to the new flags field, and add a new flag (SG_FLAGS_PAGE) to indicate that there is a struct page backing for the entry. Signed-off-by: Dan Williams dan.j.willi...@intel.com Signed-off-by: Matthew Wilcox wi...@linux.intel.com --- block/blk-merge.c |2 - drivers/dma/ste_dma40.c |5 -- drivers/mmc/card/queue.c |4 +- include/asm-generic/scatterlist.h |9 include/crypto/scatterwalk.h | 10 include/linux/scatterlist.h | 91 + 6 files changed, 105 insertions(+), 16 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 218ad1e57a49..82a688551b72 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -267,7 +267,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, if (rq-cmd_flags REQ_WRITE) memset(q-dma_drain_buffer, 0, q-dma_drain_size); - sg-page_link = ~0x02; + sg_unmark_end(sg); sg = sg_next(sg); sg_set_page(sg, virt_to_page(q-dma_drain_buffer), q-dma_drain_size, diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c index 3c10f034d4b9..e8c00642cacb 100644 --- a/drivers/dma/ste_dma40.c +++ b/drivers/dma/ste_dma40.c @@ -2562,10 +2562,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t dma_addr, dma_addr += period_len; } - sg[periods].offset = 0; - sg_dma_len(sg[periods]) = 0; - sg[periods].page_link = - ((unsigned long)sg | 0x01) ~0x02; + sg_chain(sg, periods + 1, sg); txd = d40_prep_sg(chan, sg, sg, periods, direction, DMA_PREP_INTERRUPT); diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c index 236d194c2883..127f76294e71 100644 --- a/drivers/mmc/card/queue.c +++ b/drivers/mmc/card/queue.c @@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq, sg_set_buf(__sg, buf + offset, len); offset += len; remain -= len; - (__sg++)-page_link = ~0x02; + sg_unmark_end(__sg++); sg_len++; } while (remain); } @@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq, list_for_each_entry(req, packed-list, queuelist) { sg_len += blk_rq_map_sg(mq-queue, req, __sg); __sg = sg + (sg_len - 1); - (__sg++)-page_link = ~0x02; + sg_unmark_end(__sg++); } sg_mark_end(sg + (sg_len - 1)); return sg_len; diff --git a/include/asm-generic/scatterlist.h b/include/asm-generic/scatterlist.h index 5de07355fad4..959f51572a8e 100644 --- a/include/asm-generic/scatterlist.h +++ b/include/asm-generic/scatterlist.h @@ -7,8 +7,17 @@ struct scatterlist { #ifdef CONFIG_DEBUG_SG unsigned long sg_magic; #endif +#ifdef CONFIG_HAVE_DMA_PFN + union { + __pfn_t pfn; + struct scatterlist *next; + }; + unsigned short offset; + unsigned short sg_flags; +#else unsigned long page_link; unsigned intoffset; +#endif unsigned intlength; dma_addr_t dma_address; #ifdef CONFIG_NEED_SG_DMA_LENGTH diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h index 20e4226a2e14..7296d89a50b2 100644 --- a/include/crypto/scatterwalk.h +++ b/include/crypto/scatterwalk.h @@ -25,6 +25,15 @@ #include linux/scatterlist.h #include linux/sched.h +#ifdef CONFIG_HAVE_DMA_PFN +/* + * If we're using PFNs, the architecture must also have been converted to + * support SG_CHAIN. So we can use the generic code instead of custom + * code. + */ +#define scatterwalk_sg_chain(prv, num, sgl)sg_chain(prv, num, sgl) +#define scatterwalk_sg_next(sgl) sg_next(sgl) +#else static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num, struct scatterlist *sg2) { @@ -32,6 +41,7 @@ static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num, sg1[num - 1].page_link = ~0x02; sg1[num - 1].page_link |= 0x01; } +#endif static inline void scatterwalk_crypto_chain(struct scatterlist *head, struct scatterlist *sg, diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ed8f9e70df9b..9d423e559bdb 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -5,6 +5,7 @@ #include linux/bug.h #include linux/mm.h +#include asm/page.h #include asm/types.h #include asm
[PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory
It would be unfortunate if the kmap infrastructure escaped its current 32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t() implementation to scan a list of pre-mapped persistent memory address ranges inserted by the pmem driver. The __pfn_t to resource lookup is indeed inefficient walking of a linked list, but there are two mitigating factors: 1/ The number of persistent memory ranges is bounded by the number of DIMMs which is on the order of 10s of DIMMs, not hundreds. 2/ The lookup yields the entire range, if it becomes inefficient to do a kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take advantage of the fact that the lookup can be amortized for all kmap operations it needs to perform in a given range. Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/Kconfig |3 + arch/x86/Kconfig |2 + arch/x86/kernel/Makefile |1 arch/x86/kernel/kmap.c | 95 ++ drivers/block/pmem.c |6 +++ include/linux/highmem.h | 23 +++ 6 files changed, 130 insertions(+) create mode 100644 arch/x86/kernel/kmap.c diff --git a/arch/Kconfig b/arch/Kconfig index f7f800860c00..69d3a3fa21af 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS config HAVE_DMA_PFN bool +config HAVE_KMAP_PFN + bool + config GENERIC_SMP_IDLE_THREAD bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1fae5e842423..eddaea839500 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY Say Y if unsure. config X86_PMEM_DMA + depends on !HIGHMEM def_bool PMEM_IO + select HAVE_KMAP_PFN select HAVE_DMA_PFN config HIGHPTE diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 9bcd0b56ca17..44c323342996 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT)+= paravirt.o paravirt_patch_$(BITS).o obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o obj-$(CONFIG_X86_PMEM_LEGACY) += pmem.o +obj-$(CONFIG_X86_PMEM_DMA) += kmap.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c new file mode 100644 index ..d597c475377b --- /dev/null +++ b/arch/x86/kernel/kmap.c @@ -0,0 +1,95 @@ +/* + * Copyright(c) 2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include linux/rcupdate.h +#include linux/rculist.h +#include linux/highmem.h +#include linux/device.h +#include linux/slab.h +#include linux/mm.h + +static LIST_HEAD(ranges); + +struct kmap { + struct list_head list; + struct resource *res; + struct device *dev; + void *base; +}; + +static void teardown_kmap(void *data) +{ + struct kmap *kmap = data; + + dev_dbg(kmap-dev, kmap unregister %pr\n, kmap-res); + list_del_rcu(kmap-list); + synchronize_rcu(); + kfree(kmap); +} + +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res, + void *base) +{ + struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL); + int rc; + + if (!kmap) + return -ENOMEM; + + INIT_LIST_HEAD(kmap-list); + kmap-res = res; + kmap-base = base; + kmap-dev = dev; + rc = devm_add_action(dev, teardown_kmap, kmap); + if (rc) { + kfree(kmap); + return rc; + } + dev_dbg(kmap-dev, kmap register %pr\n, kmap-res); + list_add_rcu(kmap-list, ranges); + return 0; +} +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range); + +void *kmap_atomic_pfn_t(__pfn_t pfn) +{ + struct page *page = __pfn_t_to_page(pfn); + resource_size_t addr; + struct kmap *kmap; + + if (page) + return kmap_atomic(page); + addr = __pfn_t_to_phys(pfn); + rcu_read_lock(); + list_for_each_entry_rcu(kmap, ranges, list) + if (addr = kmap-res-start addr = kmap-res-end) + return kmap-base + addr - kmap-res-start; + + /* only unlock in the error case */ + rcu_read_unlock(); + return NULL; +} +EXPORT_SYMBOL(kmap_atomic_pfn_t); + +void kunmap_atomic_pfn_t(void *addr) +{ + rcu_read_unlock(); + + /* +* If the original __pfn_t had an entry in the memmap
[PATCH v2 09/10] dax: convert to __pfn_t
The primary source for non-page-backed page-frames to enter the system is via the pmem driver's -direct_access() method. The pfns returned by the top-level bdev_direct_access() may be passed to any other subsystem in the kernel and those sub-systems either need to assume that the pfn is page backed (CONFIG_PMEM_IO=n) or be prepared to handle non-page backed case (CONFIG_PMEM_IO=y). Currently the pfns returned by -direct_access() are only ever used by vm_insert_mixed() which does not care if the pfn is mapped. As we go to add more usages of these pfns add the type-safety of __pfn_t. Cc: Matthew Wilcox wi...@linux.intel.com Cc: Ross Zwisler ross.zwis...@linux.intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Jens Axboe ax...@kernel.dk Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com Cc: Boaz Harrosh b...@plexistor.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/powerpc/sysdev/axonram.c |4 ++-- drivers/block/brd.c |4 ++-- drivers/block/pmem.c |8 +--- drivers/s390/block/dcssblk.c |6 +++--- fs/block_dev.c|2 +- fs/dax.c |9 + include/asm-generic/pfn.h |7 +++ include/linux/blkdev.h|4 ++-- 8 files changed, 27 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 9bb5da7f2c0c..069cb5285f18 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -141,13 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) */ static long axon_ram_direct_access(struct block_device *device, sector_t sector, - void **kaddr, unsigned long *pfn, long size) + void **kaddr, __pfn_t *pfn, long size) { struct axon_ram_bank *bank = device-bd_disk-private_data; loff_t offset = (loff_t)sector AXON_RAM_SECTOR_SHIFT; *kaddr = (void *)(bank-ph_addr + offset); - *pfn = virt_to_phys(*kaddr) PAGE_SHIFT; + *pfn = phys_to_pfn_t(virt_to_phys(*kaddr)); return bank-size - offset; } diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 115c6cf9cb43..57f4cd787ea2 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, #ifdef CONFIG_BLK_DEV_RAM_DAX static long brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, unsigned long *pfn, long size) + void **kaddr, __pfn_t *pfn, long size) { struct brd_device *brd = bdev-bd_disk-private_data; struct page *page; @@ -382,7 +382,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, if (!page) return -ENOSPC; *kaddr = page_address(page); - *pfn = page_to_pfn(page); + *pfn = page_to_pfn_t(page); /* * TODO: If size PAGE_SIZE, we could look to see if the next page in diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index 2a847651f8de..18edb48e405e 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -98,8 +98,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, return 0; } -static long pmem_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, unsigned long *pfn, long size) +static long __maybe_unused pmem_direct_access(struct block_device *bdev, + sector_t sector, void **kaddr, __pfn_t *pfn, long size) { struct pmem_device *pmem = bdev-bd_disk-private_data; size_t offset = sector 9; @@ -108,7 +108,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector, return -ENODEV; *kaddr = pmem-virt_addr + offset; - *pfn = (pmem-phys_addr + offset) PAGE_SHIFT; + *pfn = phys_to_pfn_t(pmem-phys_addr + offset); return pmem-size - offset; } @@ -116,7 +116,9 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector, static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .rw_page = pmem_rw_page, +#if IS_ENABLED(CONFIG_PMEM_IO) .direct_access =pmem_direct_access, +#endif }; static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 5da8515b8fb9..8616c1d33786 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static void dcssblk_make_request(struct request_queue *q, struct bio *bio); static long dcssblk_direct_access(struct block_device *bdev
[PATCH v2 10/10] block: base support for pfn i/o
Allow block device drivers to opt-in to receiving bio(s) where the bio_vec(s) point to memory that is not backed by struct page entries. When a driver opts in it asserts that it will use the __pfn_t versions of the dma_map/kmap/scatterlist apis in its bio submission path. Cc: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Signed-off-by: Dan Williams dan.j.willi...@intel.com --- block/bio.c | 48 ++--- block/blk-core.c |9 include/linux/blk_types.h |1 + include/linux/blkdev.h|2 ++ 4 files changed, 52 insertions(+), 8 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7100fd6d5898..9c506dd6a093 100644 --- a/block/bio.c +++ b/block/bio.c @@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio-bi_rw = bio_src-bi_rw; bio-bi_iter = bio_src-bi_iter; bio-bi_io_vec = bio_src-bi_io_vec; + bio-bi_flags |= bio_src-bi_flags (1 BIO_PFN); } EXPORT_SYMBOL(__bio_clone_fast); @@ -658,6 +659,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, goto integrity_clone; } + bio-bi_flags |= bio_src-bi_flags (1 BIO_PFN); + bio_for_each_segment(bv, bio_src, iter) bio-bi_io_vec[bio-bi_vcnt++] = bv; @@ -699,9 +702,9 @@ int bio_get_nr_vecs(struct block_device *bdev) } EXPORT_SYMBOL(bio_get_nr_vecs); -static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page - *page, unsigned int len, unsigned int offset, - unsigned int max_sectors) +static int __bio_add_pfn(struct request_queue *q, struct bio *bio, + __pfn_t pfn, unsigned int len, unsigned int offset, + unsigned int max_sectors) { int retried_segments = 0; struct bio_vec *bvec; @@ -723,7 +726,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page if (bio-bi_vcnt 0) { struct bio_vec *prev = bio-bi_io_vec[bio-bi_vcnt - 1]; - if (page == bvec_page(prev) + if (pfn.pfn == prev-bv_pfn.pfn offset == prev-bv_offset + prev-bv_len) { unsigned int prev_bv_len = prev-bv_len; prev-bv_len += len; @@ -768,7 +771,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page * cannot add the page */ bvec = bio-bi_io_vec[bio-bi_vcnt]; - bvec_set_page(bvec, page); + bvec-bv_pfn = pfn; bvec-bv_len = len; bvec-bv_offset = offset; bio-bi_vcnt++; @@ -818,7 +821,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page return len; failed: - bvec_set_page(bvec, NULL); + bvec-bv_pfn.pfn = 0; bvec-bv_len = 0; bvec-bv_offset = 0; bio-bi_vcnt--; @@ -845,7 +848,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { - return __bio_add_page(q, bio, page, len, offset, + return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset, queue_max_hw_sectors(q)); } EXPORT_SYMBOL(bio_add_pc_page); @@ -872,10 +875,39 @@ int bio_add_page(struct bio *bio, struct page *page, unsigned int len, if ((max_sectors (len 9)) !bio-bi_iter.bi_size) max_sectors = len 9; - return __bio_add_page(q, bio, page, len, offset, max_sectors); + return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset, + max_sectors); } EXPORT_SYMBOL(bio_add_page); +/** + * bio_add_pfn - attempt to add pfn to bio + * @bio: destination bio + * @pfn: pfn to add + * @len: vec entry length + * @offset: vec entry offset + * + * Identical to bio_add_page() except this variant flags the bio as + * not have struct page backing. A given request_queue must assert + * that it is prepared to handle this constraint before bio(s) + * flagged in the manner can be passed. + */ +int bio_add_pfn(struct bio *bio, __pfn_t pfn, unsigned int len, + unsigned int offset) +{ + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + unsigned int max_sectors; + + if (!blk_queue_pfn(q)) + return 0; + set_bit(BIO_PFN, bio-bi_flags); + max_sectors = blk_max_size_offset(q, bio-bi_iter.bi_sector); + if ((max_sectors (len 9)) !bio-bi_iter.bi_size) + max_sectors = len 9; + + return __bio_add_pfn(q, bio, pfn, len, offset, max_sectors); +} + struct submit_bio_ret { struct completion event; int error; diff --git a/block/blk-core.c b/block/blk-core.c index 94d2c6ccf801..4eefff363986 100644
[PATCH v2 05/10] scatterlist: use sg_phys()
Coccinelle cleanup to replace open coded sg to physical address translations. This is in preparation for introducing scatterlists that reference pfn(s) without a backing struct page. // sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg) // usage: make coccicheck COCCI=sg_phys.cocci MODE=patch virtual patch virtual report virtual org @@ struct scatterlist *sg; @@ - page_to_phys(sg_page(sg)) + sg-offset + sg_phys(sg) @@ struct scatterlist *sg; @@ - page_to_phys(sg_page(sg)) + sg_phys(sg) - sg-offset Cc: Julia Lawall julia.law...@lip6.fr Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/arm/mm/dma-mapping.c|2 +- arch/microblaze/kernel/dma.c |2 +- drivers/iommu/intel-iommu.c |4 ++-- drivers/iommu/iommu.c|2 +- drivers/staging/android/ion/ion_chunk_heap.c |4 ++-- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 09c5fe3d30c2..43cc6a8fdacc 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg, return -ENOMEM; for (count = 0, s = sg; count (size PAGE_SHIFT); s = sg_next(s)) { - phys_addr_t phys = page_to_phys(sg_page(s)); + phys_addr_t phys = sg_phys(s) - s-offset; unsigned int len = PAGE_ALIGN(s-offset + s-length); if (!is_coherent diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c index ed7ba8a11822..dcb3c594d626 100644 --- a/arch/microblaze/kernel/dma.c +++ b/arch/microblaze/kernel/dma.c @@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, /* FIXME this part of code is untested */ for_each_sg(sgl, sg, nents, i) { sg-dma_address = sg_phys(sg); - __dma_sync(page_to_phys(sg_page(sg)) + sg-offset, + __dma_sync(sg_phys(sg), sg-length, direction); } diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 68d43beccb7e..9b9ada71e0d3 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, sg_res = aligned_nrpages(sg-offset, sg-length); sg-dma_address = ((dma_addr_t)iov_pfn VTD_PAGE_SHIFT) + sg-offset; sg-dma_length = sg-length; - pteval = page_to_phys(sg_page(sg)) | prot; + pteval = (sg_phys(sg) - sg-offset) | prot; phys_pfn = pteval VTD_PAGE_SHIFT; } @@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device *hddev, for_each_sg(sglist, sg, nelems, i) { BUG_ON(!sg_page(sg)); - sg-dma_address = page_to_phys(sg_page(sg)) + sg-offset; + sg-dma_address = sg_phys(sg); sg-dma_length = sg-length; } return nelems; diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d4f527e56679..59808fc9110d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, min_pagesz = 1 __ffs(domain-ops-pgsize_bitmap); for_each_sg(sg, s, nents, i) { - phys_addr_t phys = page_to_phys(sg_page(s)) + s-offset; + phys_addr_t phys = sg_phys(s); /* * We are mapping on IOMMU page boundaries, so offset within diff --git a/drivers/staging/android/ion/ion_chunk_heap.c b/drivers/staging/android/ion/ion_chunk_heap.c index 3e6ec2ee6802..b7da5d142aa9 100644 --- a/drivers/staging/android/ion/ion_chunk_heap.c +++ b/drivers/staging/android/ion/ion_chunk_heap.c @@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap, err: sg = table-sgl; for (i -= 1; i = 0; i--) { - gen_pool_free(chunk_heap-pool, page_to_phys(sg_page(sg)), + gen_pool_free(chunk_heap-pool, sg_phys(sg) - sg-offset, sg-length); sg = sg_next(sg); } @@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer) DMA_BIDIRECTIONAL); for_each_sg(table-sgl, sg, table-nents, i) { - gen_pool_free(chunk_heap-pool, page_to_phys(sg_page(sg)), + gen_pool_free(chunk_heap-pool, sg_phys(sg) - sg-offset, sg-length); } chunk_heap-allocated -= allocated_size; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord
[PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o
Introduce a type that encapsulates a page-frame-number that is optionally backed by memmap (struct page). This type will be used in place of 'struct page *' instances in contexts where persistent memory is being referenced (scatterlists for drivers, biovecs for the block layer, etc). The operations in those i/o paths that formerly required a 'struct page *' are to be converted to use __pfn_t aware equivalent helpers. Otherwise, in the absence of persistent memory, there is no functional change and __pfn_t is an alias for a normal memory page. It turns out that while 'struct page' references are used broadly in the kernel I/O stacks the usage of 'struct page' based capabilities is very shallow. It is only used for populating bio_vecs and scatterlists for the retrieval of dma addresses, and for temporary kernel mappings (kmap). Aside from kmap, these usages can be trivially converted to operate on a pfn. Indeed, kmap_atomic() is more problematic as it uses mm infrastructure, via struct page, to setup and track temporary kernel mappings. It would be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM bonds and leaked into 64-bit code. Thankfully, it seems all that is needed here is to convert kmap_atomic() callers, that want to opt-in to supporting persistent memory, to use a new kmap_atomic_pfn_t(). Where kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping established by the driver for persistent memory. Note, that as far as conceptually understanding __pfn_t is concerned, 'persistent memory' is really any address range in host memory not covered by memmap. Contrast this with pure iomem that is on an mmio mapped bus like PCI and cannot be converted to a dma_addr_t by pfn PAGE_SHIFT. Cc: H. Peter Anvin h...@zytor.com Cc: Jens Axboe ax...@kernel.dk Cc: Tejun Heo t...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Dan Williams dan.j.willi...@intel.com --- include/asm-generic/memory_model.h |1 - include/asm-generic/pfn.h | 51 include/linux/mm.h |1 + init/Kconfig | 13 + 4 files changed, 65 insertions(+), 1 deletion(-) create mode 100644 include/asm-generic/pfn.h diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h index 14909b0b9cae..1b0ae21fd8ff 100644 --- a/include/asm-generic/memory_model.h +++ b/include/asm-generic/memory_model.h @@ -70,7 +70,6 @@ #endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */ #define page_to_pfn __page_to_pfn -#define pfn_to_page __pfn_to_page #endif /* __ASSEMBLY__ */ diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h new file mode 100644 index ..91171e0285d9 --- /dev/null +++ b/include/asm-generic/pfn.h @@ -0,0 +1,51 @@ +#ifndef __ASM_PFN_H +#define __ASM_PFN_H + +#ifndef __pfn_to_phys +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) PAGE_SHIFT) +#endif + +static inline struct page *pfn_to_page(unsigned long pfn) +{ + return __pfn_to_page(pfn); +} + +/* + * __pfn_t: encapsulates a page-frame number that is optionally backed + * by memmap (struct page). This type will be used in place of a + * 'struct page *' instance in contexts where unmapped memory (usually + * persistent memory) is being referenced (scatterlists for drivers, + * biovecs for the block layer, etc). + */ +typedef struct { + union { + unsigned long pfn; + struct page *page; + }; +} __pfn_t; + +static inline struct page *__pfn_t_to_page(__pfn_t pfn) +{ +#if IS_ENABLED(CONFIG_PMEM_IO) + if (pfn.pfn PAGE_OFFSET) + return NULL; +#endif + return pfn.page; +} + +static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn) +{ +#if IS_ENABLED(CONFIG_PMEM_IO) + if (pfn.pfn PAGE_OFFSET) + return __pfn_to_phys(pfn.pfn); +#endif + return __pfn_to_phys(page_to_pfn(pfn.page)); +} + +static inline __pfn_t page_to_pfn_t(struct page *page) +{ + __pfn_t pfn = { .page = page }; + + return pfn; +} +#endif /* __ASM_PFN_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 0755b9fd03a7..9d35cff41c12 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -52,6 +52,7 @@ extern int sysctl_legacy_va_layout; #include asm/page.h #include asm/pgtable.h #include asm/processor.h +#include asm-generic/pfn.h #ifndef __pa_symbol #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0)) diff --git a/init/Kconfig b/init/Kconfig index dc24dec60232..7d2ad350fd29 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1764,6 +1764,19 @@ config PROFILING Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile. +config PMEM_IO + default n + bool Support for I/O, DAX, DMA, RDMA to unmapped (persistent) memory if EXPERT + help + Say Y here
[PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec
Carry an __pfn_t in a bio_vec rather than a 'struct page *' in support of allowing a bio to reference unmapped (not struct page backed) persistent memory. This also fixes up the macros and static initializers that we were not automatically converted by the Coccinelle script that introduced the bvec_page() and bvec_set_page() helpers. If CONFIG_PMEM_IO=n this is functionally equivalent to the status quo as the __pfn_t helpers can assume that a __pfn_t always has a corresponding struct page. Cc: Jens Axboe ax...@kernel.dk Cc: Matthew Wilcox wi...@linux.intel.com Cc: Dave Hansen dave.han...@linux.intel.com Cc: Julia Lawall julia.law...@lip6.fr Signed-off-by: Dan Williams dan.j.willi...@intel.com --- block/blk-integrity.c |4 ++-- block/blk-merge.c |6 +++--- block/bounce.c|2 +- drivers/md/bcache/btree.c |2 +- include/linux/bio.h | 24 +--- include/linux/blk_types.h | 13 ++--- lib/iov_iter.c| 22 +++--- mm/page_io.c |4 ++-- 8 files changed, 43 insertions(+), 34 deletions(-) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 0458f31f075a..351198fbda3c 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -43,7 +43,7 @@ static const char *bi_unsupported_name = unsupported; */ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); unsigned int segments = 0; unsigned int seg_size = 0; struct bvec_iter iter; @@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg); int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist) { - struct bio_vec iv, ivprv = { NULL }; + struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv); struct scatterlist *sg = NULL; unsigned int segments = 0; struct bvec_iter iter; diff --git a/block/blk-merge.c b/block/blk-merge.c index 47ceefacd320..218ad1e57a49 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio, bool no_sg_merge) { - struct bio_vec bv, bvprv = { NULL }; + struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv); int cluster, high, highprv = 1; unsigned int seg_size, nr_phys_segs; struct bio *fbio, *bbio; @@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments); static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, struct bio *nxt) { - struct bio_vec end_bv = { NULL }, nxt_bv; + struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv; struct bvec_iter iter; if (!blk_queue_cluster(q)) @@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist, struct scatterlist **sg) { - struct bio_vec bvec, bvprv = { NULL }; + struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv); struct bvec_iter iter; int nsegs, cluster; diff --git a/block/bounce.c b/block/bounce.c index 0390e44d6e1b..4a3098067c81 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom) #else /* CONFIG_HIGHMEM */ #define bounce_copy_vec(to, vfrom) \ - memcpy(page_address((to)-bv_page) + (to)-bv_offset, vfrom, (to)-bv_len) + memcpy(page_address(bvec_page(to)) + (to)-bv_offset, vfrom, (to)-bv_len) #endif /* CONFIG_HIGHMEM */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2e76e8b62902..36bbe29a806b 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b) void *base = (void *) ((unsigned long) i ~(PAGE_SIZE - 1)); bio_for_each_segment_all(bv, b-bio, j) - memcpy(page_address(bv-bv_page), + memcpy(page_address(bvec_page(bv)), base + j * PAGE_SIZE, PAGE_SIZE); bch_submit_bbio(b-bio, b-c, k.key, 0); diff --git a/include/linux/bio.h b/include/linux/bio.h index da3a127c9958..a59d97cbfe13 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -63,8 +63,8 @@ */ #define __bvec_iter_bvec(bvec, iter) ((bvec)[(iter).bi_idx]) -#define bvec_iter_page(bvec, iter) \ - (__bvec_iter_bvec((bvec), (iter))-bv_page) +#define bvec_iter_pfn(bvec, iter) \ + (__bvec_iter_bvec((bvec), (iter))-bv_pfn) #define bvec_iter_len(bvec, iter) \ min((iter
[PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Changes since v1 [1]: 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers. 2/ added kmap_atomic_pfn_t() 3/ rebased on v4.1-rc2 [1]: http://marc.info/?l=linux-kernelm=142653770511970w=2 --- A lead in note, this looks scarier than it is. Most of the code thrash is automated via Coccinelle. Also the subtle differences behind an 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls whether a pfn and a __pfn_t are equivalent. The motivation for this change is persistent memory and the desire to use it not only via the pmem driver, but also as a memory target for I/O (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. Aside from the pmem driver and DAX, persistent memory is not able to be used in these I/O scenarios due to the lack of a backing struct page, i.e. persistent memory is not part of the memmap. This patchset takes the position that the solution is to teach I/O paths that want to operate on persistent memory to do so by referencing a __pfn_t. The alternatives are discussed in the changelog for [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o, copied here: Alternatives: 1/ Provide struct page coverage for persistent memory in DRAM. The expectation is that persistent memory capacities make this untenable in the long term. 2/ Provide struct page coverage for persistent memory with persistent memory. While persistent memory may have near DRAM performance characteristics it may not have the same write-endurance of DRAM. Given the update frequency of struct page objects it may not be suitable for persistent memory. 3/ Dynamically allocate struct page. This appears to be on the order of the complexity of converting code paths to use __pfn_t references instead of struct page, and the amount of setup required to establish a valid struct page reference is mostly wasted when the only usage in the block stack is to perform a page_to_pfn() conversion for dma-mapping. Instances of kmap() / kmap_atomic() usage appear to be the only occasions in the block stack where struct page is non-trivially used. A new kmap_atomic_pfn_t() is proposed to handle those cases. --- Dan Williams (9): arch: introduce __pfn_t for persistent memory i/o block: add helpers for accessing a bio_vec page block: convert .bv_page to .bv_pfn bio_vec dma-mapping: allow archs to optionally specify a -map_pfn() operation scatterlist: use sg_phys() x86: support dma_map_pfn() x86: support kmap_atomic_pfn_t() for persistent memory dax: convert to __pfn_t block: base support for pfn i/o Matthew Wilcox (1): scatterlist: support page-less (__pfn_t only) entries arch/Kconfig |6 ++ arch/arm/mm/dma-mapping.c|2 - arch/microblaze/kernel/dma.c |2 - arch/powerpc/sysdev/axonram.c|6 +- arch/x86/Kconfig |7 ++ arch/x86/kernel/Makefile |1 arch/x86/kernel/amd_gart_64.c| 22 +- arch/x86/kernel/kmap.c | 95 ++ arch/x86/kernel/pci-nommu.c | 22 +- arch/x86/kernel/pci-swiotlb.c|4 + arch/x86/pci/sta2x11-fixup.c |4 + arch/x86/xen/pci-swiotlb-xen.c |4 + block/bio-integrity.c|8 +- block/bio.c | 82 -- block/blk-core.c | 13 +++- block/blk-integrity.c|7 +- block/blk-lib.c |2 - block/blk-merge.c| 15 ++-- block/bounce.c | 26 --- drivers/block/aoe/aoecmd.c |8 +- drivers/block/brd.c |6 +- drivers/block/drbd/drbd_bitmap.c |5 + drivers/block/drbd/drbd_main.c |6 +- drivers/block/drbd/drbd_receiver.c |4 + drivers/block/drbd/drbd_worker.c |3 + drivers/block/floppy.c |6 +- drivers/block/loop.c | 13 ++-- drivers/block/nbd.c |8 +- drivers/block/nvme-core.c|2 - drivers/block/pktcdvd.c | 11 ++- drivers/block/pmem.c | 16 +++- drivers/block/ps3disk.c |2 - drivers/block/ps3vram.c |2 - drivers/block/rbd.c |2 - drivers/block/rsxx/dma.c |2 - drivers/block/umem.c |2 - drivers/block/zram
Re: [Linux-nvdimm] [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory
On Wed, May 6, 2015 at 1:05 PM, Dan Williams dan.j.willi...@intel.com wrote: It would be unfortunate if the kmap infrastructure escaped its current 32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t() implementation to scan a list of pre-mapped persistent memory address ranges inserted by the pmem driver. The __pfn_t to resource lookup is indeed inefficient walking of a linked list, but there are two mitigating factors: 1/ The number of persistent memory ranges is bounded by the number of DIMMs which is on the order of 10s of DIMMs, not hundreds. 2/ The lookup yields the entire range, if it becomes inefficient to do a kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take advantage of the fact that the lookup can be amortized for all kmap operations it needs to perform in a given range. Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/Kconfig |3 + arch/x86/Kconfig |2 + arch/x86/kernel/Makefile |1 arch/x86/kernel/kmap.c | 95 ++ drivers/block/pmem.c |6 +++ include/linux/highmem.h | 23 +++ 6 files changed, 130 insertions(+) create mode 100644 arch/x86/kernel/kmap.c diff --git a/arch/Kconfig b/arch/Kconfig index f7f800860c00..69d3a3fa21af 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS config HAVE_DMA_PFN bool +config HAVE_KMAP_PFN + bool + config GENERIC_SMP_IDLE_THREAD bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1fae5e842423..eddaea839500 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY Say Y if unsure. config X86_PMEM_DMA + depends on !HIGHMEM def_bool PMEM_IO + select HAVE_KMAP_PFN select HAVE_DMA_PFN config HIGHPTE diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 9bcd0b56ca17..44c323342996 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT)+= paravirt.o paravirt_patch_$(BITS).o obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o obj-$(CONFIG_X86_PMEM_LEGACY) += pmem.o +obj-$(CONFIG_X86_PMEM_DMA) += kmap.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c new file mode 100644 index ..d597c475377b --- /dev/null +++ b/arch/x86/kernel/kmap.c @@ -0,0 +1,95 @@ +/* + * Copyright(c) 2015 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include linux/rcupdate.h +#include linux/rculist.h +#include linux/highmem.h +#include linux/device.h +#include linux/slab.h +#include linux/mm.h + +static LIST_HEAD(ranges); + +struct kmap { + struct list_head list; + struct resource *res; + struct device *dev; + void *base; +}; + +static void teardown_kmap(void *data) +{ + struct kmap *kmap = data; + + dev_dbg(kmap-dev, kmap unregister %pr\n, kmap-res); + list_del_rcu(kmap-list); + synchronize_rcu(); + kfree(kmap); +} + +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res, + void *base) +{ + struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL); + int rc; + + if (!kmap) + return -ENOMEM; + + INIT_LIST_HEAD(kmap-list); + kmap-res = res; + kmap-base = base; + kmap-dev = dev; + rc = devm_add_action(dev, teardown_kmap, kmap); + if (rc) { + kfree(kmap); + return rc; + } + dev_dbg(kmap-dev, kmap register %pr\n, kmap-res); + list_add_rcu(kmap-list, ranges); + return 0; +} +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range); + +void *kmap_atomic_pfn_t(__pfn_t pfn) +{ + struct page *page = __pfn_t_to_page(pfn); + resource_size_t addr; + struct kmap *kmap; + + if (page) + return kmap_atomic(page); + addr = __pfn_t_to_phys(pfn); + rcu_read_lock(); + list_for_each_entry_rcu(kmap, ranges, list) + if (addr = kmap-res-start addr = kmap-res-end) + return kmap-base + addr - kmap-res-start; + + /* only unlock in the error case
[PATCH v2 07/10] x86: support dma_map_pfn()
Fix up x86 dma_map_ops to allow pfn-only mappings. As long as a dma_map_sg() implementation uses the generic sg_phys() helpers it can support scatterlists that use __pfn_t instead of struct page. Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/x86/Kconfig |5 + arch/x86/kernel/amd_gart_64.c| 22 +- arch/x86/kernel/pci-nommu.c | 22 +- arch/x86/kernel/pci-swiotlb.c|4 arch/x86/pci/sta2x11-fixup.c |4 arch/x86/xen/pci-swiotlb-xen.c |4 drivers/iommu/amd_iommu.c| 21 - drivers/iommu/intel-iommu.c | 22 +- drivers/xen/swiotlb-xen.c| 29 +++-- include/asm-generic/dma-mapping-common.h |4 ++-- include/asm-generic/scatterlist.h|1 + include/linux/swiotlb.h |4 lib/swiotlb.c| 20 +++- 13 files changed, 125 insertions(+), 37 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 226d5696e1d1..1fae5e842423 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -796,6 +796,7 @@ config CALGARY_IOMMU bool IBM Calgary IOMMU support select SWIOTLB depends on X86_64 PCI + depends on !HAVE_DMA_PFN ---help--- Support for hardware IOMMUs in IBM's xSeries x366 and x460 systems. Needed to run systems with more than 3GB of memory @@ -1432,6 +1433,10 @@ config X86_PMEM_LEGACY Say Y if unsure. +config X86_PMEM_DMA + def_bool PMEM_IO + select HAVE_DMA_PFN + config HIGHPTE bool Allocate 3rd-level pagetables from highmem depends on HIGHMEM diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c index 8e3842fc8bea..8fad83c8dfd2 100644 --- a/arch/x86/kernel/amd_gart_64.c +++ b/arch/x86/kernel/amd_gart_64.c @@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem, } /* Map a single area into the IOMMU */ -static dma_addr_t gart_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, - struct dma_attrs *attrs) +static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long bus; - phys_addr_t paddr = page_to_phys(page) + offset; + phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset; if (!dev) dev = x86_dma_fallback_dev; @@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct page *page, return bus; } +static __maybe_unused dma_addr_t gart_map_page(struct device *dev, + struct page *page, unsigned long offset, size_t size, + enum dma_data_direction dir, struct dma_attrs *attrs) +{ + return gart_map_pfn(dev, page_to_pfn_t(page), offset, size, dir, + attrs); +} + /* * Free a DMA mapping. */ @@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info) static struct dma_map_ops gart_dma_ops = { .map_sg = gart_map_sg, .unmap_sg = gart_unmap_sg, +#ifdef CONFIG_HAVE_DMA_PFN + .map_pfn= gart_map_pfn, +#else .map_page = gart_map_page, +#endif .unmap_page = gart_unmap_page, .alloc = gart_alloc_coherent, .free = gart_free_coherent, diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c index da15918d1c81..876dacfbabf6 100644 --- a/arch/x86/kernel/pci-nommu.c +++ b/arch/x86/kernel/pci-nommu.c @@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t bus, size_t size) return 1; } -static dma_addr_t nommu_map_page(struct device *dev, struct page *page, -unsigned long offset, size_t size, -enum dma_data_direction dir, -struct dma_attrs *attrs) +static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - dma_addr_t bus = page_to_phys(page) + offset; + dma_addr_t bus = __pfn_t_to_phys(pfn) + offset; WARN_ON(size == 0); if (!check_addr(map_single, dev, bus, size)) return DMA_ERROR_CODE; @@ -38,6 +38,14 @@ static
[PATCH v2 02/10] block: add helpers for accessing a bio_vec page
In preparation for converting struct bio_vec to carry a __pfn_t instead of struct page. This change is prompted by the desire to add in-kernel DMA support (O_DIRECT, hierarchical storage, RDMA, etc) for persistent memory which lacks struct page coverage. Alternatives: 1/ Provide struct page coverage for persistent memory in DRAM. The expectation is that persistent memory capacities make this untenable in the long term. 2/ Provide struct page coverage for persistent memory with persistent memory. While persistent memory may have near DRAM performance characteristics it may not have the same write-endurance of DRAM. Given the update frequency of struct page objects it may not be suitable for persistent memory. 3/ Dynamically allocate struct page. This appears to be on the order of the complexity of converting code paths to use __pfn_t references instead of struct page, and the amount of setup required to establish a valid struct page reference is mostly wasted when the only usage in the block stack is to perform a page_to_pfn() conversion for dma-mapping. Instances of kmap() / kmap_atomic() usage appear to be the only occasions in the block stack where struct page is non-trivially used. A new kmap_atomic_pfn_t() is proposed to handle those cases. Generated with the following semantic patch: // bv_page.cocci: convert usage of -bv_page to use set/get helpers // usage: make coccicheck COCCI=bv_page.cocci MODE=patch virtual patch virtual report virtual org @@ struct bio_vec bvec; expression E; type T; @@ - bvec.bv_page = (T)E + bvec_set_page(bvec, E) @@ struct bio_vec *bvec; expression E; type T; @@ - bvec-bv_page = (T)E + bvec_set_page(bvec, E) @@ struct bio_vec bvec; type T; @@ - (T)bvec.bv_page + bvec_page(bvec) @@ struct bio_vec *bvec; type T; @@ - (T)bvec-bv_page + bvec_page(bvec) @@ struct bio *bio; expression E; expression F; type T; @@ - bio-bi_io_vec[F].bv_page = (T)E + bvec_set_page(bio-bi_io_vec[F], E) @@ struct bio *bio; expression E; type T; @@ - bio-bi_io_vec-bv_page = (T)E + bvec_set_page(bio-bi_io_vec, E) @@ struct cached_dev *dc; expression E; type T; @@ - dc-sb_bio.bi_io_vec-bv_page = (T)E + bvec_set_page(dc-sb_bio.bi_io_vec, E) @@ struct cache *ca; expression E; expression F; type T; @@ - ca-sb_bio.bi_io_vec[F].bv_page = (T)E + bvec_set_page(ca-sb_bio.bi_io_vec[F], E) @@ struct cache *ca; expression F; @@ - ca-sb_bio.bi_io_vec[F].bv_page + bvec_page(ca-sb_bio.bi_io_vec[F]) @@ struct cache *ca; expression E; expression F; type T; @@ - ca-sb_bio.bi_inline_vecs[F].bv_page = (T)E + bvec_set_page(ca-sb_bio.bi_inline_vecs[F], E) @@ struct cache *ca; expression F; @@ - ca-sb_bio.bi_inline_vecs[F].bv_page + bvec_page(ca-sb_bio.bi_inline_vecs[F]) @@ struct cache *ca; expression E; type T; @@ - ca-sb_bio.bi_io_vec-bv_page = (T)E + bvec_set_page(ca-sb_bio.bi_io_vec, E) @@ struct bio *bio; expression F; @@ - bio-bi_io_vec[F].bv_page + bvec_page(bio-bi_io_vec[F]) @@ struct bio bio; expression F; @@ - bio.bi_io_vec[F].bv_page + bvec_page(bio.bi_io_vec[F]) @@ struct bio *bio; @@ - bio-bi_io_vec-bv_page + bvec_page(bio-bi_io_vec) @@ struct cached_dev *dc; @@ - dc-sb_bio.bi_io_vec-bv_page + bvec_page(dc-sb_bio-bi_io_vec) @@ struct bio bio; @@ - bio.bi_io_vec-bv_page + bvec_page(bio.bi_io_vec) @@ struct bio_integrity_payload *bip; expression E; type T; @@ - bip-bip_vec-bv_page = (T)E + bvec_set_page(bip-bip_vec, E) @@ struct bio_integrity_payload *bip; @@ - bip-bip_vec-bv_page + bvec_page(bip-bip_vec) @@ struct bio_integrity_payload bip; @@ - bip.bip_vec-bv_page + bvec_page(bip.bip_vec) Cc: Jens Axboe ax...@kernel.dk Cc: Matthew Wilcox wi...@linux.intel.com Cc: Ross Zwisler ross.zwis...@linux.intel.com Cc: Neil Brown ne...@suse.de Cc: Alasdair Kergon a...@redhat.com Cc: Mike Snitzer snit...@redhat.com Cc: Chris Mason c...@fb.com Cc: Boaz Harrosh b...@plexistor.com Cc: Theodore Ts'o ty...@mit.edu Cc: Jan Kara j...@suse.cz Cc: Julia Lawall julia.law...@lip6.fr Cc: Martin K. Petersen martin.peter...@oracle.com Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/powerpc/sysdev/axonram.c |2 + block/bio-integrity.c |8 ++-- block/bio.c | 40 +++--- block/blk-core.c|4 +- block/blk-integrity.c |3 +- block/blk-lib.c |2 + block/blk-merge.c |7 ++-- block/bounce.c | 24 ++--- drivers/block/aoe/aoecmd.c |8 ++-- drivers/block/brd.c |2 + drivers/block/drbd/drbd_bitmap.c|5 ++- drivers/block/drbd/drbd_main.c |6 ++- drivers/block/drbd/drbd_receiver.c |4 +- drivers/block/drbd/drbd_worker.c|3 +- drivers/block/floppy.c |6
[PATCH v2 04/10] dma-mapping: allow archs to optionally specify a -map_pfn() operation
This is in support of enabling block device drivers to perform DMA to/from persistent memory which may not have a backing struct page entry. Signed-off-by: Dan Williams dan.j.willi...@intel.com --- arch/Kconfig |3 +++ include/asm-generic/dma-mapping-common.h | 30 ++ include/asm-generic/pfn.h|9 + include/linux/dma-debug.h| 23 +++ include/linux/dma-mapping.h |8 +++- lib/dma-debug.c | 10 ++ 6 files changed, 74 insertions(+), 9 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index a65eafb24997..f7f800860c00 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS config HAVE_DMA_CONTIGUOUS bool +config HAVE_DMA_PFN + bool + config GENERIC_SMP_IDLE_THREAD bool diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h index 940d5ec122c9..7305efb1bac6 100644 --- a/include/asm-generic/dma-mapping-common.h +++ b/include/asm-generic/dma-mapping-common.h @@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, kmemcheck_mark_initialized(ptr, size); BUG_ON(!valid_dma_direction(dir)); +#ifdef CONFIG_HAVE_DMA_PFN + addr = ops-map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)), +(unsigned long)ptr ~PAGE_MASK, size, +dir, attrs); +#else addr = ops-map_page(dev, virt_to_page(ptr), (unsigned long)ptr ~PAGE_MASK, size, dir, attrs); +#endif debug_dma_map_page(dev, virt_to_page(ptr), (unsigned long)ptr ~PAGE_MASK, size, dir, addr, true); @@ -73,6 +79,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg ops-unmap_sg(dev, sg, nents, dir, attrs); } +#ifdef CONFIG_HAVE_DMA_PFN +static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn, + size_t offset, size_t size, + enum dma_data_direction dir) +{ + struct dma_map_ops *ops = get_dma_ops(dev); + dma_addr_t addr; + + BUG_ON(!valid_dma_direction(dir)); + addr = ops-map_pfn(dev, pfn, offset, size, dir, NULL); + debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false); + + return addr; +} + +static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, + size_t offset, size_t size, + enum dma_data_direction dir) +{ + kmemcheck_mark_initialized(page_address(page) + offset, size); + return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir); +} +#else static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir) @@ -87,6 +116,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, return addr; } +#endif /* CONFIG_HAVE_DMA_PFN */ static inline void dma_unmap_page(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h index 91171e0285d9..c1fdf41fb726 100644 --- a/include/asm-generic/pfn.h +++ b/include/asm-generic/pfn.h @@ -48,4 +48,13 @@ static inline __pfn_t page_to_pfn_t(struct page *page) return pfn; } + +static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn) +{ +#if IS_ENABLED(CONFIG_PMEM_IO) + if (pfn.pfn PAGE_OFFSET) + return pfn.pfn; +#endif + return page_to_pfn(__pfn_t_to_page(pfn)); +} #endif /* __ASM_PFN_H */ diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h index fe8cb610deac..a3b4c8c0cd68 100644 --- a/include/linux/dma-debug.h +++ b/include/linux/dma-debug.h @@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries); extern int dma_debug_resize_entries(u32 num_entries); -extern void debug_dma_map_page(struct device *dev, struct page *page, - size_t offset, size_t size, - int direction, dma_addr_t dma_addr, - bool map_single); +extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset, + size_t size, int direction, dma_addr_t dma_addr, + bool map_single); + +static inline void debug_dma_map_page(struct device *dev, struct page *page, + size_t offset, size_t size, + int direction, dma_addr_t dma_addr, + bool map_single
Re: [dm-devel] [PATCH stable] block: discard bdi_unregister() in favour of bdi_destroy()
On Wed, Apr 29, 2015 at 5:32 PM, NeilBrown ne...@suse.de wrote: bdi_unregister() now contains very little functionality. It contains a WARN_ON if bdi-dev is NULL. This warning is of no real consequence as bdi-dev isn't needed by anything else in the function, and it triggers if blk_cleanup_queue() - bdi_destroy() is called before bdi_unregister, which a subsequent patch will make happen. So this isn't wanted. It also calls bdi_set_min_ratio(). This needs to be called after writes through the bdi have all been flushed, and before the bdi is destroyed. Calling it early is better than calling it late as it frees up a global resource. Calling it immediately after bdi_wb_shutdown() in bdi_destroy() perfectly fits these requirements. So bdi_unregister can be discarded with the important content moved to bdi_destroy, as can the writeback_bdi_unregister event which is already not used. This is tagged for 'stable' as it is a pre-requisite for a subsequent patch which moves calls to blk_cleanup_queue() before calls to del_gendisk(). The commit identified as 'Fixes' removed a lot of other functionality from bdi_unregister(), and made a change which necessitated moving the blk_cleanup_queue() calls. Reported-by: Mike Snitzer snit...@redhat.com Cc: Christoph Hellwig h...@lst.de Cc: Peter Zijlstra pet...@infradead.org Cc: sta...@vger.kernel.org (v4.0) Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Signed-off-by: NeilBrown ne...@suse.de --- Hi again Jens, would you be able to queue this patch *before* the other one: block: destroy bdi before blockdev is unregistered. If it has to come after I'll need to re-write the text a bit. If you could give me the commit hash to reference I'll do that. Seems it is after: http://git.kernel.dk/?p=linux-block.git;a=commit;h=6cd18e71 Also, we gave both patches a try internally after seeing the duplicate sysfs warning. You can add: Acked-by: Dan Williams dan.j.willi...@intel.com Tested-by: Nicholas Moulin nicholas.w.mou...@linux.intel.com ...on the re-send. Thanks Neil! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote: Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. Maybe. It could go either way, though. In particular, with the dynamically allocated struct page approach, if somebody uses it past the supposed lifetime of the use, things like poisoning the temporary struct page could be fairly effective. You can't really poison the pfn - it's just a number, and if somebody uses it later than you think (and you have re-used that physical memory for something else), you'll never ever know. True, but there's little need to poison a _pfn_t because it's permanent once discovered via -direct_access() on the hosting struct block_device. Sure, kmap_atomic_pfn_t() may fail when the pmem driver unbinds from a device, but the __pfn_t is still valid. Obviously, we can only support atomic kmap(s) with this property, and it would be nice to fault if someone continued to use the __pfn_t after the hosting device was disabled. To be clear, DAX has this same problem today. Nothing stops whomever called -direct_access() to continue using the pfn after the backing device has been disabled. I'd *assume* that most users of the dynamic struct page allocation have very clear lifetime rules. Those things would presumably normally get looked-up by some extended version of get_user_pages(), and there's a clear use of the result, with no longer lifetime. Also, you do need to have some higher-level locking when you do this, to make sure that the persistent pages don't magically get re-assigned. We're presumably talking about having a filesystem in that persistent memory, so we cannot be doing IO to the pages (from some other source - whether RDMA or some special zero-copy model) while the underlying filesystem is reassigning the storage because somebody deleted the file. IOW, there had better be other external rules about when - and how long - you can use a particular persistent page. No? So the whole when/how to allocate the temporary 'struct page' is just another detail in that whole thing. And yes, some uses may not ever actually see that. If the whole of persistent memory is just assigned to a database or something, and the DB just wants to do a flush this range of persistent memory to long-term disk storage, then there may not be much of a lifetime issue for the persistent memory. But even then you're going to have IO completion callbacks etc to let the DB know that it has hit the disk, so.. What is the primary thing that is driving this need? Do we have a very concrete example? My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). The more difficult use case is precisely what Al picked up on, O_DIRECT and RDMA. This patchset does nothing to address those use cases outside of not needing a struct page when they eventually craft a bio. I know Matthew Wilcox has explored the idea of get_user_sg() and let the scatterlist hold the reference count and locks, but I'll let him speak to that. I still see __pfn_t as generally useful for the simple in-kernel stacked-block-i/o use case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/