from:"Dan Williams"

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Dan Williams

On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
 wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe  wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers//bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Dan Williams

On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe  wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  9 +
>  drivers/pci/p2pdma.c   | 45 ++---
>  drivers/pci/pci.c  |  6 ++
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>   transations must be between devices behind the same root port.
>   (Typically behind a network of PCIe switches).
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effectively puts all devices behind any
> + switch heirarchy into the same IOMMU group. Which implies that
> + individual devices behind any switch will not be able to be
> + assigned to separate VMs because there is no isolation between
> + them. Additionally, any malicious PCIe devices will be able to
> + DMA to memory exposed by other EPs in the same domain as TLPs
> + will not be checked by the IOMMU.
> +
>   If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.

Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Dan Williams

On Wed, Mar 14, 2018 at 12:34 PM, Stephen  Bates  wrote:
>> P2P over PCI/PCI-X is quite common in devices like raid controllers.
>
> Hi Dan
>
> Do you mean between PCIe devices below the RAID controller? Isn't it pretty 
> novel to be able to support PCIe EPs below a RAID controller (as opposed to 
> SCSI based devices)?

I'm thinking of the classic I/O offload card where there's an NTB to
an internal PCI bus that has a storage controller and raid offload
engines.

Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Dan Williams

On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe  wrote:
>
>
> On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
>> You are focused on PCIe systems, and in those systems, most topologies
>> do have an upstream switch, which means two upstream bridges.  I'm
>> trying to remove that assumption because I don't think there's a
>> requirement for it in the spec.  Enforcing this assumption complicates
>> the code and makes it harder to understand because the reader says
>> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
>> so why do we need these two bridges?"
>
> Yes, as I've said, we focused on being behind a single PCIe Switch
> because it's easier and vaguely safer (we *know* switches will work but
> other types of topology we have to assume will work based on the spec).
> Also, I have my doubts that anyone will ever have a use for this with
> non-PCIe devices.

P2P over PCI/PCI-X is quite common in devices like raid controllers.
It would be useful if those configurations were not left behind so
that Linux could feasibly deploy offload code to a controller in the
PCI domain.

Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats

2018-03-01 Thread Dan Williams

On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe  wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma.  And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.

Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams

On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<b...@au1.ibm.com> wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <b...@au1.ibm.com> wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams

On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.

Re: [PATCH] brd: fix overflow in __brd_direct_access

2017-09-13 Thread Dan Williams

On Wed, Sep 13, 2017 at 6:17 AM, Mikulas Patocka <mpato...@redhat.com> wrote:
> The code in __brd_direct_access multiplies the pgoff variable by page size
> and divides it by 512. It can cause overflow on 32-bit architectures. The
> overflow happens if we create ramdisk larger than 4G and use it as a
> sparse device.
>
> This patch replaces multiplication and division with multiplication by the
> number of sectors per page.
>
> Signed-off-by: Mikulas Patocka <mpato...@redhat.com>
> Fixes: 1647b9b959c7 ("brd: add dax_operations support")
> Cc: sta...@vger.kernel.org  # 4.12+
>
> ---
>  drivers/block/brd.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-4.13/drivers/block/brd.c
> ===
> --- linux-4.13.orig/drivers/block/brd.c
> +++ linux-4.13/drivers/block/brd.c
> @@ -339,7 +339,7 @@ static long __brd_direct_access(struct b
>
> if (!brd)
> return -ENODEV;
> -   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
> +   page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);

Looks good to me, you can add:

Reviewed-by: Dan Williams <dan.j.willi...@intel.com>

Re: [PATCH 5/5] testb: badblock support

2017-08-05 Thread Dan Williams

On Sat, Aug 5, 2017 at 8:51 AM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> Sometime disk could have tracks broken and data there is inaccessable,
> but data in other parts can be accessed in normal way. MD RAID supports
> such disks. But we don't have a good way to test it, because we can't
> control which part of a physical disk is bad. For a virtual disk, this
> can be easily controlled.
>
> This patch adds a new 'badblock' attribute. Configure it in this way:
> echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
> blocks.
> echo "-20-30" > xxx/badblock, this will make sector [20-30] good

Did you happen to overlook block/badblocks.c, or did you find it unsuitable?

Re: [resend PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-07-29 Thread Dan Williams

On Fri, Jul 28, 2017 at 9:17 AM, Bart Van Assche <bart.vanass...@wdc.com> wrote:
> On Mon, 2017-04-17 at 12:09 -0700, Dan Williams wrote:
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index b7767da50c26..1de8372d9459 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
>>  config BLK_DEV_DM
>>   tristate "Device mapper support"
>>   select BLK_DEV_DM_BUILTIN
>> + select DAX
>>   ---help---
>> Device-mapper is a low level volume manager.  It works by allowing
>> people to specify mappings for ranges of logical sectors.  Various
>
> (replying to an e-mail of three months ago)
>
> Hello Dan,
>
> While building a v4.12 kernel I noticed that enabling device mapper support
> now unconditionally enables DAX. I think there are plenty of systems that use
> dm but do not need DAX. Have you considered to rework this such that instead
> of dm selecting DAX that DAX support is only enabled in dm if CONFIG_DAX is
> enabled?
>

I'd rather flip this around and add a CONFIG_DM_DAX that gates whether
DM enables / links to the DAX core. I'll take a look at a patch.

Re: [PATCH] brd: fix brd_rw_page() vs copy_to_brd_setup errors

2017-07-26 Thread Dan Williams

On Wed, Jul 26, 2017 at 2:32 PM, Ross Zwisler
<ross.zwis...@linux.intel.com> wrote:
> On Wed, Jul 26, 2017 at 01:12:28PM -0700, Christoph Hellwig wrote:
>> On Tue, Jul 25, 2017 at 06:02:29PM -0700, Dan Williams wrote:
>> > As is done in zram_rw_page, pmem_rw_page, and btt_rw_page, don't
>> > call page_endio in the error case since do_mpage_readpage and
>> > __mpage_writepage will resubmit on error. Calling page_endio in the
>> > error case leads to double completion.
>> >
>> > Cc: Jens Axboe <ax...@kernel.dk>
>> > Cc: Matthew Wilcox <mawil...@microsoft.com>
>> > Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
>> > Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> > ---
>> > Noticed this while looking at unrelated brd code...
>>
>> And the real question would be:  where would we see any real life impact
>> of just removing brd_rw_page?
>
> I've got patches ready that remove rw_page from brd, btt and pmem.  I'll send
> out once I'm done regression testing.

That would leave zram_rw_page(), is there a compelling reason to keep
that and the related infrastructure?

[PATCH] brd: fix brd_rw_page() vs copy_to_brd_setup errors

2017-07-25 Thread Dan Williams

As is done in zram_rw_page, pmem_rw_page, and btt_rw_page, don't
call page_endio in the error case since do_mpage_readpage and
__mpage_writepage will resubmit on error. Calling page_endio in the
error case leads to double completion.

Cc: Jens Axboe <ax...@kernel.dk>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Noticed this while looking at unrelated brd code...

 drivers/block/brd.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c0490d..055255ea131d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -327,7 +327,13 @@ static int brd_rw_page(struct block_device *bdev, sector_t 
sector,
 {
struct brd_device *brd = bdev->bd_disk->private_data;
int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
-   page_endio(page, is_write, err);
+
+   /*
+* In the error case we expect the upper layer to retry, so we
+* can't trigger page_endio yet.
+*/
+   if (err == 0)
+   page_endio(page, is_write, 0);
return err;
 }

Re: [PATCH 1/2] block, dax: move "select DAX" from BLOCK to FS_DAX

2017-05-09 Thread Dan Williams

On Tue, May 9, 2017 at 12:57 AM, Geert Uytterhoeven
<ge...@linux-m68k.org> wrote:
> Hi Dan,
>
> On Tue, May 9, 2017 at 12:36 AM, kbuild test robot <l...@intel.com> wrote:
>> [auto build test ERROR on linus/master]
>> [also build test ERROR on next-20170508]
>> [cannot apply to v4.11]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/Dan-Williams/block-dax-move-select-DAX-from-BLOCK-to-FS_DAX/20170509-051522
>> config: parisc-c3000_defconfig (attached as .config)
>> compiler: hppa-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> make.cross ARCH=parisc
>>
>> All errors (new ones prefixed by >>):
>>
>>fs/built-in.o: In function `bdev_dax_supported':
>>>> (.text.bdev_dax_supported+0x4c): undefined reference to `dax_get_by_host'
>>fs/built-in.o: In function `bdev_dax_supported':
>>>> (.text.bdev_dax_supported+0x5c): undefined reference to `dax_read_lock'
>>fs/built-in.o: In function `bdev_dax_supported':
>>>> (.text.bdev_dax_supported+0x7c): undefined reference to `dax_direct_access'
>>fs/built-in.o: In function `bdev_dax_supported':
>>>> (.text.bdev_dax_supported+0x88): undefined reference to `dax_read_unlock'
>>fs/built-in.o: In function `bdev_dax_supported':
>>>> (.text.bdev_dax_supported+0x90): undefined reference to `put_dax'
>
> I ran into the same issue if CONFIG_DAX=m (it's still selected by some other
> modular symbol). #if IS_ENABLED(CONFIG_DAX) is true in the modular case, so
> the dummies provided by include/linux/dax.h are not used.
>
> However, while changing it to #ifdef CONFIG_DAX allows to build vmlinux, it
> leads to other issues as DAX is compiled as a module:
>
> drivers/dax/super.c:35: error: redefinition of ‘dax_read_lock’
> include/linux/dax.h:30: error: previous definition of
> ‘dax_read_lock’ was here
>
> Yes, calling into optional modular code from builtin code in fs/blockdev.c is
> tricky ;-(  Perhaps you can make bdev_dax_supported() a small wrapper
> that calls into the real code through a function pointer, when the DAX module
> is available?

In fact, that is close to what I did for v2, and it passed a full run
through the kbuild robot. I'll send it out shortly.

[PATCH 2/2] device-dax: kill NR_DEV_DAX

2017-05-08 Thread Dan Williams

There is no point to ask how many device-dax instances the kernel should
support. Since we are already using a dynamic major number, just allow
the max number of minors by default and be done. This also fixes the
fact that the proposed max for the NR_DEV_DAX range was larger than what
could be supported by alloc_chrdev_region().

Fixes: ba09c01d2fa8 ("dax: convert to the cdev api")
Reported-by: Geert Uytterhoeven <ge...@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/Kconfig |5 -
 drivers/dax/super.c |   11 +++
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 6b7e20eae16c..b79aa8f7a497 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -28,9 +28,4 @@ config DEV_DAX_PMEM
 
  Say Y if unsure
 
-config NR_DEV_DAX
-   int "Maximum number of Device-DAX instances"
-   default 32768
-   range 256 2147483647
-
 endif
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 465dcd7317d5..f2d4f20eb2b8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -20,10 +20,6 @@
 #include 
 #include 
 
-static int nr_dax = CONFIG_NR_DEV_DAX;
-module_param(nr_dax, int, S_IRUGO);
-MODULE_PARM_DESC(nr_dax, "max number of dax device instances");
-
 static dev_t dax_devt;
 DEFINE_STATIC_SRCU(dax_srcu);
 static struct vfsmount *dax_mnt;
@@ -261,7 +257,7 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
if (__host && !host)
return NULL;
 
-   minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL);
+   minor = ida_simple_get(_minor_ida, 0, MINORMASK+1, GFP_KERNEL);
if (minor < 0)
goto err_minor;
 
@@ -405,8 +401,7 @@ static int __init dax_fs_init(void)
if (rc)
return rc;
 
-   nr_dax = max(nr_dax, 256);
-   rc = alloc_chrdev_region(_devt, 0, nr_dax, "dax");
+   rc = alloc_chrdev_region(_devt, 0, MINORMASK+1, "dax");
if (rc)
__dax_fs_exit();
return rc;
@@ -414,7 +409,7 @@ static int __init dax_fs_init(void)
 
 static void __exit dax_fs_exit(void)
 {
-   unregister_chrdev_region(dax_devt, nr_dax);
+   unregister_chrdev_region(dax_devt, MINORMASK+1);
ida_destroy(_minor_ida);
__dax_fs_exit();
 }

[PATCH 1/2] block, dax: move "select DAX" from BLOCK to FS_DAX

2017-05-08 Thread Dan Williams

For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

The only core block routine that calls a DAX routine is
bdev_dax_supported(), that now fails by default as expected if FS_DAX=n,
or no DAX-capable drivers are configured.

Reported-by: Geert Uytterhoeven <ge...@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/Kconfig   |1 -
 fs/Kconfig  |1 +
 include/linux/dax.h |   37 ++---
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 93da7fc3f254..e9f780f815f5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,7 +6,6 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
-   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 83eab52fb3f6..b0e42b6a96b9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -39,6 +39,7 @@ config FS_DAX
depends on MMU
depends on !(ARM || MIPS || SPARC)
select FS_IOMAP
+   select DAX
help
  Direct Access (DAX) can be used on memory-backed block devices.
  If the block device supports DAX and the filesystem supports DAX,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..e3fa0ad5a12d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,17 +18,48 @@ struct dax_operations {
void **, pfn_t *);
 };
 
+#if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *dax_get_by_host(const char *host);
+void put_dax(struct dax_device *dax_dev);
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn);
+#else
+static inline int dax_read_lock(void)
+{
+   /* all dax operations inside this lock will fail below */
+   return 0;
+}
+
+static inline void dax_read_unlock(int id)
+{
+}
+
+static inline struct dax_device *dax_get_by_host(const char *host)
+{
+   return NULL;
+}
+
+static inline void put_dax(struct dax_device *dax_dev)
+{
+}
+
+static inline long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   /* avoid 'uninitialized variable' warnings for @kaddr and @pfn */
+   *pfn = (pfn_t) { .val = ULLONG_MAX };
+   *kaddr = (void *) ULONG_MAX;
+
+   return -EOPNOTSUPP;
+}
+#endif
 struct dax_device *alloc_dax(void *private, const char *host,
const struct dax_operations *ops);
-void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
-long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
-   void **kaddr, pfn_t *pfn);
 
 /*
  * We use lowest available bit in exceptional entry for locking, one bit for

Re: [PATCH] brd: fix uninitialized use of brd->dax_dev

2017-05-03 Thread Dan Williams

On Wed, May 3, 2017 at 5:56 AM, Gerald Schaefer
 wrote:
> commit 1647b9b9 "brd: add dax_operations support" introduced the allocation
> and freeing of a dax_device, but the allocated dax_device is not stored
> into the brd_device, so brd_del_one() will eventually operate on an
> uninitialized brd->dax_dev.
>
> Fix this by storing the allocated dax_device to brd->dax_dev.
>
> Signed-off-by: Gerald Schaefer 

Thanks Gerald, I added this to the dax_device conversion for 4.12.

[PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations

2017-04-28 Thread Dan Williams

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
  etc. (Ingo)

 arch/x86/Kconfig  |1 
 arch/x86/include/asm/string_64.h  |5 +
 arch/x86/include/asm/uaccess_64.h |   11 +++
 arch/x86/lib/usercopy_64.c|  128 +
 drivers/acpi/nfit/core.c  |3 -
 drivers/nvdimm/claim.c|2 -
 drivers/nvdimm/pmem.c |   13 +++-
 drivers/nvdimm/region_devs.c  |4 +
 include/linux/dax.h   |3 +
 include/linux/string.h|6 ++
 include/linux/uio.h   |   15 
 lib/Kconfig   |3 +
 lib/iov_iter.c|   21 ++
 13 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_UACCESS_WT  if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h 
b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const 
void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
 
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+  size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
  unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void 
__user *src,
return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+   kasan_check_write(dst, size);
+   return __copy_user_wt(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a02

[PATCH] block: hide badblocks attribute by default

2017-04-27 Thread Dan Williams

Commit 99e6608c9e74 "block: Add badblock management for gendisks"
allowed for drivers like pmem and software-raid to advertise a list of
bad media areas. However, it inadvertently added a 'badblocks' to all
block devices. Lets clean this up by having the 'badblocks' attribute
not be visible when the driver has not populated a 'struct badblocks'
instance in the gendisk.

Cc: Jens Axboe <ax...@fb.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Reported-by: Vishal Verma <vishal.l.ve...@intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/genhd.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..12acd48e1210 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1060,8 +1060,19 @@ static struct attribute *disk_attrs[] = {
NULL
 };
 
+static umode_t disk_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+   struct device *dev = container_of(kobj, typeof(*dev), kobj);
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (a == _attr_badblocks.attr && !disk->bb)
+   return 0;
+   return a->mode;
+}
+
 static struct attribute_group disk_attr_group = {
.attrs = disk_attrs,
+   .is_visible = disk_visible,
 };
 
 static const struct attribute_group *disk_attr_groups[] = {

[RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem

2017-04-26 Thread Dan Williams

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---

This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.

 arch/x86/Kconfig  |1 
 arch/x86/include/asm/string_64.h  |5 +
 arch/x86/include/asm/uaccess_64.h |   13 
 arch/x86/lib/usercopy_64.c|  128 +
 drivers/acpi/nfit/core.c  |2 -
 drivers/nvdimm/claim.c|2 -
 drivers/nvdimm/pmem.c |   13 +++-
 drivers/nvdimm/region_devs.c  |2 -
 include/linux/dax.h   |3 +
 include/linux/string.h|6 ++
 include/linux/uio.h   |   15 
 lib/Kconfig   |3 +
 lib/iov_iter.c|   22 ++
 13 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_UACCESS_WRITETHRU   if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h 
b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const 
void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
 
+extern long __copy_user_writethru(void *dst, const void __user *src,
+ unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+ size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
  unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void 
__user *src,
return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+ unsigned size)
+{
+   kasan_check_write(dst, size);
+   return __copy_user_writethru(dst, src, size);
+}
+
 unsigned long
 copy_user

Re: [resend PATCH v2 00/33] dax: introduce dax_operations

2017-04-25 Thread Dan Williams

On Fri, Apr 21, 2017 at 6:06 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> [ adding akpm, sfr, and jens ]
>
> I applied this series and pushed it out for the nvdimm.git branch that
> gets auto pulled into -next. The set is still awaiting acks from
> device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
> 29/33). If those come next week perhaps this can be merged for 4.12,
> but if not this will need to wait until 4.13.
>
> There are some minor collisions with Al's copy_from_user rework, the
> new dax tracepoints, and the removal of discard support from the brd
> driver. A sample merge is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge
>
> If it causes any other problems just drop and I'll retry for 4.13.

Al has nak'd the uaccess related changes, and I'll need to rework
those patches to move the pmem routines into lib/iov_iter.c directly.
That doesn't affect the dax_device and dax_operations work, so I'm
still looking to move forward with that change.  That reduces the set
targeting 4.12 to just the first 18 patches from this series:

Dan Williams (18):
  device-dax: rename 'dax_dev' to 'dev_dax'
  dax: refactor dax-fs into a generic provider of 'struct
dax_device' instances
  dax: add a facility to lookup a dax device by 'host' device name
  dax: introduce dax_operations
  pmem: add dax_operations support
  axon_ram: add dax_operations support
  brd: add dax_operations support
  dcssblk: add dax_operations support
  block: kill bdev_dax_capable()
  dax: introduce dax_direct_access()
  dm: add dax_device and dax_operations support
  dm: teach dm-targets to use a dax_device + dax_operations
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  Revert "block: use DAX for partition table reads"
  filesystem-dax: convert to dax_direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  block: remove block_device_operations ->direct_access()
  x86, dax, pmem: remove indirection around memcpy_from_pmem()

Re: [resend PATCH v2 00/33] dax: introduce dax_operations

2017-04-21 Thread Dan Williams

[ adding akpm, sfr, and jens ]

I applied this series and pushed it out for the nvdimm.git branch that
gets auto pulled into -next. The set is still awaiting acks from
device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
29/33). If those come next week perhaps this can be merged for 4.12,
but if not this will need to wait until 4.13.

There are some minor collisions with Al's copy_from_user rework, the
new dax tracepoints, and the removal of discard support from the brd
driver. A sample merge is available here:

https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge

If it causes any other problems just drop and I'll retry for 4.13.

On Mon, Apr 17, 2017 at 12:08 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> [ resend to add dm-devel, linux-block, and fs-devel, apologies for the
> duplicates ]
>
> Changes since v1 [1] and the dax-fs RFC [2]:
> * rename struct dax_inode to struct dax_device (Christoph)
> * rewrite arch_memcpy_to_pmem() in C with inline asm
> * use QUEUE_FLAG_WC to gate dax cache management (Jeff)
> * add device-mapper plumbing for the ->copy_from_iter() and ->flush()
>   dax_operations
> * kill struct blk_dax_ctl and bdev_direct_access (Christoph)
> * cleanup the ->direct_access() calling convention to be page based
>   (Christoph)
> * introduce dax_get_by_host() and don't pollute struct super_block with
>   dax_device details (Christoph)
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
> [2]: https://lwn.net/Articles/713064/
>
> ---
> A few months back, in the course of reviewing the memcpy_nocache()
> proposal from Brian, Linus proposed that the pmem specific
> memcpy_to_pmem() routine be moved to be implemented at the driver level
> [3]:
>
>"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using 'movnt' by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in
> the sense of ."
>
> This feedback also dovetails with another fs/dax.c design wart of being
> hard coded to assume the backing device is pmem. We call the pmem
> specific copy, clear, and flush routines even if the backing device
> driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
> There is no reason to spend cpu cycles flushing the cache after writing
> to brd, for example, since it is using volatile memory for storage.
>
> Moreover, the pmem driver might be fronting a volatile memory range
> published by the ACPI NFIT, or the platform might have arranged to flush
> cpu caches on power fail. This latter capability is a feature that has
> appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
> platforms).
>
> So, this series:
>
> 1/ moves what was previously named "the pmem api" out of the global
>namespace and into drivers that need to be concerned with
>architecture specific persistent memory considerations.
>
> 2/ arranges for dax to stop abusing __copy_user_nocache() and implements
>a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
>expanded in the future to use 'movntdqa' if the copy size is above
>some threshold, or expanded with support for other architectures [4].
>
> 3/ makes cache maintenance optional by arranging for dax to call driver
>specific copy and flush operations only if the driver publishes them.
>
> 4/ allows filesytem-dax cache management to be controlled by the block
>device write-cache queue flag. The pmem driver is updated to clear
>that flag by default when pmem is driving volatile memory.
>
> [3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
>
> These patches have been through a round of build regression fixes
> notified by the 0day robot. All review welcome, but the patches that
> need extra attention are the device-mapper and uio changes
> (copy_from_iter_ops).
>
> This series is based on a merge of char-misc-next (for cdev api reworks)
> and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
>
> ---
>
> Dan Williams (33):
>   device-dax: rename 'dax_dev' to 'dev_dax'
>   dax: refactor dax-fs into a generic provider of 'struct dax_device' 
> instances
>   dax: add a facility to lookup a dax device by 'host' device name
>   dax: introduce dax_operations
>   pmem: add da

Re: [PATCH] block: get rid of blk_integrity_revalidate()

2017-04-21 Thread Dan Williams

On Wed, Apr 19, 2017 at 7:04 PM, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
> Ilya Dryomov <idryo...@gmail.com> writes:
>
> Ilya,
>
>> Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
>> introduced blk_integrity_revalidate(), which seems to assume ownership
>> of the stable pages flag and unilaterally clears it if no blk_integrity
>> profile is registered:
>>
>> if (bi->profile)
>> disk->queue->backing_dev_info->capabilities |=
>> BDI_CAP_STABLE_WRITES;
>> else
>> disk->queue->backing_dev_info->capabilities &=
>> ~BDI_CAP_STABLE_WRITES;
>>
>> It's called from revalidate_disk() and rescan_partitions(), making it
>> impossible to enable stable pages for drivers that support partitions
>> and don't use blk_integrity: while the call in revalidate_disk() can be
>> trivially worked around (see zram, which doesn't support partitions and
>> hence gets away with zram_revalidate_disk()), rescan_partitions() can
>> be triggered from userspace at any time.  This breaks rbd, where the
>> ceph messenger is responsible for generating/verifying CRCs.
>>
>> Since blk_integrity_{un,}register() "must" be used for (un)registering
>> the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
>> setting there.  This way drivers that call blk_integrity_register() and
>> use integrity infrastructure won't interfere with drivers that don't
>> but still want stable pages.
>
> I seem to recall that the reason for the revalidate hook was that either
> NVMe or nvdimm had to register an integrity profile prior to the actual
> format being known.
>
> So while I am OK with the change from a SCSI perspective, I think we
> need Keith and Dan to ack it.

Looks good to me,

Tested-by: Dan Williams <dan.j.willi...@intel.com>

Re: [resend PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-04-20 Thread Dan Williams

On Mon, Apr 17, 2017 at 12:09 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> Allocate a dax_device to represent the capacity of a device-mapper
> instance. Provide a ->direct_access() method via the new dax_operations
> indirection that mirrors the functionality of the current direct_access
> support via block_device_operations.  Once fs/dax.c has been converted
> to use dax_operations the old dm_blk_direct_access() will be removed.
>
> A new helper dm_dax_get_live_target() is introduced to separate some of
> the dm-specifics from the direct_access implementation.
>
> This enabling is only for the top-level dm representation to upper
> layers. Converting target direct_access implementations is deferred to a
> separate patch.
>
> Cc: Toshi Kani <toshi.k...@hpe.com>
> Cc: Mike Snitzer <snit...@redhat.com>

Hi Mike,

Any concerns with these dax_device and dax_operations changes to
device-mapper for the upcoming merge window?


> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
> ---
>  drivers/md/Kconfig|1
>  drivers/md/dm-core.h  |1
>  drivers/md/dm.c   |   84 
> ++---
>  include/linux/device-mapper.h |1
>  4 files changed, 73 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index b7767da50c26..1de8372d9459 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
>  config BLK_DEV_DM
> tristate "Device mapper support"
> select BLK_DEV_DM_BUILTIN
> +   select DAX
> ---help---
>   Device-mapper is a low level volume manager.  It works by allowing
>   people to specify mappings for ranges of logical sectors.  Various
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 136fda3ff9e5..538630190f66 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -58,6 +58,7 @@ struct mapped_device {
> struct target_type *immutable_target_type;
>
> struct gendisk *disk;
> +   struct dax_device *dax_dev;
> char name[16];
>
> void *interface_ptr;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index dfb75979e455..bd56dfe43a99 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
> sector_t len)
>  }
>  EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
>
> -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
> -void **kaddr, pfn_t *pfn, long size)
> +static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
> +   sector_t sector, int *srcu_idx)
>  {
> -   struct mapped_device *md = bdev->bd_disk->private_data;
> struct dm_table *map;
> struct dm_target *ti;
> -   int srcu_idx;
> -   long len, ret = -EIO;
>
> -   map = dm_get_live_table(md, _idx);
> +   map = dm_get_live_table(md, srcu_idx);
> if (!map)
> -   goto out;
> +   return NULL;
>
> ti = dm_table_find_target(map, sector);
> if (!dm_target_is_valid(ti))
> -   goto out;
> +   return NULL;
>
> -   len = max_io_len(sector, ti) << SECTOR_SHIFT;
> -   size = min(len, size);
> +   return ti;
> +}
>
> -   if (ti->type->direct_access)
> -   ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
> -out:
> +static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> +   long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> +   struct mapped_device *md = dax_get_private(dax_dev);
> +   sector_t sector = pgoff * PAGE_SECTORS;
> +   struct dm_target *ti;
> +   long len, ret = -EIO;
> +   int srcu_idx;
> +
> +   ti = dm_dax_get_live_target(md, sector, _idx);
> +
> +   if (!ti)
> +   goto out;
> +   if (!ti->type->direct_access)
> +   goto out;
> +   len = max_io_len(sector, ti) / PAGE_SECTORS;
> +   if (len < 1)
> +   goto out;
> +   nr_pages = min(len, nr_pages);
> +   if (ti->type->direct_access) {
> +   ret = ti->type->direct_access(ti, sector, kaddr, pfn,
> +   nr_pages * PAGE_SIZE);
> +   /*
> +* FIXME: convert ti->type->direct_access to return
> +* nr_pa

Re: [PATCH v3] axon_ram: add dax_operations support

2017-04-19 Thread Dan Williams

On Wed, Apr 19, 2017 at 8:01 PM, kbuild test robot <l...@intel.com> wrote:
> Hi Dan,
>
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v4.11-rc7 next-20170419]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
>
> url:
> https://github.com/0day-ci/linux/commits/Dan-Williams/axon_ram-add-dax_operations-support/20170420-091615
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next

Hi kbuild team, yes this is the wrong base. It's part of a larger
series [1] and I'm just re-sending a few select patches with updates
from review, rather than the full 33 patch series. Any better way to
send individual updates to a patch in a series without re-sending the
whole series?

[1]: https://lkml.org/lkml/2017/4/14/495

[PATCH v3] axon_ram: add dax_operations support

2017-04-19 Thread Dan Williams

Setup a dax_device to have the same lifetime as the axon_ram block
device and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.

Reported-by: Gerald Schaefer <gerald.schae...@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Changes since v2:
* fix return code in the alloc_dax() failure case (Gerald)

 arch/powerpc/platforms/Kconfig |1 +
 arch/powerpc/sysdev/axonram.c  |   48 +++-
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
 config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+   select DAX
default m
help
  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index f523ac883150..171ba86a3494 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
 struct axon_ram_bank {
struct platform_device  *device;
struct gendisk  *disk;
+   struct dax_device   *dax_dev;
unsigned intirq_id;
unsigned long   ph_addr;
unsigned long   io_addr;
@@ -137,25 +139,47 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+
+   *kaddr = (void *) bank->io_addr + offset;
+   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+   return (bank->size - offset) / PAGE_SIZE;
+}
+
 /**
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
   void **kaddr, pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-   *kaddr = (void *) bank->io_addr + offset;
-   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
-   return bank->size - offset;
+   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
+   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
 }
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_direct_access
+   .direct_access  = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   struct axon_ram_bank *bank = dax_get_private(dax_dev);
+
+   return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+   .direct_access = axon_ram_dax_direct_access,
 };
 
 /**
@@ -219,6 +243,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
 
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = _ram_devops;
@@ -227,6 +252,13 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
+   bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
+   _ram_dax_ops);
+   if (!bank->dax_dev) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(>dev, "Cannot register disk queue\n");
@@ -278,6 +310,8 @@ static int axon_ram_probe(struct platform_device *device)
del_gendisk(bank->disk);
put_disk(bank->disk);
}
+   kill_dax(bank->dax_dev);
+   put_dax(bank->dax_dev);
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->

[PATCH v3] dax: add a facility to lookup a dax device by 'host' device name

2017-04-19 Thread Dan Williams

For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_device associated with a block_device. Add a
'host' property of a dax_device that can be used for this purpose. It is
a free form string, but for a dax_device associated with a block device
it is the bdev name.

This is a stop-gap until filesystems are able to mount on a dax-inode
directly.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Changes since v2:
* fixed slab corruption due to non-null dax_dev->host at
  dax_i_callback() time

 drivers/dax/dax.h|2 +
 drivers/dax/device.c |2 +
 drivers/dax/super.c  |   87 --
 include/linux/dax.h  |1 +
 4 files changed, 86 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 2472d9da96db..246a24d68d4c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,7 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_device;
-struct dax_device *alloc_dax(void *private);
+struct dax_device *alloc_dax(void *private, const char *host);
 void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 19a42edbfa03..db68f4fa8ce0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -645,7 +645,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_dev = alloc_dax(dev_dax);
+   dax_dev = alloc_dax(dev_dax, NULL);
if (!dax_dev)
goto err_dax;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c9f85f1c086e..8d446674c1da 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
+static struct hlist_head dax_host_list[DAX_HASH_SIZE];
+static DEFINE_SPINLOCK(dax_host_lock);
+
 int dax_read_lock(void)
 {
return srcu_read_lock(_srcu);
@@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
  * struct dax_device - anchor object for dax services
  * @inode: core vfs
  * @cdev: optional character interface for "device dax"
+ * @host: optional name for lookups where the device path is not available
  * @private: dax driver private data
  * @alive: !alive + rcu grace period == no new operations / mappings
  */
 struct dax_device {
+   struct hlist_node list;
struct inode inode;
struct cdev cdev;
+   const char *host;
void *private;
bool alive;
 };
@@ -63,6 +70,11 @@ bool dax_alive(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_alive);
 
+static int dax_host_hash(const char *host)
+{
+   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
+}
+
 /*
  * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
  * that any fault handlers or operations that might have seen
@@ -75,7 +87,13 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
dax_dev->alive = false;
+
synchronize_srcu(_srcu);
+
+   spin_lock(_host_lock);
+   hlist_del_init(_dev->list);
+   spin_unlock(_host_lock);
+
dax_dev->private = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dax);
@@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct dax_device *dax_dev = to_dax_dev(inode);
 
+   kfree(dax_dev->host);
+   dax_dev->host = NULL;
ida_simple_remove(_minor_ida, MINOR(inode->i_rdev));
kmem_cache_free(dax_cache, dax_dev);
 }
@@ -169,26 +189,53 @@ static struct dax_device *dax_dev_get(dev_t devt)
return dax_dev;
 }
 
-struct dax_device *alloc_dax(void *private)
+static void dax_add_host(struct dax_device *dax_dev, const char *host)
+{
+   int hash;
+
+   /*
+* Unconditionally init dax_dev since it's coming from a
+* non-zeroed slab cache
+*/
+   INIT_HLIST_NODE(_dev->list);
+   dax_dev->host = host;
+   if (!host)
+   return;
+
+   hash = dax_host_hash(host);
+   spin_lock(_host_lock);
+   hlist_add_head(_dev->list, _host_list[hash]);
+   spin_unlock(_host_lock);
+}
+
+struct dax_device *alloc_dax(void *private, const char *__host)
 {
struct dax_device *dax_dev;
+   const char *host;
dev_t devt;
int minor;
 
+   host = kstrdup(__host, GFP_KERNEL);
+   if (__host && !host)
+   return NULL;
+
minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL);
if (minor < 0)
-   return NULL;
+   goto err_minor;
 
devt = MKDEV(MAJOR(dax_devt), minor);
dax_dev = dax_

Re: [resend PATCH v2 08/33] dcssblk: add dax_operations support

2017-04-19 Thread Dan Williams

On Wed, Apr 19, 2017 at 8:31 AM, Gerald Schaefer
<gerald.schae...@de.ibm.com> wrote:
> On Mon, 17 Apr 2017 12:09:32 -0700
> Dan Williams <dan.j.willi...@intel.com> wrote:
>
>> Setup a dax_dev to have the same lifetime as the dcssblk block device
>> and add a ->direct_access() method that is equivalent to
>> dcssblk_direct_access(). Once fs/dax.c has been converted to use
>> dax_operations the old dcssblk_direct_access() will be removed.
>>
>> Cc: Gerald Schaefer <gerald.schae...@de.ibm.com>
>> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> ---
>>  drivers/s390/block/Kconfig   |1 +
>>  drivers/s390/block/dcssblk.c |   54 
>> +++---
>>  2 files changed, 46 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
>> index 4a3b62326183..0acb8c2f9475 100644
>> --- a/drivers/s390/block/Kconfig
>> +++ b/drivers/s390/block/Kconfig
>> @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
>>
>>  config DCSSBLK
>>   def_tristate m
>> + select DAX
>>   prompt "DCSSBLK support"
>>   depends on S390 && BLOCK
>>   help
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 415d10a67b7a..682a9eb4934d 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -18,6 +18,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>
>> @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, 
>> fmode_t mode);
>>  static void dcssblk_release(struct gendisk *disk, fmode_t mode);
>>  static blk_qc_t dcssblk_make_request(struct request_queue *q,
>>   struct bio *bio);
>> -static long dcssblk_direct_access(struct block_device *bdev, sector_t 
>> secnum,
>> +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
>> secnum,
>>void **kaddr, pfn_t *pfn, long size);
>> +static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t 
>> pgoff,
>> + long nr_pages, void **kaddr, pfn_t *pfn);
>>
>>  static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>>
>> @@ -40,7 +43,11 @@ static const struct block_device_operations 
>> dcssblk_devops = {
>>   .owner  = THIS_MODULE,
>>   .open   = dcssblk_open,
>>   .release= dcssblk_release,
>> - .direct_access  = dcssblk_direct_access,
>> + .direct_access  = dcssblk_blk_direct_access,
>> +};
>> +
>> +static const struct dax_operations dcssblk_dax_ops = {
>> + .direct_access = dcssblk_dax_direct_access,
>>  };
>>
>>  struct dcssblk_dev_info {
>> @@ -57,6 +64,7 @@ struct dcssblk_dev_info {
>>   struct request_queue *dcssblk_queue;
>>   int num_of_segments;
>>   struct list_head seg_list;
>> + struct dax_device *dax_dev;
>>  };
>>
>>  struct segment_info {
>> @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
>> device_attribute *attr, const ch
>>   }
>>   list_del(_info->lh);
>>
>> + kill_dax(dev_info->dax_dev);
>> + put_dax(dev_info->dax_dev);
>>   del_gendisk(dev_info->gd);
>>   blk_cleanup_queue(dev_info->dcssblk_queue);
>>   dev_info->gd->queue = NULL;
>> @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
>> device_attribute *attr, const char
>>   int rc, i, j, num_of_segments;
>>   struct dcssblk_dev_info *dev_info;
>>   struct segment_info *seg_info, *temp;
>> + struct dax_device *dax_dev;
>>   char *local_buf;
>>   unsigned long seg_byte_size;
>>
>> @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
>> device_attribute *attr, const char
>>   if (rc)
>>   goto put_dev;
>>
>> + dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
>> + _dax_ops);
>> + if (!dax_dev)
>> + goto put_dev;
>> +
>
> The returned dax_dev should be stored into dev_info->dax_dev, for later use
> by kill/put_dax(). This can also be done directly, so that we don't need the
> local dax_dev variable here.
>
> Also, in the error case, a proper rc should be set before going to put_dev,
> probably -ENOMEM.
>
> I took a quick look at the patches for the other affected drivers, and it
> looks like axonram also has the "missing rc" issue, and brd the "missing
> brd->dax_dev init" issue, pmem seems to be fine.

Thank you for taking a look. I'll get this fixed up.

[resend PATCH v2 02/33] dax: refactor dax-fs into a generic provider of 'struct dax_device' instances

2017-04-17 Thread Dan Williams

We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/Makefile|2 
 drivers/dax/Kconfig |   10 +
 drivers/dax/Makefile|5 +
 drivers/dax/dax.h   |   20 +--
 drivers/dax/device-dax.h|   25 
 drivers/dax/device.c|  241 ++
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  303 +++
 include/linux/dax.h |3 
 tools/testing/nvdimm/Kbuild |   10 +
 10 files changed, 404 insertions(+), 217 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (77%)
 create mode 100644 drivers/dax/super.c

diff --git a/drivers/Makefile b/drivers/Makefile
index 2eced9afba53..0442e982cf35 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
-obj-$(CONFIG_DEV_DAX)  += dax/
+obj-$(CONFIG_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 9e95bf94eb13..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,8 +1,13 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+   select SRCU
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+   tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
-   select SRCU
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
@@ -11,7 +16,6 @@ menuconfig DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-if DEV_DAX
 
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
+dax-y := super.o
 dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ea176d875d60..2472d9da96db 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,12 @@
  */
 #ifndef __DAX_H__
 #define __DAX_H__
-struct device;
-struct dev_dax;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
-   int region_id, struct resource *res, unsigned int align,
-   void *addr, unsigned long flags);
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
-   struct resource *res, int count);
+struct dax_device;
+struct dax_device *alloc_dax(void *private);
+void put_dax(struct dax_device *dax_dev);
+bool dax_alive(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
+struct inode *dax_inode(struct dax_device *dax_dev);
+void

[resend PATCH v2 01/33] device-dax: rename 'dax_dev' to 'dev_dax'

2017-04-17 Thread Dan Williams

In preparation for introducing a struct dax_device type to the kernel
global type namespace, rename dax_dev to dev_dax. A 'dax_device'
instance will be a generic device-driver object for any provider of dax
functionality. A 'dev_dax' object is a device-dax-driver local /
internal instance.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.c  |  206 ++--
 drivers/dax/dax.h  |4 +
 drivers/dax/pmem.c |8 +-
 3 files changed, 109 insertions(+), 109 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 352cc54056ce..376fdd353aea 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -57,7 +57,7 @@ struct dax_region {
 };
 
 /**
- * struct dax_dev - subdivision of a dax region
+ * struct dev_dax - instance data for a subdivision of a dax region
  * @region - parent region
  * @dev - device backing the character device
  * @cdev - core chardev data
@@ -66,7 +66,7 @@ struct dax_region {
  * @num_resources - number of physical address extents in this device
  * @res - array of physical address ranges
  */
-struct dax_dev {
+struct dev_dax {
struct dax_region *region;
struct inode *inode;
struct device dev;
@@ -323,47 +323,47 @@ struct dax_region *alloc_dax_region(struct device 
*parent, int region_id,
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
 
-static struct dax_dev *to_dax_dev(struct device *dev)
+static struct dev_dax *to_dev_dax(struct device *dev)
 {
-   return container_of(dev, struct dax_dev, dev);
+   return container_of(dev, struct dev_dax, dev);
 }
 
 static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct dax_dev *dax_dev = to_dax_dev(dev);
+   struct dev_dax *dev_dax = to_dev_dax(dev);
unsigned long long size = 0;
int i;
 
-   for (i = 0; i < dax_dev->num_resources; i++)
-   size += resource_size(_dev->res[i]);
+   for (i = 0; i < dev_dax->num_resources; i++)
+   size += resource_size(_dax->res[i]);
 
return sprintf(buf, "%llu\n", size);
 }
 static DEVICE_ATTR_RO(size);
 
-static struct attribute *dax_device_attributes[] = {
+static struct attribute *dev_dax_attributes[] = {
_attr_size.attr,
NULL,
 };
 
-static const struct attribute_group dax_device_attribute_group = {
-   .attrs = dax_device_attributes,
+static const struct attribute_group dev_dax_attribute_group = {
+   .attrs = dev_dax_attributes,
 };
 
 static const struct attribute_group *dax_attribute_groups[] = {
-   _device_attribute_group,
+   _dax_attribute_group,
NULL,
 };
 
-static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma,
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
 {
-   struct dax_region *dax_region = dax_dev->region;
-   struct device *dev = _dev->dev;
+   struct dax_region *dax_region = dev_dax->region;
+   struct device *dev = _dax->dev;
unsigned long mask;
 
-   if (!dax_dev->alive)
+   if (!dev_dax->alive)
return -ENXIO;
 
/* prevent private mappings from being established */
@@ -397,23 +397,23 @@ static int check_vma(struct dax_dev *dax_dev, struct 
vm_area_struct *vma,
return 0;
 }
 
-static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+static phys_addr_t pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
 {
struct resource *res;
phys_addr_t phys;
int i;
 
-   for (i = 0; i < dax_dev->num_resources; i++) {
-   res = _dev->res[i];
+   for (i = 0; i < dev_dax->num_resources; i++) {
+   res = _dax->res[i];
phys = pgoff * PAGE_SIZE + res->start;
if (phys >= res->start && phys <= res->end)
break;
pgoff -= PHYS_PFN(resource_size(res));
}
 
-   if (i < dax_dev->num_resources) {
-   res = _dev->res[i];
+   if (i < dev_dax->num_resources) {
+   res = _dax->res[i];
if (phys + size - 1 <= res->end)
return phys;
}
@@ -421,19 +421,19 @@ static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, 
pgoff_t pgoff,
return -1;
 }
 
-static int __dax_dev_pte_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
+static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 {
-   struct device *dev = _dev->dev;
+   struct device *dev = _dax->dev;
struct dax_region *dax_region;
int rc = VM_FAULT_SIGBUS;
phys_addr_t phys;
pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
 
-   if (check_vma(dax_dev, vmf->vma, __func__))
+

[resend PATCH v2 03/33] dax: add a facility to lookup a dax device by 'host' device name

2017-04-17 Thread Dan Williams

For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_device associated with a block_device. Add a
'host' property of a dax_device that can be used for this purpose. It is
a free form string, but for a dax_device associated with a block device
it is the bdev name.

This is a stop-gap until filesystems are able to mount on a dax-inode
directly.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h|2 +
 drivers/dax/device.c |2 +
 drivers/dax/super.c  |   83 --
 include/linux/dax.h  |1 +
 4 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 2472d9da96db..246a24d68d4c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,7 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_device;
-struct dax_device *alloc_dax(void *private);
+struct dax_device *alloc_dax(void *private, const char *host);
 void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 19a42edbfa03..db68f4fa8ce0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -645,7 +645,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_dev = alloc_dax(dev_dax);
+   dax_dev = alloc_dax(dev_dax, NULL);
if (!dax_dev)
goto err_dax;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c9f85f1c086e..bb22956a106b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
+static struct hlist_head dax_host_list[DAX_HASH_SIZE];
+static DEFINE_SPINLOCK(dax_host_lock);
+
 int dax_read_lock(void)
 {
return srcu_read_lock(_srcu);
@@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
  * struct dax_device - anchor object for dax services
  * @inode: core vfs
  * @cdev: optional character interface for "device dax"
+ * @host: optional name for lookups where the device path is not available
  * @private: dax driver private data
  * @alive: !alive + rcu grace period == no new operations / mappings
  */
 struct dax_device {
+   struct hlist_node list;
struct inode inode;
struct cdev cdev;
+   const char *host;
void *private;
bool alive;
 };
@@ -63,6 +70,11 @@ bool dax_alive(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_alive);
 
+static int dax_host_hash(const char *host)
+{
+   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
+}
+
 /*
  * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
  * that any fault handlers or operations that might have seen
@@ -75,6 +87,12 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
dax_dev->alive = false;
+
+   spin_lock(_host_lock);
+   if (!hlist_unhashed(_dev->list))
+   hlist_del_init(_dev->list);
+   spin_unlock(_host_lock);
+
synchronize_srcu(_srcu);
dax_dev->private = NULL;
 }
@@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct dax_device *dax_dev = to_dax_dev(inode);
 
+   kfree(dax_dev->host);
+   dax_dev->host = NULL;
ida_simple_remove(_minor_ida, MINOR(inode->i_rdev));
kmem_cache_free(dax_cache, dax_dev);
 }
@@ -169,26 +189,49 @@ static struct dax_device *dax_dev_get(dev_t devt)
return dax_dev;
 }
 
-struct dax_device *alloc_dax(void *private)
+static void dax_add_host(struct dax_device *dax_dev, const char *host)
+{
+   int hash;
+
+   INIT_HLIST_NODE(_dev->list);
+   if (!host)
+   return;
+
+   dax_dev->host = host;
+   hash = dax_host_hash(host);
+   spin_lock(_host_lock);
+   hlist_add_head(_dev->list, _host_list[hash]);
+   spin_unlock(_host_lock);
+}
+
+struct dax_device *alloc_dax(void *private, const char *__host)
 {
struct dax_device *dax_dev;
+   const char *host;
dev_t devt;
int minor;
 
+   host = kstrdup(__host, GFP_KERNEL);
+   if (__host && !host)
+   return NULL;
+
minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL);
if (minor < 0)
-   return NULL;
+   goto err_minor;
 
devt = MKDEV(MAJOR(dax_devt), minor);
dax_dev = dax_dev_get(devt);
if (!dax_dev)
-   goto err_inode;
+   goto err_dev;
 
+   dax_add_host(dax_dev, host);
dax_dev->private = private;
return dax_dev;

[resend PATCH v2 07/33] brd: add dax_operations support

2017-04-17 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/block/Kconfig |1 +
 drivers/block/brd.c   |   65 +
 2 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index f744de7a0f9b..e66956fc2c88 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -339,6 +339,7 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
tristate "RAM block device support"
+   select DAX if BLK_DEV_RAM_DAX
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..60f3193c9ce2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
 #include 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include 
+#include 
 #endif
 
 #include 
@@ -41,6 +42,9 @@ struct brd_device {
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device   *dax_dev;
+#endif
struct list_headbrd_list;
 
/*
@@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
-   struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, sector);
+   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn_t(page);
 
-   return PAGE_SIZE;
+   return 1;
+}
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = bdev->bd_disk->private_data;
+   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
+   PHYS_PFN(size), kaddr, pfn);
+
+   if (nr_pages < 0)
+   return nr_pages;
+   return nr_pages * PAGE_SIZE;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct brd_device *brd = dax_get_private(dax_dev);
+
+   return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
 }
+
+static const struct dax_operations brd_dax_ops = {
+   .direct_access = brd_dax_direct_access,
+};
 #else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_direct_access,
+   .direct_access =brd_blk_direct_access,
 };
 
 /*
@@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i)
 {
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device *dax_dev;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+   dax_dev = alloc_dax(brd, disk->disk_name, _dax_ops);
+   if (!dax_dev)
+   goto out_free_inode;
+#endif
+
+
return brd;
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+   kill_dax(dax_dev);
+   put_dax(dax_dev);
+#endif
 out_free_queue:
blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {

[resend PATCH v2 06/33] axon_ram: add dax_operations support

2017-04-17 Thread Dan Williams

Setup a dax_device to have the same lifetime as the axon_ram block
device and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/powerpc/platforms/Kconfig |1 +
 arch/powerpc/sysdev/axonram.c  |   48 +++-
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
 config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+   select DAX
default m
help
  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index f523ac883150..ad857d5e81b1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
 struct axon_ram_bank {
struct platform_device  *device;
struct gendisk  *disk;
+   struct dax_device   *dax_dev;
unsigned intirq_id;
unsigned long   ph_addr;
unsigned long   io_addr;
@@ -137,25 +139,47 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+
+   *kaddr = (void *) bank->io_addr + offset;
+   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+   return (bank->size - offset) / PAGE_SIZE;
+}
+
 /**
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
   void **kaddr, pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-   *kaddr = (void *) bank->io_addr + offset;
-   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
-   return bank->size - offset;
+   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
+   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
 }
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_direct_access
+   .direct_access  = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   struct axon_ram_bank *bank = dax_get_private(dax_dev);
+
+   return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+   .direct_access = axon_ram_dax_direct_access,
 };
 
 /**
@@ -219,6 +243,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
 
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = _ram_devops;
@@ -227,6 +252,11 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
+   bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
+   _ram_dax_ops);
+   if (!bank->dax_dev)
+   goto failed;
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(>dev, "Cannot register disk queue\n");
@@ -278,6 +308,10 @@ static int axon_ram_probe(struct platform_device *device)
del_gendisk(bank->disk);
put_disk(bank->disk);
}
+   if (bank->dax_dev) {
+   kill_dax(bank->dax_dev);
+   put_dax(bank->dax_dev);
+   }
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->io_addr);
@@ -300,6 +334,8 @@ axon_ram_remove(struct platform_device *device)
 
device_remove_file(&

[resend PATCH v2 05/33] pmem: add dax_operations support

2017-04-17 Thread Dan Williams

Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h   |7 
 drivers/nvdimm/Kconfig  |1 +
 drivers/nvdimm/pmem.c   |   61 +++
 drivers/nvdimm/pmem.h   |7 +++-
 include/linux/dax.h |6 
 tools/testing/nvdimm/pmem-dax.c |   21 ++---
 6 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 617bbc24be2b..f9e5feea742c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,13 +13,6 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_device;
-struct dax_operations;
-struct dax_device *alloc_dax(void *private, const char *host,
-   const struct dax_operations *ops);
-void put_dax(struct dax_device *dax_dev);
-bool dax_alive(struct dax_device *dax_dev);
-void kill_dax(struct dax_device *dax_dev);
 struct dax_device *inode_dax(struct inode *inode);
 struct inode *dax_inode(struct dax_device *dax_dev);
-void *dax_get_private(struct dax_device *dax_dev);
 #endif /* __DAX_H__ */
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 59e750183b7f..5bdd499b5f4f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
+   select DAX
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..fbbcf8154eec 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "pfn.h"
@@ -199,13 +200,13 @@ static int pmem_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
-__weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
-   struct pmem_device *pmem = bdev->bd_queue->queuedata;
-   resource_size_t offset = sector * 512 + pmem->data_offset;
+   resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
-   if (unlikely(is_bad_pmem(>bb, sector, size)))
+   if (unlikely(is_bad_pmem(>bb, PFN_PHYS(pgoff) / 512,
+   PFN_PHYS(nr_pages
return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
@@ -215,26 +216,51 @@ __weak long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
 * requested range.
 */
if (unlikely(pmem->bb.count))
-   return size;
-   return pmem->size - pmem->pfn_pad - offset;
+   return nr_pages;
+   return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
+}
+
+static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = bdev->bd_queue->queuedata;
+
+   return __pmem_direct_access(pmem, PHYS_PFN(sector * 512),
+   PHYS_PFN(size), kaddr, pfn);
 }
 
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
-   .direct_access =pmem_direct_access,
+   .direct_access =pmem_blk_direct_access,
.revalidate_disk =  nvdimm_revalidate_disk,
 };
 
+static long pmem_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct pmem_device *pmem = dax_get_private(dax_dev);
+
+   return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations pmem_dax_ops = {
+   .direct_access = pmem_dax_direct_access,
+};
+
 static void pmem_release_queue(void *q)
 {
blk_cleanup_queue(q);
 }
 
-static void pmem_release_disk(void *disk)
+static void pmem_release_disk(void *__pmem)
 {
-   del_gendisk(disk);
-   put_disk(disk);
+   struct pmem_device *pmem = __pmem;
+
+   kill_dax(pmem->dax_dev);
+   put_dax(pmem->dax_dev);
+   del_gendisk(pmem->disk);
+   put_disk(pmem->disk);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -245,6 +271,7 @@ static int pmem_attach_disk(struct device *dev,
struct vm

[resend PATCH v2 13/33] ext2, ext4, xfs: retrieve dax_device for iomap operations

2017-04-17 Thread Dan Williams

In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/ext2/inode.c   |9 -
 fs/ext4/inode.c   |9 -
 fs/xfs/xfs_iomap.c|   10 ++
 include/linux/iomap.h |1 +
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 128cce540645..4c9d2d44e879 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
@@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
return ret;
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
iomap->offset = (u64)first_block << blkbits;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
 
if (ret == 0) {
iomap->type = IOMAP_HOLE;
@@ -835,6 +841,7 @@ static int
 ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
 {
+   put_dax(iomap->dax_dev);
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4247d8d25687..2cb2634daa99 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long last_block = (offset + length - 1) >> blkbits;
@@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
}
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
iomap->offset = first_block << blkbits;
 
if (ret == 0) {
@@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t 
offset, loff_t length,
int blkbits = inode->i_blkbits;
bool truncate = false;
 
+   put_dax(iomap->dax_dev);
if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
return 0;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..4b47403f8089 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -976,6 +976,7 @@ xfs_file_iomap_begin(
int nimaps = 1, error = 0;
boolshared = false, trimmed = false;
unsignedlockmode;
+   struct block_device *bdev;
 
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
@@ -1063,6 +1064,14 @@ xfs_file_iomap_begin(
}
 
xfs_bmbt_to_iomap(ip, iomap, );
+
+   /* optionally associate a dax device with the iomap bdev */
+   bdev = iomap->bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
+
if (shared)
iomap->flags |= IOMAP_F_SHARED;
return 0;
@@ -1140,6 +1149,7 @@ xfs_file_iomap_end(
unsignedflags,
struct iomap*iomap)
 {
+   put_dax(iomap->dax_dev);
if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
length, written, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..f753e788da31 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type;   /* type of mapping */
u16 flags;  /* flags for mapping */
struct block_device *bdev;  /* block device for I/O */
+   struct dax_device   *dax_dev; /* dax_dev for dax operations */
 };
 
 /*

[resend PATCH v2 10/33] dax: introduce dax_direct_access()

2017-04-17 Thread Dan Williams

Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/Kconfig  |1 +
 drivers/dax/super.c|   39 +++
 fs/block_dev.c |   14 ++
 include/linux/blkdev.h |1 +
 include/linux/dax.h|2 ++
 5 files changed, 57 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index e9f780f815f5..93da7fc3f254 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 45ccfc043da8..23ce3ab49f10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,45 @@ struct dax_device {
const struct dax_operations *ops;
 };
 
+/**
+ * dax_direct_access() - translate a device pgoff to an absolute pfn
+ * @dax_dev: a dax_device instance representing the logical memory range
+ * @pgoff: offset in pages from the start of the device to translate
+ * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @kaddr: output parameter that returns a virtual address mapping of pfn
+ * @pfn: output parameter that returns an absolute pfn translation of @pgoff
+ *
+ * Return: negative errno if an error occurs, otherwise the number of
+ * pages accessible at the device relative @pgoff.
+ */
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn)
+{
+   long avail;
+
+   /*
+* The device driver is allowed to sleep, in order to make the
+* memory directly accessible.
+*/
+   might_sleep();
+
+   if (!dax_dev)
+   return -EOPNOTSUPP;
+
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (nr_pages < 0)
+   return nr_pages;
+
+   avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
+   kaddr, pfn);
+   if (!avail)
+   return -ERANGE;
+   return min(avail, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7f40ea2f0875..2f7885712575 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct 
blk_dax_ctl *dax)
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
+int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
+   pgoff_t *pgoff)
+{
+   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+
+   if (pgoff)
+   *pgoff = PHYS_PFN(phys_off);
+   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+   return -EINVAL;
+   return 0;
+}
+EXPORT_SYMBOL(bdev_dax_pgoff);
+
 /**
  * bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f72708399b83..612c497d1461 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
+int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 39a0312c45c3..7e62e280c11f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn);
 
 /*
  * We use lowest available bit in exceptional entry for locking, one bit for

[resend PATCH v2 12/33] dm: teach dm-targets to use a dax_device + dax_operations

2017-04-17 Thread Dan Williams

Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.

Cc: Toshi Kani <toshi.k...@hpe.com>
Cc: Mike Snitzer <snit...@redhat.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/md/dm-linear.c|   27 +--
 drivers/md/dm-snap.c  |6 +++---
 drivers/md/dm-stripe.c|   29 ++---
 drivers/md/dm-target.c|6 +++---
 drivers/md/dm.c   |   16 ++--
 include/linux/device-mapper.h |7 ---
 6 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..c5a52f4dae81 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   long ret;
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
-   struct blk_dax_ctl dax = {
-   .sector = linear_map_sector(ti, sector),
-   .size = size,
-   };
-   long ret;
-
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
+   if (ret)
+   return ret;
+   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
 static struct target_type linear_target = {
@@ -169,7 +168,7 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-   .direct_access = linear_direct_access,
+   .direct_access = linear_dax_direct_access,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..e152d9817c81 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
DMWARN("device does not support dax.");
return -EIO;
@@ -2368,7 +2368,7 @@ static struct target_type origin_target = {
.postsuspend = origin_postsuspend,
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
-   .direct_access = origin_direct_access,
+   .direct_access = origin_dax_direct_access,
 };
 
 static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..cb4b1e9e16ab 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio 
*bio)
return DM_MAPIO_REMAPPED;
 }
 
-static long stripe_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
-   uint32_t stripe;
+   struct dax_device *dax_dev;
struct block_device *bdev;
-   struct blk_dax_ctl dax = {
-   .size = size,
-   };
+   uint32_t stripe;
long ret;
 
-   stripe_map_sector(sc, sector, , );
-
-   dax.sector += sc->stripe[stripe].physical_start;
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
bdev = sc->stripe[stripe].dev->bdev;
 
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *

[resend PATCH v2 14/33] Revert "block: use DAX for partition table reads"

2017-04-17 Thread Dan Williams

commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer <jmo...@redhat.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/partition-generic.c |   17 ++---
 fs/dax.c  |   20 
 include/linux/dax.h   |6 --
 3 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct 
block_device *bdev)
return 0;
 }
 
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t 
n)
-{
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-   return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
-NULL);
-}
-
 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector 
*p)
 {
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
 
-   /* don't populate page cache for dax capable devices */
-   if (IS_DAX(bdev->bd_inode))
-   page = read_dax_sector(bdev, n);
-   else
-   page = read_pagecache_sector(bdev, n);
-
+   page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..b78a6947c4f5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
 }
 
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
-   struct page *page = alloc_pages(GFP_KERNEL, 0);
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   .sector = n & ~int) PAGE_SIZE) / 512) - 1),
-   };
-   long rc;
-
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   rc = dax_map_atomic(bdev, );
-   if (rc < 0)
-   return ERR_PTR(rc);
-   memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
-   dax_unmap_atomic(bdev, );
-   return page;
-}
-
 /*
  * DAX radix tree locking
  */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7e62e280c11f..0d0d890f9186 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
-static inline struct page *read_dax_sector(struct block_device *bdev,
-   sector_t n)
-{
-   return ERR_PTR(-ENXIO);
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {

[resend PATCH v2 18/33] x86, dax, pmem: remove indirection around memcpy_from_pmem()

2017-04-17 Thread Dan Williams

memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.

This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Tony Luck <tony.l...@intel.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/x86/include/asm/pmem.h  |5 -
 arch/x86/include/asm/string_64.h |1 +
 drivers/acpi/nfit/core.c |3 +--
 drivers/nvdimm/claim.c   |2 +-
 drivers/nvdimm/pmem.c|2 +-
 include/linux/pmem.h |   23 ---
 include/linux/string.h   |8 
 7 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 529bb4a6487a..d5a22bac9988 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   return memcpy_mcsafe(dst, src, n);
-}
-
 /**
  * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr: virtual start address
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..733bae07fb29 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+#define __HAVE_ARCH_MEMCPY_MCSAFE 1
 __must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t 
cnt);
 DECLARE_STATIC_KEY_FALSE(mcsafe_key);
 
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index c8ea9d698cd0..d0c07b2344e4 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
*nfit_blk,
mmio_flush_range((void __force *)
mmio->addr.aperture + offset, c);
 
-   memcpy_from_pmem(iobuf + copied,
-   mmio->addr.aperture + offset, c);
+   memcpy(iobuf + copied, mmio->addr.aperture + offset, c);
}
 
copied += c;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index ca6d572c48fc..3a35e8028b9c 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
if (rw == READ) {
if (unlikely(is_bad_pmem(>bb, sector, sz_align)))
return -EIO;
-   return memcpy_from_pmem(buf, nsio->addr + offset, size);
+   return memcpy_mcsafe(buf, nsio->addr + offset, size);
}
 
if (unlikely(is_bad_pmem(>bb, sector, sz_align))) {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 85b85633d674..3b3dab73d741 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off,
int rc;
void *mem = kmap_atomic(page);
 
-   rc = memcpy_from_pmem(mem + off, pmem_addr, len);
+   rc = memcpy_mcsafe(mem + off, pmem_addr, len);
kunmap_atomic(mem);
if (rc)
return -EIO;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index e856c2cb0fe8..71ecf3d46aac 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   BUG();
-   return -EFAULT;
-}
-
 static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
struct iov_iter *i)
 {
@@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void)
return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
 }
 
-/*
- * memcpy_from_pmem - read from persistent memory with error handling
- * @dst: destination buffer
- * @src: source buffer
- * @size: transfer length
- *
- * Returns 0 on success negative error code on failure.
- */
-static inline int memcpy_from_pmem(void *dst, void const *src, size_t size

[resend PATCH v2 17/33] block: remove block_device_operations ->direct_access()

2017-04-17 Thread Dan Williams

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/powerpc/sysdev/axonram.c |   23 -
 drivers/block/brd.c   |   15 --
 drivers/md/dm.c   |   13 
 drivers/nvdimm/pmem.c |   10 -
 drivers/s390/block/dcssblk.c  |   16 ---
 fs/block_dev.c|   45 -
 include/linux/blkdev.h|   17 ---
 7 files changed, 4 insertions(+), 135 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad857d5e81b1..83eb56ff1d2c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static const struct block_device_operations axon_ram_devops = {
+   .owner  = THIS_MODULE,
+};
+
 static long
 __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
@@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, 
pgoff_t pgoff, long nr_page
return (bank->size - offset) / PAGE_SIZE;
 }
 
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, pfn_t *pfn, long size)
-{
-   struct axon_ram_bank *bank = device->bd_disk->private_data;
-
-   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
-   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
-}
-
-static const struct block_device_operations axon_ram_devops = {
-   .owner  = THIS_MODULE,
-   .direct_access  = axon_ram_blk_direct_access
-};
-
 static long
 axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 60f3193c9ce2..bfa4ed2c75ef 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, 
pgoff_t pgoff,
return 1;
 }
 
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct brd_device *brd = bdev->bd_disk->private_data;
-   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
-   PHYS_PFN(size), kaddr, pfn);
-
-   if (nr_pages < 0)
-   return nr_pages;
-   return nr_pages * PAGE_SIZE;
-}
-
 static long brd_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device 
*dax_dev,
 static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
 };
-#else
-#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_blk_direct_access,
 };
 
 /*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ef4c6f8cad47..79d5f5fd823e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct mapped_device *md = bdev->bd_disk->private_data;
-   struct dax_device *dax_dev = md->dax_dev;
-   long nr_pages = size / PAGE_SIZE;
-
-   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
-   nr_pages, kaddr, pfn);
-   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
-}
-
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = 
{
.open = dm_blk_open,
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
-   .direct_access = dm_blk_direct_access,
.getgeo = dm_blk_getgeo,
.pr_ops = _pr_ops,
.owner = THIS_MODULE
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fbbcf8154eec..85b85633d674 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return

[resend PATCH v2 19/33] dax, pmem: introduce 'copy_from_iter' dax operation

2017-04-17 Thread Dan Williams

The direct-I/O write path for a pmem device must ensure that data is
flushed to a power-fail safe zone when the operation is complete.
However, other dax capable block devices, like brd, do not have this
requirement.  Introduce a 'copy_from_iter' dax operation so that pmem
can inject cache management without imposing this overhead on other dax
capable block_device drivers.

This is also a first step of moving all architecture-specific
pmem-operations to the pmem driver.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/nvdimm/pmem.c |   43 +++
 include/linux/dax.h   |3 +++
 2 files changed, 46 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..e501df4ab4b4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len;
+
+   /* TODO: skip the write-back by always using non-temporal stores */
+   len = copy_from_iter_nocache(addr, bytes, i);
+
+   /*
+* In the iovec case on x86_64 copy_from_iter_nocache() uses
+* non-temporal stores for the bulk of the transfer, but we need
+* to manually flush if the transfer is unaligned. A cached
+* memory copy is used when destination or size is not naturally
+* aligned. That is:
+*   - Require 8-byte alignment when size is 8 bytes or larger.
+*   - Require 4-byte alignment when size is 4 bytes.
+*
+* In the non-iovec case the entire destination needs to be
+* flushed.
+*/
+   if (iter_is_iovec(i)) {
+   unsigned long flushed, dest = (unsigned long) addr;
+
+   if (bytes < 8) {
+   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
+   wb_cache_pmem(addr, 1);
+   } else {
+   if (!IS_ALIGNED(dest, 8)) {
+   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
+   wb_cache_pmem(addr, 1);
+   }
+
+   flushed = dest - (unsigned long) addr;
+   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
+   wb_cache_pmem(addr + bytes - 1, 1);
+   }
+   } else
+   wb_cache_pmem(addr, bytes);
+
+   return len;
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
@@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+   .copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 */
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+   /* copy_from_iter: dax-driver override for default copy_from_iter */
+   size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+   struct iov_iter *);
 };
 
 int dax_read_lock(void);

[resend PATCH v2 23/33] dm: add ->flush() dax operation support

2017-04-17 Thread Dan Williams

Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying flush implementations.

Cc: Toshi Kani <toshi.k...@hpe.com>
Cc: Mike Snitzer <snit...@redhat.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/super.c   |   11 +++
 drivers/md/dm-linear.c|   15 +++
 drivers/md/dm-stripe.c|   20 
 drivers/md/dm.c   |   19 +++
 include/linux/dax.h   |2 ++
 include/linux/device-mapper.h |3 +++
 6 files changed, 70 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 73f0da8e5d27..1253c05a2e53 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, 
pgoff_t pgoff, void *addr,
 }
 EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   if (!dax_alive(dax_dev))
+   return;
+
+   if (dax_dev->ops->flush)
+   dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+}
+EXPORT_SYMBOL_GPL(dax_flush);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 5fe44a0ddfab..70d8439a1b63 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -184,6 +198,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_copy_from_iter = linear_dax_copy_from_iter,
+   .dax_flush = linear_dax_flush,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 4f45d23249b2..829fd438318d 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+   struct stripe_c *sc = ti->private;
+   struct dax_device *dax_dev;
+   struct block_device *bdev;
+   uint32_t stripe;
+
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
+   bdev = sc->stripe[stripe].dev->bdev;
+
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 /*
  * Stripe status:
  *
@@ -468,6 +487,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_copy_from_iter = stripe_dax_copy_from_iter,
+   .dax_flush = stripe_dax_flush,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c8579efcba2..6a97711cdbdf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (ti->type->dax_flush)
+   ti->type->dax_flush(ti, pgoff, addr, size);
+ out:
+   dm_put_live_table(md, srcu_idx);
+}
+
 /*
  * A target may call dm_acce

[resend PATCH v2 22/33] dax, pmem: introduce an optional 'flush' dax_operation

2017-04-17 Thread Dan Williams

Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.

An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.

Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/nvdimm/pmem.c |   11 +++
 include/linux/dax.h   |2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e501df4ab4b4..822b85fb3365 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t size)
+{
+   /*
+* TODO: move arch specific cache management into the driver
+* directly.
+*/
+   wb_cache_pmem(addr, size);
+}
+
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+   .flush = pmem_dax_flush,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index cd8561bb21f3..c88bbcba26d9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -19,6 +19,8 @@ struct dax_operations {
/* copy_from_iter: dax-driver override for default copy_from_iter */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+   /* flush: optional driver-specific cache management after writes */
+   void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
 };
 
 int dax_read_lock(void);

[resend PATCH v2 24/33] filesystem-dax: convert to dax_flush()

2017-04-17 Thread Dan Williams

Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/dax.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b9909c91df..edbf988de86c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   wb_cache_pmem(kaddr, size);
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as

[resend PATCH v2 21/33] filesystem-dax: convert to dax_copy_from_iter()

2017-04-17 Thread Dan Williams

Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/x86/include/asm/pmem.h |   50 ---
 fs/dax.c|3 ++-
 include/linux/pmem.h|   24 -
 3 files changed, 2 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d5a22bac9988..60e8edbe0205 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 }
 
 /**
- * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- */
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
-}
-
-/**
  * arch_clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero
diff --git a/fs/dax.c b/fs/dax.c
index ce9dc9c3e829..11b9909c91df 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
map_len = end - pos;
 
if (iov_iter_rw(iter) == WRITE)
-   map_len = copy_from_iter_pmem(kaddr, map_len, iter);
+   map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
+   map_len, iter);
else
map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 71ecf3d46aac..9d542a5600e4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   BUG();
-   return 0;
-}
-
 static inline void arch_clear_pmem(void *addr, size_t size)
 {
BUG();
@@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   if (arch_has_pmem_api())
-   return arch_copy_from_iter_pmem(addr, bytes, i);
-   return copy_from_iter_nocache(addr, bytes, i);
-}
-
-/**
  * clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero

[resend PATCH v2 26/33] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

2017-04-17 Thread Dan Williams

With all calls to this routine re-directed through the pmem driver, we
can kill the pmem api indirection. arch_wb_cache_pmem() is now
optionally supplied by an arch specific extension to libnvdimm.  Same as
before, pmem flushing is only defined for x86_64, but it is
straightforward to add other archs in the future.

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Oliver O'Halloran <ooh...@gmail.com>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/x86/include/asm/pmem.h |   21 -
 drivers/nvdimm/Makefile |1 +
 drivers/nvdimm/pmem.c   |   14 +-
 drivers/nvdimm/pmem.h   |8 
 drivers/nvdimm/x86.c|   36 
 include/linux/pmem.h|   19 ---
 tools/testing/nvdimm/Kbuild |1 +
 7 files changed, 51 insertions(+), 49 deletions(-)
 create mode 100644 drivers/nvdimm/x86.c

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..4759a179aa52 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size:  number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
-   u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
-   unsigned long clflush_mask = x86_clflush_size - 1;
-   void *vend = addr + size;
-   void *p;
-
-   for (p = (void *)((unsigned long)addr & ~clflush_mask);
-p < vend; p += x86_clflush_size)
-   clwb(p);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..9eafb1dd2876 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
 libnvdimm-$(CONFIG_BTT) += btt_devs.o
 libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += x86.o
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 822b85fb3365..c77a3a757729 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
 
if (bytes < 8) {
if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
} else {
if (!IS_ALIGNED(dest, 8)) {
dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
}
 
flushed = dest - (unsigned long) addr;
if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   wb_cache_pmem(addr + bytes - 1, 1);
+   arch_wb_cache_pmem(addr + bytes - 1, 1);
}
} else
-   wb_cache_pmem(addr, bytes);
+   arch_wb_cache_pmem(addr, bytes);
 
return len;
 }
@@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
 {
-   /*
-* TODO: move arch specific cache management into the driver
-* directly.
-*/
-   wb_cache_pmem(addr, size);
+   arch_wb_cache_pmem(addr, size);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..c4b3371c7f88 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,6 +5,14 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
/* One contiguous memory region per device */
diff --git a/

[resend PATCH v2 27/33] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm

2017-04-17 Thread Dan Williams

Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove the public pmem api.

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/x86/include/asm/pmem.h |4 
 drivers/nvdimm/claim.c  |3 ++-
 drivers/nvdimm/pmem.c   |2 +-
 drivers/nvdimm/pmem.h   |4 
 drivers/nvdimm/x86.c|6 ++
 include/linux/pmem.h|   19 ---
 6 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 4759a179aa52..ded2541a7ba9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   clflush_cache_range(addr, size);
-}
 #endif /* CONFIG_ARCH_HAS_PMEM_API */
 #endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..1e13a196ce4b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include "nd-core.h"
+#include "pmem.h"
 #include "pfn.h"
 #include "btt.h"
 #include "nd.h"
@@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
cleared /= 512;
badblocks_clear(>bb, sector, cleared);
}
-   invalidate_pmem(nsio->addr + offset, size);
+   arch_invalidate_pmem(nsio->addr + offset, size);
} else
rc = -EIO;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77a3a757729..769a510c20e8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, 
phys_addr_t offset,
badblocks_clear(>bb, sector, cleared);
}
 
-   invalidate_pmem(pmem->virt_addr + offset, len);
+   arch_invalidate_pmem(pmem->virt_addr + offset, len);
 
return rc;
 }
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index c4b3371c7f88..5900c1b7 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -7,10 +7,14 @@
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
+void arch_invalidate_pmem(void *addr, size_t size);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
 }
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index 79d7267da4d2..07478ed7ce97 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
+
+void arch_invalidate_pmem(void *addr, size_t size)
+{
+   clflush_cache_range(addr, size);
+}
+EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 33ae761f010a..559c00848583 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 {
BUG();
 }
-
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   BUG();
-}
 #endif
 
 static inline bool arch_has_pmem_api(void)
@@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
else
memcpy(dst, src, n);
 }
-
-/**
- * invalidate_pmem - flush a pmem range from the cache hierarchy
- * @addr:  virtual start address
- * @size:  bytes to invalidate (internally aligned to cache line size)
- *
- * For platforms that support clearing poison this flushes any poisoned
- * ranges out of the cache
- */
-static inline void invalidate_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_invalidate_pmem(addr, size);
-}
 #endif /* __PMEM_H__ */

[resend PATCH v2 32/33] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

2017-04-17 Thread Dan Williams

Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.

Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/dax.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f37ed21e4093..5b7ee1bc74d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   dax_flush(dax_dev, pgoff, kaddr, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as
@@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev,
return rc;
}
memset(kaddr + offset, 0, size);
-   dax_flush(dax_dev, pgoff, kaddr + offset, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;

[resend PATCH v2 29/33] uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations

2017-04-17 Thread Dan Williams

Introduce copy_from_iter_ops() to enable passing custom sub-routines to
iterate_and_advance(). Define pmem operations that guarantee cache
bypass to supplement the existing usage of __copy_from_iter_nocache()
backed by arch_wb_cache_pmem().

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Toshi Kani <toshi.k...@hpe.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/nvdimm/Kconfig |1 +
 drivers/nvdimm/pmem.c  |   38 +-
 drivers/nvdimm/pmem.h  |7 +++
 drivers/nvdimm/x86.c   |   48 
 include/linux/uio.h|4 
 lib/Kconfig|3 +++
 lib/iov_iter.c |   25 +
 7 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 4d45196d6f94..28002298cdc8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -38,6 +38,7 @@ config BLK_DEV_PMEM
 
 config ARCH_HAS_PMEM_API
depends on X86_64
+   select COPY_FROM_ITER_OPS
def_bool y
 
 config ND_BLK
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 329895ca88e1..b000c6db5731 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
+   return arch_copy_from_iter_pmem(addr, bytes, i);
 }
 
 static const struct block_device_operations pmem_fops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 5900c1b7..574b63fb5376 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,11 +3,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void arch_invalidate_pmem(void *addr, size_t size);
+size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
@@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
 }
+static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
+   struct iov_iter *i)
+{
+   return copy_from_iter_nocache(addr, bytes, i);
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index d99b452332a9..bc145d760d43 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -10,6 +10,9 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned 
size)
}
 }
 EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem);
+
+static int pmem_from_user(void *dst, const void __user *src, unsigned size)
+{
+

[resend PATCH v2 28/33] x86, libnvdimm, dax: stop abusing __copy_user_nocache

2017-04-17 Thread Dan Williams

The pmem and nd_blk drivers both have need to copy data through the cpu
cache to persistent memory. To date they have been abusing
__copy_user_nocache through the memcpy_to_pmem abstraction, but this has
several problems:

* __copy_user_nocache does not guarantee that it will always avoid the
  cache. While we have fixed the cases where the pmem usage might
  trigger that behavior it's a fragile assumption and burdens the
  uaccess.h implementation with worrying about the distinction between
  'nocache' and the stricter write-through semantic needed by pmem.
  Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or
  (ab-)using copy_user_nocache() just needs to die. ... If some driver
  ends up using "movnt" by hand, that is up to that *driver*."

* It implements SMAP (supervisor mode access protection) which is only
  meant for user copies.

* It expects faults. For in-kernel copies, faults are fatal and we
  should not be coding for exception handling in that case.

__arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache()
minus SMAP, unaligned support, and exception handling. The configuration
symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to
the implementation.

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Toshi Kani <toshi.k...@hpe.com>
Cc: Tony Luck <tony.l...@intel.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Oliver O'Halloran <ooh...@gmail.com>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 MAINTAINERS |2 -
 arch/x86/Kconfig|1 -
 arch/x86/include/asm/pmem.h |   48 -
 drivers/acpi/nfit/core.c|3 +-
 drivers/nvdimm/Kconfig  |4 ++
 drivers/nvdimm/claim.c  |4 +-
 drivers/nvdimm/namespace_devs.c |1 -
 drivers/nvdimm/pmem.c   |4 +-
 drivers/nvdimm/region_devs.c|1 -
 drivers/nvdimm/x86.c|   65 +++
 fs/dax.c|1 -
 include/linux/libnvdimm.h   |9 +
 include/linux/pmem.h|   59 ---
 lib/Kconfig |3 --
 14 files changed, 83 insertions(+), 122 deletions(-)
 delete mode 100644 arch/x86/include/asm/pmem.h
 delete mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 819d5e8b668a..1c4da1bebd7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7458,8 +7458,6 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem.c
-F: include/linux/pmem.h
-F: arch/*/include/asm/pmem.h
 
 LIGHTNVM PLATFORM SUPPORT
 M: Matias Bjorling <m...@lightnvm.io>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..d377da696903 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,7 +53,6 @@ config X86
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
-   select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
deleted file mode 100644
index ded2541a7ba9..
--- a/arch/x86/include/asm/pmem.h
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- */
-#ifndef __ASM_X86_PMEM_H__
-#define __ASM_X86_PMEM_H__
-
-#include 
-#include 
-#include 
-#include 
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-/**
- * arch_memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Copy data to persistent memory media via non-temporal stores so that
- * a subsequent pmem driver flush operation will drain posted write queues.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
-   int rem;
-
-   /*
-*

[resend PATCH v2 31/33] libnvdimm, nfit: enable support for volatile ranges

2017-04-17 Thread Dan Williams

Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/acpi/nfit/core.c|9 -
 drivers/nvdimm/bus.c|   10 +-
 drivers/nvdimm/core.c   |2 +-
 drivers/nvdimm/dax_devs.c   |2 +-
 drivers/nvdimm/dimm_devs.c  |2 +-
 drivers/nvdimm/namespace_devs.c |8 
 drivers/nvdimm/nd-core.h|9 +
 drivers/nvdimm/pfn_devs.c   |4 ++--
 drivers/nvdimm/region_devs.c|   27 ++-
 9 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 8b4c6212737c..6ac31846c4df 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct 
acpi_nfit_system_address *spa)
nfit_spa_type(spa) == NFIT_SPA_PCD);
 }
 
+static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
+{
+   return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
+   nfit_spa_type(spa) == NFIT_SPA_VCD   ||
+   nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
struct nfit_spa *nfit_spa)
 {
@@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct 
acpi_nfit_desc *acpi_desc,
ndr_desc);
if (!nfit_spa->nd_region)
rc = -ENOMEM;
-   } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
+   } else if (nfit_spa_is_volatile(spa)) {
nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
ndr_desc);
if (!nfit_spa->nd_region)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8f6503..d4173fbdba28 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev)
 {
if (is_nvdimm(dev))
return ND_DEVICE_DIMM;
-   else if (is_nd_pmem(dev))
+   else if (is_memory(dev))
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
-   else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+   else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
 
return 0;
@@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 * Ensure that region devices always have their numa node set as
 * early as possible.
 */
-   if (is_nd_pmem(dev) || is_nd_blk(dev))
+   if (is_nd_region(dev))
set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
@@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 static struct module *to_bus_provider(struct device *dev)
 {
/* pin bus providers while regions are enabled */
-   if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+   if (is_nd_region(dev)) {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
 
return nvdimm_bus->nd_desc->module;
@@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
 
 static int pmem_active(struct device *dev, void *data)
 {
-   if (is_nd_pmem(dev) && dev->driver)
+   if (is_memory(dev) && dev->driver)
return -EBUSY;
return 0;
 }
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 9303cfeb8bee..875ef4cecb35 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
struct list_head *poison_list;
 
-   if (!is_nd_pmem(_region->dev)) {
+   if (!is_memory(_region->dev)) {
dev_WARN_ONCE(_region->dev, 1,
"%s only valid for pmem regions\n", __func__);
return;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 45fa82cae87c..6a92b84

[resend PATCH v2 30/33] libnvdimm, pmem: fix persistence warning

2017-04-17 Thread Dan Williams

The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/nvdimm/region_devs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 307a48060aa3..5976f6c0407f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region)
struct nd_region_data *ndrd = dev_get_drvdata(_region->dev);
int i;
 
-   /* no nvdimm == flushing capability unknown */
-   if (nd_region->ndr_mappings == 0)
+   /* no nvdimm or pmem api == flushing capability unknown */
+   if (nd_region->ndr_mappings == 0
+   || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
return -ENXIO;
 
for (i = 0; i < nd_region->ndr_mappings; i++)

[resend PATCH v2 33/33] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

2017-04-17 Thread Dan Williams

The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/nvdimm/pmem.c|9 +++--
 drivers/nvdimm/region_devs.c |6 ++
 include/linux/libnvdimm.h|1 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b000c6db5731..42876a75dab8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn = NULL;
+   int has_flush, fua = 0, wbc;
struct dax_device *dax_dev;
int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
@@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
-   if (nvdimm_has_flush(nd_region) < 0)
+   has_flush = nvdimm_has_flush(nd_region);
+   if (has_flush < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
+   else
+   fua = has_flush;
+   wbc = nvdimm_has_cache(nd_region);
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(>dev))) {
@@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
 
-   blk_queue_write_cache(q, true, true);
+   blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 2df259010720..a085f7094b76 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_flush);
 
+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+   return is_nd_pmem(_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
 void __exit nd_region_devs_exit(void)
 {
ida_destroy(_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a98004745768..b733030107bb 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, 
unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 void nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_memcpy_to_pmem(void *dst, void *src, unsigned size);
 #define ARCH_MEMREMAP_PMEM MEMREMAP_WB

[resend PATCH v2 25/33] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush

2017-04-17 Thread Dan Williams

The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: <x...@kernel.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/x86/include/asm/pmem.h |   13 -
 fs/dax.c|3 ++-
 include/linux/pmem.h|   21 -
 3 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 60e8edbe0205..f4c119d253f3 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
clwb(p);
 }
 
-/**
- * arch_clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- */
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   memset(addr, 0, size);
-   arch_wb_cache_pmem(addr, size);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/fs/dax.c b/fs/dax.c
index edbf988de86c..edee7e8298bc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
-   clear_pmem(kaddr + offset, size);
+   memset(kaddr + offset, 0, size);
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 9d542a5600e4..772bd02a5b52 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   BUG();
-}
-
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
BUG();
@@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void clear_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_clear_pmem(addr, size);
-   else
-   memset(addr, 0, size);
-}
-
-/**
  * invalidate_pmem - flush a pmem range from the cache hierarchy
  * @addr:  virtual start address
  * @size:  bytes to invalidate (internally aligned to cache line size)

[resend PATCH v2 15/33] filesystem-dax: convert to dax_direct_access()

2017-04-17 Thread Dan Williams

Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/dax.c|  277 +--
 fs/iomap.c  |3 -
 include/linux/dax.h |6 +
 3 files changed, 162 insertions(+), 124 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b78a6947c4f5..ce9dc9c3e829 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
-   struct request_queue *q = bdev->bd_queue;
-   long rc = -EIO;
-
-   dax->addr = ERR_PTR(-EIO);
-   if (blk_queue_enter(q, true) != 0)
-   return rc;
-
-   rc = bdev_direct_access(bdev, dax);
-   if (rc < 0) {
-   dax->addr = ERR_PTR(rc);
-   blk_queue_exit(q);
-   return rc;
-   }
-   return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
-   const struct blk_dax_ctl *dax)
-{
-   if (IS_ERR(dax->addr))
-   return;
-   blk_queue_exit(bdev->bd_queue);
-}
-
 static int dax_is_pmd_entry(void *entry)
 {
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, 
void **entry,
return ret;
 }
 
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
-   struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+   sector_t sector, size_t size, struct page *to,
+   unsigned long vaddr)
 {
-   struct blk_dax_ctl dax = {
-   .sector = sector,
-   .size = size,
-   };
-   void *vto;
-
-   if (dax_map_atomic(bdev, ) < 0)
-   return PTR_ERR(dax.addr);
+   void *vto, *kaddr;
+   pgoff_t pgoff;
+   pfn_t pfn;
+   long rc;
+   int id;
+
+   rc = bdev_dax_pgoff(bdev, sector, size, );
+   if (rc)
+   return rc;
+
+   id = dax_read_lock();
+   rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), , );
+   if (rc < 0) {
+   dax_read_unlock(id);
+   return rc;
+   }
vto = kmap_atomic(to);
-   copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
+   copy_user_page(vto, (void __force *)kaddr, vaddr, to);
kunmap_atomic(vto);
-   dax_unmap_atomic(bdev, );
+   dax_read_unlock(id);
return 0;
 }
 
@@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
 }
 
 static int dax_writeback_one(struct block_device *bdev,
-   struct address_space *mapping, pgoff_t index, void *entry)
+   struct dax_device *dax_dev, struct address_space *mapping,
+   pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   struct blk_dax_ctl dax;
-   void *entry2, **slot;
-   int ret = 0;
+   void *entry2, **slot, *kaddr;
+   long ret = 0, id;
+   sector_t sector;
+   pgoff_t pgoff;
+   size_t size;
+   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev,
 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 * worry about partial PMD writebacks.
 */
-   dax.sector = dax_radix_sector(entry);
-   dax.size = PAGE_SIZE << dax_radix_order(entry);
+   sector = dax_radix_sector(entry);
+   size = PAGE_SIZE << dax_radix_order(entry);
+
+   id = dax_read_lock();
+   ret = bdev_dax_pgoff(bdev, sector, size, );
+   if (ret)
+   goto dax_unlock;
 
/*
-* We cannot hold tree_lock while calling dax_map_atomic() because it
-* eventually calls cond_resched().
+* dax_direct_access() may sleep, so cannot hold tree_lock over
+* its invocation.
 */
-   ret = dax_map_atomic(bdev, );
-   if (ret < 0) {
-   put_locked_mapping_entry(mapping, index, entry);
-   return ret;
-   }
+   ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, , );
+   if (ret < 0)
+   goto dax_unlock;
 
-   if (WARN_ON_ONCE(ret < dax.size)) {
+   if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
ret = -EIO;
-   goto unmap;
+   goto dax_unlock;
}
 
-   dax_mapping_entry_mkclean(mapping, i

[resend PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-04-17 Thread Dan Williams

Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations.  Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.

A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.

This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.

Cc: Toshi Kani <toshi.k...@hpe.com>
Cc: Mike Snitzer <snit...@redhat.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/md/Kconfig|1 
 drivers/md/dm-core.h  |1 
 drivers/md/dm.c   |   84 ++---
 include/linux/device-mapper.h |1 
 4 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..538630190f66 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -58,6 +58,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
 
struct gendisk *disk;
+   struct dax_device *dax_dev;
char name[16];
 
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..bd56dfe43a99 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
+   sector_t sector, int *srcu_idx)
 {
-   struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map;
struct dm_target *ti;
-   int srcu_idx;
-   long len, ret = -EIO;
 
-   map = dm_get_live_table(md, _idx);
+   map = dm_get_live_table(md, srcu_idx);
if (!map)
-   goto out;
+   return NULL;
 
ti = dm_table_find_target(map, sector);
if (!dm_target_is_valid(ti))
-   goto out;
+   return NULL;
 
-   len = max_io_len(sector, ti) << SECTOR_SHIFT;
-   size = min(len, size);
+   return ti;
+}
 
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
-out:
+static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   long len, ret = -EIO;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (!ti->type->direct_access)
+   goto out;
+   len = max_io_len(sector, ti) / PAGE_SECTORS;
+   if (len < 1)
+   goto out;
+   nr_pages = min(len, nr_pages);
+   if (ti->type->direct_access) {
+   ret = ti->type->direct_access(ti, sector, kaddr, pfn,
+   nr_pages * PAGE_SIZE);
+   /*
+* FIXME: convert ti->type->direct_access to return
+* nr_pages directly.
+*/
+   if (ret >= 0)
+   ret /= PAGE_SIZE;
+   }
+ out:
dm_put_live_table(md, srcu_idx);
-   return min(ret, size);
+
+   return ret;
+}
+
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+   struct dax_device *dax_dev = md->dax_dev;
+   long nr_pages = size / PAGE_SIZE;
+
+   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
+   nr_pages, kaddr, pfn);
+   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
 }
 
 /*
@@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor)
 }
 
 static

[resend PATCH v2 09/33] block: kill bdev_dax_capable()

2017-04-17 Thread Dan Williams

This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/block_dev.c |   24 
 include/linux/blkdev.h |1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..7f40ea2f0875 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   };
-
-   if (!IS_ENABLED(CONFIG_FS_DAX))
-   return false;
-
-   dax.sector = 0;
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   return true;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..f72708399b83 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;

[resend PATCH v2 04/33] dax: introduce dax_operations

2017-04-17 Thread Dan Williams

Track a set of dax_operations per dax_device that can be set at
alloc_dax() time. These operations will be used to stop the abuse of
block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax device methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.

This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h|4 +++-
 drivers/dax/device.c |6 +-
 drivers/dax/super.c  |6 +-
 include/linux/dax.h  |   10 ++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 246a24d68d4c..617bbc24be2b 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,9 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_device;
-struct dax_device *alloc_dax(void *private, const char *host);
+struct dax_operations;
+struct dax_device *alloc_dax(void *private, const char *host,
+   const struct dax_operations *ops);
 void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db68f4fa8ce0..a0db055054a4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -645,7 +645,11 @@ struct dev_dax *devm_create_dev_dax(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_dev = alloc_dax(dev_dax, NULL);
+   /*
+* No 'host' or dax_operations since there is no access to this
+* device outside of mmap of the resulting character device.
+*/
+   dax_dev = alloc_dax(dev_dax, NULL, NULL);
if (!dax_dev)
goto err_dax;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index bb22956a106b..45ccfc043da8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static int nr_dax = CONFIG_NR_DEV_DAX;
@@ -61,6 +62,7 @@ struct dax_device {
const char *host;
void *private;
bool alive;
+   const struct dax_operations *ops;
 };
 
 bool dax_alive(struct dax_device *dax_dev)
@@ -204,7 +206,8 @@ static void dax_add_host(struct dax_device *dax_dev, const 
char *host)
spin_unlock(_host_lock);
 }
 
-struct dax_device *alloc_dax(void *private, const char *__host)
+struct dax_device *alloc_dax(void *private, const char *__host,
+   const struct dax_operations *ops)
 {
struct dax_device *dax_dev;
const char *host;
@@ -225,6 +228,7 @@ struct dax_device *alloc_dax(void *private, const char 
*__host)
goto err_dev;
 
dax_add_host(dax_dev, host);
+   dax_dev->ops = ops;
dax_dev->private = private;
return dax_dev;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9b2d5ba10d7d..74ebb92b625a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,16 @@
 #include 
 
 struct iomap_ops;
+struct dax_device;
+struct dax_operations {
+   /*
+* direct_access: translate a device-relative
+* logical-page-offset into an absolute physical pfn. Return the
+* number of pages available for DAX at that pfn.
+*/
+   long (*direct_access)(struct dax_device *, pgoff_t, long,
+   void **, pfn_t *);
+};
 
 int dax_read_lock(void);
 void dax_read_unlock(int id);

[resend PATCH v2 08/33] dcssblk: add dax_operations support

2017-04-17 Thread Dan Williams

Setup a dax_dev to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Cc: Gerald Schaefer <gerald.schae...@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/s390/block/Kconfig   |1 +
 drivers/s390/block/dcssblk.c |   54 +++---
 2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
+   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..682a9eb4934d 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
secnum,
 void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = 
{
.owner  = THIS_MODULE,
.open   = dcssblk_open,
.release= dcssblk_release,
-   .direct_access  = dcssblk_direct_access,
+   .direct_access  = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+   .direct_access = dcssblk_dax_direct_access,
 };
 
 struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+   struct dax_device *dax_dev;
 };
 
 struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
device_attribute *attr, const ch
}
list_del(_info->lh);
 
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
+   dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
+   _dax_ops);
+   if (!dax_dev)
+   goto put_dev;
+
get_device(_info->dev);
device_add_disk(_info->dev, dev_info->gd);
 
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct 
device_attribute *attr, const ch
}
 
list_del(_info->lh);
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio 
*bio)
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+   unsigned long dev_sz;
+
+   dev_sz = dev_info->end - dev_info->start + 1;
+   *kaddr = (void *) dev_info->start + offset;
+   *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+   return (dev_sz - offset) / PAGE_SIZE;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
 {
struct dcssblk_dev_info *dev_info;
-   unsigned long offset, dev_sz;
 
dev_info = bdev->bd_disk->private_data;
if (!dev_info)

[resend PATCH v2 00/33] dax: introduce dax_operations

2017-04-17 Thread Dan Williams

[ resend to add dm-devel, linux-block, and fs-devel, apologies for the
duplicates ]

Changes since v1 [1] and the dax-fs RFC [2]:
* rename struct dax_inode to struct dax_device (Christoph)
* rewrite arch_memcpy_to_pmem() in C with inline asm
* use QUEUE_FLAG_WC to gate dax cache management (Jeff)
* add device-mapper plumbing for the ->copy_from_iter() and ->flush()
  dax_operations
* kill struct blk_dax_ctl and bdev_direct_access (Christoph)
* cleanup the ->direct_access() calling convention to be page based
  (Christoph)
* introduce dax_get_by_host() and don't pollute struct super_block with
  dax_device details (Christoph)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lwn.net/Articles/713064/

---
A few months back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus proposed that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver level
[3]:

   "Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of ."

This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.

Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
platforms).

So, this series:

1/ moves what was previously named "the pmem api" out of the global
   namespace and into drivers that need to be concerned with
   architecture specific persistent memory considerations.

2/ arranges for dax to stop abusing __copy_user_nocache() and implements
   a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
   expanded in the future to use 'movntdqa' if the copy size is above
   some threshold, or expanded with support for other architectures [4].

3/ makes cache maintenance optional by arranging for dax to call driver
   specific copy and flush operations only if the driver publishes them.

4/ allows filesytem-dax cache management to be controlled by the block
   device write-cache queue flag. The pmem driver is updated to clear
   that flag by default when pmem is driving volatile memory.

[3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html

These patches have been through a round of build regression fixes
notified by the 0day robot. All review welcome, but the patches that
need extra attention are the device-mapper and uio changes
(copy_from_iter_ops).

This series is based on a merge of char-misc-next (for cdev api reworks)
and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).

---

Dan Williams (33):
  device-dax: rename 'dax_dev' to 'dev_dax'
  dax: refactor dax-fs into a generic provider of 'struct dax_device' 
instances
  dax: add a facility to lookup a dax device by 'host' device name
  dax: introduce dax_operations
  pmem: add dax_operations support
  axon_ram: add dax_operations support
  brd: add dax_operations support
  dcssblk: add dax_operations support
  block: kill bdev_dax_capable()
  dax: introduce dax_direct_access()
  dm: add dax_device and dax_operations support
  dm: teach dm-targets to use a dax_device + dax_operations
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  Revert "block: use DAX for partition table reads"
  filesystem-dax: convert to dax_direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  block: remove block_device_operations ->direct_access()
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  dax, pmem: introduce 'copy_from_iter' dax operation
  dm: add ->copy_from_iter() dax operation support
  filesystem-dax: convert to dax_copy_from_iter()
  dax, pmem: introduce an optional 'flush' dax_operation
  dm: add ->flush() dax operation support
  filesystem-dax: convert to dax_flush()
  x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
  x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

Re: [PATCH 0/11 v4] block: Fix block device shutdown related races

2017-03-13 Thread Dan Williams

On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara  wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix 
> several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in 
> gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.

Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.

Re: [PATCH 4/4] Revert "scsi, block: fix duplicate bdi name registration crashes"

2017-03-08 Thread Dan Williams

On Wed, Mar 8, 2017 at 8:48 AM, Jan Kara <j...@suse.cz> wrote:
> This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
> leaking of device numbers for SCSI when SCSI registers multiple gendisks
> for one request_queue in succession. It can be easily reproduced using
> Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
> Furthermore the protection provided by this commit is not needed anymore
> as the problem it was fixing got also fixed by commit 165a5e22fafb
> "block: Move bdi_unregister() to del_gendisk()".
>
> [1]: http://marc.info/?l=linux-block=148554717109098=2
>
> Signed-off-by: Jan Kara <j...@suse.cz>

Acked-by: Dan Williams <dan.j.willi...@intel.com>

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-02-21 Thread Dan Williams

On Tue, Feb 21, 2017 at 9:19 AM, Jan Kara  wrote:
> On Tue 21-02-17 18:09:45, Jan Kara wrote:
>> Hello,
>>
>> this is a second revision of the patch set to fix several different races and
>> issues I've found when testing device shutdown and reuse. The first three
>> patches are fixes to problems in my previous series fixing BDI lifetime 
>> issues.
>> Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I 
>> cannot
>> reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
>> so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops 
>> that
>> is triggered by __blkdev_put() calling inode_detach_wb() too early (the 
>> problem
>> reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
>> where get_gendisk() can return already freed gendisk structure (again 
>> triggered
>> by Omar's stress test).
>>
>> People, please have a look at patches. They are mostly simple however the
>> interactions are rather complex so I may have missed something. Also I'm
>> happy for any additional testing these patches can get - I've stressed them
>> with Omar's script, tested memcg writeback, tested static (not udev managed)
>> device inodes.
>>
>> Jens, I think at least patches 1-3 should go in together with my fixes you
>> already have in your tree (or shortly after them). It is up to you whether
>> you decide to delay my first fixes or pick these up quickly. Patch 4 is
>> (IMHO a cleaner) replacement of Dan's patches so consider whether you want
>> to use it instead of those patches.

FWIW, I wholeheartedly agree with replacing my band-aid with this deeper fix.

Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup

2017-02-09 Thread Dan Williams

On Wed, Feb 8, 2017 at 4:08 PM, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Mon, 2017-02-06 at 21:42 -0800, Dan Williams wrote:
[..]
>> ...but it reproduces on current mainline with the same config. I
>> haven't spotted what makes scsi_debug behave like this.
>
> Looking at the config, it's a static debug with report luns enabled.
>  Is it as simple as the fact that we probe lun 0 manually to see if the
> target exists, but then we don't account for the fact that we already
> did this, so if it turns up again in the report lun scan, we'll probe
> it again leading to a double add.  If that theory is correct, this may
> be the fix (compile tested only).
>
> James
>
> ---
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 6f7128f..ba4be08 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1441,6 +1441,10 @@ static int scsi_report_lun_scan(struct scsi_target 
> *starget, int bflags,
> for (lunp = _data[1]; lunp <= _data[num_luns]; lunp++) {
> lun = scsilun_to_int(lunp);
>
> +   if (lun == 0)
> +   /* already scanned LUN 0 */
> +   continue;
> +
> if (lun > sdev->host->max_lun) {
> sdev_printk(KERN_WARNING, sdev,
> "lun%llu has a LUN larger than"

I gave this a shot on top of linux-next, but still hit the failure.
Log attached.


log
Description: Binary data

Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup

2017-02-06 Thread Dan Williams

On Mon, Feb 6, 2017 at 8:09 PM, Jens Axboe <ax...@fb.com> wrote:
> On 02/06/2017 05:14 PM, James Bottomley wrote:
>> On Sun, 2017-02-05 at 21:13 -0800, Dan Williams wrote:
>>> On Sun, Feb 5, 2017 at 1:13 AM, Christoph Hellwig <h...@lst.de> wrote:
>>>> Dan,
>>>>
>>>> can you please quote your emails?  I can't find any content
>>>> inbetween all these quotes.
>>>
>>> Sorry, I'm using gmail, but I'll switch to attaching the logs.
>>>
>>> So with help from Xiaolong I was able to reproduce this, and it does
>>> not appear to be a regression. We simply change the failure output of
>>> an existing bug. Attached is a log of the same test on v4.10-rc7
>>> (i.e. without the recent block/scsi fixes), and it shows sda being
>>> registered twice.
>>>
>>> "[6.647077] kobject (d5078ca4): tried to init an initialized
>>> object, something is seriously wrong."
>>>
>>> The change that "scsi, block: fix duplicate bdi name registration
>>> crashes" makes is to properly try to register sdb since the sda devt
>>> is still alive. However that's not a fix because we've managed to
>>> call blk_register_queue() twice on the same queue.
>>
>> OK, time to involve others: linux-scsi and linux-block cc'd and I've
>> inserted the log below.
>>
>> James
>>
>> ---
>>
>> [5.969672] scsi host0: scsi_debug: version 1.86 [20160430]
>> [5.969672]   dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
>> [5.971895] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   
>> 0186 PQ: 0 ANSI: 7
>> [6.006983] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 
>> MB/8.00 MiB)
>> [6.026965] sd 0:0:0:0: [sda] Write Protect is off
>> [6.027870] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
>> [6.066962] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
>> supports DPO and FUA
>> [6.486962] sd 0:0:0:0: [sda] Attached SCSI disk
>> [6.488377] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [6.489455] sd 0:0:0:0: Attached scsi generic sg0 type 0
>> [6.526982] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 
>> MB/8.00 MiB)
>> [6.546964] sd 0:0:0:0: [sda] Write Protect is off
>> [6.547873] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
>> [6.586963] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
>> supports DPO and FUA
>> [6.647077] kobject (d5078ca4): tried to init an initialized object, 
>> something is seriously wrong.
>
> So sda is probed twice, and hilarity ensues when we try to register it
> twice.  I can't reproduce this, using scsi_debug and with scsi_async
> enabled.
>
> This is running linux-next? What's your .config?
>

The original failure report is here:

http://marc.info/?l=linux-kernel=148619222300774=2

...but it reproduces on current mainline with the same config. I
haven't spotted what makes scsi_debug behave like this.

Re: [PATCH v3] scsi, block: fix duplicate bdi name registration crashes

2017-02-01 Thread Dan Williams

On Wed, Feb 1, 2017 at 2:35 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 02/01/2017 02:05 PM, Dan Williams wrote:
>> Warnings of the following form occur because scsi reuses a devt number
>> while the block layer still has it referenced as the name of the bdi
>> [1]:
>>
>>  WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
>>  sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
>>  [..]
>>  Call Trace:
>>   dump_stack+0x86/0xc3
>>   __warn+0xcb/0xf0
>>   warn_slowpath_fmt+0x5f/0x80
>>   ? kernfs_path_from_node+0x4f/0x60
>>   sysfs_warn_dup+0x62/0x80
>>   sysfs_create_dir_ns+0x77/0x90
>>   kobject_add_internal+0xb2/0x350
>>   kobject_add+0x75/0xd0
>>   device_add+0x15a/0x650
>>   device_create_groups_vargs+0xe0/0xf0
>>   device_create_vargs+0x1c/0x20
>>   bdi_register+0x90/0x240
>>   ? lockdep_init_map+0x57/0x200
>>   bdi_register_owner+0x36/0x60
>>   device_add_disk+0x1bb/0x4e0
>>   ? __pm_runtime_use_autosuspend+0x5c/0x70
>>   sd_probe_async+0x10d/0x1c0
>>   async_run_entry_fn+0x39/0x170
>>
>> This is a brute-force fix to pass the devt release information from
>> sd_probe() to the locations where we register the bdi,
>> device_add_disk(), and unregister the bdi, blk_cleanup_queue().
>>
>> Thanks to Omar for the quick reproducer script [2]. This patch survives
>> where an unmodified kernel fails in a few seconds.
>
> What is the patch against? Doesn't seem to apply cleanly for me on
> master, nor the 4.11 block tree.
>

I built it on top of Jan's bdi fixes series [1]. I can rebase to
block/for-next, just let me know which patches you want to take first.

[1]: http://marc.info/?l=linux-block=148586843819160=2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] scsi, block: fix duplicate bdi name registration crashes

2017-02-01 Thread Dan Williams

Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

 WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? kernfs_path_from_node+0x4f/0x60
  sysfs_warn_dup+0x62/0x80
  sysfs_create_dir_ns+0x77/0x90
  kobject_add_internal+0xb2/0x350
  kobject_add+0x75/0xd0
  device_add+0x15a/0x650
  device_create_groups_vargs+0xe0/0xf0
  device_create_vargs+0x1c/0x20
  bdi_register+0x90/0x240
  ? lockdep_init_map+0x57/0x200
  bdi_register_owner+0x36/0x60
  device_add_disk+0x1bb/0x4e0
  ? __pm_runtime_use_autosuspend+0x5c/0x70
  sd_probe_async+0x10d/0x1c0
  async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi=147116857810716=4
[2]: http://marc.info/?l=linux-block=148554717109098=2

Cc: James Bottomley <james.bottom...@hansenpartnership.com>
Cc: Bart Van Assche <bart.vanass...@sandisk.com>
Cc: "Martin K. Petersen" <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Jens Axboe <ax...@kernel.dk>
Cc: Jan Kara <j...@suse.cz>
Reported-by: Omar Sandoval <osan...@osandov.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Changes in v2:
 * rebased on top of Jan's bdi lifetime series
 * replace kref_{get,put}() with atomic_{inc,dec_and_test} (Christoph)

 block/blk-core.c   |1 +
 block/genhd.c  |7 +++
 drivers/scsi/sd.c  |   41 +
 include/linux/blkdev.h |1 +
 include/linux/genhd.h  |   17 +
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 84fabb51714a..0cd6b3c4b41c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -595,6 +595,7 @@ void blk_cleanup_queue(struct request_queue *q)
spin_unlock_irq(lock);
 
bdi_unregister(q->backing_dev_info);
+   put_disk_devt(q->disk_devt);
 
/* @q is and will stay empty, shutdown and put */
blk_put_queue(q);
diff --git a/block/genhd.c b/block/genhd.c
index d9ccd42f3675..124499db04d6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
disk_alloc_events(disk);
 
+   /*
+* Take a reference on the devt and assign it to queue since it
+* must not be reallocated while the bdi is registered
+*/
+   disk->queue->disk_devt = disk->disk_devt;
+   get_disk_devt(disk->disk_devt);
+
/* Register BDI before referencing it from bdev */
bdi = disk->queue->backing_dev_info;
bdi_register_owner(bdi, disk_to_dev(disk));
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0b09638fa39b..102111e730ce 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
put_device(>dev);
 }
 
+struct sd_devt {
+   int idx;
+   struct disk_devt disk_devt;
+};
+
+void sd_devt_release(struct disk_devt *disk_devt)
+{
+   struct sd_devt *sd_devt = container_of(disk_devt, struct sd_devt,
+   disk_devt);
+
+   spin_lock(_index_lock);
+   ida_remove(_index_ida, sd_devt->idx);
+   spin_unlock(_index_lock);
+
+   kfree(sd_devt);
+}
+
 /**
  * sd_probe - called during driver initialization and whenever a
  * new scsi device is attached to the system. It is called once
@@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
 static int sd_probe(struct device *dev)
 {
struct scsi_device *sdp = to_scsi_device(dev);
+   struct sd_devt *sd_devt;
struct scsi_disk *sdkp;
struct gendisk *gd;
int index;
@@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
+   sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL);
+   if (!sd_devt)
+   goto out_free;
+
gd = alloc_disk(SD_MINORS);
if (!gd)
-   goto out_free;
+   goto out_free_devt;
 
do {
if (!ida_pre_get(_index_ida, GFP_KERNEL))
@@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev)
goto out_put;
}
 
+   atomic_set(_devt->disk_devt.count, 1);
+   sd_devt->disk_devt.release = sd_devt_release;
+   sd_devt->idx = index;
+   gd->disk_devt = _devt->disk_de

Re: [PATCH 4/4] block: Make blk_get_backing_dev_info() safe without open bdev

2017-02-01 Thread Dan Williams

On Tue, Jan 31, 2017 at 4:54 AM, Jan Kara <j...@suse.cz> wrote:
> Currenly blk_get_backing_dev_info() is not safe to be called when the
> block device is not open as bdev->bd_disk is NULL in that case. However
> inode_to_bdi() uses this function and may be call called from flusher
> worker or other writeback related functions without bdev being open
> which leads to crashes such as:
>
> [113031.075540] Unable to handle kernel paging request for data at address 
> 0x
> [113031.075614] Faulting instruction address: 0xc03692e0
> 0:mon> t
> [c000fb65f900] c036cb6c writeback_sb_inodes+0x30c/0x590
> [c000fb65fa10] c036ced4 __writeback_inodes_wb+0xe4/0x150
> [c000fb65fa70] c036d33c wb_writeback+0x30c/0x450
> [c000fb65fb40] c036e198 wb_workfn+0x268/0x580
> [c000fb65fc50] c00f3470 process_one_work+0x1e0/0x590
> [c000fb65fce0] c00f38c8 worker_thread+0xa8/0x660
> [c000fb65fd80] c00fc4b0 kthread+0x110/0x130
> [c000fb65fe30] c00098f0 ret_from_kernel_thread+0x5c/0x6c
> --- Exception: 0  at 
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c000fb65f620]
> pc: c03692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
> lr: c036cb6c: writeback_sb_inodes+0x30c/0x590
> sp: c000fb65f8a0
>msr: 80010280b033
>dar: 0
>  dsisr: 4000
>   current = 0xc001d69be400
>   paca= 0xc348   softe: 0irq_happened: 0x01
> pid   = 18689, comm = kworker/u16:10
>
> Fix the problem by grabbing reference to bdi on first open of the block
> device and drop the reference only once the inode is evicted from
> memory. This pins struct backing_dev_info in memory and thus fixes the
> crashes.
>
> Reported-by: Dan Williams <dan.j.willi...@intel.com>
> Reported-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
> Signed-off-by: Jan Kara <j...@suse.cz>

Tested-by: Dan Williams <dan.j.willi...@intel.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 10/17] block: introduce bdev_dax_direct_access()

2017-02-01 Thread Dan Williams

On Wed, Feb 1, 2017 at 12:10 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Mon, Jan 30, 2017 at 10:16:29AM -0800, Dan Williams wrote:
>> Ok, now that dax_map_atomic() is gone, it's much easier to remove
>> struct blk_dax_ctl.
>>
>> We can also move the partition alignment checks to be a one-time check
>> at bdev_dax_capable() time and kill bdev_dax_direct_access() in favor
>> of calling dax_direct_access() directly.
>
> Yes, please.
>
>> >> + if ((sector + DIV_ROUND_UP(dax->size, 512))
>> >> + > part_nr_sects_read(bdev->bd_part))
>> >> + return -ERANGE;
>> >> + sector += get_start_sect(bdev);
>> >> + return dax_direct_access(dax_inode, sector * 512, >addr,
>> >> + >pfn, dax->size);
>> >
>> > And please switch to using bytes as the granularity given that we're
>> > deadling with byte addressable memory.
>>
>> dax_direct_access() does take a byte aligned physical address, but it
>> needs to be at least page aligned since we are returning a pfn_t...
>>
>> Hmm, perhaps the input should be raw page frame number. We could
>> reduce one of the arguments by making the current 'pfn_t *' parameter
>> an in/out-parameter.
>
> In/Out parameters are always a bit problematic in terms of API clarity.
> And updating a device-relative address with an absolute physical one
> sounds like an odd API for sure.

Yes, it does, and I thought better of it shortly after sending that. How about:

long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
unsigned long nr_pages, void **kaddr, pfn_t *pfn)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 13/17] fs: update mount_bdev() to lookup dax infrastructure

2017-02-01 Thread Dan Williams

On Wed, Feb 1, 2017 at 12:08 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Mon, Jan 30, 2017 at 10:29:12AM -0800, Dan Williams wrote:
>> On Mon, Jan 30, 2017 at 4:26 AM, Christoph Hellwig <h...@lst.de> wrote:
>> > On Sat, Jan 28, 2017 at 12:37:14AM -0800, Dan Williams wrote:
>> >> This is in preparation for removing the ->direct_access() method from
>> >> block_device_operations.
>> >
>> > I don't think mount_bdev has any business knowing about DAX.
>> > Just call dax_get_by_host manually from the affected file systems for
>> > now, and in the future we can have a pure-DAX mount_dax helper.
>>
>> Ok, since we already need dax_get_by_host() in the blkdev_writepages()
>> path I can sprinkle a few more of those calls and leave mount_bdev
>> alone.
>
> Huh?  I thought we stopped using DAX I/O for the block device nodes
> a while ago?

Oh, yeah, you're right. The blkdev_writepages() call to
dax_writeback_mapping_range() is likely leftover dead code. I'll clean
it up.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 4:24 AM, Christoph Hellwig  wrote:
> Hi Dan,
>
> this looks mostly fine to me.  A few code comments below, but except
> for this there is another issue with it:  We still have drivers
> that share a single request_queue for multiple gendisks, so I wonder

scsi drivers or others? If those drivers can switch to dynamically
allocated devt (GENHD_FL_EXT_DEVT), then they don't need this fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 4:24 AM, Christoph Hellwig  wrote:
> Hi Dan,
>
> this looks mostly fine to me.  A few code comments below, but except
> for this there is another issue with it:  We still have drivers
> that share a single request_queue for multiple gendisks, so I wonder
>
> Also I think you probably want one patch for the block framework,
> and one to switch SCSI over to it.
>
>> +struct disk_devt {
>> + struct kref kref;
>> + void (*release)(struct kref *);
>> +};
>> +
>> +static inline void put_disk_devt(struct disk_devt *disk_devt)
>> +{
>> + if (disk_devt)
>> + kref_put(_devt->kref, disk_devt->release);
>> +}
>> +
>> +static inline void get_disk_devt(struct disk_devt *disk_devt)
>> +{
>> + if (disk_devt)
>> + kref_get(_devt->kref);
>> +}
>
> Given that we have a user-supplied release callack I'd much rather get
> rid of the kref here, use a normal atomic_t and pass the disk_devt
> structure to the release callback then a kref.

I'm missing something... kref is just:

struct kref {
atomic_t refcount;
};

...so what do we gain by open coding kref_get() and kref_put()?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 13/17] fs: update mount_bdev() to lookup dax infrastructure

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 4:26 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Sat, Jan 28, 2017 at 12:37:14AM -0800, Dan Williams wrote:
>> This is in preparation for removing the ->direct_access() method from
>> block_device_operations.
>
> I don't think mount_bdev has any business knowing about DAX.
> Just call dax_get_by_host manually from the affected file systems for
> now, and in the future we can have a pure-DAX mount_dax helper.

Ok, since we already need dax_get_by_host() in the blkdev_writepages()
path I can sprinkle a few more of those calls and leave mount_bdev
alone.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 10/17] block: introduce bdev_dax_direct_access()

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 4:32 AM, Christoph Hellwig <h...@lst.de> wrote:
> On Sat, Jan 28, 2017 at 12:36:58AM -0800, Dan Williams wrote:
>> Provide a replacement for bdev_direct_access() that uses
>> dax_operations.direct_access() instead of
>> block_device_operations.direct_access(). Once all consumers of the old
>> api have been converted bdev_direct_access() will be deleted.
>>
>> Given that block device partitioning decisions can cause dax page
>> alignment constraints to be violated we still need to validate the
>> block_device before calling the dax ->direct_access method.
>>
>> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> ---
>>  block/Kconfig  |1 +
>>  drivers/dax/super.c|   33 +
>>  fs/block_dev.c |   28 
>>  include/linux/blkdev.h |3 +++
>>  include/linux/dax.h|2 ++
>>  5 files changed, 67 insertions(+)
>>
>> diff --git a/block/Kconfig b/block/Kconfig
>> index 8bf114a3858a..9be785173280 100644
>> --- a/block/Kconfig
>> +++ b/block/Kconfig
>> @@ -6,6 +6,7 @@ menuconfig BLOCK
>> default y
>> select SBITMAP
>> select SRCU
>> +   select DAX
>> help
>>Provide block layer support for the kernel.
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index eb844ffea3cf..ab5b082df5dd 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -65,6 +65,39 @@ struct dax_inode {
>>   const struct dax_operations *ops;
>>  };
>>
>> +long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
>> + void **kaddr, pfn_t *pfn, long size)
>> +{
>> + long avail;
>> +
>> + /*
>> +  * The device driver is allowed to sleep, in order to make the
>> +  * memory directly accessible.
>> +  */
>> + might_sleep();
>> +
>> + if (!dax_inode)
>> + return -EOPNOTSUPP;
>> +
>> + if (!dax_inode_alive(dax_inode))
>> + return -ENXIO;
>> +
>> + if (size < 0)
>> + return size;
>> +
>> + if (dev_addr % PAGE_SIZE)
>> + return -EINVAL;
>> +
>> + avail = dax_inode->ops->direct_access(dax_inode, dev_addr, kaddr, pfn,
>> + size);
>> + if (!avail)
>> + return -ERANGE;
>> + if (avail > 0 && avail & ~PAGE_MASK)
>> + return -ENXIO;
>> + return min(avail, size);
>> +}
>> +EXPORT_SYMBOL_GPL(dax_direct_access);
>> +
>>  bool dax_inode_alive(struct dax_inode *dax_inode)
>>  {
>>   lockdep_assert_held(_srcu);
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index edb1d2b16b8f..bf4b51a3a412 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -18,6 +18,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -763,6 +764,33 @@ long bdev_direct_access(struct block_device *bdev, 
>> struct blk_dax_ctl *dax)
>>  EXPORT_SYMBOL_GPL(bdev_direct_access);
>>
>>  /**
>> + * bdev_dax_direct_access() - bdev-sector to pfn_t and kernel virtual 
>> address
>> + * @bdev: host block device for @dax_inode
>> + * @dax_inode: interface data and operations for a memory device
>> + * @dax: control and output parameters for ->direct_access
>> + *
>> + * Return: negative errno if an error occurs, otherwise the number of bytes
>> + * accessible at this address.
>> + *
>> + * Locking: must be called with dax_read_lock() held
>> + */
>> +long bdev_dax_direct_access(struct block_device *bdev,
>> + struct dax_inode *dax_inode, struct blk_dax_ctl *dax)
>> +{
>> + sector_t sector = dax->sector;
>> +
>> + if (!blk_queue_dax(bdev->bd_queue))
>> + return -EOPNOTSUPP;
>
> I don't think this should take a bdev - the caller should know if
> it has a dax_inode.  Also if you touch this anyway can we kill
> the annoying struct blk_dax_ctl calling convention?  Passing the
> four arguments explicitly is just a lot more readable and understandable.

Ok, now that dax_map_atomic() is gone, it's much easier to remove
struct blk_dax_ctl.

We can also move the partition alignment checks to be a one-time check
at bdev_dax_capable() time and kill bdev_dax_direct_access() in favor
of calling dax_direct_access() directly.

>> + if ((sector

Re: [PATCH 0/4 RFC] BDI lifetime fix

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 9:19 AM, Jan Kara <j...@suse.cz> wrote:
> On Thu 26-01-17 22:15:06, Dan Williams wrote:
>> On Thu, Jan 26, 2017 at 9:45 AM, Jan Kara <j...@suse.cz> wrote:
>> > Hello,
>> >
>> > this patch series attempts to solve the problems with the life time of a
>> > backing_dev_info structure. Currently it lives inside request_queue 
>> > structure
>> > and thus it gets destroyed as soon as request queue goes away. However
>> > the block device inode still stays around and thus inode_to_bdi() call on
>> > that inode (e.g. from flusher worker) may happen after request queue has 
>> > been
>> > destroyed resulting in oops.
>> >
>> > This patch set tries to solve these problems by making backing_dev_info
>> > independent structure referenced from block device inode. That makes sure
>> > inode_to_bdi() cannot ever oops. The patches are lightly tested for now
>> > (they boot, basic tests with adding & removing loop devices seem to do what
>> > I'd expect them to do ;). If someone is able to reproduce crashes on bdi
>> > when device goes away, please test these patches.
>>
>> This survives a several runs of the libnvdimm unit tests which stress
>> del_gendisk() and blk_cleanup_queue(). I'll keep testing since the
>> failure was intermittent, but this is looking good.
>
> Thanks for testing!
>
>> > I'd also appreciate if people had a look whether the approach I took looks
>> > sensible.
>>
>> Looks sensible, just the kref comment.
>>
>> I also don't see a need to try to tag on the bdi device name reuse
>> into this series. I'm wondering if we can handle that separately with
>> device_rename(bdi->dev, ...) when we know scsi is done with the old
>> bdi but it has not finished being deleted
>
> Do you mean I should not speak about it in the changelog? The problems I
> have are not as much with reusing device *name* here (and resulting sysfs
> conflicts) but rather a major:minor number pair which results in reusing
> block device inode and we are not prepared for that since the bdi
> associated with that inode may be already unregistered and reusing it would
> be difficult.

No, sorry for the confusion, changelog is fine.

I think others were expecting that this series addressed the
"duplicate bdi device" warnings, and I was clarifying that this series
is for something else.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 01/17] dax: refactor dax-fs into a generic provider of dax inodes

2017-01-30 Thread Dan Williams

On Mon, Jan 30, 2017 at 4:28 AM, Christoph Hellwig  wrote:
> I really don't like the dax_inode name.  Why not something like
> dax_device or dax_region?

Fair enough, I'll switch struct dax_inode to dax_device and switch the
existing struct dax_dev to dax_info.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes

2017-01-29 Thread Dan Williams

On Sun, Jan 29, 2017 at 11:22 PM, Omar Sandoval <osan...@osandov.com> wrote:
> On Mon, Jan 30, 2017 at 08:05:52AM +0100, Hannes Reinecke wrote:
>> On 01/29/2017 05:58 AM, Dan Williams wrote:
>> > Warnings of the following form occur because scsi reuses a devt number
>> > while the block layer still has it referenced as the name of the bdi
>> > [1]:
>> >
>> >  WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
>> >  sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
>> >  [..]
>> >  Call Trace:
>> >   dump_stack+0x86/0xc3
>> >   __warn+0xcb/0xf0
>> >   warn_slowpath_fmt+0x5f/0x80
>> >   ? kernfs_path_from_node+0x4f/0x60
>> >   sysfs_warn_dup+0x62/0x80
>> >   sysfs_create_dir_ns+0x77/0x90
>> >   kobject_add_internal+0xb2/0x350
>> >   kobject_add+0x75/0xd0
>> >   device_add+0x15a/0x650
>> >   device_create_groups_vargs+0xe0/0xf0
>> >   device_create_vargs+0x1c/0x20
>> >   bdi_register+0x90/0x240
>> >   ? lockdep_init_map+0x57/0x200
>> >   bdi_register_owner+0x36/0x60
>> >   device_add_disk+0x1bb/0x4e0
>> >   ? __pm_runtime_use_autosuspend+0x5c/0x70
>> >   sd_probe_async+0x10d/0x1c0
>> >   async_run_entry_fn+0x39/0x170
>> >
>> > This is a brute-force fix to pass the devt release information from
>> > sd_probe() to the locations where we register the bdi,
>> > device_add_disk(), and unregister the bdi, blk_cleanup_queue().
>> >
>> > Thanks to Omar for the quick reproducer script [2]. This patch survives
>> > where an unmodified kernel fails in a few seconds.
>> >
>> > [1]: https://marc.info/?l=linux-scsi=147116857810716=4
>> > [2]: http://marc.info/?l=linux-block=148554717109098=2
>> >
>> > Cc: James Bottomley <james.bottom...@hansenpartnership.com>
>> > Cc: Bart Van Assche <bart.vanass...@sandisk.com>
>> > Cc: "Martin K. Petersen" <martin.peter...@oracle.com>
>> > Cc: Christoph Hellwig <h...@lst.de>
>> > Cc: Jens Axboe <ax...@kernel.dk>
>> > Reported-by: Omar Sandoval <osan...@osandov.com>
>> > Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> > ---
>> >  block/blk-core.c   |1 +
>> >  block/genhd.c  |7 +++
>> >  drivers/scsi/sd.c  |   41 +
>> >  include/linux/blkdev.h |1 +
>> >  include/linux/genhd.h  |   17 +
>> >  5 files changed, 59 insertions(+), 8 deletions(-)
>> >
>> Please check the patchset from Jan Kara (cf 'BDI lifetime fix' on
>> linux-block), which attempts to solve the same problem.
>
> Hi, Hannes,
>
> It's not the same problem. Jan's series fixes a bdi vs. inode lifetime
> issue, this patch is for a bdi vs devt lifetime issue. Jan's series
> doesn't fix the crashes caused by my reproducer script.

Correct. In fact I was running Jan's patches in my baseline kernel
that fails almost immediately.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] scsi, block: fix duplicate bdi name registration crashes

2017-01-28 Thread Dan Williams

Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

 WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? kernfs_path_from_node+0x4f/0x60
  sysfs_warn_dup+0x62/0x80
  sysfs_create_dir_ns+0x77/0x90
  kobject_add_internal+0xb2/0x350
  kobject_add+0x75/0xd0
  device_add+0x15a/0x650
  device_create_groups_vargs+0xe0/0xf0
  device_create_vargs+0x1c/0x20
  bdi_register+0x90/0x240
  ? lockdep_init_map+0x57/0x200
  bdi_register_owner+0x36/0x60
  device_add_disk+0x1bb/0x4e0
  ? __pm_runtime_use_autosuspend+0x5c/0x70
  sd_probe_async+0x10d/0x1c0
  async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi=147116857810716=4
[2]: http://marc.info/?l=linux-block=148554717109098=2

Cc: James Bottomley <james.bottom...@hansenpartnership.com>
Cc: Bart Van Assche <bart.vanass...@sandisk.com>
Cc: "Martin K. Petersen" <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Jens Axboe <ax...@kernel.dk>
Reported-by: Omar Sandoval <osan...@osandov.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/blk-core.c   |1 +
 block/genhd.c  |7 +++
 drivers/scsi/sd.c  |   41 +
 include/linux/blkdev.h |1 +
 include/linux/genhd.h  |   17 +
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..950cea1e202e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -597,6 +597,7 @@ void blk_cleanup_queue(struct request_queue *q)
spin_unlock_irq(lock);
 
bdi_unregister(>backing_dev_info);
+   put_disk_devt(q->disk_devt);
 
/* @q is and will stay empty, shutdown and put */
blk_put_queue(q);
diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4fae657..eb8009e928f5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
disk_alloc_events(disk);
 
+   /*
+* Take a reference on the devt and assign it to queue since it
+* must not be reallocated while the bdi is registerted
+*/
+   disk->queue->disk_devt = disk->disk_devt;
+   get_disk_devt(disk->disk_devt);
+
/* Register BDI before referencing it from bdev */
bdi = >queue->backing_dev_info;
bdi_register_owner(bdi, disk_to_dev(disk));
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0b09638fa39b..09405351577c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
put_device(>dev);
 }
 
+struct sd_devt {
+   int idx;
+   struct disk_devt disk_devt;
+};
+
+void sd_devt_release(struct kref *kref)
+{
+   struct sd_devt *sd_devt = container_of(kref, struct sd_devt,
+   disk_devt.kref);
+
+   spin_lock(_index_lock);
+   ida_remove(_index_ida, sd_devt->idx);
+   spin_unlock(_index_lock);
+
+   kfree(sd_devt);
+}
+
 /**
  * sd_probe - called during driver initialization and whenever a
  * new scsi device is attached to the system. It is called once
@@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
 static int sd_probe(struct device *dev)
 {
struct scsi_device *sdp = to_scsi_device(dev);
+   struct sd_devt *sd_devt;
struct scsi_disk *sdkp;
struct gendisk *gd;
int index;
@@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
+   sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL);
+   if (!sd_devt)
+   goto out_free;
+
gd = alloc_disk(SD_MINORS);
if (!gd)
-   goto out_free;
+   goto out_free_devt;
 
do {
if (!ida_pre_get(_index_ida, GFP_KERNEL))
@@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev)
goto out_put;
}
 
+   kref_init(_devt->disk_devt.kref);
+   sd_devt->disk_devt.release = sd_devt_release;
+   sd_devt->idx = index;
+   gd->disk_devt = _devt->disk_devt;
+
error = sd_format_disk_name("sd", index, gd->disk_name, DISK_NAME_LEN);
if (error) {
sdev_printk(KERN_WARNING, sdp, "SCSI dis

[RFC PATCH 04/17] dax: introduce dax_operations

2017-01-28 Thread Dan Williams

Track a set of dax_operations per dax_inode that can be set at
alloc_dax_inode() time. These operations will be used to stop the abuse
of block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax inode methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.

This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h|4 +++-
 drivers/dax/device.c |6 +-
 drivers/dax/super.c  |6 +-
 include/linux/dax.h  |5 +
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index f33c16ed2ec6..aeb1d49aafb8 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,9 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_inode *alloc_dax_inode(void *private, const char *host);
+struct dax_operations;
+struct dax_inode *alloc_dax_inode(void *private, const char *host,
+   const struct dax_operations *ops);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6d0a3241a608..c3d9405ec285 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -560,7 +560,11 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_inode = alloc_dax_inode(dax_dev, NULL);
+   /*
+* No 'host' or dax_operations since there is no access to this
+* device outside of mmap of the resulting character device.
+*/
+   dax_inode = alloc_dax_inode(dax_dev, NULL, NULL);
if (!dax_inode)
goto err_inode;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7ac048f94b2b..eb844ffea3cf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static int nr_dax = CONFIG_NR_DEV_DAX;
@@ -61,6 +62,7 @@ struct dax_inode {
const char *host;
void *private;
bool alive;
+   const struct dax_operations *ops;
 };
 
 bool dax_inode_alive(struct dax_inode *dax_inode)
@@ -204,7 +206,8 @@ static void dax_add_host(struct dax_inode *dax_inode, const 
char *host)
spin_unlock(_host_lock);
 }
 
-struct dax_inode *alloc_dax_inode(void *private, const char *__host)
+struct dax_inode *alloc_dax_inode(void *private, const char *__host,
+   const struct dax_operations *ops)
 {
struct dax_inode *dax_inode;
const char *host;
@@ -225,6 +228,7 @@ struct dax_inode *alloc_dax_inode(void *private, const char 
*__host)
goto err_inode;
 
dax_add_host(dax_inode, host);
+   dax_inode->ops = ops;
dax_inode->private = private;
return dax_inode;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8fe19230e118..def9a9d118c9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,11 @@
 #include 
 
 struct iomap_ops;
+struct dax_inode;
+struct dax_operations {
+   long (*direct_access)(struct dax_inode *, phys_addr_t, void **,
+   pfn_t *, long);
+};
 
 int dax_read_lock(void);
 void dax_read_unlock(int id);

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 13/17] fs: update mount_bdev() to lookup dax infrastructure

2017-01-28 Thread Dan Williams

This is in preparation for removing the ->direct_access() method from
block_device_operations.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/block_dev.c |6 --
 fs/super.c |   32 +---
 include/linux/fs.h |1 +
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bf4b51a3a412..a73f2388c515 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -806,14 +806,16 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
.sector = 0,
.size = PAGE_SIZE,
};
-   int err;
+   int err, id;
 
if (blocksize != PAGE_SIZE) {
vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax");
return -EINVAL;
}
 
-   err = bdev_direct_access(sb->s_bdev, );
+   id = dax_read_lock();
+   err = bdev_dax_direct_access(sb->s_bdev, sb->s_dax, );
+   dax_read_unlock(id);
if (err < 0) {
switch (err) {
case -EOPNOTSUPP:
diff --git a/fs/super.c b/fs/super.c
index ea662b0e5e78..5e64d11c46c1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include/* for the emergency remount stuff */
+#include 
 #include 
 #include 
 #include 
@@ -1038,9 +1039,17 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
 EXPORT_SYMBOL(mount_ns);
 
 #ifdef CONFIG_BLOCK
+struct mount_bdev_data {
+   struct block_device *bdev;
+   struct dax_inode *dax_inode;
+};
+
 static int set_bdev_super(struct super_block *s, void *data)
 {
-   s->s_bdev = data;
+   struct mount_bdev_data *mb_data = data;
+
+   s->s_bdev = mb_data->bdev;
+   s->s_dax = mb_data->dax_inode;
s->s_dev = s->s_bdev->bd_dev;
 
/*
@@ -1053,14 +1062,18 @@ static int set_bdev_super(struct super_block *s, void 
*data)
 
 static int test_bdev_super(struct super_block *s, void *data)
 {
-   return (void *)s->s_bdev == data;
+   struct mount_bdev_data *mb_data = data;
+
+   return s->s_bdev == mb_data->bdev;
 }
 
 struct dentry *mount_bdev(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data,
int (*fill_super)(struct super_block *, void *, int))
 {
+   struct mount_bdev_data mb_data;
struct block_device *bdev;
+   struct dax_inode *dax_inode;
struct super_block *s;
fmode_t mode = FMODE_READ | FMODE_EXCL;
int error = 0;
@@ -1072,6 +1085,11 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
if (IS_ERR(bdev))
return ERR_CAST(bdev);
 
+   if (IS_ENABLED(CONFIG_FS_DAX))
+   dax_inode = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   dax_inode = NULL;
+
/*
 * once the super is inserted into the list by sget, s_umount
 * will protect the lockfs code from trying to start a snapshot
@@ -1083,8 +1101,13 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
error = -EBUSY;
goto error_bdev;
}
+
+   mb_data = (struct mount_bdev_data) {
+   .bdev = bdev,
+   .dax_inode = dax_inode,
+   };
s = sget(fs_type, test_bdev_super, set_bdev_super, flags | MS_NOSEC,
-bdev);
+_data);
mutex_unlock(>bd_fsfreeze_mutex);
if (IS_ERR(s))
goto error_s;
@@ -1126,6 +1149,7 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
error = PTR_ERR(s);
 error_bdev:
blkdev_put(bdev, mode);
+   put_dax_inode(dax_inode);
 error:
return ERR_PTR(error);
 }
@@ -1133,6 +1157,7 @@ EXPORT_SYMBOL(mount_bdev);
 
 void kill_block_super(struct super_block *sb)
 {
+   struct dax_inode *dax_inode = sb->s_dax;
struct block_device *bdev = sb->s_bdev;
fmode_t mode = sb->s_mode;
 
@@ -1141,6 +1166,7 @@ void kill_block_super(struct super_block *sb)
sync_blockdev(bdev);
WARN_ON_ONCE(!(mode & FMODE_EXCL));
blkdev_put(bdev, mode | FMODE_EXCL);
+   put_dax_inode(dax_inode);
 }
 
 EXPORT_SYMBOL(kill_block_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c930cbc19342..fdad43169146 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1313,6 +1313,7 @@ struct super_block {
struct hlist_bl_heads_anon; /* anonymous dentries for (nfs) 
exporting */
struct list_heads_mounts;   /* list of mounts; _not_ for fs 
use */
struct block_device *s_bdev;
+   struct dax_inode*s_dax;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node   s_instances;

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 15/17] Revert "block: use DAX for partition table reads"

2017-01-28 Thread Dan Williams

commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/partition-generic.c |   17 ++---
 fs/dax.c  |   20 
 include/linux/dax.h   |6 --
 3 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct 
block_device *bdev)
return 0;
 }
 
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t 
n)
-{
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-   return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
-NULL);
-}
-
 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector 
*p)
 {
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
 
-   /* don't populate page cache for dax capable devices */
-   if (IS_DAX(bdev->bd_inode))
-   page = read_dax_sector(bdev, n);
-   else
-   page = read_pagecache_sector(bdev, n);
-
+   page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index ddcddfeaa03b..a990211c8a3d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -97,26 +97,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
 }
 
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
-   struct page *page = alloc_pages(GFP_KERNEL, 0);
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   .sector = n & ~int) PAGE_SIZE) / 512) - 1),
-   };
-   long rc;
-
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   rc = dax_map_atomic(bdev, );
-   if (rc < 0)
-   return ERR_PTR(rc);
-   memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
-   dax_unmap_atomic(bdev, );
-   return page;
-}
-
 /*
  * DAX radix tree locking
  */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2ef8e18e2587..10b742af3d56 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -65,15 +65,9 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
-static inline struct page *read_dax_sector(struct block_device *bdev,
-   sector_t n)
-{
-   return ERR_PTR(-ENXIO);
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 10/17] block: introduce bdev_dax_direct_access()

2017-01-28 Thread Dan Williams

Provide a replacement for bdev_direct_access() that uses
dax_operations.direct_access() instead of
block_device_operations.direct_access(). Once all consumers of the old
api have been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated we still need to validate the
block_device before calling the dax ->direct_access method.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 block/Kconfig  |1 +
 drivers/dax/super.c|   33 +
 fs/block_dev.c |   28 
 include/linux/blkdev.h |3 +++
 include/linux/dax.h|2 ++
 5 files changed, 67 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 8bf114a3858a..9be785173280 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index eb844ffea3cf..ab5b082df5dd 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,39 @@ struct dax_inode {
const struct dax_operations *ops;
 };
 
+long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   long avail;
+
+   /*
+* The device driver is allowed to sleep, in order to make the
+* memory directly accessible.
+*/
+   might_sleep();
+
+   if (!dax_inode)
+   return -EOPNOTSUPP;
+
+   if (!dax_inode_alive(dax_inode))
+   return -ENXIO;
+
+   if (size < 0)
+   return size;
+
+   if (dev_addr % PAGE_SIZE)
+   return -EINVAL;
+
+   avail = dax_inode->ops->direct_access(dax_inode, dev_addr, kaddr, pfn,
+   size);
+   if (!avail)
+   return -ERANGE;
+   if (avail > 0 && avail & ~PAGE_MASK)
+   return -ENXIO;
+   return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
 bool dax_inode_alive(struct dax_inode *dax_inode)
 {
lockdep_assert_held(_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index edb1d2b16b8f..bf4b51a3a412 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -763,6 +764,33 @@ long bdev_direct_access(struct block_device *bdev, struct 
blk_dax_ctl *dax)
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
 /**
+ * bdev_dax_direct_access() - bdev-sector to pfn_t and kernel virtual address
+ * @bdev: host block device for @dax_inode
+ * @dax_inode: interface data and operations for a memory device
+ * @dax: control and output parameters for ->direct_access
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ *
+ * Locking: must be called with dax_read_lock() held
+ */
+long bdev_dax_direct_access(struct block_device *bdev,
+   struct dax_inode *dax_inode, struct blk_dax_ctl *dax)
+{
+   sector_t sector = dax->sector;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return -EOPNOTSUPP;
+   if ((sector + DIV_ROUND_UP(dax->size, 512))
+   > part_nr_sects_read(bdev->bd_part))
+   return -ERANGE;
+   sector += get_start_sect(bdev);
+   return dax_direct_access(dax_inode, sector * 512, >addr,
+   >pfn, dax->size);
+}
+EXPORT_SYMBOL_GPL(bdev_dax_direct_access);
+
+/**
  * bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
  * @blocksize: The block size of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5e7706f7d533..3b3c5ce376fd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1903,6 +1903,9 @@ extern int bdev_read_page(struct block_device *, 
sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
+struct dax_inode;
+extern long bdev_dax_direct_access(struct block_device *bdev,
+   struct dax_inode *dax_inode, struct blk_dax_ctl *dax);
 extern int bdev_dax_supported(struct super_block *, int);
 #else /* CONFIG_BLOCK */
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5aa620e8e5a2..2ef8e18e2587 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,6 +22,8 @@ void *dax_inode_get_private(struct dax_inode *dax_inode);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
+long dax_dir

[RFC PATCH 05/17] pmem: add dax_operations support

2017-01-28 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the pmem block device and
add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h   |7 -
 drivers/nvdimm/Kconfig  |1 +
 drivers/nvdimm/pmem.c   |   55 +++
 drivers/nvdimm/pmem.h   |7 -
 include/linux/dax.h |6 
 tools/testing/nvdimm/pmem-dax.c |   12 -
 6 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index aeb1d49aafb8..b4c686d2d446 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,15 +13,8 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_operations;
-struct dax_inode *alloc_dax_inode(void *private, const char *host,
-   const struct dax_operations *ops);
-void put_dax_inode(struct dax_inode *dax_inode);
-bool dax_inode_alive(struct dax_inode *dax_inode);
-void kill_dax_inode(struct dax_inode *dax_inode);
 struct dax_inode *inode_to_dax_inode(struct inode *inode);
 struct inode *dax_inode_to_inode(struct dax_inode *dax_inode);
-void *dax_inode_get_private(struct dax_inode *dax_inode);
 int dax_inode_register(struct dax_inode *dax_inode,
const struct file_operations *fops, struct module *owner,
struct kobject *parent);
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 59e750183b7f..5bdd499b5f4f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
+   select DAX
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..d3d7de645e20 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "pfn.h"
@@ -199,13 +200,12 @@ static int pmem_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
-__weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+__weak long __pmem_direct_access(struct pmem_device *pmem, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
 {
-   struct pmem_device *pmem = bdev->bd_queue->queuedata;
-   resource_size_t offset = sector * 512 + pmem->data_offset;
+   resource_size_t offset = dev_addr + pmem->data_offset;
 
-   if (unlikely(is_bad_pmem(>bb, sector, size)))
+   if (unlikely(is_bad_pmem(>bb, dev_addr / 512, size)))
return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
@@ -219,22 +219,46 @@ __weak long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
return pmem->size - pmem->pfn_pad - offset;
 }
 
+static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = bdev->bd_queue->queuedata;
+
+   return __pmem_direct_access(pmem, sector * 512, kaddr, pfn, size);
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
-   .direct_access =pmem_direct_access,
+   .direct_access =pmem_blk_direct_access,
.revalidate_disk =  nvdimm_revalidate_disk,
 };
 
+static long pmem_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = dax_inode_get_private(dax_inode);
+
+   return __pmem_direct_access(pmem, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations pmem_dax_ops = {
+   .direct_access = pmem_dax_direct_access,
+};
+
 static void pmem_release_queue(void *q)
 {
blk_cleanup_queue(q);
 }
 
-static void pmem_release_disk(void *disk)
+static void pmem_release_disk(void *__pmem)
 {
-   del_gendisk(disk);
-   put_disk(disk);
+   struct pmem_device *pmem = __pmem;
+
+   kill_dax_inode(pmem->dax_inode);
+   put_dax_inode(pmem->dax_inode);
+   del_gendisk(pmem->disk);
+   put_disk(pmem->disk);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -245,6 +269,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn =

[RFC PATCH 09/17] block: kill bdev_dax_capable()

2017-01-28 Thread Dan Williams

This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/block_dev.c |   24 
 include/linux/blkdev.h |1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 601b71b76d7f..edb1d2b16b8f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   };
-
-   if (!IS_ENABLED(CONFIG_FS_DAX))
-   return false;
-
-   dax.sector = 0;
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   return true;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3c0ff78b1219..5e7706f7d533 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1904,7 +1904,6 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 08/17] dcssblk: add dax_operations support

2017-01-28 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/s390/block/Kconfig   |1 +
 drivers/s390/block/dcssblk.c |   53 +++---
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
+   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 9d66b4fb174b..67b0885b4d12 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
secnum,
 void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = 
{
.owner  = THIS_MODULE,
.open   = dcssblk_open,
.release= dcssblk_release,
-   .direct_access  = dcssblk_direct_access,
+   .direct_access  = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+   .direct_access = dcssblk_dax_direct_access,
 };
 
 struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+   struct dax_inode *dax_inode;
 };
 
 struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
device_attribute *attr, const ch
}
list_del(_info->lh);
 
+   kill_dax_inode(dev_info->dax_inode);
+   put_dax_inode(dev_info->dax_inode);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_inode *dax_inode;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
+   dax_inode = alloc_dax_inode(dev_info, dev_info->gd->disk_name,
+   _dax_ops);
+   if (!dax_inode)
+   goto put_dev;
+
get_device(_info->dev);
device_add_disk(_info->dev, dev_info->gd);
 
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct 
device_attribute *attr, const ch
}
 
list_del(_info->lh);
+   kill_dax_inode(dev_info->dax_inode);
+   put_dax_inode(dev_info->dax_inode);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,38 @@ dcssblk_make_request(struct request_queue *q, struct bio 
*bio)
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, phys_addr_t offset,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   unsigned long dev_sz;
+
+   dev_sz = dev_info->end - dev_info->start;
+   *kaddr = (void *) dev_info->start + offset;
+   *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+   return dev_sz - offset;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
 {
struct dcssblk_dev_info *dev_info;
-   unsigned long offset, dev_sz;
 
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
-   dev_sz = dev_info->end - dev_info->start;
-

[RFC PATCH 07/17] brd: add dax_operations support

2017-01-28 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/block/Kconfig |1 +
 drivers/block/brd.c   |   57 +
 2 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 223ff2fcae7e..604b51a884b6 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -337,6 +337,7 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
tristate "RAM block device support"
+   select DAX if BLK_DEV_RAM_DAX
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..1279df4dc07c 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
 #include 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include 
+#include 
 #endif
 
 #include 
@@ -41,6 +42,9 @@ struct brd_device {
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_inode*dax_inode;
+#endif
struct list_headbrd_list;
 
/*
@@ -375,15 +379,14 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
+static long __brd_direct_access(struct brd_device *brd, phys_addr_t dev_addr,
void **kaddr, pfn_t *pfn, long size)
 {
-   struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, sector);
+   page = brd_insert_page(brd, dev_addr / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
@@ -391,14 +394,34 @@ static long brd_direct_access(struct block_device *bdev, 
sector_t sector,
 
return PAGE_SIZE;
 }
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = bdev->bd_disk->private_data;
+
+   return __brd_direct_access(brd, sector * 512, kaddr, pfn, size);
+}
+
+static long brd_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = dax_inode_get_private(dax_inode);
+
+   return __brd_direct_access(brd, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations brd_dax_ops = {
+   .direct_access = brd_dax_direct_access,
+};
 #else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_direct_access,
+   .direct_access =brd_blk_direct_access,
 };
 
 /*
@@ -441,7 +464,9 @@ static struct brd_device *brd_alloc(int i)
 {
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_inode *dax_inode;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +494,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +506,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+   dax_inode = alloc_dax_inode(brd, disk->disk_name, _dax_ops);
+   if (!dax_inode)
+   goto out_free_inode;
+#endif
+
+
return brd;
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+   kill_dax_inode(dax_inode);
+   put_dax_inode(dax_inode);
+#endif
 out_free_queue:
blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -525,6 +560,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
list_del(>brd_list);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   kill_dax_inode(brd->dax_inode);
+   put_dax_inode(brd-

[RFC PATCH 06/17] axon_ram: add dax_operations support

2017-01-28 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the axon_ram block device
and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.
---
 arch/powerpc/platforms/Kconfig |1 +
 arch/powerpc/sysdev/axonram.c  |   46 +++-
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
 config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+   select DAX
default m
help
  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ada29eaed6e2..4e1f58187726 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
 struct axon_ram_bank {
struct platform_device  *device;
struct gendisk  *disk;
+   struct dax_inode*dax_inode;
unsigned intirq_id;
unsigned long   ph_addr;
unsigned long   io_addr;
@@ -137,25 +139,45 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, phys_addr_t offset,
+  void **kaddr, pfn_t *pfn, long size)
+{
+   *kaddr = (void *) bank->io_addr + offset;
+   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+   return bank->size - offset;
+}
+
 /**
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
   void **kaddr, pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-   *kaddr = (void *) bank->io_addr + offset;
-   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
-   return bank->size - offset;
+   return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT,
+   kaddr, pfn, size);
 }
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_direct_access
+   .direct_access  = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
+  void **kaddr, pfn_t *pfn, long size)
+{
+   struct axon_ram_bank *bank = dax_inode_get_private(dax_inode);
+
+   return __axon_ram_direct_access(bank, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+   .direct_access = axon_ram_dax_direct_access,
 };
 
 /**
@@ -219,6 +241,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
 
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = _ram_devops;
@@ -227,6 +250,11 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
+   bank->dax_inode = alloc_dax_inode(bank, bank->disk->disk_name,
+   _ram_dax_ops);
+   if (!bank->dax_inode)
+   goto failed;
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(>dev, "Cannot register disk queue\n");
@@ -276,6 +304,10 @@ static int axon_ram_probe(struct platform_device *device)
bank->disk->disk_name);
del_gendisk(bank->disk);
}
+   if (bank->dax_inode) {
+   kill_dax_inode(bank->dax_inode);
+   put_dax_inode(bank->dax_inode);
+   }
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->io_addr);
@@ -298,6 +330,8 @@ axon_ram_remove(struct platform_device *device)
 
device_remove_file(>dev, _attr_ecc);
free_irq(bank->irq_id, device);
+   kill_dax_inode(bank->dax_inode);
+   put_dax_inode(bank->dax_inode);
del_gendisk(bank->disk);
iounmap((void __iomem *) bank->io_addr);
kfree(bank);

--

[RFC PATCH 11/17] dm: add dax_operations support (producer)

2017-01-28 Thread Dan Williams

Setup a dax_inode to have the same lifetime as the dm block device and
add a ->direct_access() method that is equivalent to
dm_blk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dm_blk_direct_access() will be removed.

This enabling is only for the top-level dm representation to upper
layers. Sub-sequent patches are needed to convert the bottom layer
interface to backing devices.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/md/Kconfig   |1 +
 drivers/md/dm-core.h |3 +++
 drivers/md/dm.c  |   42 +++---
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 40ceba1fe8be..f6eb8d8db646 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -24,6 +24,8 @@ struct dm_kobject_holder {
struct completion completion;
 };
 
+struct dax_inode;
+
 /*
  * DM core internal structure that used directly by dm.c and dm-rq.c
  * DM targets must _not_ deference a mapped_device to directly access its 
members!
@@ -58,6 +60,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
 
struct gendisk *disk;
+   struct dax_inode *dax_inode;
char name[16];
 
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db934b1dba9d..1b3d9253e92c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -905,10 +906,10 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long __dm_direct_access(struct mapped_device *md, phys_addr_t dev_addr,
+  void **kaddr, pfn_t *pfn, long size)
 {
-   struct mapped_device *md = bdev->bd_disk->private_data;
+   sector_t sector = dev_addr >> SECTOR_SHIFT;
struct dm_table *map;
struct dm_target *ti;
int srcu_idx;
@@ -932,6 +933,23 @@ static long dm_blk_direct_access(struct block_device 
*bdev, sector_t sector,
return min(ret, size);
 }
 
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+void **kaddr, pfn_t *pfn, long size)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+
+   return __dm_direct_access(md, sector << SECTOR_SHIFT, kaddr, pfn, size);
+}
+
+static long dm_dax_direct_access(struct dax_inode *dax_inode,
+phys_addr_t dev_addr, void **kaddr, pfn_t *pfn,
+long size)
+{
+   struct mapped_device *md = dax_inode_get_private(dax_inode);
+
+   return __dm_direct_access(md, dev_addr, kaddr, pfn, size);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -1376,6 +1394,7 @@ static int next_free_minor(int *minor)
 }
 
 static const struct block_device_operations dm_blk_dops;
+static const struct dax_operations dm_dax_ops;
 
 static void dm_wq_work(struct work_struct *work);
 
@@ -1423,6 +1442,12 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
if (md->bs)
bioset_free(md->bs);
 
+   if (md->dax_inode) {
+   kill_dax_inode(md->dax_inode);
+   put_dax_inode(md->dax_inode);
+   md->dax_inode = NULL;
+   }
+
if (md->disk) {
spin_lock(&_minor_lock);
md->disk->private_data = NULL;
@@ -1450,6 +1475,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_inode *dax_inode;
struct mapped_device *md;
void *old_md;
 
@@ -1514,6 +1540,12 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->queue = md->queue;
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
+
+   dax_inode = alloc_dax_inode(md, md->disk->disk_name, _dax_ops);
+   if (!dax_inode)
+   goto bad;
+   md->dax_inode = dax_inode;
+
add_disk(md->disk);

[RFC PATCH 16/17] fs, dax: convert filesystem-dax to bdev_dax_direct_access

2017-01-28 Thread Dan Williams

Now that a dax_inode is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/dax.c|  143 +++
 fs/iomap.c  |3 +
 include/linux/dax.h |6 +-
 3 files changed, 82 insertions(+), 70 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a990211c8a3d..07b36a26db06 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -51,32 +51,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
-   struct request_queue *q = bdev->bd_queue;
-   long rc = -EIO;
-
-   dax->addr = ERR_PTR(-EIO);
-   if (blk_queue_enter(q, true) != 0)
-   return rc;
-
-   rc = bdev_direct_access(bdev, dax);
-   if (rc < 0) {
-   dax->addr = ERR_PTR(rc);
-   blk_queue_exit(q);
-   return rc;
-   }
-   return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
-   const struct blk_dax_ctl *dax)
-{
-   if (IS_ERR(dax->addr))
-   return;
-   blk_queue_exit(bdev->bd_queue);
-}
-
 static int dax_is_pmd_entry(void *entry)
 {
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -549,21 +523,28 @@ static int dax_load_hole(struct address_space *mapping, 
void **entry,
return ret;
 }
 
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
-   struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_inode 
*dax_inode,
+   sector_t sector, size_t size, struct page *to,
+   unsigned long vaddr)
 {
struct blk_dax_ctl dax = {
.sector = sector,
.size = size,
};
void *vto;
+   long rc;
+   int id;
 
-   if (dax_map_atomic(bdev, ) < 0)
-   return PTR_ERR(dax.addr);
+   id = dax_read_lock();
+   rc = bdev_dax_direct_access(bdev, dax_inode, );
+   if (rc < 0) {
+   dax_read_unlock(id);
+   return rc;
+   }
vto = kmap_atomic(to);
copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
kunmap_atomic(vto);
-   dax_unmap_atomic(bdev, );
+   dax_read_unlock(id);
return 0;
 }
 
@@ -731,12 +712,13 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
 }
 
 static int dax_writeback_one(struct block_device *bdev,
-   struct address_space *mapping, pgoff_t index, void *entry)
+   struct dax_inode *dax_inode, struct address_space *mapping,
+   pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
struct blk_dax_ctl dax;
void *entry2, **slot;
-   int ret = 0;
+   int ret = 0, id;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,18 +771,20 @@ static int dax_writeback_one(struct block_device *bdev,
dax.size = PAGE_SIZE << dax_radix_order(entry);
 
/*
-* We cannot hold tree_lock while calling dax_map_atomic() because it
-* eventually calls cond_resched().
+* bdev_dax_direct_access() may sleep, so cannot hold tree_lock
+* over its invocation.
 */
-   ret = dax_map_atomic(bdev, );
+   id = dax_read_lock();
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
if (ret < 0) {
+   dax_read_unlock(id);
put_locked_mapping_entry(mapping, index, entry);
return ret;
}
 
if (WARN_ON_ONCE(ret < dax.size)) {
ret = -EIO;
-   goto unmap;
+   goto dax_unlock;
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
@@ -814,8 +798,8 @@ static int dax_writeback_one(struct block_device *bdev,
spin_lock_irq(>tree_lock);
radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
spin_unlock_irq(>tree_lock);
- unmap:
-   dax_unmap_atomic(bdev, );
+ dax_unlock:
+   dax_read_unlock(id);
put_locked_mapping_entry(mapping, index, entry);
return ret;
 
@@ -836,6 +820,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
struct inode *inode = mapping->host;
pgoff_t start_index, end_index;
pgoff_t indices[PAGEVEC_SIZE];
+   struct dax_inode *dax_inode;
struct pagevec pvec;
bool done = false;
int i, ret = 0;
@@ -846,6 +831,10 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
return 0;
 
+   dax_inode = dax_get_by_host(bdev->bd_disk->disk_name);
+

[RFC PATCH 14/17] ext2, ext4, xfs: retrieve dax_inode through iomap operations

2017-01-28 Thread Dan Williams

In preparation for converting fs/dax.c to use bdev_dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_inode determined at mount through ->iomap_begin.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/ext2/inode.c   |1 +
 fs/ext4/inode.c   |1 +
 fs/xfs/xfs_aops.c |   13 +
 fs/xfs/xfs_aops.h |1 +
 fs/xfs/xfs_buf.h  |1 +
 fs/xfs/xfs_iomap.c|1 +
 fs/xfs/xfs_super.c|3 +++
 include/linux/iomap.h |1 +
 8 files changed, 22 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f073bfca694b..c83f84748ec9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -813,6 +813,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
 
iomap->flags = 0;
iomap->bdev = inode->i_sb->s_bdev;
+   iomap->dax_inode = inode->i_sb->s_dax;
iomap->offset = (u64)first_block << blkbits;
 
if (ret == 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..ae6fa6a78d0d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3344,6 +3344,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
 
iomap->flags = 0;
iomap->bdev = inode->i_sb->s_bdev;
+   iomap->dax_inode = inode->i_sb->s_dax;
iomap->offset = first_block << blkbits;
 
if (ret == 0) {
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 631e7c0e0a29..7d22938a4d8b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -80,6 +80,19 @@ xfs_find_bdev_for_inode(
return mp->m_ddev_targp->bt_bdev;
 }
 
+struct dax_inode *
+xfs_find_dax_for_inode(
+   struct inode*inode)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   struct xfs_mount*mp = ip->i_mount;
+
+   if (XFS_IS_REALTIME_INODE(ip))
+   return NULL;
+   else
+   return mp->m_ddev_targp->bt_dax;
+}
+
 /*
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index cc174ec6c2fd..e5b65f436acf 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,5 +59,6 @@ int   xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, 
size_t size);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
+extern struct dax_inode *xfs_find_dax_for_inode(struct inode *);
 
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8a9d3a9599f0..1ff83f398649 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -109,6 +109,7 @@ typedef unsigned int xfs_buf_flags_t;
 typedef struct xfs_buftarg {
dev_t   bt_dev;
struct block_device *bt_bdev;
+   struct dax_inode*bt_dax;
struct backing_dev_info *bt_bdi;
struct xfs_mount*bt_mount;
unsigned intbt_meta_sectorsize;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 0d147428971e..1d08bd2433d5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -69,6 +69,7 @@ xfs_bmbt_to_iomap(
iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
+   iomap->dax_inode = xfs_find_dax_for_inode(VFS_I(ip));
 }
 
 xfs_extlen_t
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index eecbaac08eba..1a99013a0701 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -774,6 +774,9 @@ xfs_open_devices(
if (!mp->m_ddev_targp)
goto out_close_rtdev;
 
+   /* associate dax inode for filesystem-dax */
+   mp->m_ddev_targp->bt_dax = mp->m_super->s_dax;
+
if (rtdev) {
mp->m_rtdev_targp = xfs_alloc_buftarg(mp, rtdev);
if (!mp->m_rtdev_targp)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a4c94b86401e..01e265e7cf55 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type;   /* type of mapping */
u16 flags;  /* flags for mapping */
struct block_device *bdev;  /* block device for I/O */
+   struct dax_inode*dax_inode; /* dax_inode for dax operations */
 };
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 12/17] dm: add dax_operations support (consumer)

2017-01-28 Thread Dan Williams

Arrange for dm to lookup the dax services available from member
devices. Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/md/dm-linear.c|   24 
 drivers/md/dm-snap.c  |   12 
 drivers/md/dm-stripe.c|   30 ++
 drivers/md/dm-target.c|   11 +++
 drivers/md/dm.c   |   16 
 include/linux/device-mapper.h |7 +++
 6 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e91ca8089333 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -159,6 +159,29 @@ static long linear_direct_access(struct dm_target *ti, 
sector_t sector,
return ret;
 }
 
+static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+void **kaddr, pfn_t *pfn, long size)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_inode *dax_inode = lc->dev->dax_inode;
+   struct blk_dax_ctl dax = {
+   .sector = linear_map_sector(ti, dev_addr >> SECTOR_SHIFT),
+   .size = size,
+   };
+   long ret;
+
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
+   *kaddr = dax.addr;
+   *pfn = dax.pfn;
+
+   return ret;
+}
+
+static const struct dm_dax_operations linear_dax_ops = {
+   .dm_direct_access = linear_dax_direct_access,
+};
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -170,6 +193,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..1990e3bd6958 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2309,6 +2309,13 @@ static long origin_direct_access(struct dm_target *ti, 
sector_t sector,
return -EIO;
 }
 
+static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   DMWARN("device does not support dax.");
+   return -EIO;
+}
+
 /*
  * Set the target "max_io_len" field to the minimum of all the snapshots'
  * chunk sizes.
@@ -2357,6 +2364,10 @@ static int origin_iterate_devices(struct dm_target *ti,
return fn(ti, o->dev, 0, ti->len, data);
 }
 
+static const struct dm_dax_operations origin_dax_ops = {
+   .dm_direct_access = origin_dax_direct_access,
+};
+
 static struct target_type origin_target = {
.name= "snapshot-origin",
.version = {1, 9, 0},
@@ -2369,6 +2380,7 @@ static struct target_type origin_target = {
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
.direct_access = origin_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..47fb56a6184a 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,6 +331,31 @@ static long stripe_direct_access(struct dm_target *ti, 
sector_t sector,
return ret;
 }
 
+static long stripe_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct stripe_c *sc = ti->private;
+   uint32_t stripe;
+   struct block_device *bdev;
+   struct dax_inode *dax_inode;
+   struct blk_dax_ctl dax = {
+   .size = size,
+   };
+   long ret;
+
+   stripe_map_sector(sc, dev_addr >> SECTOR_SHIFT, , );
+
+   dax.sector += sc->stripe[stripe].physical_start;
+   bdev = sc->stripe[stripe].dev->bdev;
+   dax_inode = sc->stripe[stripe].dev->dax_inode;
+
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
+   *kaddr = dax.addr;
+   *pfn = dax.pfn;
+
+   return ret;
+}
+
 /*
  * Stripe status:
  *
@@ -437,6 +462,10 @@ static void stripe_io_hints(struct dm_target *ti,
blk_limits_io_opt(limits, chunk_size * sc->stripes);
 }
 
+static const struct dm_dax_operations stripe_dax_ops = {
+   .dm_direct_access = stripe_dax_direct_access,
+};
+
 static struct target_type stripe_target = {
.name   = "striped",
.version = {1, 6, 0},
@@ -449,6 +478,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 int __init dm_s

[RFC PATCH 17/17] block: remove block_device_operations.direct_access and related infrastructure

2017-01-28 Thread Dan Williams

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_inode, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/powerpc/sysdev/axonram.c |   15 --
 drivers/block/brd.c   |   11 --
 drivers/md/dm-linear.c|   19 -
 drivers/md/dm-snap.c  |8 ---
 drivers/md/dm-stripe.c|   24 --
 drivers/md/dm-table.c |2 +-
 drivers/md/dm-target.c|7 --
 drivers/md/dm.c   |   19 +++--
 drivers/nvdimm/pmem.c |9 
 drivers/s390/block/dcssblk.c  |   16 ---
 fs/block_dev.c|   45 -
 include/linux/blkdev.h|3 ---
 include/linux/device-mapper.h |9 
 13 files changed, 4 insertions(+), 183 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 4e1f58187726..1337b5829980 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -148,23 +148,8 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, 
phys_addr_t offset,
return bank->size - offset;
 }
 
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, pfn_t *pfn, long size)
-{
-   struct axon_ram_bank *bank = device->bd_disk->private_data;
-
-   return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT,
-   kaddr, pfn, size);
-}
-
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_blk_direct_access
 };
 
 static long
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1279df4dc07c..52a1259f8ded 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,14 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, 
phys_addr_t dev_addr,
return PAGE_SIZE;
 }
 
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct brd_device *brd = bdev->bd_disk->private_data;
-
-   return __brd_direct_access(brd, sector * 512, kaddr, pfn, size);
-}
-
 static long brd_dax_direct_access(struct dax_inode *dax_inode,
phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
 {
@@ -414,14 +406,11 @@ static long brd_dax_direct_access(struct dax_inode 
*dax_inode,
 static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
 };
-#else
-#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_blk_direct_access,
 };
 
 /*
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index e91ca8089333..7ec2a8eb8a14 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -141,24 +141,6 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
-{
-   struct linear_c *lc = ti->private;
-   struct block_device *bdev = lc->dev->bdev;
-   struct blk_dax_ctl dax = {
-   .sector = linear_map_sector(ti, sector),
-   .size = size,
-   };
-   long ret;
-
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
-}
-
 static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
 void **kaddr, pfn_t *pfn, long size)
 {
@@ -192,7 +174,6 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-   .direct_access = linear_direct_access,
.dax_ops = _dax_ops,
 };
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 1990e3bd6958..1d9407633bb5 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,13 +2302,6 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   DMWARN("device does not support dax.");
-   return -EIO;
-}
-
 static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,

[RFC PATCH 03/17] dax: add a facility to lookup a dax inode by 'host' device name

2017-01-28 Thread Dan Williams

For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_inode associated with a block_device. Add a
'host' property of a dax_inode that can be used for this purpose. It is
a free form string, but for a dax_inode associated with a block device
it is the bdev name.

This is a band-aid until filesystems are able to mount on a dax-inode
directly.

We use a hash list since blkdev_writepages() will need to use this
interface to issue dax_writeback_mapping_range().

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/dax.h|2 +
 drivers/dax/device.c |2 +
 drivers/dax/super.c  |   79 +-
 include/linux/dax.h  |1 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index def061aa75f4..f33c16ed2ec6 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,7 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_inode *alloc_dax_inode(void *private);
+struct dax_inode *alloc_dax_inode(void *private, const char *host);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index af06d0bfd6ea..6d0a3241a608 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -560,7 +560,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_inode = alloc_dax_inode(dax_dev);
+   dax_inode = alloc_dax_inode(dax_dev, NULL);
if (!dax_inode)
goto err_inode;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7c4dc97d53a8..7ac048f94b2b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
+static struct hlist_head dax_host_list[DAX_HASH_SIZE];
+static DEFINE_SPINLOCK(dax_host_lock);
+
 int dax_read_lock(void)
 {
return srcu_read_lock(_srcu);
@@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
  * struct dax_inode - anchor object for dax services
  * @inode: core vfs
  * @cdev: optional character interface for "device dax"
+ * @host: optional name for lookups where the device path is not available
  * @private: dax driver private data
  * @alive: !alive + rcu grace period == no new operations / mappings
  */
 struct dax_inode {
+   struct hlist_node list;
struct inode inode;
struct cdev cdev;
+   const char *host;
void *private;
bool alive;
 };
@@ -63,6 +70,11 @@ bool dax_inode_alive(struct dax_inode *dax_inode)
 }
 EXPORT_SYMBOL_GPL(dax_inode_alive);
 
+static int dax_host_hash(const char *host)
+{
+   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
+}
+
 /*
  * Note, rcu is not protecting the liveness of dax_inode, rcu is
  * ensuring that any fault handlers or operations that might have seen
@@ -75,6 +87,12 @@ void kill_dax_inode(struct dax_inode *dax_inode)
return;
 
dax_inode->alive = false;
+
+   spin_lock(_host_lock);
+   if (!hlist_unhashed(_inode->list))
+   hlist_del_init(_inode->list);
+   spin_unlock(_host_lock);
+
synchronize_srcu(_srcu);
dax_inode->private = NULL;
 }
@@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct dax_inode *dax_inode = to_dax_inode(inode);
 
+   kfree(dax_inode->host);
+   dax_inode->host = NULL;
ida_simple_remove(_minor_ida, MINOR(inode->i_rdev));
kmem_cache_free(dax_cache, dax_inode);
 }
@@ -169,26 +189,49 @@ static struct dax_inode *dax_inode_get(dev_t devt)
return dax_inode;
 }
 
-struct dax_inode *alloc_dax_inode(void *private)
+static void dax_add_host(struct dax_inode *dax_inode, const char *host)
+{
+   int hash;
+
+   INIT_HLIST_NODE(_inode->list);
+   if (!host)
+   return;
+
+   dax_inode->host = host;
+   hash = dax_host_hash(host);
+   spin_lock(_host_lock);
+   hlist_add_head(_inode->list, _host_list[hash]);
+   spin_unlock(_host_lock);
+}
+
+struct dax_inode *alloc_dax_inode(void *private, const char *__host)
 {
struct dax_inode *dax_inode;
+   const char *host;
dev_t devt;
int minor;
 
+   host = kstrdup(__host, GFP_KERNEL);
+   if (__host && !host)
+   return NULL;
+
minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL);
if (minor < 0)
-   return NULL;
+   goto err_minor;
 
devt = MKDEV(MAJOR(dax_devt), mino

[RFC PATCH 01/17] dax: refactor dax-fs into a generic provider of dax inodes

2017-01-28 Thread Dan Williams

We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax inode and add a lookup mechanism to go from block device path
to a dax inode. A dax capable driver like pmem or brd is responsible for
registering a dax inode, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax inode by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
inode" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base inode. "Filesystem DAX" is
then a mapping service that layers a filesystem on top of the base dax
inode. Filesystem DAX goes through a block_device for now, but perhaps
directly to a dax inode in the future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/Makefile|2 
 drivers/dax/Kconfig |8 +
 drivers/dax/Makefile|5 +
 drivers/dax/dax.h   |   24 ++-
 drivers/dax/device-dax.h|   25 +++
 drivers/dax/device.c|  241 +
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  310 +++
 tools/testing/nvdimm/Kbuild |6 -
 9 files changed, 402 insertions(+), 221 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (75%)
 create mode 100644 drivers/dax/super.c

diff --git a/drivers/Makefile b/drivers/Makefile
index 060026a02f59..17f42e4a6717 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -68,7 +68,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
-obj-$(CONFIG_DEV_DAX)  += dax/
+obj-$(CONFIG_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3e2ab3b14eea..39bcbf4c5e40 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,6 +1,11 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+   tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
help
  Support raw access to differentiated (persistence, bandwidth,
@@ -10,7 +15,6 @@ menuconfig DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-if DEV_DAX
 
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
+dax-y := super.o
 dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ddd829ab58c0..def061aa75f4 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,16 @@
  */
 #ifndef __DAX_H__
 #define __DAX_H__
-struct device;
-struct dax_dev;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
-   int region_id, struct resource *res, unsigned int align,
-   void *addr, unsigned long flags);
-struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
-   struct resource *res, int count);
+struct dax_inode;
+struct dax_inode *alloc_dax_inode(void *private);
+void put_dax_inode(struct dax_inode *dax_inode);
+bool dax_inode_alive(struct dax_inode *dax_inode);
+void kill_dax_inode(struct dax_inode *dax_inode);
+struct dax_inode *inode_to_dax_inode(struct inode *inode);
+struct inode *dax_inode_to_inode(struct dax_inode *dax_inode);
+void *dax_inode_get_private(struct dax_inode *dax_inode);
+int dax_inode_register(struct dax_inode *dax_inode,
+   const

[RFC PATCH 00/17] introduce a dax_inode for dax_operations

2017-01-28 Thread Dan Williams

Recently there was an effort to introduce dax_operations to unwind the
abuse of the user-copy api in the pmem api [1]. Christoph noted that we
should not add new block-dax operations as it is further abuse of struct
block_device [2].

The ->direct_access() method in block_device_operations was an expedient
way to get the filesystem-dax capability bootstrapped. However, looking
forward to native persistent memory filesystems, they can forgo the
block layer and mount directly on a provider of dax services, a dax
inode.

For the time being, since current dax capable filesystems are block
based, we need a facility to look up this dax object via the
block-device name. If this approach looks reasonable I'll follow up with
reworking the proposed ->copy_from_iter(), ->flush(), and ->clear() dax
operations into this new scheme.

These patches survive a run of the libnvdimm unit tests, but I have not
tested the non-libnvdimm dax drivers.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008638.html

---

Dan Williams (17):
  dax: refactor dax-fs into a generic provider of dax inodes
  dax: convert dax_inode locking to srcu
  dax: add a facility to lookup a dax inode by 'host' device name
  dax: introduce dax_operations
  pmem: add dax_operations support
  axon_ram: add dax_operations support
  brd: add dax_operations support
  dcssblk: add dax_operations support
  block: kill bdev_dax_capable()
  block: introduce bdev_dax_direct_access()
  dm: add dax_operations support (producer)
  dm: add dax_operations support (consumer)
  fs: update mount_bdev() to lookup dax infrastructure
  ext2, ext4, xfs: retrieve dax_inode through iomap operations
  Revert "block: use DAX for partition table reads"
  fs, dax: convert filesystem-dax to bdev_dax_direct_access
  block: remove block_device_operations.direct_access and related 
infrastructure


 arch/powerpc/platforms/Kconfig  |1 
 arch/powerpc/sysdev/axonram.c   |   37 +++
 block/Kconfig   |1 
 block/partition-generic.c   |   17 --
 drivers/Makefile|2 
 drivers/block/Kconfig   |1 
 drivers/block/brd.c |   48 +++-
 drivers/dax/Kconfig |9 +
 drivers/dax/Makefile|5 
 drivers/dax/dax.h   |   19 +-
 drivers/dax/device-dax.h|   25 ++
 drivers/dax/device.c|  257 ---
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  434 +++
 drivers/md/Kconfig  |1 
 drivers/md/dm-core.h|3 
 drivers/md/dm-linear.c  |   15 +
 drivers/md/dm-snap.c|8 +
 drivers/md/dm-stripe.c  |   16 +
 drivers/md/dm-table.c   |2 
 drivers/md/dm-target.c  |   10 +
 drivers/md/dm.c |   43 +++-
 drivers/nvdimm/Kconfig  |1 
 drivers/nvdimm/pmem.c   |   46 +++-
 drivers/nvdimm/pmem.h   |7 -
 drivers/s390/block/Kconfig  |1 
 drivers/s390/block/dcssblk.c|   41 +++-
 fs/block_dev.c  |   75 ++-
 fs/dax.c|  149 ++---
 fs/ext2/inode.c |1 
 fs/ext4/inode.c |1 
 fs/iomap.c  |3 
 fs/super.c  |   32 +++
 fs/xfs/xfs_aops.c   |   13 +
 fs/xfs/xfs_aops.h   |1 
 fs/xfs/xfs_buf.h|1 
 fs/xfs/xfs_iomap.c  |1 
 fs/xfs/xfs_super.c  |3 
 include/linux/blkdev.h  |7 -
 include/linux/dax.h |   29 ++-
 include/linux/device-mapper.h   |   16 +
 include/linux/fs.h  |1 
 include/linux/iomap.h   |1 
 tools/testing/nvdimm/Kbuild |6 -
 tools/testing/nvdimm/pmem-dax.c |   12 -
 45 files changed, 927 insertions(+), 477 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (74%)
 create mode 100644 drivers/dax/super.c
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 02/17] dax: convert dax_inode locking to srcu

2017-01-28 Thread Dan Williams

In preparation for adding dax_operations that perform ->direct_access()
and user copy operations relative to a dax_inode, convert the existing
dax_inode locking to srcu. Some dax drivers need to sleep in their
->direct_access() methods and user copying may fault / sleep.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/Kconfig  |1 +
 drivers/dax/device.c |   18 +-
 drivers/dax/super.c  |   20 
 include/linux/dax.h  |3 +++
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 39bcbf4c5e40..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,5 +1,6 @@
 menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+   select SRCU
default m if NVDIMM_DAX
 
 if DAX
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5b5572314929..af06d0bfd6ea 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -333,16 +333,16 @@ static int __dax_dev_fault(struct dax_dev *dax_dev, 
struct vm_area_struct *vma,
 
 static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   int rc;
+   int rc, id;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
 
dev_dbg(_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__,
current->comm, (vmf->flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
-   rcu_read_lock();
+   id = dax_read_lock();
rc = __dax_dev_fault(dax_dev, vma, vmf);
-   rcu_read_unlock();
+   dax_read_unlock(id);
 
return rc;
 }
@@ -390,7 +390,7 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
 static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
 {
-   int rc;
+   int rc, id;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
 
@@ -398,9 +398,9 @@ static int dax_dev_pmd_fault(struct vm_area_struct *vma, 
unsigned long addr,
current->comm, (flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
 
-   rcu_read_lock();
+   id = dax_read_lock();
rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags);
-   rcu_read_unlock();
+   dax_read_unlock(id);
 
return rc;
 }
@@ -412,8 +412,8 @@ static const struct vm_operations_struct dax_dev_vm_ops = {
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 {
+   int rc, id;
struct dax_dev *dax_dev = filp->private_data;
-   int rc;
 
dev_dbg(_dev->dev, "%s\n", __func__);
 
@@ -421,9 +421,9 @@ static int dax_mmap(struct file *filp, struct 
vm_area_struct *vma)
 * We lock to check dax_inode liveness and will re-check at
 * fault time.
 */
-   rcu_read_lock();
+   id = dax_read_lock();
rc = check_vma(dax_dev, vma, __func__);
-   rcu_read_unlock();
+   dax_read_unlock(id);
if (rc)
return rc;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e6369b851619..7c4dc97d53a8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -24,11 +24,24 @@ module_param(nr_dax, int, S_IRUGO);
 MODULE_PARM_DESC(nr_dax, "max number of dax device instances");
 
 static dev_t dax_devt;
+DEFINE_STATIC_SRCU(dax_srcu);
 static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+int dax_read_lock(void)
+{
+   return srcu_read_lock(_srcu);
+}
+EXPORT_SYMBOL_GPL(dax_read_lock);
+
+void dax_read_unlock(int id)
+{
+   srcu_read_unlock(_srcu, id);
+}
+EXPORT_SYMBOL_GPL(dax_read_unlock);
+
 /**
  * struct dax_inode - anchor object for dax services
  * @inode: core vfs
@@ -45,8 +58,7 @@ struct dax_inode {
 
 bool dax_inode_alive(struct dax_inode *dax_inode)
 {
-   RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
-   "dax operations require rcu_read_lock()\n");
+   lockdep_assert_held(_srcu);
return dax_inode->alive;
 }
 EXPORT_SYMBOL_GPL(dax_inode_alive);
@@ -55,7 +67,7 @@ EXPORT_SYMBOL_GPL(dax_inode_alive);
  * Note, rcu is not protecting the liveness of dax_inode, rcu is
  * ensuring that any fault handlers or operations that might have seen
  * dax_inode_alive(), have completed.  Any operations that start after
- * synchronize_rcu() has run will abort upon seeing !dax_inode_alive().
+ * synchronize_srcu() has run will abort upon seeing !dax_inode_alive().
  */
 void kill_dax_inode(struct dax_inode *dax_inode)
 {
@@ -63,7 +75,7 @@ void kill_dax_inode(struct dax_inode *dax_inode

Re: [PATCH 0/4 RFC] BDI lifetime fix

2017-01-26 Thread Dan Williams

On Thu, Jan 26, 2017 at 9:45 AM, Jan Kara  wrote:
> Hello,
>
> this patch series attempts to solve the problems with the life time of a
> backing_dev_info structure. Currently it lives inside request_queue structure
> and thus it gets destroyed as soon as request queue goes away. However
> the block device inode still stays around and thus inode_to_bdi() call on
> that inode (e.g. from flusher worker) may happen after request queue has been
> destroyed resulting in oops.
>
> This patch set tries to solve these problems by making backing_dev_info
> independent structure referenced from block device inode. That makes sure
> inode_to_bdi() cannot ever oops. The patches are lightly tested for now
> (they boot, basic tests with adding & removing loop devices seem to do what
> I'd expect them to do ;). If someone is able to reproduce crashes on bdi
> when device goes away, please test these patches.

This survives a several runs of the libnvdimm unit tests which stress
del_gendisk() and blk_cleanup_queue(). I'll keep testing since the
failure was intermittent, but this is looking good.

> I'd also appreciate if people had a look whether the approach I took looks
> sensible.

Looks sensible, just the kref comment.

I also don't see a need to try to tag on the bdi device name reuse
into this series. I'm wondering if we can handle that separately with
device_rename(bdi->dev, ...) when we know scsi is done with the old
bdi but it has not finished being deleted
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] block: Dynamically allocate and refcount backing_dev_info

2017-01-26 Thread Dan Williams

On Thu, Jan 26, 2017 at 9:45 AM, Jan Kara  wrote:
> Instead of storing backing_dev_info inside struct request_queue,
> allocate it dynamically, reference count it, and free it when the last
> reference is dropped. Currently only request_queue holds the reference
> but in the following patch we add other users referencing
> backing_dev_info.
>
> Signed-off-by: Jan Kara 
[..]
> diff --git a/include/linux/backing-dev-defs.h 
> b/include/linux/backing-dev-defs.h
> index e850e76acaaf..4282f21b1611 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -144,7 +144,9 @@ struct backing_dev_info {
>
> char *name;
>
> -   unsigned int capabilities; /* Device capabilities */
> +   atomic_t refcnt;/* Reference counter for the structure */
> +   unsigned int capabilities:31;   /* Device capabilities */
> +   unsigned int free_on_put:1; /* Structure will be freed on last 
> bdi_put() */
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
>

Any reason to not just use struct kref for this? The "free on final
put" should be implicit. In other words, if there is a path that would
ever clear this flag, then it really should just be holding its own
reference.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 0/2] block: fix backing_dev_info lifetime

2017-01-26 Thread Dan Williams

On Thu, Jan 26, 2017 at 5:17 AM, Christoph Hellwig  wrote:
> On Thu, Jan 26, 2017 at 11:06:53AM +0100, Jan Kara wrote:
>> Yeah, so my patches (and I suspect your as well), have a problem when the
>> backing_device_info stays around because blkdev inode still exists, device
>> gets removed (e.g. USB disk gets unplugged) but blkdev inode still stays
>> around (there doesn't appear to be anything that would be forcing blkdev
>> inode out of cache on device removal and there cannot be because different
>> processes may hold inode reference) and then some other device gets plugged
>> in and reuses the same MAJOR:MINOR combination. Things get awkward there, I
>> think we need to unhash blkdev inode on device removal but so far I didn't
>> make this work...
>
> The other option is to simply not release the dev_t until the backing_dev
> is gone.

I came to a similar conclusion here:

   https://marc.info/?l=linux-scsi=147103737421897=4

James had some concerns, but I think its now clear this problem is
bigger than something we can fix locally in scsi.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 0/2] block: fix backing_dev_info lifetime

2017-01-25 Thread Dan Williams

On Mon, Jan 23, 2017 at 1:17 PM, Thiago Jung Bauermann
<bauer...@linux.vnet.ibm.com> wrote:
> Hello Dan,
>
> Am Freitag, 6. Januar 2017, 17:02:51 BRST schrieb Dan Williams:
>> v1 of these changes [1] was a one line change to bdev_get_queue() to
>> prevent a shutdown crash when del_gendisk() races the final
>> __blkdev_put().
>>
>> While it is known at del_gendisk() time that the queue is still alive,
>> Jan Kara points to other paths [2] that are racing __blkdev_put() where
>> the assumption that ->bd_queue, or inode->i_wb is valid does not hold.
>>
>> Fix that broken assumption, make it the case that if you have a live
>> block_device, or block_device-inode that the corresponding queue and
>> inode-write-back data is still valid.
>>
>> These changes survive a run of the libnvdimm unit test suite which puts
>> some stress on the block_device shutdown path.
>
> I realize that the kernel test robot found problems with this series, but FWIW
> it fixes the bug mentioned in [2].
>

Thanks for the test result. I might take a look at cleaning up the
test robot reports and resubmitting this approach unless Jan beats me
to the punch with his backing_devi_info lifetime change patches.

>> [2]: http://www.spinics.net/lists/linux-fsdevel/msg105153.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-22 Thread Dan Williams

[ adding Oleg ]

On Sun, Jan 22, 2017 at 10:00 PM, Song Liu  wrote:
> Hi Dan,
>
> I think the the block level event log is more like log only system. When en 
> event
> happens,  it is not necessary to take immediate action. (I guess this is 
> different
> to bad block list?).
>
> I would hope the event log to track more information. Some of these individual
> event may not be very interesting, for example, soft error or latency 
> outliers.
> However, when we gather event log for a fleet of devices, these "soft event"
> may become valuable for health monitoring.

I'd be interested in this. It sounds like you're trying to fill a gap
between tracing and console log messages which I believe others have
encountered as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 127 matches

Mail list logo