[RFC PATCH] scsi, block: fix duplicate bdi name registration crashes

2017-01-28 Thread Dan Williams
Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

 WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? kernfs_path_from_node+0x4f/0x60
  sysfs_warn_dup+0x62/0x80
  sysfs_create_dir_ns+0x77/0x90
  kobject_add_internal+0xb2/0x350
  kobject_add+0x75/0xd0
  device_add+0x15a/0x650
  device_create_groups_vargs+0xe0/0xf0
  device_create_vargs+0x1c/0x20
  bdi_register+0x90/0x240
  ? lockdep_init_map+0x57/0x200
  bdi_register_owner+0x36/0x60
  device_add_disk+0x1bb/0x4e0
  ? __pm_runtime_use_autosuspend+0x5c/0x70
  sd_probe_async+0x10d/0x1c0
  async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi=147116857810716=4
[2]: http://marc.info/?l=linux-block=148554717109098=2

Cc: James Bottomley 
Cc: Bart Van Assche 
Cc: "Martin K. Petersen" 
Cc: Christoph Hellwig 
Cc: Jens Axboe 
Reported-by: Omar Sandoval 
Signed-off-by: Dan Williams 
---
 block/blk-core.c   |1 +
 block/genhd.c  |7 +++
 drivers/scsi/sd.c  |   41 +
 include/linux/blkdev.h |1 +
 include/linux/genhd.h  |   17 +
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..950cea1e202e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -597,6 +597,7 @@ void blk_cleanup_queue(struct request_queue *q)
spin_unlock_irq(lock);
 
bdi_unregister(>backing_dev_info);
+   put_disk_devt(q->disk_devt);
 
/* @q is and will stay empty, shutdown and put */
blk_put_queue(q);
diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4fae657..eb8009e928f5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
disk_alloc_events(disk);
 
+   /*
+* Take a reference on the devt and assign it to queue since it
+* must not be reallocated while the bdi is registerted
+*/
+   disk->queue->disk_devt = disk->disk_devt;
+   get_disk_devt(disk->disk_devt);
+
/* Register BDI before referencing it from bdev */
bdi = >queue->backing_dev_info;
bdi_register_owner(bdi, disk_to_dev(disk));
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0b09638fa39b..09405351577c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
put_device(>dev);
 }
 
+struct sd_devt {
+   int idx;
+   struct disk_devt disk_devt;
+};
+
+void sd_devt_release(struct kref *kref)
+{
+   struct sd_devt *sd_devt = container_of(kref, struct sd_devt,
+   disk_devt.kref);
+
+   spin_lock(_index_lock);
+   ida_remove(_index_ida, sd_devt->idx);
+   spin_unlock(_index_lock);
+
+   kfree(sd_devt);
+}
+
 /**
  * sd_probe - called during driver initialization and whenever a
  * new scsi device is attached to the system. It is called once
@@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
 static int sd_probe(struct device *dev)
 {
struct scsi_device *sdp = to_scsi_device(dev);
+   struct sd_devt *sd_devt;
struct scsi_disk *sdkp;
struct gendisk *gd;
int index;
@@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
+   sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL);
+   if (!sd_devt)
+   goto out_free;
+
gd = alloc_disk(SD_MINORS);
if (!gd)
-   goto out_free;
+   goto out_free_devt;
 
do {
if (!ida_pre_get(_index_ida, GFP_KERNEL))
@@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev)
goto out_put;
}
 
+   kref_init(_devt->disk_devt.kref);
+   sd_devt->disk_devt.release = sd_devt_release;
+   sd_devt->idx = index;
+   gd->disk_devt = _devt->disk_devt;
+
error = sd_format_disk_name("sd", index, gd->disk_name, DISK_NAME_LEN);
if (error) {
sdev_printk(KERN_WARNING, sdp, "SCSI disk (sd) name length 
exceeded.\n");
@@ -3170,13 +3197,14 @@ static int sd_probe(struct device *dev)
return 0;
 
  

[PATCH V3 1/1] percpu-refcount: fix reference leak during percpu-atomic transition

2017-01-28 Thread Douglas Miller
percpu_ref_tryget() and percpu_ref_tryget_live() should return
"true" IFF they acquire a reference. But the return value from
atomic_long_inc_not_zero() is a long and may have high bits set,
e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines
is bool so the reference may actually be acquired but the routines
return "false" which results in a reference leak since the caller
assumes it does not need to do a corresponding percpu_ref_put().

This was seen when performing CPU hotplug during I/O, as hangs in
blk_mq_freeze_queue_wait where percpu_ref_kill (blk_mq_freeze_queue_start)
raced with percpu_ref_tryget (blk_mq_timeout_work).
Sample stack trace:

__switch_to+0x2c0/0x450
__schedule+0x2f8/0x970
schedule+0x48/0xc0
blk_mq_freeze_queue_wait+0x94/0x120
blk_mq_queue_reinit_work+0xb8/0x180
blk_mq_queue_reinit_prepare+0x84/0xa0
cpuhp_invoke_callback+0x17c/0x600
cpuhp_up_callbacks+0x58/0x150
_cpu_up+0xf0/0x1c0
do_cpu_up+0x120/0x150
cpu_subsys_online+0x64/0xe0
device_online+0xb4/0x120
online_store+0xb4/0xc0
dev_attr_store+0x68/0xa0
sysfs_kf_write+0x80/0xb0
kernfs_fop_write+0x17c/0x250
__vfs_write+0x6c/0x1e0
vfs_write+0xd0/0x270
SyS_write+0x6c/0x110
system_call+0x38/0xe0

Examination of the queue showed a single reference (no PERCPU_COUNT_BIAS,
and __PERCPU_REF_DEAD, __PERCPU_REF_ATOMIC set) and no requests.
However, conditions at the time of the race are count of PERCPU_COUNT_BIAS + 0
and __PERCPU_REF_DEAD and __PERCPU_REF_ATOMIC set.

The fix is to make the tryget routines use an actual boolean internally instead
of the atomic long result truncated to a int.

Fixes: e625305b3907 percpu-refcount: make percpu_ref based on longs instead of 
ints
Link: https://bugzilla.kernel.org/show_bug.cgi?id=190751
Signed-off-by: Douglas Miller 
---
 include/linux/percpu-refcount.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 1c7eec0..3a481a4 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -204,7 +204,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 {
unsigned long __percpu *percpu_count;
-   int ret;
+   bool ret;
 
rcu_read_lock_sched();
 
@@ -238,7 +238,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 {
unsigned long __percpu *percpu_count;
-   int ret = false;
+   bool ret = false;
 
rcu_read_lock_sched();
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/1] percpu-refcount: fix reference leak during percpu-atomic transition

2017-01-28 Thread Tejun Heo
On Sat, Jan 28, 2017 at 06:42:20AM -0600, Douglas Miller wrote:
> percpu_ref_tryget() and percpu_ref_tryget_live() should return
> "true" IFF they acquire a reference. But the return value from
> atomic_long_inc_not_zero() is a long and may have high bits set,
> e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines
> is bool so the reference may actually be acquired but the routines
> return "false" which results in a reference leak since the caller
> assumes it does not need to do a corresponding percpu_ref_put().

Applied to percpu/for-4.10-fixes w/ stable cc'd.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 04/17] dax: introduce dax_operations

2017-01-28 Thread Dan Williams
Track a set of dax_operations per dax_inode that can be set at
alloc_dax_inode() time. These operations will be used to stop the abuse
of block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax inode methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.

This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/dax.h|4 +++-
 drivers/dax/device.c |6 +-
 drivers/dax/super.c  |6 +-
 include/linux/dax.h  |5 +
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index f33c16ed2ec6..aeb1d49aafb8 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,9 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_inode *alloc_dax_inode(void *private, const char *host);
+struct dax_operations;
+struct dax_inode *alloc_dax_inode(void *private, const char *host,
+   const struct dax_operations *ops);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6d0a3241a608..c3d9405ec285 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -560,7 +560,11 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_inode = alloc_dax_inode(dax_dev, NULL);
+   /*
+* No 'host' or dax_operations since there is no access to this
+* device outside of mmap of the resulting character device.
+*/
+   dax_inode = alloc_dax_inode(dax_dev, NULL, NULL);
if (!dax_inode)
goto err_inode;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7ac048f94b2b..eb844ffea3cf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static int nr_dax = CONFIG_NR_DEV_DAX;
@@ -61,6 +62,7 @@ struct dax_inode {
const char *host;
void *private;
bool alive;
+   const struct dax_operations *ops;
 };
 
 bool dax_inode_alive(struct dax_inode *dax_inode)
@@ -204,7 +206,8 @@ static void dax_add_host(struct dax_inode *dax_inode, const 
char *host)
spin_unlock(_host_lock);
 }
 
-struct dax_inode *alloc_dax_inode(void *private, const char *__host)
+struct dax_inode *alloc_dax_inode(void *private, const char *__host,
+   const struct dax_operations *ops)
 {
struct dax_inode *dax_inode;
const char *host;
@@ -225,6 +228,7 @@ struct dax_inode *alloc_dax_inode(void *private, const char 
*__host)
goto err_inode;
 
dax_add_host(dax_inode, host);
+   dax_inode->ops = ops;
dax_inode->private = private;
return dax_inode;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8fe19230e118..def9a9d118c9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,11 @@
 #include 
 
 struct iomap_ops;
+struct dax_inode;
+struct dax_operations {
+   long (*direct_access)(struct dax_inode *, phys_addr_t, void **,
+   pfn_t *, long);
+};
 
 int dax_read_lock(void);
 void dax_read_unlock(int id);

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 13/17] fs: update mount_bdev() to lookup dax infrastructure

2017-01-28 Thread Dan Williams
This is in preparation for removing the ->direct_access() method from
block_device_operations.

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |6 --
 fs/super.c |   32 +---
 include/linux/fs.h |1 +
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bf4b51a3a412..a73f2388c515 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -806,14 +806,16 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
.sector = 0,
.size = PAGE_SIZE,
};
-   int err;
+   int err, id;
 
if (blocksize != PAGE_SIZE) {
vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax");
return -EINVAL;
}
 
-   err = bdev_direct_access(sb->s_bdev, );
+   id = dax_read_lock();
+   err = bdev_dax_direct_access(sb->s_bdev, sb->s_dax, );
+   dax_read_unlock(id);
if (err < 0) {
switch (err) {
case -EOPNOTSUPP:
diff --git a/fs/super.c b/fs/super.c
index ea662b0e5e78..5e64d11c46c1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include/* for the emergency remount stuff */
+#include 
 #include 
 #include 
 #include 
@@ -1038,9 +1039,17 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
 EXPORT_SYMBOL(mount_ns);
 
 #ifdef CONFIG_BLOCK
+struct mount_bdev_data {
+   struct block_device *bdev;
+   struct dax_inode *dax_inode;
+};
+
 static int set_bdev_super(struct super_block *s, void *data)
 {
-   s->s_bdev = data;
+   struct mount_bdev_data *mb_data = data;
+
+   s->s_bdev = mb_data->bdev;
+   s->s_dax = mb_data->dax_inode;
s->s_dev = s->s_bdev->bd_dev;
 
/*
@@ -1053,14 +1062,18 @@ static int set_bdev_super(struct super_block *s, void 
*data)
 
 static int test_bdev_super(struct super_block *s, void *data)
 {
-   return (void *)s->s_bdev == data;
+   struct mount_bdev_data *mb_data = data;
+
+   return s->s_bdev == mb_data->bdev;
 }
 
 struct dentry *mount_bdev(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data,
int (*fill_super)(struct super_block *, void *, int))
 {
+   struct mount_bdev_data mb_data;
struct block_device *bdev;
+   struct dax_inode *dax_inode;
struct super_block *s;
fmode_t mode = FMODE_READ | FMODE_EXCL;
int error = 0;
@@ -1072,6 +1085,11 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
if (IS_ERR(bdev))
return ERR_CAST(bdev);
 
+   if (IS_ENABLED(CONFIG_FS_DAX))
+   dax_inode = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   dax_inode = NULL;
+
/*
 * once the super is inserted into the list by sget, s_umount
 * will protect the lockfs code from trying to start a snapshot
@@ -1083,8 +1101,13 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
error = -EBUSY;
goto error_bdev;
}
+
+   mb_data = (struct mount_bdev_data) {
+   .bdev = bdev,
+   .dax_inode = dax_inode,
+   };
s = sget(fs_type, test_bdev_super, set_bdev_super, flags | MS_NOSEC,
-bdev);
+_data);
mutex_unlock(>bd_fsfreeze_mutex);
if (IS_ERR(s))
goto error_s;
@@ -1126,6 +1149,7 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
error = PTR_ERR(s);
 error_bdev:
blkdev_put(bdev, mode);
+   put_dax_inode(dax_inode);
 error:
return ERR_PTR(error);
 }
@@ -1133,6 +1157,7 @@ EXPORT_SYMBOL(mount_bdev);
 
 void kill_block_super(struct super_block *sb)
 {
+   struct dax_inode *dax_inode = sb->s_dax;
struct block_device *bdev = sb->s_bdev;
fmode_t mode = sb->s_mode;
 
@@ -1141,6 +1166,7 @@ void kill_block_super(struct super_block *sb)
sync_blockdev(bdev);
WARN_ON_ONCE(!(mode & FMODE_EXCL));
blkdev_put(bdev, mode | FMODE_EXCL);
+   put_dax_inode(dax_inode);
 }
 
 EXPORT_SYMBOL(kill_block_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c930cbc19342..fdad43169146 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1313,6 +1313,7 @@ struct super_block {
struct hlist_bl_heads_anon; /* anonymous dentries for (nfs) 
exporting */
struct list_heads_mounts;   /* list of mounts; _not_ for fs 
use */
struct block_device *s_bdev;
+   struct dax_inode*s_dax;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node   s_instances;

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 15/17] Revert "block: use DAX for partition table reads"

2017-01-28 Thread Dan Williams
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Signed-off-by: Dan Williams 
---
 block/partition-generic.c |   17 ++---
 fs/dax.c  |   20 
 include/linux/dax.h   |6 --
 3 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct 
block_device *bdev)
return 0;
 }
 
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t 
n)
-{
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-   return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
-NULL);
-}
-
 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector 
*p)
 {
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
 
-   /* don't populate page cache for dax capable devices */
-   if (IS_DAX(bdev->bd_inode))
-   page = read_dax_sector(bdev, n);
-   else
-   page = read_pagecache_sector(bdev, n);
-
+   page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index ddcddfeaa03b..a990211c8a3d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -97,26 +97,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
 }
 
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
-   struct page *page = alloc_pages(GFP_KERNEL, 0);
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   .sector = n & ~int) PAGE_SIZE) / 512) - 1),
-   };
-   long rc;
-
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   rc = dax_map_atomic(bdev, );
-   if (rc < 0)
-   return ERR_PTR(rc);
-   memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
-   dax_unmap_atomic(bdev, );
-   return page;
-}
-
 /*
  * DAX radix tree locking
  */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2ef8e18e2587..10b742af3d56 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -65,15 +65,9 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
-static inline struct page *read_dax_sector(struct block_device *bdev,
-   sector_t n)
-{
-   return ERR_PTR(-ENXIO);
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 10/17] block: introduce bdev_dax_direct_access()

2017-01-28 Thread Dan Williams
Provide a replacement for bdev_direct_access() that uses
dax_operations.direct_access() instead of
block_device_operations.direct_access(). Once all consumers of the old
api have been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated we still need to validate the
block_device before calling the dax ->direct_access method.

Signed-off-by: Dan Williams 
---
 block/Kconfig  |1 +
 drivers/dax/super.c|   33 +
 fs/block_dev.c |   28 
 include/linux/blkdev.h |3 +++
 include/linux/dax.h|2 ++
 5 files changed, 67 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 8bf114a3858a..9be785173280 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index eb844ffea3cf..ab5b082df5dd 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,39 @@ struct dax_inode {
const struct dax_operations *ops;
 };
 
+long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   long avail;
+
+   /*
+* The device driver is allowed to sleep, in order to make the
+* memory directly accessible.
+*/
+   might_sleep();
+
+   if (!dax_inode)
+   return -EOPNOTSUPP;
+
+   if (!dax_inode_alive(dax_inode))
+   return -ENXIO;
+
+   if (size < 0)
+   return size;
+
+   if (dev_addr % PAGE_SIZE)
+   return -EINVAL;
+
+   avail = dax_inode->ops->direct_access(dax_inode, dev_addr, kaddr, pfn,
+   size);
+   if (!avail)
+   return -ERANGE;
+   if (avail > 0 && avail & ~PAGE_MASK)
+   return -ENXIO;
+   return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
 bool dax_inode_alive(struct dax_inode *dax_inode)
 {
lockdep_assert_held(_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index edb1d2b16b8f..bf4b51a3a412 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -763,6 +764,33 @@ long bdev_direct_access(struct block_device *bdev, struct 
blk_dax_ctl *dax)
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
 /**
+ * bdev_dax_direct_access() - bdev-sector to pfn_t and kernel virtual address
+ * @bdev: host block device for @dax_inode
+ * @dax_inode: interface data and operations for a memory device
+ * @dax: control and output parameters for ->direct_access
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ *
+ * Locking: must be called with dax_read_lock() held
+ */
+long bdev_dax_direct_access(struct block_device *bdev,
+   struct dax_inode *dax_inode, struct blk_dax_ctl *dax)
+{
+   sector_t sector = dax->sector;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return -EOPNOTSUPP;
+   if ((sector + DIV_ROUND_UP(dax->size, 512))
+   > part_nr_sects_read(bdev->bd_part))
+   return -ERANGE;
+   sector += get_start_sect(bdev);
+   return dax_direct_access(dax_inode, sector * 512, >addr,
+   >pfn, dax->size);
+}
+EXPORT_SYMBOL_GPL(bdev_dax_direct_access);
+
+/**
  * bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
  * @blocksize: The block size of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5e7706f7d533..3b3c5ce376fd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1903,6 +1903,9 @@ extern int bdev_read_page(struct block_device *, 
sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
+struct dax_inode;
+extern long bdev_dax_direct_access(struct block_device *bdev,
+   struct dax_inode *dax_inode, struct blk_dax_ctl *dax);
 extern int bdev_dax_supported(struct super_block *, int);
 #else /* CONFIG_BLOCK */
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5aa620e8e5a2..2ef8e18e2587 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,6 +22,8 @@ void *dax_inode_get_private(struct dax_inode *dax_inode);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
+long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
+  

[RFC PATCH 05/17] pmem: add dax_operations support

2017-01-28 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the pmem block device and
add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/dax/dax.h   |7 -
 drivers/nvdimm/Kconfig  |1 +
 drivers/nvdimm/pmem.c   |   55 +++
 drivers/nvdimm/pmem.h   |7 -
 include/linux/dax.h |6 
 tools/testing/nvdimm/pmem-dax.c |   12 -
 6 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index aeb1d49aafb8..b4c686d2d446 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,15 +13,8 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_operations;
-struct dax_inode *alloc_dax_inode(void *private, const char *host,
-   const struct dax_operations *ops);
-void put_dax_inode(struct dax_inode *dax_inode);
-bool dax_inode_alive(struct dax_inode *dax_inode);
-void kill_dax_inode(struct dax_inode *dax_inode);
 struct dax_inode *inode_to_dax_inode(struct inode *inode);
 struct inode *dax_inode_to_inode(struct dax_inode *dax_inode);
-void *dax_inode_get_private(struct dax_inode *dax_inode);
 int dax_inode_register(struct dax_inode *dax_inode,
const struct file_operations *fops, struct module *owner,
struct kobject *parent);
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 59e750183b7f..5bdd499b5f4f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
+   select DAX
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..d3d7de645e20 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "pfn.h"
@@ -199,13 +200,12 @@ static int pmem_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
-__weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+__weak long __pmem_direct_access(struct pmem_device *pmem, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
 {
-   struct pmem_device *pmem = bdev->bd_queue->queuedata;
-   resource_size_t offset = sector * 512 + pmem->data_offset;
+   resource_size_t offset = dev_addr + pmem->data_offset;
 
-   if (unlikely(is_bad_pmem(>bb, sector, size)))
+   if (unlikely(is_bad_pmem(>bb, dev_addr / 512, size)))
return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
@@ -219,22 +219,46 @@ __weak long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
return pmem->size - pmem->pfn_pad - offset;
 }
 
+static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = bdev->bd_queue->queuedata;
+
+   return __pmem_direct_access(pmem, sector * 512, kaddr, pfn, size);
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
-   .direct_access =pmem_direct_access,
+   .direct_access =pmem_blk_direct_access,
.revalidate_disk =  nvdimm_revalidate_disk,
 };
 
+static long pmem_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = dax_inode_get_private(dax_inode);
+
+   return __pmem_direct_access(pmem, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations pmem_dax_ops = {
+   .direct_access = pmem_dax_direct_access,
+};
+
 static void pmem_release_queue(void *q)
 {
blk_cleanup_queue(q);
 }
 
-static void pmem_release_disk(void *disk)
+static void pmem_release_disk(void *__pmem)
 {
-   del_gendisk(disk);
-   put_disk(disk);
+   struct pmem_device *pmem = __pmem;
+
+   kill_dax_inode(pmem->dax_inode);
+   put_dax_inode(pmem->dax_inode);
+   del_gendisk(pmem->disk);
+   put_disk(pmem->disk);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -245,6 +269,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn = NULL;
+   struct dax_inode *dax_inode;
int nid = dev_to_node(dev);
struct nd_pfn_sb 

[RFC PATCH 09/17] block: kill bdev_dax_capable()

2017-01-28 Thread Dan Williams
This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |   24 
 include/linux/blkdev.h |1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 601b71b76d7f..edb1d2b16b8f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   };
-
-   if (!IS_ENABLED(CONFIG_FS_DAX))
-   return false;
-
-   dax.sector = 0;
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   return true;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3c0ff78b1219..5e7706f7d533 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1904,7 +1904,6 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 08/17] dcssblk: add dax_operations support

2017-01-28 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/s390/block/Kconfig   |1 +
 drivers/s390/block/dcssblk.c |   53 +++---
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
+   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 9d66b4fb174b..67b0885b4d12 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
secnum,
 void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = 
{
.owner  = THIS_MODULE,
.open   = dcssblk_open,
.release= dcssblk_release,
-   .direct_access  = dcssblk_direct_access,
+   .direct_access  = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+   .direct_access = dcssblk_dax_direct_access,
 };
 
 struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+   struct dax_inode *dax_inode;
 };
 
 struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
device_attribute *attr, const ch
}
list_del(_info->lh);
 
+   kill_dax_inode(dev_info->dax_inode);
+   put_dax_inode(dev_info->dax_inode);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_inode *dax_inode;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
+   dax_inode = alloc_dax_inode(dev_info, dev_info->gd->disk_name,
+   _dax_ops);
+   if (!dax_inode)
+   goto put_dev;
+
get_device(_info->dev);
device_add_disk(_info->dev, dev_info->gd);
 
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct 
device_attribute *attr, const ch
}
 
list_del(_info->lh);
+   kill_dax_inode(dev_info->dax_inode);
+   put_dax_inode(dev_info->dax_inode);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,38 @@ dcssblk_make_request(struct request_queue *q, struct bio 
*bio)
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, phys_addr_t offset,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   unsigned long dev_sz;
+
+   dev_sz = dev_info->end - dev_info->start;
+   *kaddr = (void *) dev_info->start + offset;
+   *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+   return dev_sz - offset;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
 {
struct dcssblk_dev_info *dev_info;
-   unsigned long offset, dev_sz;
 
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
-   dev_sz = dev_info->end - dev_info->start;
-   offset = secnum * 512;
-   *kaddr = (void *) dev_info->start + offset;
-   *pfn = 

[RFC PATCH 07/17] brd: add dax_operations support

2017-01-28 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/block/Kconfig |1 +
 drivers/block/brd.c   |   57 +
 2 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 223ff2fcae7e..604b51a884b6 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -337,6 +337,7 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
tristate "RAM block device support"
+   select DAX if BLK_DEV_RAM_DAX
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..1279df4dc07c 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
 #include 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include 
+#include 
 #endif
 
 #include 
@@ -41,6 +42,9 @@ struct brd_device {
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_inode*dax_inode;
+#endif
struct list_headbrd_list;
 
/*
@@ -375,15 +379,14 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
+static long __brd_direct_access(struct brd_device *brd, phys_addr_t dev_addr,
void **kaddr, pfn_t *pfn, long size)
 {
-   struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, sector);
+   page = brd_insert_page(brd, dev_addr / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
@@ -391,14 +394,34 @@ static long brd_direct_access(struct block_device *bdev, 
sector_t sector,
 
return PAGE_SIZE;
 }
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = bdev->bd_disk->private_data;
+
+   return __brd_direct_access(brd, sector * 512, kaddr, pfn, size);
+}
+
+static long brd_dax_direct_access(struct dax_inode *dax_inode,
+   phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = dax_inode_get_private(dax_inode);
+
+   return __brd_direct_access(brd, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations brd_dax_ops = {
+   .direct_access = brd_dax_direct_access,
+};
 #else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_direct_access,
+   .direct_access =brd_blk_direct_access,
 };
 
 /*
@@ -441,7 +464,9 @@ static struct brd_device *brd_alloc(int i)
 {
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_inode *dax_inode;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +494,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +506,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+   dax_inode = alloc_dax_inode(brd, disk->disk_name, _dax_ops);
+   if (!dax_inode)
+   goto out_free_inode;
+#endif
+
+
return brd;
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+   kill_dax_inode(dax_inode);
+   put_dax_inode(dax_inode);
+#endif
 out_free_queue:
blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -525,6 +560,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
list_del(>brd_list);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   kill_dax_inode(brd->dax_inode);
+   put_dax_inode(brd->dax_inode);
+#endif
del_gendisk(brd->brd_disk);
brd_free(brd);
 }


[RFC PATCH 06/17] axon_ram: add dax_operations support

2017-01-28 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the axon_ram block device
and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.
---
 arch/powerpc/platforms/Kconfig |1 +
 arch/powerpc/sysdev/axonram.c  |   46 +++-
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
 config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+   select DAX
default m
help
  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ada29eaed6e2..4e1f58187726 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
 struct axon_ram_bank {
struct platform_device  *device;
struct gendisk  *disk;
+   struct dax_inode*dax_inode;
unsigned intirq_id;
unsigned long   ph_addr;
unsigned long   io_addr;
@@ -137,25 +139,45 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, phys_addr_t offset,
+  void **kaddr, pfn_t *pfn, long size)
+{
+   *kaddr = (void *) bank->io_addr + offset;
+   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+   return bank->size - offset;
+}
+
 /**
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
   void **kaddr, pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-   *kaddr = (void *) bank->io_addr + offset;
-   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
-   return bank->size - offset;
+   return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT,
+   kaddr, pfn, size);
 }
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_direct_access
+   .direct_access  = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr,
+  void **kaddr, pfn_t *pfn, long size)
+{
+   struct axon_ram_bank *bank = dax_inode_get_private(dax_inode);
+
+   return __axon_ram_direct_access(bank, dev_addr, kaddr, pfn, size);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+   .direct_access = axon_ram_dax_direct_access,
 };
 
 /**
@@ -219,6 +241,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
 
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = _ram_devops;
@@ -227,6 +250,11 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
+   bank->dax_inode = alloc_dax_inode(bank, bank->disk->disk_name,
+   _ram_dax_ops);
+   if (!bank->dax_inode)
+   goto failed;
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(>dev, "Cannot register disk queue\n");
@@ -276,6 +304,10 @@ static int axon_ram_probe(struct platform_device *device)
bank->disk->disk_name);
del_gendisk(bank->disk);
}
+   if (bank->dax_inode) {
+   kill_dax_inode(bank->dax_inode);
+   put_dax_inode(bank->dax_inode);
+   }
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->io_addr);
@@ -298,6 +330,8 @@ axon_ram_remove(struct platform_device *device)
 
device_remove_file(>dev, _attr_ecc);
free_irq(bank->irq_id, device);
+   kill_dax_inode(bank->dax_inode);
+   put_dax_inode(bank->dax_inode);
del_gendisk(bank->disk);
iounmap((void __iomem *) bank->io_addr);
kfree(bank);

--

[RFC PATCH 11/17] dm: add dax_operations support (producer)

2017-01-28 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the dm block device and
add a ->direct_access() method that is equivalent to
dm_blk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dm_blk_direct_access() will be removed.

This enabling is only for the top-level dm representation to upper
layers. Sub-sequent patches are needed to convert the bottom layer
interface to backing devices.

Signed-off-by: Dan Williams 
---
 drivers/md/Kconfig   |1 +
 drivers/md/dm-core.h |3 +++
 drivers/md/dm.c  |   42 +++---
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 40ceba1fe8be..f6eb8d8db646 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -24,6 +24,8 @@ struct dm_kobject_holder {
struct completion completion;
 };
 
+struct dax_inode;
+
 /*
  * DM core internal structure that used directly by dm.c and dm-rq.c
  * DM targets must _not_ deference a mapped_device to directly access its 
members!
@@ -58,6 +60,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
 
struct gendisk *disk;
+   struct dax_inode *dax_inode;
char name[16];
 
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db934b1dba9d..1b3d9253e92c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -905,10 +906,10 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long __dm_direct_access(struct mapped_device *md, phys_addr_t dev_addr,
+  void **kaddr, pfn_t *pfn, long size)
 {
-   struct mapped_device *md = bdev->bd_disk->private_data;
+   sector_t sector = dev_addr >> SECTOR_SHIFT;
struct dm_table *map;
struct dm_target *ti;
int srcu_idx;
@@ -932,6 +933,23 @@ static long dm_blk_direct_access(struct block_device 
*bdev, sector_t sector,
return min(ret, size);
 }
 
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+void **kaddr, pfn_t *pfn, long size)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+
+   return __dm_direct_access(md, sector << SECTOR_SHIFT, kaddr, pfn, size);
+}
+
+static long dm_dax_direct_access(struct dax_inode *dax_inode,
+phys_addr_t dev_addr, void **kaddr, pfn_t *pfn,
+long size)
+{
+   struct mapped_device *md = dax_inode_get_private(dax_inode);
+
+   return __dm_direct_access(md, dev_addr, kaddr, pfn, size);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -1376,6 +1394,7 @@ static int next_free_minor(int *minor)
 }
 
 static const struct block_device_operations dm_blk_dops;
+static const struct dax_operations dm_dax_ops;
 
 static void dm_wq_work(struct work_struct *work);
 
@@ -1423,6 +1442,12 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
if (md->bs)
bioset_free(md->bs);
 
+   if (md->dax_inode) {
+   kill_dax_inode(md->dax_inode);
+   put_dax_inode(md->dax_inode);
+   md->dax_inode = NULL;
+   }
+
if (md->disk) {
spin_lock(&_minor_lock);
md->disk->private_data = NULL;
@@ -1450,6 +1475,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_inode *dax_inode;
struct mapped_device *md;
void *old_md;
 
@@ -1514,6 +1540,12 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->queue = md->queue;
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
+
+   dax_inode = alloc_dax_inode(md, md->disk->disk_name, _dax_ops);
+   if (!dax_inode)
+   goto bad;
+   md->dax_inode = dax_inode;
+
add_disk(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));
 
@@ -2735,6 +2767,10 @@ static const struct block_device_operations 

[RFC PATCH 16/17] fs, dax: convert filesystem-dax to bdev_dax_direct_access

2017-01-28 Thread Dan Williams
Now that a dax_inode is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

Signed-off-by: Dan Williams 
---
 fs/dax.c|  143 +++
 fs/iomap.c  |3 +
 include/linux/dax.h |6 +-
 3 files changed, 82 insertions(+), 70 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a990211c8a3d..07b36a26db06 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -51,32 +51,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
-   struct request_queue *q = bdev->bd_queue;
-   long rc = -EIO;
-
-   dax->addr = ERR_PTR(-EIO);
-   if (blk_queue_enter(q, true) != 0)
-   return rc;
-
-   rc = bdev_direct_access(bdev, dax);
-   if (rc < 0) {
-   dax->addr = ERR_PTR(rc);
-   blk_queue_exit(q);
-   return rc;
-   }
-   return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
-   const struct blk_dax_ctl *dax)
-{
-   if (IS_ERR(dax->addr))
-   return;
-   blk_queue_exit(bdev->bd_queue);
-}
-
 static int dax_is_pmd_entry(void *entry)
 {
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -549,21 +523,28 @@ static int dax_load_hole(struct address_space *mapping, 
void **entry,
return ret;
 }
 
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
-   struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_inode 
*dax_inode,
+   sector_t sector, size_t size, struct page *to,
+   unsigned long vaddr)
 {
struct blk_dax_ctl dax = {
.sector = sector,
.size = size,
};
void *vto;
+   long rc;
+   int id;
 
-   if (dax_map_atomic(bdev, ) < 0)
-   return PTR_ERR(dax.addr);
+   id = dax_read_lock();
+   rc = bdev_dax_direct_access(bdev, dax_inode, );
+   if (rc < 0) {
+   dax_read_unlock(id);
+   return rc;
+   }
vto = kmap_atomic(to);
copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
kunmap_atomic(vto);
-   dax_unmap_atomic(bdev, );
+   dax_read_unlock(id);
return 0;
 }
 
@@ -731,12 +712,13 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
 }
 
 static int dax_writeback_one(struct block_device *bdev,
-   struct address_space *mapping, pgoff_t index, void *entry)
+   struct dax_inode *dax_inode, struct address_space *mapping,
+   pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
struct blk_dax_ctl dax;
void *entry2, **slot;
-   int ret = 0;
+   int ret = 0, id;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,18 +771,20 @@ static int dax_writeback_one(struct block_device *bdev,
dax.size = PAGE_SIZE << dax_radix_order(entry);
 
/*
-* We cannot hold tree_lock while calling dax_map_atomic() because it
-* eventually calls cond_resched().
+* bdev_dax_direct_access() may sleep, so cannot hold tree_lock
+* over its invocation.
 */
-   ret = dax_map_atomic(bdev, );
+   id = dax_read_lock();
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
if (ret < 0) {
+   dax_read_unlock(id);
put_locked_mapping_entry(mapping, index, entry);
return ret;
}
 
if (WARN_ON_ONCE(ret < dax.size)) {
ret = -EIO;
-   goto unmap;
+   goto dax_unlock;
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
@@ -814,8 +798,8 @@ static int dax_writeback_one(struct block_device *bdev,
spin_lock_irq(>tree_lock);
radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
spin_unlock_irq(>tree_lock);
- unmap:
-   dax_unmap_atomic(bdev, );
+ dax_unlock:
+   dax_read_unlock(id);
put_locked_mapping_entry(mapping, index, entry);
return ret;
 
@@ -836,6 +820,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
struct inode *inode = mapping->host;
pgoff_t start_index, end_index;
pgoff_t indices[PAGEVEC_SIZE];
+   struct dax_inode *dax_inode;
struct pagevec pvec;
bool done = false;
int i, ret = 0;
@@ -846,6 +831,10 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
return 0;
 
+   dax_inode = dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_inode)
+   return -EIO;
+
start_index = 

[RFC PATCH 14/17] ext2, ext4, xfs: retrieve dax_inode through iomap operations

2017-01-28 Thread Dan Williams
In preparation for converting fs/dax.c to use bdev_dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_inode determined at mount through ->iomap_begin.

Signed-off-by: Dan Williams 
---
 fs/ext2/inode.c   |1 +
 fs/ext4/inode.c   |1 +
 fs/xfs/xfs_aops.c |   13 +
 fs/xfs/xfs_aops.h |1 +
 fs/xfs/xfs_buf.h  |1 +
 fs/xfs/xfs_iomap.c|1 +
 fs/xfs/xfs_super.c|3 +++
 include/linux/iomap.h |1 +
 8 files changed, 22 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f073bfca694b..c83f84748ec9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -813,6 +813,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
 
iomap->flags = 0;
iomap->bdev = inode->i_sb->s_bdev;
+   iomap->dax_inode = inode->i_sb->s_dax;
iomap->offset = (u64)first_block << blkbits;
 
if (ret == 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..ae6fa6a78d0d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3344,6 +3344,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
 
iomap->flags = 0;
iomap->bdev = inode->i_sb->s_bdev;
+   iomap->dax_inode = inode->i_sb->s_dax;
iomap->offset = first_block << blkbits;
 
if (ret == 0) {
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 631e7c0e0a29..7d22938a4d8b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -80,6 +80,19 @@ xfs_find_bdev_for_inode(
return mp->m_ddev_targp->bt_bdev;
 }
 
+struct dax_inode *
+xfs_find_dax_for_inode(
+   struct inode*inode)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   struct xfs_mount*mp = ip->i_mount;
+
+   if (XFS_IS_REALTIME_INODE(ip))
+   return NULL;
+   else
+   return mp->m_ddev_targp->bt_dax;
+}
+
 /*
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index cc174ec6c2fd..e5b65f436acf 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,5 +59,6 @@ int   xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, 
size_t size);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
+extern struct dax_inode *xfs_find_dax_for_inode(struct inode *);
 
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8a9d3a9599f0..1ff83f398649 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -109,6 +109,7 @@ typedef unsigned int xfs_buf_flags_t;
 typedef struct xfs_buftarg {
dev_t   bt_dev;
struct block_device *bt_bdev;
+   struct dax_inode*bt_dax;
struct backing_dev_info *bt_bdi;
struct xfs_mount*bt_mount;
unsigned intbt_meta_sectorsize;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 0d147428971e..1d08bd2433d5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -69,6 +69,7 @@ xfs_bmbt_to_iomap(
iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
+   iomap->dax_inode = xfs_find_dax_for_inode(VFS_I(ip));
 }
 
 xfs_extlen_t
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index eecbaac08eba..1a99013a0701 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -774,6 +774,9 @@ xfs_open_devices(
if (!mp->m_ddev_targp)
goto out_close_rtdev;
 
+   /* associate dax inode for filesystem-dax */
+   mp->m_ddev_targp->bt_dax = mp->m_super->s_dax;
+
if (rtdev) {
mp->m_rtdev_targp = xfs_alloc_buftarg(mp, rtdev);
if (!mp->m_rtdev_targp)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a4c94b86401e..01e265e7cf55 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type;   /* type of mapping */
u16 flags;  /* flags for mapping */
struct block_device *bdev;  /* block device for I/O */
+   struct dax_inode*dax_inode; /* dax_inode for dax operations */
 };
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 12/17] dm: add dax_operations support (consumer)

2017-01-28 Thread Dan Williams
Arrange for dm to lookup the dax services available from member
devices. Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device.

Signed-off-by: Dan Williams 
---
 drivers/md/dm-linear.c|   24 
 drivers/md/dm-snap.c  |   12 
 drivers/md/dm-stripe.c|   30 ++
 drivers/md/dm-target.c|   11 +++
 drivers/md/dm.c   |   16 
 include/linux/device-mapper.h |7 +++
 6 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e91ca8089333 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -159,6 +159,29 @@ static long linear_direct_access(struct dm_target *ti, 
sector_t sector,
return ret;
 }
 
+static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+void **kaddr, pfn_t *pfn, long size)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_inode *dax_inode = lc->dev->dax_inode;
+   struct blk_dax_ctl dax = {
+   .sector = linear_map_sector(ti, dev_addr >> SECTOR_SHIFT),
+   .size = size,
+   };
+   long ret;
+
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
+   *kaddr = dax.addr;
+   *pfn = dax.pfn;
+
+   return ret;
+}
+
+static const struct dm_dax_operations linear_dax_ops = {
+   .dm_direct_access = linear_dax_direct_access,
+};
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -170,6 +193,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..1990e3bd6958 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2309,6 +2309,13 @@ static long origin_direct_access(struct dm_target *ti, 
sector_t sector,
return -EIO;
 }
 
+static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   DMWARN("device does not support dax.");
+   return -EIO;
+}
+
 /*
  * Set the target "max_io_len" field to the minimum of all the snapshots'
  * chunk sizes.
@@ -2357,6 +2364,10 @@ static int origin_iterate_devices(struct dm_target *ti,
return fn(ti, o->dev, 0, ti->len, data);
 }
 
+static const struct dm_dax_operations origin_dax_ops = {
+   .dm_direct_access = origin_dax_direct_access,
+};
+
 static struct target_type origin_target = {
.name= "snapshot-origin",
.version = {1, 9, 0},
@@ -2369,6 +2380,7 @@ static struct target_type origin_target = {
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
.direct_access = origin_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..47fb56a6184a 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,6 +331,31 @@ static long stripe_direct_access(struct dm_target *ti, 
sector_t sector,
return ret;
 }
 
+static long stripe_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct stripe_c *sc = ti->private;
+   uint32_t stripe;
+   struct block_device *bdev;
+   struct dax_inode *dax_inode;
+   struct blk_dax_ctl dax = {
+   .size = size,
+   };
+   long ret;
+
+   stripe_map_sector(sc, dev_addr >> SECTOR_SHIFT, , );
+
+   dax.sector += sc->stripe[stripe].physical_start;
+   bdev = sc->stripe[stripe].dev->bdev;
+   dax_inode = sc->stripe[stripe].dev->dax_inode;
+
+   ret = bdev_dax_direct_access(bdev, dax_inode, );
+   *kaddr = dax.addr;
+   *pfn = dax.pfn;
+
+   return ret;
+}
+
 /*
  * Stripe status:
  *
@@ -437,6 +462,10 @@ static void stripe_io_hints(struct dm_target *ti,
blk_limits_io_opt(limits, chunk_size * sc->stripes);
 }
 
+static const struct dm_dax_operations stripe_dax_ops = {
+   .dm_direct_access = stripe_dax_direct_access,
+};
+
 static struct target_type stripe_target = {
.name   = "striped",
.version = {1, 6, 0},
@@ -449,6 +478,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_direct_access,
+   .dax_ops = _dax_ops,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 710ae28fd618..ab072f53cf24 

[RFC PATCH 17/17] block: remove block_device_operations.direct_access and related infrastructure

2017-01-28 Thread Dan Williams
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_inode, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams 
---
 arch/powerpc/sysdev/axonram.c |   15 --
 drivers/block/brd.c   |   11 --
 drivers/md/dm-linear.c|   19 -
 drivers/md/dm-snap.c  |8 ---
 drivers/md/dm-stripe.c|   24 --
 drivers/md/dm-table.c |2 +-
 drivers/md/dm-target.c|7 --
 drivers/md/dm.c   |   19 +++--
 drivers/nvdimm/pmem.c |9 
 drivers/s390/block/dcssblk.c  |   16 ---
 fs/block_dev.c|   45 -
 include/linux/blkdev.h|3 ---
 include/linux/device-mapper.h |9 
 13 files changed, 4 insertions(+), 183 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 4e1f58187726..1337b5829980 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -148,23 +148,8 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, 
phys_addr_t offset,
return bank->size - offset;
 }
 
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, pfn_t *pfn, long size)
-{
-   struct axon_ram_bank *bank = device->bd_disk->private_data;
-
-   return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT,
-   kaddr, pfn, size);
-}
-
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_blk_direct_access
 };
 
 static long
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1279df4dc07c..52a1259f8ded 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,14 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, 
phys_addr_t dev_addr,
return PAGE_SIZE;
 }
 
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct brd_device *brd = bdev->bd_disk->private_data;
-
-   return __brd_direct_access(brd, sector * 512, kaddr, pfn, size);
-}
-
 static long brd_dax_direct_access(struct dax_inode *dax_inode,
phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size)
 {
@@ -414,14 +406,11 @@ static long brd_dax_direct_access(struct dax_inode 
*dax_inode,
 static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
 };
-#else
-#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_blk_direct_access,
 };
 
 /*
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index e91ca8089333..7ec2a8eb8a14 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -141,24 +141,6 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
-{
-   struct linear_c *lc = ti->private;
-   struct block_device *bdev = lc->dev->bdev;
-   struct blk_dax_ctl dax = {
-   .sector = linear_map_sector(ti, sector),
-   .size = size,
-   };
-   long ret;
-
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
-}
-
 static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
 void **kaddr, pfn_t *pfn, long size)
 {
@@ -192,7 +174,6 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-   .direct_access = linear_direct_access,
.dax_ops = _dax_ops,
 };
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 1990e3bd6958..1d9407633bb5 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,13 +2302,6 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   DMWARN("device does not support dax.");
-   return -EIO;
-}
-
 static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t 
dev_addr,
void **kaddr, pfn_t *pfn, long size)
 {
@@ -2379,7 +2372,6 

[RFC PATCH 03/17] dax: add a facility to lookup a dax inode by 'host' device name

2017-01-28 Thread Dan Williams
For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_inode associated with a block_device. Add a
'host' property of a dax_inode that can be used for this purpose. It is
a free form string, but for a dax_inode associated with a block device
it is the bdev name.

This is a band-aid until filesystems are able to mount on a dax-inode
directly.

We use a hash list since blkdev_writepages() will need to use this
interface to issue dax_writeback_mapping_range().

Signed-off-by: Dan Williams 
---
 drivers/dax/dax.h|2 +
 drivers/dax/device.c |2 +
 drivers/dax/super.c  |   79 +-
 include/linux/dax.h  |1 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index def061aa75f4..f33c16ed2ec6 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,7 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_inode;
-struct dax_inode *alloc_dax_inode(void *private);
+struct dax_inode *alloc_dax_inode(void *private, const char *host);
 void put_dax_inode(struct dax_inode *dax_inode);
 bool dax_inode_alive(struct dax_inode *dax_inode);
 void kill_dax_inode(struct dax_inode *dax_inode);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index af06d0bfd6ea..6d0a3241a608 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -560,7 +560,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
*dax_region,
goto err_id;
}
 
-   dax_inode = alloc_dax_inode(dax_dev);
+   dax_inode = alloc_dax_inode(dax_dev, NULL);
if (!dax_inode)
goto err_inode;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7c4dc97d53a8..7ac048f94b2b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
+static struct hlist_head dax_host_list[DAX_HASH_SIZE];
+static DEFINE_SPINLOCK(dax_host_lock);
+
 int dax_read_lock(void)
 {
return srcu_read_lock(_srcu);
@@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
  * struct dax_inode - anchor object for dax services
  * @inode: core vfs
  * @cdev: optional character interface for "device dax"
+ * @host: optional name for lookups where the device path is not available
  * @private: dax driver private data
  * @alive: !alive + rcu grace period == no new operations / mappings
  */
 struct dax_inode {
+   struct hlist_node list;
struct inode inode;
struct cdev cdev;
+   const char *host;
void *private;
bool alive;
 };
@@ -63,6 +70,11 @@ bool dax_inode_alive(struct dax_inode *dax_inode)
 }
 EXPORT_SYMBOL_GPL(dax_inode_alive);
 
+static int dax_host_hash(const char *host)
+{
+   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
+}
+
 /*
  * Note, rcu is not protecting the liveness of dax_inode, rcu is
  * ensuring that any fault handlers or operations that might have seen
@@ -75,6 +87,12 @@ void kill_dax_inode(struct dax_inode *dax_inode)
return;
 
dax_inode->alive = false;
+
+   spin_lock(_host_lock);
+   if (!hlist_unhashed(_inode->list))
+   hlist_del_init(_inode->list);
+   spin_unlock(_host_lock);
+
synchronize_srcu(_srcu);
dax_inode->private = NULL;
 }
@@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct dax_inode *dax_inode = to_dax_inode(inode);
 
+   kfree(dax_inode->host);
+   dax_inode->host = NULL;
ida_simple_remove(_minor_ida, MINOR(inode->i_rdev));
kmem_cache_free(dax_cache, dax_inode);
 }
@@ -169,26 +189,49 @@ static struct dax_inode *dax_inode_get(dev_t devt)
return dax_inode;
 }
 
-struct dax_inode *alloc_dax_inode(void *private)
+static void dax_add_host(struct dax_inode *dax_inode, const char *host)
+{
+   int hash;
+
+   INIT_HLIST_NODE(_inode->list);
+   if (!host)
+   return;
+
+   dax_inode->host = host;
+   hash = dax_host_hash(host);
+   spin_lock(_host_lock);
+   hlist_add_head(_inode->list, _host_list[hash]);
+   spin_unlock(_host_lock);
+}
+
+struct dax_inode *alloc_dax_inode(void *private, const char *__host)
 {
struct dax_inode *dax_inode;
+   const char *host;
dev_t devt;
int minor;
 
+   host = kstrdup(__host, GFP_KERNEL);
+   if (__host && !host)
+   return NULL;
+
minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL);
if (minor < 0)
-   return NULL;
+   goto err_minor;
 
devt = MKDEV(MAJOR(dax_devt), minor);
dax_inode = dax_inode_get(devt);
if (!dax_inode)
   

[RFC PATCH 01/17] dax: refactor dax-fs into a generic provider of dax inodes

2017-01-28 Thread Dan Williams
We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax inode and add a lookup mechanism to go from block device path
to a dax inode. A dax capable driver like pmem or brd is responsible for
registering a dax inode, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax inode by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
inode" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base inode. "Filesystem DAX" is
then a mapping service that layers a filesystem on top of the base dax
inode. Filesystem DAX goes through a block_device for now, but perhaps
directly to a dax inode in the future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/Makefile|2 
 drivers/dax/Kconfig |8 +
 drivers/dax/Makefile|5 +
 drivers/dax/dax.h   |   24 ++-
 drivers/dax/device-dax.h|   25 +++
 drivers/dax/device.c|  241 +
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  310 +++
 tools/testing/nvdimm/Kbuild |6 -
 9 files changed, 402 insertions(+), 221 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (75%)
 create mode 100644 drivers/dax/super.c

diff --git a/drivers/Makefile b/drivers/Makefile
index 060026a02f59..17f42e4a6717 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -68,7 +68,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
-obj-$(CONFIG_DEV_DAX)  += dax/
+obj-$(CONFIG_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3e2ab3b14eea..39bcbf4c5e40 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,6 +1,11 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+   tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
help
  Support raw access to differentiated (persistence, bandwidth,
@@ -10,7 +15,6 @@ menuconfig DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-if DEV_DAX
 
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
+dax-y := super.o
 dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ddd829ab58c0..def061aa75f4 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,16 @@
  */
 #ifndef __DAX_H__
 #define __DAX_H__
-struct device;
-struct dax_dev;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
-   int region_id, struct resource *res, unsigned int align,
-   void *addr, unsigned long flags);
-struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
-   struct resource *res, int count);
+struct dax_inode;
+struct dax_inode *alloc_dax_inode(void *private);
+void put_dax_inode(struct dax_inode *dax_inode);
+bool dax_inode_alive(struct dax_inode *dax_inode);
+void kill_dax_inode(struct dax_inode *dax_inode);
+struct dax_inode *inode_to_dax_inode(struct inode *inode);
+struct inode *dax_inode_to_inode(struct dax_inode *dax_inode);
+void *dax_inode_get_private(struct dax_inode *dax_inode);
+int dax_inode_register(struct dax_inode *dax_inode,
+   const struct file_operations *fops, struct module *owner,
+   struct 

[RFC PATCH 00/17] introduce a dax_inode for dax_operations

2017-01-28 Thread Dan Williams
Recently there was an effort to introduce dax_operations to unwind the
abuse of the user-copy api in the pmem api [1]. Christoph noted that we
should not add new block-dax operations as it is further abuse of struct
block_device [2].

The ->direct_access() method in block_device_operations was an expedient
way to get the filesystem-dax capability bootstrapped. However, looking
forward to native persistent memory filesystems, they can forgo the
block layer and mount directly on a provider of dax services, a dax
inode.

For the time being, since current dax capable filesystems are block
based, we need a facility to look up this dax object via the
block-device name. If this approach looks reasonable I'll follow up with
reworking the proposed ->copy_from_iter(), ->flush(), and ->clear() dax
operations into this new scheme.

These patches survive a run of the libnvdimm unit tests, but I have not
tested the non-libnvdimm dax drivers.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008638.html

---

Dan Williams (17):
  dax: refactor dax-fs into a generic provider of dax inodes
  dax: convert dax_inode locking to srcu
  dax: add a facility to lookup a dax inode by 'host' device name
  dax: introduce dax_operations
  pmem: add dax_operations support
  axon_ram: add dax_operations support
  brd: add dax_operations support
  dcssblk: add dax_operations support
  block: kill bdev_dax_capable()
  block: introduce bdev_dax_direct_access()
  dm: add dax_operations support (producer)
  dm: add dax_operations support (consumer)
  fs: update mount_bdev() to lookup dax infrastructure
  ext2, ext4, xfs: retrieve dax_inode through iomap operations
  Revert "block: use DAX for partition table reads"
  fs, dax: convert filesystem-dax to bdev_dax_direct_access
  block: remove block_device_operations.direct_access and related 
infrastructure


 arch/powerpc/platforms/Kconfig  |1 
 arch/powerpc/sysdev/axonram.c   |   37 +++
 block/Kconfig   |1 
 block/partition-generic.c   |   17 --
 drivers/Makefile|2 
 drivers/block/Kconfig   |1 
 drivers/block/brd.c |   48 +++-
 drivers/dax/Kconfig |9 +
 drivers/dax/Makefile|5 
 drivers/dax/dax.h   |   19 +-
 drivers/dax/device-dax.h|   25 ++
 drivers/dax/device.c|  257 ---
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  434 +++
 drivers/md/Kconfig  |1 
 drivers/md/dm-core.h|3 
 drivers/md/dm-linear.c  |   15 +
 drivers/md/dm-snap.c|8 +
 drivers/md/dm-stripe.c  |   16 +
 drivers/md/dm-table.c   |2 
 drivers/md/dm-target.c  |   10 +
 drivers/md/dm.c |   43 +++-
 drivers/nvdimm/Kconfig  |1 
 drivers/nvdimm/pmem.c   |   46 +++-
 drivers/nvdimm/pmem.h   |7 -
 drivers/s390/block/Kconfig  |1 
 drivers/s390/block/dcssblk.c|   41 +++-
 fs/block_dev.c  |   75 ++-
 fs/dax.c|  149 ++---
 fs/ext2/inode.c |1 
 fs/ext4/inode.c |1 
 fs/iomap.c  |3 
 fs/super.c  |   32 +++
 fs/xfs/xfs_aops.c   |   13 +
 fs/xfs/xfs_aops.h   |1 
 fs/xfs/xfs_buf.h|1 
 fs/xfs/xfs_iomap.c  |1 
 fs/xfs/xfs_super.c  |3 
 include/linux/blkdev.h  |7 -
 include/linux/dax.h |   29 ++-
 include/linux/device-mapper.h   |   16 +
 include/linux/fs.h  |1 
 include/linux/iomap.h   |1 
 tools/testing/nvdimm/Kbuild |6 -
 tools/testing/nvdimm/pmem-dax.c |   12 -
 45 files changed, 927 insertions(+), 477 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (74%)
 create mode 100644 drivers/dax/super.c
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 02/17] dax: convert dax_inode locking to srcu

2017-01-28 Thread Dan Williams
In preparation for adding dax_operations that perform ->direct_access()
and user copy operations relative to a dax_inode, convert the existing
dax_inode locking to srcu. Some dax drivers need to sleep in their
->direct_access() methods and user copying may fault / sleep.

Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig  |1 +
 drivers/dax/device.c |   18 +-
 drivers/dax/super.c  |   20 
 include/linux/dax.h  |3 +++
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 39bcbf4c5e40..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,5 +1,6 @@
 menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+   select SRCU
default m if NVDIMM_DAX
 
 if DAX
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5b5572314929..af06d0bfd6ea 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -333,16 +333,16 @@ static int __dax_dev_fault(struct dax_dev *dax_dev, 
struct vm_area_struct *vma,
 
 static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   int rc;
+   int rc, id;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
 
dev_dbg(_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__,
current->comm, (vmf->flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
-   rcu_read_lock();
+   id = dax_read_lock();
rc = __dax_dev_fault(dax_dev, vma, vmf);
-   rcu_read_unlock();
+   dax_read_unlock(id);
 
return rc;
 }
@@ -390,7 +390,7 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
 static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
 {
-   int rc;
+   int rc, id;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
 
@@ -398,9 +398,9 @@ static int dax_dev_pmd_fault(struct vm_area_struct *vma, 
unsigned long addr,
current->comm, (flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
 
-   rcu_read_lock();
+   id = dax_read_lock();
rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags);
-   rcu_read_unlock();
+   dax_read_unlock(id);
 
return rc;
 }
@@ -412,8 +412,8 @@ static const struct vm_operations_struct dax_dev_vm_ops = {
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 {
+   int rc, id;
struct dax_dev *dax_dev = filp->private_data;
-   int rc;
 
dev_dbg(_dev->dev, "%s\n", __func__);
 
@@ -421,9 +421,9 @@ static int dax_mmap(struct file *filp, struct 
vm_area_struct *vma)
 * We lock to check dax_inode liveness and will re-check at
 * fault time.
 */
-   rcu_read_lock();
+   id = dax_read_lock();
rc = check_vma(dax_dev, vma, __func__);
-   rcu_read_unlock();
+   dax_read_unlock(id);
if (rc)
return rc;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e6369b851619..7c4dc97d53a8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -24,11 +24,24 @@ module_param(nr_dax, int, S_IRUGO);
 MODULE_PARM_DESC(nr_dax, "max number of dax device instances");
 
 static dev_t dax_devt;
+DEFINE_STATIC_SRCU(dax_srcu);
 static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 
+int dax_read_lock(void)
+{
+   return srcu_read_lock(_srcu);
+}
+EXPORT_SYMBOL_GPL(dax_read_lock);
+
+void dax_read_unlock(int id)
+{
+   srcu_read_unlock(_srcu, id);
+}
+EXPORT_SYMBOL_GPL(dax_read_unlock);
+
 /**
  * struct dax_inode - anchor object for dax services
  * @inode: core vfs
@@ -45,8 +58,7 @@ struct dax_inode {
 
 bool dax_inode_alive(struct dax_inode *dax_inode)
 {
-   RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
-   "dax operations require rcu_read_lock()\n");
+   lockdep_assert_held(_srcu);
return dax_inode->alive;
 }
 EXPORT_SYMBOL_GPL(dax_inode_alive);
@@ -55,7 +67,7 @@ EXPORT_SYMBOL_GPL(dax_inode_alive);
  * Note, rcu is not protecting the liveness of dax_inode, rcu is
  * ensuring that any fault handlers or operations that might have seen
  * dax_inode_alive(), have completed.  Any operations that start after
- * synchronize_rcu() has run will abort upon seeing !dax_inode_alive().
+ * synchronize_srcu() has run will abort upon seeing !dax_inode_alive().
  */
 void kill_dax_inode(struct dax_inode *dax_inode)
 {
@@ -63,7 +75,7 @@ void kill_dax_inode(struct dax_inode *dax_inode)
return;
 
dax_inode->alive = false;
-   synchronize_rcu();
+   synchronize_srcu(_srcu);
dax_inode->private = NULL;
 }

Re: [PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request

2017-01-28 Thread h...@lst.de
On Fri, Jan 27, 2017 at 06:39:46PM +, Bart Van Assche wrote:
> Why have the scsi_release_buffers() and scsi_put_command(cmd) calls been
> moved up? I haven't found an explanation for this change in the patch
> description.

Because they reference the scsi_cmnd, which are now part of the request
and thus freed by blk_finish_request.  And yes, I should have mentioned
it in the changelog, sorry.

> Please also consider to remove the cmd->request->special = NULL assignments
> via this patch. Since this patch makes the lifetime of struct scsi_cmnd and
> struct request identical these assignments are no longer needed.

True.  If I had to resend again I would have fixed it up, but it's probably
not worth the churn now.

> This patch introduces the function scsi_exit_rq(). Having two functions
> for the single-queue path that release resources (scsi_release_buffers()
> and scsi_exit_rq()) is confusing. Since every scsi_release_buffers() call
> is followed by a blk_unprep_request() call, have you considered to move
> the scsi_release_buffers() call into scsi_unprep_fn() via an additional
> patch?

We could have done that.  But it's just more change for a code path
that I hope won't survive this calendar year.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: split scsi passthrough fields out of struct request V2

2017-01-28 Thread h...@lst.de
On Fri, Jan 27, 2017 at 09:27:53PM +, Bart Van Assche wrote:
> Have you considered to convert all block drivers to the new
> approach and to get rid of request.special? If so, do you already
> have plans to start working on this? I'm namely wondering wheter I
> should start working on this myself.

Hi Bart,

I'd love to have all drivers move of using .special (and thus reducing
request size further).  I think the general way to do that is to convert
them to blk-mq and not using the legacy cmd_size field.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: split scsi passthrough fields out of struct request V3

2017-01-28 Thread h...@lst.de
On Fri, Jan 27, 2017 at 06:58:53PM +, Bart Van Assche wrote:
> Version 3 of the patch with title "block: split scsi_request out of
> struct request" (commit 3c30af6ebe12) differs significantly from v2
> of that patch that has been posted on several mailing lists. E.g. v2
> moves __cmd[], cmd and cmd_len from struct request into struct
> scsi_request but v3 not. Which version do you want us to review?

Hi Bart,

I tried to resend the whole updated v3 series, but the mail server
stopped accepting mails due to overload.  Otherwise it would have
included all the patches.  Jens instead took the updated version
straight from this git branch:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/18] scsi: remove __scsi_alloc_queue

2017-01-28 Thread h...@lst.de
On Fri, Jan 27, 2017 at 05:58:02PM +, Bart Van Assche wrote:
> Since __scsi_init_queue() modifies data in the Scsi_Host structure, have you
> considered to add the declaration for this function to ?
> If you want to keep this declaration in  please add a
> direct include of that header file to drivers/scsi/scsi_lib.c such that the
> declaration remains visible to the compiler if someone would minimize the
> number of #include directives in SCSI header files.

Feel free to send an incremental patch either way.  In the long run
I'd really like to kill off __scsi_init_queue and remove the transport
BSG queue abuse of SCSI internals, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html