[RFC PATCH] scsi, block: fix duplicate bdi name registration crashes
Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi=147116857810716=4 [2]: http://marc.info/?l=linux-block=148554717109098=2 Cc: James BottomleyCc: Bart Van Assche Cc: "Martin K. Petersen" Cc: Christoph Hellwig Cc: Jens Axboe Reported-by: Omar Sandoval Signed-off-by: Dan Williams --- block/blk-core.c |1 + block/genhd.c |7 +++ drivers/scsi/sd.c | 41 + include/linux/blkdev.h |1 + include/linux/genhd.h | 17 + 5 files changed, 59 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c58b64..950cea1e202e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -597,6 +597,7 @@ void blk_cleanup_queue(struct request_queue *q) spin_unlock_irq(lock); bdi_unregister(>backing_dev_info); + put_disk_devt(q->disk_devt); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/genhd.c b/block/genhd.c index fcd6d4fae657..eb8009e928f5 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk) disk_alloc_events(disk); + /* +* Take a reference on the devt and assign it to queue since it +* must not be reallocated while the bdi is registerted +*/ + disk->queue->disk_devt = disk->disk_devt; + get_disk_devt(disk->disk_devt); + /* Register BDI before referencing it from bdev */ bdi = >queue->backing_dev_info; bdi_register_owner(bdi, disk_to_dev(disk)); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 0b09638fa39b..09405351577c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t cookie) put_device(>dev); } +struct sd_devt { + int idx; + struct disk_devt disk_devt; +}; + +void sd_devt_release(struct kref *kref) +{ + struct sd_devt *sd_devt = container_of(kref, struct sd_devt, + disk_devt.kref); + + spin_lock(_index_lock); + ida_remove(_index_ida, sd_devt->idx); + spin_unlock(_index_lock); + + kfree(sd_devt); +} + /** * sd_probe - called during driver initialization and whenever a * new scsi device is attached to the system. It is called once @@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) static int sd_probe(struct device *dev) { struct scsi_device *sdp = to_scsi_device(dev); + struct sd_devt *sd_devt; struct scsi_disk *sdkp; struct gendisk *gd; int index; @@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; + sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL); + if (!sd_devt) + goto out_free; + gd = alloc_disk(SD_MINORS); if (!gd) - goto out_free; + goto out_free_devt; do { if (!ida_pre_get(_index_ida, GFP_KERNEL)) @@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev) goto out_put; } + kref_init(_devt->disk_devt.kref); + sd_devt->disk_devt.release = sd_devt_release; + sd_devt->idx = index; + gd->disk_devt = _devt->disk_devt; + error = sd_format_disk_name("sd", index, gd->disk_name, DISK_NAME_LEN); if (error) { sdev_printk(KERN_WARNING, sdp, "SCSI disk (sd) name length exceeded.\n"); @@ -3170,13 +3197,14 @@ static int sd_probe(struct device *dev) return 0;
[PATCH V3 1/1] percpu-refcount: fix reference leak during percpu-atomic transition
percpu_ref_tryget() and percpu_ref_tryget_live() should return "true" IFF they acquire a reference. But the return value from atomic_long_inc_not_zero() is a long and may have high bits set, e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines is bool so the reference may actually be acquired but the routines return "false" which results in a reference leak since the caller assumes it does not need to do a corresponding percpu_ref_put(). This was seen when performing CPU hotplug during I/O, as hangs in blk_mq_freeze_queue_wait where percpu_ref_kill (blk_mq_freeze_queue_start) raced with percpu_ref_tryget (blk_mq_timeout_work). Sample stack trace: __switch_to+0x2c0/0x450 __schedule+0x2f8/0x970 schedule+0x48/0xc0 blk_mq_freeze_queue_wait+0x94/0x120 blk_mq_queue_reinit_work+0xb8/0x180 blk_mq_queue_reinit_prepare+0x84/0xa0 cpuhp_invoke_callback+0x17c/0x600 cpuhp_up_callbacks+0x58/0x150 _cpu_up+0xf0/0x1c0 do_cpu_up+0x120/0x150 cpu_subsys_online+0x64/0xe0 device_online+0xb4/0x120 online_store+0xb4/0xc0 dev_attr_store+0x68/0xa0 sysfs_kf_write+0x80/0xb0 kernfs_fop_write+0x17c/0x250 __vfs_write+0x6c/0x1e0 vfs_write+0xd0/0x270 SyS_write+0x6c/0x110 system_call+0x38/0xe0 Examination of the queue showed a single reference (no PERCPU_COUNT_BIAS, and __PERCPU_REF_DEAD, __PERCPU_REF_ATOMIC set) and no requests. However, conditions at the time of the race are count of PERCPU_COUNT_BIAS + 0 and __PERCPU_REF_DEAD and __PERCPU_REF_ATOMIC set. The fix is to make the tryget routines use an actual boolean internally instead of the atomic long result truncated to a int. Fixes: e625305b3907 percpu-refcount: make percpu_ref based on longs instead of ints Link: https://bugzilla.kernel.org/show_bug.cgi?id=190751 Signed-off-by: Douglas Miller--- include/linux/percpu-refcount.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 1c7eec0..3a481a4 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -204,7 +204,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) static inline bool percpu_ref_tryget(struct percpu_ref *ref) { unsigned long __percpu *percpu_count; - int ret; + bool ret; rcu_read_lock_sched(); @@ -238,7 +238,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) { unsigned long __percpu *percpu_count; - int ret = false; + bool ret = false; rcu_read_lock_sched(); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/1] percpu-refcount: fix reference leak during percpu-atomic transition
On Sat, Jan 28, 2017 at 06:42:20AM -0600, Douglas Miller wrote: > percpu_ref_tryget() and percpu_ref_tryget_live() should return > "true" IFF they acquire a reference. But the return value from > atomic_long_inc_not_zero() is a long and may have high bits set, > e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines > is bool so the reference may actually be acquired but the routines > return "false" which results in a reference leak since the caller > assumes it does not need to do a corresponding percpu_ref_put(). Applied to percpu/for-4.10-fixes w/ stable cc'd. Thanks! -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 04/17] dax: introduce dax_operations
Track a set of dax_operations per dax_inode that can be set at alloc_dax_inode() time. These operations will be used to stop the abuse of block_device_operations for communicating dax capabilities to filesystems. It will also be used to replace the "pmem api" and move pmem-specific cache maintenance, and other dax-driver-specific filesystem-dax operations, to dax inode methods. In particular this allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(), with a driver specific replacement. This is a standalone introduction of the operations. Follow on patches convert each dax-driver and teach fs/dax.c to use ->direct_access() from dax_operations instead of block_device_operations. Suggested-by: Christoph HellwigSigned-off-by: Dan Williams --- drivers/dax/dax.h|4 +++- drivers/dax/device.c |6 +- drivers/dax/super.c |6 +- include/linux/dax.h |5 + 4 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index f33c16ed2ec6..aeb1d49aafb8 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -13,7 +13,9 @@ #ifndef __DAX_H__ #define __DAX_H__ struct dax_inode; -struct dax_inode *alloc_dax_inode(void *private, const char *host); +struct dax_operations; +struct dax_inode *alloc_dax_inode(void *private, const char *host, + const struct dax_operations *ops); void put_dax_inode(struct dax_inode *dax_inode); bool dax_inode_alive(struct dax_inode *dax_inode); void kill_dax_inode(struct dax_inode *dax_inode); diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 6d0a3241a608..c3d9405ec285 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -560,7 +560,11 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region, goto err_id; } - dax_inode = alloc_dax_inode(dax_dev, NULL); + /* +* No 'host' or dax_operations since there is no access to this +* device outside of mmap of the resulting character device. +*/ + dax_inode = alloc_dax_inode(dax_dev, NULL, NULL); if (!dax_inode) goto err_inode; diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 7ac048f94b2b..eb844ffea3cf 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -17,6 +17,7 @@ #include #include #include +#include #include static int nr_dax = CONFIG_NR_DEV_DAX; @@ -61,6 +62,7 @@ struct dax_inode { const char *host; void *private; bool alive; + const struct dax_operations *ops; }; bool dax_inode_alive(struct dax_inode *dax_inode) @@ -204,7 +206,8 @@ static void dax_add_host(struct dax_inode *dax_inode, const char *host) spin_unlock(_host_lock); } -struct dax_inode *alloc_dax_inode(void *private, const char *__host) +struct dax_inode *alloc_dax_inode(void *private, const char *__host, + const struct dax_operations *ops) { struct dax_inode *dax_inode; const char *host; @@ -225,6 +228,7 @@ struct dax_inode *alloc_dax_inode(void *private, const char *__host) goto err_inode; dax_add_host(dax_inode, host); + dax_inode->ops = ops; dax_inode->private = private; return dax_inode; diff --git a/include/linux/dax.h b/include/linux/dax.h index 8fe19230e118..def9a9d118c9 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,6 +7,11 @@ #include struct iomap_ops; +struct dax_inode; +struct dax_operations { + long (*direct_access)(struct dax_inode *, phys_addr_t, void **, + pfn_t *, long); +}; int dax_read_lock(void); void dax_read_unlock(int id); -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 13/17] fs: update mount_bdev() to lookup dax infrastructure
This is in preparation for removing the ->direct_access() method from block_device_operations. Signed-off-by: Dan Williams--- fs/block_dev.c |6 -- fs/super.c | 32 +--- include/linux/fs.h |1 + 3 files changed, 34 insertions(+), 5 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index bf4b51a3a412..a73f2388c515 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -806,14 +806,16 @@ int bdev_dax_supported(struct super_block *sb, int blocksize) .sector = 0, .size = PAGE_SIZE, }; - int err; + int err, id; if (blocksize != PAGE_SIZE) { vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax"); return -EINVAL; } - err = bdev_direct_access(sb->s_bdev, ); + id = dax_read_lock(); + err = bdev_dax_direct_access(sb->s_bdev, sb->s_dax, ); + dax_read_unlock(id); if (err < 0) { switch (err) { case -EOPNOTSUPP: diff --git a/fs/super.c b/fs/super.c index ea662b0e5e78..5e64d11c46c1 100644 --- a/fs/super.c +++ b/fs/super.c @@ -26,6 +26,7 @@ #include #include #include/* for the emergency remount stuff */ +#include #include #include #include @@ -1038,9 +1039,17 @@ struct dentry *mount_ns(struct file_system_type *fs_type, EXPORT_SYMBOL(mount_ns); #ifdef CONFIG_BLOCK +struct mount_bdev_data { + struct block_device *bdev; + struct dax_inode *dax_inode; +}; + static int set_bdev_super(struct super_block *s, void *data) { - s->s_bdev = data; + struct mount_bdev_data *mb_data = data; + + s->s_bdev = mb_data->bdev; + s->s_dax = mb_data->dax_inode; s->s_dev = s->s_bdev->bd_dev; /* @@ -1053,14 +1062,18 @@ static int set_bdev_super(struct super_block *s, void *data) static int test_bdev_super(struct super_block *s, void *data) { - return (void *)s->s_bdev == data; + struct mount_bdev_data *mb_data = data; + + return s->s_bdev == mb_data->bdev; } struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int)) { + struct mount_bdev_data mb_data; struct block_device *bdev; + struct dax_inode *dax_inode; struct super_block *s; fmode_t mode = FMODE_READ | FMODE_EXCL; int error = 0; @@ -1072,6 +1085,11 @@ struct dentry *mount_bdev(struct file_system_type *fs_type, if (IS_ERR(bdev)) return ERR_CAST(bdev); + if (IS_ENABLED(CONFIG_FS_DAX)) + dax_inode = dax_get_by_host(bdev->bd_disk->disk_name); + else + dax_inode = NULL; + /* * once the super is inserted into the list by sget, s_umount * will protect the lockfs code from trying to start a snapshot @@ -1083,8 +1101,13 @@ struct dentry *mount_bdev(struct file_system_type *fs_type, error = -EBUSY; goto error_bdev; } + + mb_data = (struct mount_bdev_data) { + .bdev = bdev, + .dax_inode = dax_inode, + }; s = sget(fs_type, test_bdev_super, set_bdev_super, flags | MS_NOSEC, -bdev); +_data); mutex_unlock(>bd_fsfreeze_mutex); if (IS_ERR(s)) goto error_s; @@ -1126,6 +1149,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type, error = PTR_ERR(s); error_bdev: blkdev_put(bdev, mode); + put_dax_inode(dax_inode); error: return ERR_PTR(error); } @@ -1133,6 +1157,7 @@ EXPORT_SYMBOL(mount_bdev); void kill_block_super(struct super_block *sb) { + struct dax_inode *dax_inode = sb->s_dax; struct block_device *bdev = sb->s_bdev; fmode_t mode = sb->s_mode; @@ -1141,6 +1166,7 @@ void kill_block_super(struct super_block *sb) sync_blockdev(bdev); WARN_ON_ONCE(!(mode & FMODE_EXCL)); blkdev_put(bdev, mode | FMODE_EXCL); + put_dax_inode(dax_inode); } EXPORT_SYMBOL(kill_block_super); diff --git a/include/linux/fs.h b/include/linux/fs.h index c930cbc19342..fdad43169146 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1313,6 +1313,7 @@ struct super_block { struct hlist_bl_heads_anon; /* anonymous dentries for (nfs) exporting */ struct list_heads_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; + struct dax_inode*s_dax; struct backing_dev_info *s_bdi; struct mtd_info *s_mtd; struct hlist_node s_instances; -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 15/17] Revert "block: use DAX for partition table reads"
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was part of a stalled effort to allow dax mappings of block devices. Since then the device-dax mechanism has filled the role of dax-mapping static device ranges. Now that we are moving ->direct_access() from a block_device operation to a dax_inode operation we would need block devices to map and carry their own dax_inode reference. Unless / until we decide to revive dax mapping of raw block devices through the dax_inode scheme, there is no need to carry read_dax_sector(). Its removal in turn allows for the removal of bdev_direct_access() and should have been included in commit 223757016837 ("block_dev: remove DAX leftovers"). Signed-off-by: Dan Williams--- block/partition-generic.c | 17 ++--- fs/dax.c | 20 include/linux/dax.h |6 -- 3 files changed, 2 insertions(+), 41 deletions(-) diff --git a/block/partition-generic.c b/block/partition-generic.c index 7afb9907821f..5dfac337b0f2 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include "partitions/check.h" @@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct block_device *bdev) return 0; } -static struct page *read_pagecache_sector(struct block_device *bdev, sector_t n) -{ - struct address_space *mapping = bdev->bd_inode->i_mapping; - - return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), -NULL); -} - unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p) { + struct address_space *mapping = bdev->bd_inode->i_mapping; struct page *page; - /* don't populate page cache for dax capable devices */ - if (IS_DAX(bdev->bd_inode)) - page = read_dax_sector(bdev, n); - else - page = read_pagecache_sector(bdev, n); - + page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL); if (!IS_ERR(page)) { if (PageError(page)) goto fail; diff --git a/fs/dax.c b/fs/dax.c index ddcddfeaa03b..a990211c8a3d 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -97,26 +97,6 @@ static int dax_is_empty_entry(void *entry) return (unsigned long)entry & RADIX_DAX_EMPTY; } -struct page *read_dax_sector(struct block_device *bdev, sector_t n) -{ - struct page *page = alloc_pages(GFP_KERNEL, 0); - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - .sector = n & ~int) PAGE_SIZE) / 512) - 1), - }; - long rc; - - if (!page) - return ERR_PTR(-ENOMEM); - - rc = dax_map_atomic(bdev, ); - if (rc < 0) - return ERR_PTR(rc); - memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE); - dax_unmap_atomic(bdev, ); - return page; -} - /* * DAX radix tree locking */ diff --git a/include/linux/dax.h b/include/linux/dax.h index 2ef8e18e2587..10b742af3d56 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -65,15 +65,9 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping, pgoff_t index, void *entry, bool wake_all); #ifdef CONFIG_FS_DAX -struct page *read_dax_sector(struct block_device *bdev, sector_t n); int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length); #else -static inline struct page *read_dax_sector(struct block_device *bdev, - sector_t n) -{ - return ERR_PTR(-ENXIO); -} static inline int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length) { -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 10/17] block: introduce bdev_dax_direct_access()
Provide a replacement for bdev_direct_access() that uses dax_operations.direct_access() instead of block_device_operations.direct_access(). Once all consumers of the old api have been converted bdev_direct_access() will be deleted. Given that block device partitioning decisions can cause dax page alignment constraints to be violated we still need to validate the block_device before calling the dax ->direct_access method. Signed-off-by: Dan Williams--- block/Kconfig |1 + drivers/dax/super.c| 33 + fs/block_dev.c | 28 include/linux/blkdev.h |3 +++ include/linux/dax.h|2 ++ 5 files changed, 67 insertions(+) diff --git a/block/Kconfig b/block/Kconfig index 8bf114a3858a..9be785173280 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -6,6 +6,7 @@ menuconfig BLOCK default y select SBITMAP select SRCU + select DAX help Provide block layer support for the kernel. diff --git a/drivers/dax/super.c b/drivers/dax/super.c index eb844ffea3cf..ab5b082df5dd 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -65,6 +65,39 @@ struct dax_inode { const struct dax_operations *ops; }; +long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) +{ + long avail; + + /* +* The device driver is allowed to sleep, in order to make the +* memory directly accessible. +*/ + might_sleep(); + + if (!dax_inode) + return -EOPNOTSUPP; + + if (!dax_inode_alive(dax_inode)) + return -ENXIO; + + if (size < 0) + return size; + + if (dev_addr % PAGE_SIZE) + return -EINVAL; + + avail = dax_inode->ops->direct_access(dax_inode, dev_addr, kaddr, pfn, + size); + if (!avail) + return -ERANGE; + if (avail > 0 && avail & ~PAGE_MASK) + return -ENXIO; + return min(avail, size); +} +EXPORT_SYMBOL_GPL(dax_direct_access); + bool dax_inode_alive(struct dax_inode *dax_inode) { lockdep_assert_held(_srcu); diff --git a/fs/block_dev.c b/fs/block_dev.c index edb1d2b16b8f..bf4b51a3a412 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -763,6 +764,33 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax) EXPORT_SYMBOL_GPL(bdev_direct_access); /** + * bdev_dax_direct_access() - bdev-sector to pfn_t and kernel virtual address + * @bdev: host block device for @dax_inode + * @dax_inode: interface data and operations for a memory device + * @dax: control and output parameters for ->direct_access + * + * Return: negative errno if an error occurs, otherwise the number of bytes + * accessible at this address. + * + * Locking: must be called with dax_read_lock() held + */ +long bdev_dax_direct_access(struct block_device *bdev, + struct dax_inode *dax_inode, struct blk_dax_ctl *dax) +{ + sector_t sector = dax->sector; + + if (!blk_queue_dax(bdev->bd_queue)) + return -EOPNOTSUPP; + if ((sector + DIV_ROUND_UP(dax->size, 512)) + > part_nr_sects_read(bdev->bd_part)) + return -ERANGE; + sector += get_start_sect(bdev); + return dax_direct_access(dax_inode, sector * 512, >addr, + >pfn, dax->size); +} +EXPORT_SYMBOL_GPL(bdev_dax_direct_access); + +/** * bdev_dax_supported() - Check if the device supports dax for filesystem * @sb: The superblock of the device * @blocksize: The block size of the device diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5e7706f7d533..3b3c5ce376fd 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1903,6 +1903,9 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *); extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); +struct dax_inode; +extern long bdev_dax_direct_access(struct block_device *bdev, + struct dax_inode *dax_inode, struct blk_dax_ctl *dax); extern int bdev_dax_supported(struct super_block *, int); #else /* CONFIG_BLOCK */ diff --git a/include/linux/dax.h b/include/linux/dax.h index 5aa620e8e5a2..2ef8e18e2587 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -22,6 +22,8 @@ void *dax_inode_get_private(struct dax_inode *dax_inode); void put_dax_inode(struct dax_inode *dax_inode); bool dax_inode_alive(struct dax_inode *dax_inode); void kill_dax_inode(struct dax_inode *dax_inode); +long dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr, +
[RFC PATCH 05/17] pmem: add dax_operations support
Setup a dax_inode to have the same lifetime as the pmem block device and add a ->direct_access() method that is equivalent to pmem_direct_access(). Once fs/dax.c has been converted to use dax_operations the old pmem_direct_access() will be removed. Signed-off-by: Dan Williams--- drivers/dax/dax.h |7 - drivers/nvdimm/Kconfig |1 + drivers/nvdimm/pmem.c | 55 +++ drivers/nvdimm/pmem.h |7 - include/linux/dax.h |6 tools/testing/nvdimm/pmem-dax.c | 12 - 6 files changed, 61 insertions(+), 27 deletions(-) diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index aeb1d49aafb8..b4c686d2d446 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -13,15 +13,8 @@ #ifndef __DAX_H__ #define __DAX_H__ struct dax_inode; -struct dax_operations; -struct dax_inode *alloc_dax_inode(void *private, const char *host, - const struct dax_operations *ops); -void put_dax_inode(struct dax_inode *dax_inode); -bool dax_inode_alive(struct dax_inode *dax_inode); -void kill_dax_inode(struct dax_inode *dax_inode); struct dax_inode *inode_to_dax_inode(struct inode *inode); struct inode *dax_inode_to_inode(struct dax_inode *dax_inode); -void *dax_inode_get_private(struct dax_inode *dax_inode); int dax_inode_register(struct dax_inode *dax_inode, const struct file_operations *fops, struct module *owner, struct kobject *parent); diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 59e750183b7f..5bdd499b5f4f 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -20,6 +20,7 @@ if LIBNVDIMM config BLK_DEV_PMEM tristate "PMEM: Persistent memory block device support" default LIBNVDIMM + select DAX select ND_BTT if BTT select ND_PFN if NVDIMM_PFN help diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 5b536be5a12e..d3d7de645e20 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "pmem.h" #include "pfn.h" @@ -199,13 +200,12 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, } /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ -__weak long pmem_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +__weak long __pmem_direct_access(struct pmem_device *pmem, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) { - struct pmem_device *pmem = bdev->bd_queue->queuedata; - resource_size_t offset = sector * 512 + pmem->data_offset; + resource_size_t offset = dev_addr + pmem->data_offset; - if (unlikely(is_bad_pmem(>bb, sector, size))) + if (unlikely(is_bad_pmem(>bb, dev_addr / 512, size))) return -EIO; *kaddr = pmem->virt_addr + offset; *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); @@ -219,22 +219,46 @@ __weak long pmem_direct_access(struct block_device *bdev, sector_t sector, return pmem->size - pmem->pfn_pad - offset; } +static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct pmem_device *pmem = bdev->bd_queue->queuedata; + + return __pmem_direct_access(pmem, sector * 512, kaddr, pfn, size); +} + static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .rw_page = pmem_rw_page, - .direct_access =pmem_direct_access, + .direct_access =pmem_blk_direct_access, .revalidate_disk = nvdimm_revalidate_disk, }; +static long pmem_dax_direct_access(struct dax_inode *dax_inode, + phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) +{ + struct pmem_device *pmem = dax_inode_get_private(dax_inode); + + return __pmem_direct_access(pmem, dev_addr, kaddr, pfn, size); +} + +static const struct dax_operations pmem_dax_ops = { + .direct_access = pmem_dax_direct_access, +}; + static void pmem_release_queue(void *q) { blk_cleanup_queue(q); } -static void pmem_release_disk(void *disk) +static void pmem_release_disk(void *__pmem) { - del_gendisk(disk); - put_disk(disk); + struct pmem_device *pmem = __pmem; + + kill_dax_inode(pmem->dax_inode); + put_dax_inode(pmem->dax_inode); + del_gendisk(pmem->disk); + put_disk(pmem->disk); } static int pmem_attach_disk(struct device *dev, @@ -245,6 +269,7 @@ static int pmem_attach_disk(struct device *dev, struct vmem_altmap __altmap, *altmap = NULL; struct resource *res = >res; struct nd_pfn *nd_pfn = NULL; + struct dax_inode *dax_inode; int nid = dev_to_node(dev); struct nd_pfn_sb
[RFC PATCH 09/17] block: kill bdev_dax_capable()
This is leftover dead code that has since been replaced by bdev_dax_supported(). Signed-off-by: Dan Williams--- fs/block_dev.c | 24 include/linux/blkdev.h |1 - 2 files changed, 25 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 601b71b76d7f..edb1d2b16b8f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int blocksize) } EXPORT_SYMBOL_GPL(bdev_dax_supported); -/** - * bdev_dax_capable() - Return if the raw device is capable for dax - * @bdev: The device for raw block device access - */ -bool bdev_dax_capable(struct block_device *bdev) -{ - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - }; - - if (!IS_ENABLED(CONFIG_FS_DAX)) - return false; - - dax.sector = 0; - if (bdev_direct_access(bdev, ) < 0) - return false; - - dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512); - if (bdev_direct_access(bdev, ) < 0) - return false; - - return true; -} - /* * pseudo-fs */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 3c0ff78b1219..5e7706f7d533 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1904,7 +1904,6 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); extern int bdev_dax_supported(struct super_block *, int); -extern bool bdev_dax_capable(struct block_device *); #else /* CONFIG_BLOCK */ struct block_device; -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 08/17] dcssblk: add dax_operations support
Setup a dax_inode to have the same lifetime as the dcssblk block device and add a ->direct_access() method that is equivalent to dcssblk_direct_access(). Once fs/dax.c has been converted to use dax_operations the old dcssblk_direct_access() will be removed. Signed-off-by: Dan Williams--- drivers/s390/block/Kconfig |1 + drivers/s390/block/dcssblk.c | 53 +++--- 2 files changed, 45 insertions(+), 9 deletions(-) diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig index 4a3b62326183..0acb8c2f9475 100644 --- a/drivers/s390/block/Kconfig +++ b/drivers/s390/block/Kconfig @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM config DCSSBLK def_tristate m + select DAX prompt "DCSSBLK support" depends on S390 && BLOCK help diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 9d66b4fb174b..67b0885b4d12 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static blk_qc_t dcssblk_make_request(struct request_queue *q, struct bio *bio); -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size); +static long dcssblk_dax_direct_access(struct dax_inode *dax_inode, + phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = { .owner = THIS_MODULE, .open = dcssblk_open, .release= dcssblk_release, - .direct_access = dcssblk_direct_access, + .direct_access = dcssblk_blk_direct_access, +}; + +static const struct dax_operations dcssblk_dax_ops = { + .direct_access = dcssblk_dax_direct_access, }; struct dcssblk_dev_info { @@ -57,6 +64,7 @@ struct dcssblk_dev_info { struct request_queue *dcssblk_queue; int num_of_segments; struct list_head seg_list; + struct dax_inode *dax_inode; }; struct segment_info { @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax_inode(dev_info->dax_inode); + put_dax_inode(dev_info->dax_inode); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_inode *dax_inode; char *local_buf; unsigned long seg_byte_size; @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; + dax_inode = alloc_dax_inode(dev_info, dev_info->gd->disk_name, + _dax_ops); + if (!dax_inode) + goto put_dev; + get_device(_info->dev); device_add_disk(_info->dev, dev_info->gd); @@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax_inode(dev_info->dax_inode); + put_dax_inode(dev_info->dax_inode); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -883,21 +901,38 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio) } static long -dcssblk_direct_access (struct block_device *bdev, sector_t secnum, +__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, phys_addr_t offset, + void **kaddr, pfn_t *pfn, long size) +{ + unsigned long dev_sz; + + dev_sz = dev_info->end - dev_info->start; + *kaddr = (void *) dev_info->start + offset; + *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV); + + return dev_sz - offset; +} + +static long +dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size) { struct dcssblk_dev_info *dev_info; - unsigned long offset, dev_sz; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; - dev_sz = dev_info->end - dev_info->start; - offset = secnum * 512; - *kaddr = (void *) dev_info->start + offset; - *pfn =
[RFC PATCH 07/17] brd: add dax_operations support
Setup a dax_inode to have the same lifetime as the brd block device and add a ->direct_access() method that is equivalent to brd_direct_access(). Once fs/dax.c has been converted to use dax_operations the old brd_direct_access() will be removed. Signed-off-by: Dan Williams--- drivers/block/Kconfig |1 + drivers/block/brd.c | 57 + 2 files changed, 49 insertions(+), 9 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 223ff2fcae7e..604b51a884b6 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -337,6 +337,7 @@ config BLK_DEV_SX8 config BLK_DEV_RAM tristate "RAM block device support" + select DAX if BLK_DEV_RAM_DAX ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 3adc32a3153b..1279df4dc07c 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -21,6 +21,7 @@ #include #ifdef CONFIG_BLK_DEV_RAM_DAX #include +#include #endif #include @@ -41,6 +42,9 @@ struct brd_device { struct request_queue*brd_queue; struct gendisk *brd_disk; +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_inode*dax_inode; +#endif struct list_headbrd_list; /* @@ -375,15 +379,14 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, } #ifdef CONFIG_BLK_DEV_RAM_DAX -static long brd_direct_access(struct block_device *bdev, sector_t sector, +static long __brd_direct_access(struct brd_device *brd, phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) { - struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; if (!brd) return -ENODEV; - page = brd_insert_page(brd, sector); + page = brd_insert_page(brd, dev_addr / 512); if (!page) return -ENOSPC; *kaddr = page_address(page); @@ -391,14 +394,34 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, return PAGE_SIZE; } + +static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct brd_device *brd = bdev->bd_disk->private_data; + + return __brd_direct_access(brd, sector * 512, kaddr, pfn, size); +} + +static long brd_dax_direct_access(struct dax_inode *dax_inode, + phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) +{ + struct brd_device *brd = dax_inode_get_private(dax_inode); + + return __brd_direct_access(brd, dev_addr, kaddr, pfn, size); +} + +static const struct dax_operations brd_dax_ops = { + .direct_access = brd_dax_direct_access, +}; #else -#define brd_direct_access NULL +#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_direct_access, + .direct_access =brd_blk_direct_access, }; /* @@ -441,7 +464,9 @@ static struct brd_device *brd_alloc(int i) { struct brd_device *brd; struct gendisk *disk; - +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_inode *dax_inode; +#endif brd = kzalloc(sizeof(*brd), GFP_KERNEL); if (!brd) goto out; @@ -469,9 +494,6 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); -#ifdef CONFIG_BLK_DEV_RAM_DAX - queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); -#endif disk = brd->brd_disk = alloc_disk(max_part); if (!disk) goto out_free_queue; @@ -484,8 +506,21 @@ static struct brd_device *brd_alloc(int i) sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); +#ifdef CONFIG_BLK_DEV_RAM_DAX + queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); + dax_inode = alloc_dax_inode(brd, disk->disk_name, _dax_ops); + if (!dax_inode) + goto out_free_inode; +#endif + + return brd; +#ifdef CONFIG_BLK_DEV_RAM_DAX +out_free_inode: + kill_dax_inode(dax_inode); + put_dax_inode(dax_inode); +#endif out_free_queue: blk_cleanup_queue(brd->brd_queue); out_free_dev: @@ -525,6 +560,10 @@ static struct brd_device *brd_init_one(int i, bool *new) static void brd_del_one(struct brd_device *brd) { list_del(>brd_list); +#ifdef CONFIG_BLK_DEV_RAM_DAX + kill_dax_inode(brd->dax_inode); + put_dax_inode(brd->dax_inode); +#endif del_gendisk(brd->brd_disk); brd_free(brd); }
[RFC PATCH 06/17] axon_ram: add dax_operations support
Setup a dax_inode to have the same lifetime as the axon_ram block device and add a ->direct_access() method that is equivalent to axon_ram_direct_access(). Once fs/dax.c has been converted to use dax_operations the old axon_ram_direct_access() will be removed. --- arch/powerpc/platforms/Kconfig |1 + arch/powerpc/sysdev/axonram.c | 46 +++- 2 files changed, 41 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig index 7e3a2ebba29b..33244e3d9375 100644 --- a/arch/powerpc/platforms/Kconfig +++ b/arch/powerpc/platforms/Kconfig @@ -284,6 +284,7 @@ config CPM2 config AXON_RAM tristate "Axon DDR2 memory device driver" depends on PPC_IBM_CELL_BLADE && BLOCK + select DAX default m help It registers one block device per Axon's DDR2 memory bank found diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index ada29eaed6e2..4e1f58187726 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -25,6 +25,7 @@ #include #include +#include #include #include #include @@ -62,6 +63,7 @@ static int azfs_major, azfs_minor; struct axon_ram_bank { struct platform_device *device; struct gendisk *disk; + struct dax_inode*dax_inode; unsigned intirq_id; unsigned long ph_addr; unsigned long io_addr; @@ -137,25 +139,45 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) return BLK_QC_T_NONE; } +static long +__axon_ram_direct_access(struct axon_ram_bank *bank, phys_addr_t offset, + void **kaddr, pfn_t *pfn, long size) +{ + *kaddr = (void *) bank->io_addr + offset; + *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); + return bank->size - offset; +} + /** * axon_ram_direct_access - direct_access() method for block device * @device, @sector, @data: see block_device_operations method */ static long -axon_ram_direct_access(struct block_device *device, sector_t sector, +axon_ram_blk_direct_access(struct block_device *device, sector_t sector, void **kaddr, pfn_t *pfn, long size) { struct axon_ram_bank *bank = device->bd_disk->private_data; - loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT; - *kaddr = (void *) bank->io_addr + offset; - *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); - return bank->size - offset; + return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT, + kaddr, pfn, size); } static const struct block_device_operations axon_ram_devops = { .owner = THIS_MODULE, - .direct_access = axon_ram_direct_access + .direct_access = axon_ram_blk_direct_access +}; + +static long +axon_ram_dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) +{ + struct axon_ram_bank *bank = dax_inode_get_private(dax_inode); + + return __axon_ram_direct_access(bank, dev_addr, kaddr, pfn, size); +} + +static const struct dax_operations axon_ram_dax_ops = { + .direct_access = axon_ram_dax_direct_access, }; /** @@ -219,6 +241,7 @@ static int axon_ram_probe(struct platform_device *device) goto failed; } + bank->disk->major = azfs_major; bank->disk->first_minor = azfs_minor; bank->disk->fops = _ram_devops; @@ -227,6 +250,11 @@ static int axon_ram_probe(struct platform_device *device) sprintf(bank->disk->disk_name, "%s%d", AXON_RAM_DEVICE_NAME, axon_ram_bank_id); + bank->dax_inode = alloc_dax_inode(bank, bank->disk->disk_name, + _ram_dax_ops); + if (!bank->dax_inode) + goto failed; + bank->disk->queue = blk_alloc_queue(GFP_KERNEL); if (bank->disk->queue == NULL) { dev_err(>dev, "Cannot register disk queue\n"); @@ -276,6 +304,10 @@ static int axon_ram_probe(struct platform_device *device) bank->disk->disk_name); del_gendisk(bank->disk); } + if (bank->dax_inode) { + kill_dax_inode(bank->dax_inode); + put_dax_inode(bank->dax_inode); + } device->dev.platform_data = NULL; if (bank->io_addr != 0) iounmap((void __iomem *) bank->io_addr); @@ -298,6 +330,8 @@ axon_ram_remove(struct platform_device *device) device_remove_file(>dev, _attr_ecc); free_irq(bank->irq_id, device); + kill_dax_inode(bank->dax_inode); + put_dax_inode(bank->dax_inode); del_gendisk(bank->disk); iounmap((void __iomem *) bank->io_addr); kfree(bank); --
[RFC PATCH 11/17] dm: add dax_operations support (producer)
Setup a dax_inode to have the same lifetime as the dm block device and add a ->direct_access() method that is equivalent to dm_blk_direct_access(). Once fs/dax.c has been converted to use dax_operations the old dm_blk_direct_access() will be removed. This enabling is only for the top-level dm representation to upper layers. Sub-sequent patches are needed to convert the bottom layer interface to backing devices. Signed-off-by: Dan Williams--- drivers/md/Kconfig |1 + drivers/md/dm-core.h |3 +++ drivers/md/dm.c | 42 +++--- 3 files changed, 43 insertions(+), 3 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index b7767da50c26..1de8372d9459 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN config BLK_DEV_DM tristate "Device mapper support" select BLK_DEV_DM_BUILTIN + select DAX ---help--- Device-mapper is a low level volume manager. It works by allowing people to specify mappings for ranges of logical sectors. Various diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 40ceba1fe8be..f6eb8d8db646 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -24,6 +24,8 @@ struct dm_kobject_holder { struct completion completion; }; +struct dax_inode; + /* * DM core internal structure that used directly by dm.c and dm-rq.c * DM targets must _not_ deference a mapped_device to directly access its members! @@ -58,6 +60,7 @@ struct mapped_device { struct target_type *immutable_target_type; struct gendisk *disk; + struct dax_inode *dax_inode; char name[16]; void *interface_ptr; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index db934b1dba9d..1b3d9253e92c 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -905,10 +906,10 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len) } EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static long __dm_direct_access(struct mapped_device *md, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) { - struct mapped_device *md = bdev->bd_disk->private_data; + sector_t sector = dev_addr >> SECTOR_SHIFT; struct dm_table *map; struct dm_target *ti; int srcu_idx; @@ -932,6 +933,23 @@ static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, return min(ret, size); } +static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, +void **kaddr, pfn_t *pfn, long size) +{ + struct mapped_device *md = bdev->bd_disk->private_data; + + return __dm_direct_access(md, sector << SECTOR_SHIFT, kaddr, pfn, size); +} + +static long dm_dax_direct_access(struct dax_inode *dax_inode, +phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, +long size) +{ + struct mapped_device *md = dax_inode_get_private(dax_inode); + + return __dm_direct_access(md, dev_addr, kaddr, pfn, size); +} + /* * A target may call dm_accept_partial_bio only from the map routine. It is * allowed for all bio types except REQ_PREFLUSH. @@ -1376,6 +1394,7 @@ static int next_free_minor(int *minor) } static const struct block_device_operations dm_blk_dops; +static const struct dax_operations dm_dax_ops; static void dm_wq_work(struct work_struct *work); @@ -1423,6 +1442,12 @@ static void cleanup_mapped_device(struct mapped_device *md) if (md->bs) bioset_free(md->bs); + if (md->dax_inode) { + kill_dax_inode(md->dax_inode); + put_dax_inode(md->dax_inode); + md->dax_inode = NULL; + } + if (md->disk) { spin_lock(&_minor_lock); md->disk->private_data = NULL; @@ -1450,6 +1475,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_inode *dax_inode; struct mapped_device *md; void *old_md; @@ -1514,6 +1540,12 @@ static struct mapped_device *alloc_dev(int minor) md->disk->queue = md->queue; md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); + + dax_inode = alloc_dax_inode(md, md->disk->disk_name, _dax_ops); + if (!dax_inode) + goto bad; + md->dax_inode = dax_inode; + add_disk(md->disk); format_dev_t(md->name, MKDEV(_major, minor)); @@ -2735,6 +2767,10 @@ static const struct block_device_operations
[RFC PATCH 16/17] fs, dax: convert filesystem-dax to bdev_dax_direct_access
Now that a dax_inode is plumbed through all dax-capable drivers we can switch from block_device_operations to dax_operations for invoking ->direct_access. Signed-off-by: Dan Williams--- fs/dax.c| 143 +++ fs/iomap.c |3 + include/linux/dax.h |6 +- 3 files changed, 82 insertions(+), 70 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index a990211c8a3d..07b36a26db06 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -51,32 +51,6 @@ static int __init init_dax_wait_table(void) } fs_initcall(init_dax_wait_table); -static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) -{ - struct request_queue *q = bdev->bd_queue; - long rc = -EIO; - - dax->addr = ERR_PTR(-EIO); - if (blk_queue_enter(q, true) != 0) - return rc; - - rc = bdev_direct_access(bdev, dax); - if (rc < 0) { - dax->addr = ERR_PTR(rc); - blk_queue_exit(q); - return rc; - } - return rc; -} - -static void dax_unmap_atomic(struct block_device *bdev, - const struct blk_dax_ctl *dax) -{ - if (IS_ERR(dax->addr)) - return; - blk_queue_exit(bdev->bd_queue); -} - static int dax_is_pmd_entry(void *entry) { return (unsigned long)entry & RADIX_DAX_PMD; @@ -549,21 +523,28 @@ static int dax_load_hole(struct address_space *mapping, void **entry, return ret; } -static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, - struct page *to, unsigned long vaddr) +static int copy_user_dax(struct block_device *bdev, struct dax_inode *dax_inode, + sector_t sector, size_t size, struct page *to, + unsigned long vaddr) { struct blk_dax_ctl dax = { .sector = sector, .size = size, }; void *vto; + long rc; + int id; - if (dax_map_atomic(bdev, ) < 0) - return PTR_ERR(dax.addr); + id = dax_read_lock(); + rc = bdev_dax_direct_access(bdev, dax_inode, ); + if (rc < 0) { + dax_read_unlock(id); + return rc; + } vto = kmap_atomic(to); copy_user_page(vto, (void __force *)dax.addr, vaddr, to); kunmap_atomic(vto); - dax_unmap_atomic(bdev, ); + dax_read_unlock(id); return 0; } @@ -731,12 +712,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, } static int dax_writeback_one(struct block_device *bdev, - struct address_space *mapping, pgoff_t index, void *entry) + struct dax_inode *dax_inode, struct address_space *mapping, + pgoff_t index, void *entry) { struct radix_tree_root *page_tree = >page_tree; struct blk_dax_ctl dax; void *entry2, **slot; - int ret = 0; + int ret = 0, id; /* * A page got tagged dirty in DAX mapping? Something is seriously @@ -789,18 +771,20 @@ static int dax_writeback_one(struct block_device *bdev, dax.size = PAGE_SIZE << dax_radix_order(entry); /* -* We cannot hold tree_lock while calling dax_map_atomic() because it -* eventually calls cond_resched(). +* bdev_dax_direct_access() may sleep, so cannot hold tree_lock +* over its invocation. */ - ret = dax_map_atomic(bdev, ); + id = dax_read_lock(); + ret = bdev_dax_direct_access(bdev, dax_inode, ); if (ret < 0) { + dax_read_unlock(id); put_locked_mapping_entry(mapping, index, entry); return ret; } if (WARN_ON_ONCE(ret < dax.size)) { ret = -EIO; - goto unmap; + goto dax_unlock; } dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn)); @@ -814,8 +798,8 @@ static int dax_writeback_one(struct block_device *bdev, spin_lock_irq(>tree_lock); radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY); spin_unlock_irq(>tree_lock); - unmap: - dax_unmap_atomic(bdev, ); + dax_unlock: + dax_read_unlock(id); put_locked_mapping_entry(mapping, index, entry); return ret; @@ -836,6 +820,7 @@ int dax_writeback_mapping_range(struct address_space *mapping, struct inode *inode = mapping->host; pgoff_t start_index, end_index; pgoff_t indices[PAGEVEC_SIZE]; + struct dax_inode *dax_inode; struct pagevec pvec; bool done = false; int i, ret = 0; @@ -846,6 +831,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) return 0; + dax_inode = dax_get_by_host(bdev->bd_disk->disk_name); + if (!dax_inode) + return -EIO; + start_index =
[RFC PATCH 14/17] ext2, ext4, xfs: retrieve dax_inode through iomap operations
In preparation for converting fs/dax.c to use bdev_dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_inode determined at mount through ->iomap_begin. Signed-off-by: Dan Williams--- fs/ext2/inode.c |1 + fs/ext4/inode.c |1 + fs/xfs/xfs_aops.c | 13 + fs/xfs/xfs_aops.h |1 + fs/xfs/xfs_buf.h |1 + fs/xfs/xfs_iomap.c|1 + fs/xfs/xfs_super.c|3 +++ include/linux/iomap.h |1 + 8 files changed, 22 insertions(+) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index f073bfca694b..c83f84748ec9 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -813,6 +813,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, iomap->flags = 0; iomap->bdev = inode->i_sb->s_bdev; + iomap->dax_inode = inode->i_sb->s_dax; iomap->offset = (u64)first_block << blkbits; if (ret == 0) { diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 88d57af1b516..ae6fa6a78d0d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3344,6 +3344,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, iomap->flags = 0; iomap->bdev = inode->i_sb->s_bdev; + iomap->dax_inode = inode->i_sb->s_dax; iomap->offset = first_block << blkbits; if (ret == 0) { diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 631e7c0e0a29..7d22938a4d8b 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -80,6 +80,19 @@ xfs_find_bdev_for_inode( return mp->m_ddev_targp->bt_bdev; } +struct dax_inode * +xfs_find_dax_for_inode( + struct inode*inode) +{ + struct xfs_inode*ip = XFS_I(inode); + struct xfs_mount*mp = ip->i_mount; + + if (XFS_IS_REALTIME_INODE(ip)) + return NULL; + else + return mp->m_ddev_targp->bt_dax; +} + /* * We're now finished for good with this page. Update the page state via the * associated buffer_heads, paying attention to the start and end offsets that diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index cc174ec6c2fd..e5b65f436acf 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -59,5 +59,6 @@ int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size); extern void xfs_count_page_state(struct page *, int *, int *); extern struct block_device *xfs_find_bdev_for_inode(struct inode *); +extern struct dax_inode *xfs_find_dax_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 8a9d3a9599f0..1ff83f398649 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -109,6 +109,7 @@ typedef unsigned int xfs_buf_flags_t; typedef struct xfs_buftarg { dev_t bt_dev; struct block_device *bt_bdev; + struct dax_inode*bt_dax; struct backing_dev_info *bt_bdi; struct xfs_mount*bt_mount; unsigned intbt_meta_sectorsize; diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 0d147428971e..1d08bd2433d5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -69,6 +69,7 @@ xfs_bmbt_to_iomap( iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff); iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount); iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip)); + iomap->dax_inode = xfs_find_dax_for_inode(VFS_I(ip)); } xfs_extlen_t diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index eecbaac08eba..1a99013a0701 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -774,6 +774,9 @@ xfs_open_devices( if (!mp->m_ddev_targp) goto out_close_rtdev; + /* associate dax inode for filesystem-dax */ + mp->m_ddev_targp->bt_dax = mp->m_super->s_dax; + if (rtdev) { mp->m_rtdev_targp = xfs_alloc_buftarg(mp, rtdev); if (!mp->m_rtdev_targp) diff --git a/include/linux/iomap.h b/include/linux/iomap.h index a4c94b86401e..01e265e7cf55 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -41,6 +41,7 @@ struct iomap { u16 type; /* type of mapping */ u16 flags; /* flags for mapping */ struct block_device *bdev; /* block device for I/O */ + struct dax_inode*dax_inode; /* dax_inode for dax operations */ }; /* -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 12/17] dm: add dax_operations support (consumer)
Arrange for dm to lookup the dax services available from member devices. Update the dax-capable targets, linear and stripe, to route dax operations to the underlying device. Signed-off-by: Dan Williams--- drivers/md/dm-linear.c| 24 drivers/md/dm-snap.c | 12 drivers/md/dm-stripe.c| 30 ++ drivers/md/dm-target.c| 11 +++ drivers/md/dm.c | 16 include/linux/device-mapper.h |7 +++ 6 files changed, 96 insertions(+), 4 deletions(-) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 4788b0b989a9..e91ca8089333 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -159,6 +159,29 @@ static long linear_direct_access(struct dm_target *ti, sector_t sector, return ret; } +static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t dev_addr, +void **kaddr, pfn_t *pfn, long size) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct dax_inode *dax_inode = lc->dev->dax_inode; + struct blk_dax_ctl dax = { + .sector = linear_map_sector(ti, dev_addr >> SECTOR_SHIFT), + .size = size, + }; + long ret; + + ret = bdev_dax_direct_access(bdev, dax_inode, ); + *kaddr = dax.addr; + *pfn = dax.pfn; + + return ret; +} + +static const struct dm_dax_operations linear_dax_ops = { + .dm_direct_access = linear_dax_direct_access, +}; + static struct target_type linear_target = { .name = "linear", .version = {1, 3, 0}, @@ -170,6 +193,7 @@ static struct target_type linear_target = { .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, .direct_access = linear_direct_access, + .dax_ops = _dax_ops, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index c65feeada864..1990e3bd6958 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2309,6 +2309,13 @@ static long origin_direct_access(struct dm_target *ti, sector_t sector, return -EIO; } +static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) +{ + DMWARN("device does not support dax."); + return -EIO; +} + /* * Set the target "max_io_len" field to the minimum of all the snapshots' * chunk sizes. @@ -2357,6 +2364,10 @@ static int origin_iterate_devices(struct dm_target *ti, return fn(ti, o->dev, 0, ti->len, data); } +static const struct dm_dax_operations origin_dax_ops = { + .dm_direct_access = origin_dax_direct_access, +}; + static struct target_type origin_target = { .name= "snapshot-origin", .version = {1, 9, 0}, @@ -2369,6 +2380,7 @@ static struct target_type origin_target = { .status = origin_status, .iterate_devices = origin_iterate_devices, .direct_access = origin_direct_access, + .dax_ops = _dax_ops, }; static struct target_type snapshot_target = { diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 28193a57bf47..47fb56a6184a 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -331,6 +331,31 @@ static long stripe_direct_access(struct dm_target *ti, sector_t sector, return ret; } +static long stripe_dax_direct_access(struct dm_target *ti, phys_addr_t dev_addr, + void **kaddr, pfn_t *pfn, long size) +{ + struct stripe_c *sc = ti->private; + uint32_t stripe; + struct block_device *bdev; + struct dax_inode *dax_inode; + struct blk_dax_ctl dax = { + .size = size, + }; + long ret; + + stripe_map_sector(sc, dev_addr >> SECTOR_SHIFT, , ); + + dax.sector += sc->stripe[stripe].physical_start; + bdev = sc->stripe[stripe].dev->bdev; + dax_inode = sc->stripe[stripe].dev->dax_inode; + + ret = bdev_dax_direct_access(bdev, dax_inode, ); + *kaddr = dax.addr; + *pfn = dax.pfn; + + return ret; +} + /* * Stripe status: * @@ -437,6 +462,10 @@ static void stripe_io_hints(struct dm_target *ti, blk_limits_io_opt(limits, chunk_size * sc->stripes); } +static const struct dm_dax_operations stripe_dax_ops = { + .dm_direct_access = stripe_dax_direct_access, +}; + static struct target_type stripe_target = { .name = "striped", .version = {1, 6, 0}, @@ -449,6 +478,7 @@ static struct target_type stripe_target = { .iterate_devices = stripe_iterate_devices, .io_hints = stripe_io_hints, .direct_access = stripe_direct_access, + .dax_ops = _dax_ops, }; int __init dm_stripe_init(void) diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 710ae28fd618..ab072f53cf24
[RFC PATCH 17/17] block: remove block_device_operations.direct_access and related infrastructure
Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_inode, remove the block device direct_access enabling. Signed-off-by: Dan Williams--- arch/powerpc/sysdev/axonram.c | 15 -- drivers/block/brd.c | 11 -- drivers/md/dm-linear.c| 19 - drivers/md/dm-snap.c |8 --- drivers/md/dm-stripe.c| 24 -- drivers/md/dm-table.c |2 +- drivers/md/dm-target.c|7 -- drivers/md/dm.c | 19 +++-- drivers/nvdimm/pmem.c |9 drivers/s390/block/dcssblk.c | 16 --- fs/block_dev.c| 45 - include/linux/blkdev.h|3 --- include/linux/device-mapper.h |9 13 files changed, 4 insertions(+), 183 deletions(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 4e1f58187726..1337b5829980 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -148,23 +148,8 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, phys_addr_t offset, return bank->size - offset; } -/** - * axon_ram_direct_access - direct_access() method for block device - * @device, @sector, @data: see block_device_operations method - */ -static long -axon_ram_blk_direct_access(struct block_device *device, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct axon_ram_bank *bank = device->bd_disk->private_data; - - return __axon_ram_direct_access(bank, sector << AXON_RAM_SECTOR_SHIFT, - kaddr, pfn, size); -} - static const struct block_device_operations axon_ram_devops = { .owner = THIS_MODULE, - .direct_access = axon_ram_blk_direct_access }; static long diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 1279df4dc07c..52a1259f8ded 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -395,14 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, phys_addr_t dev_addr, return PAGE_SIZE; } -static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct brd_device *brd = bdev->bd_disk->private_data; - - return __brd_direct_access(brd, sector * 512, kaddr, pfn, size); -} - static long brd_dax_direct_access(struct dax_inode *dax_inode, phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) { @@ -414,14 +406,11 @@ static long brd_dax_direct_access(struct dax_inode *dax_inode, static const struct dax_operations brd_dax_ops = { .direct_access = brd_dax_direct_access, }; -#else -#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_blk_direct_access, }; /* diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index e91ca8089333..7ec2a8eb8a14 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -141,24 +141,6 @@ static int linear_iterate_devices(struct dm_target *ti, return fn(ti, lc->dev, lc->start, ti->len, data); } -static long linear_direct_access(struct dm_target *ti, sector_t sector, -void **kaddr, pfn_t *pfn, long size) -{ - struct linear_c *lc = ti->private; - struct block_device *bdev = lc->dev->bdev; - struct blk_dax_ctl dax = { - .sector = linear_map_sector(ti, sector), - .size = size, - }; - long ret; - - ret = bdev_direct_access(bdev, ); - *kaddr = dax.addr; - *pfn = dax.pfn; - - return ret; -} - static long linear_dax_direct_access(struct dm_target *ti, phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) { @@ -192,7 +174,6 @@ static struct target_type linear_target = { .status = linear_status, .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, - .direct_access = linear_direct_access, .dax_ops = _dax_ops, }; diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index 1990e3bd6958..1d9407633bb5 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2302,13 +2302,6 @@ static int origin_map(struct dm_target *ti, struct bio *bio) return do_origin(o->dev, bio); } -static long origin_direct_access(struct dm_target *ti, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - DMWARN("device does not support dax."); - return -EIO; -} - static long origin_dax_direct_access(struct dm_target *ti, phys_addr_t dev_addr, void **kaddr, pfn_t *pfn, long size) { @@ -2379,7 +2372,6
[RFC PATCH 03/17] dax: add a facility to lookup a dax inode by 'host' device name
For the current block_device based filesystem-dax path, we need a way for it to lookup the dax_inode associated with a block_device. Add a 'host' property of a dax_inode that can be used for this purpose. It is a free form string, but for a dax_inode associated with a block device it is the bdev name. This is a band-aid until filesystems are able to mount on a dax-inode directly. We use a hash list since blkdev_writepages() will need to use this interface to issue dax_writeback_mapping_range(). Signed-off-by: Dan Williams--- drivers/dax/dax.h|2 + drivers/dax/device.c |2 + drivers/dax/super.c | 79 +- include/linux/dax.h |1 + 4 files changed, 80 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index def061aa75f4..f33c16ed2ec6 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -13,7 +13,7 @@ #ifndef __DAX_H__ #define __DAX_H__ struct dax_inode; -struct dax_inode *alloc_dax_inode(void *private); +struct dax_inode *alloc_dax_inode(void *private, const char *host); void put_dax_inode(struct dax_inode *dax_inode); bool dax_inode_alive(struct dax_inode *dax_inode); void kill_dax_inode(struct dax_inode *dax_inode); diff --git a/drivers/dax/device.c b/drivers/dax/device.c index af06d0bfd6ea..6d0a3241a608 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -560,7 +560,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region, goto err_id; } - dax_inode = alloc_dax_inode(dax_dev); + dax_inode = alloc_dax_inode(dax_dev, NULL); if (!dax_inode) goto err_inode; diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 7c4dc97d53a8..7ac048f94b2b 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida); static struct kmem_cache *dax_cache __read_mostly; static struct super_block *dax_superblock __read_mostly; +#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head)) +static struct hlist_head dax_host_list[DAX_HASH_SIZE]; +static DEFINE_SPINLOCK(dax_host_lock); + int dax_read_lock(void) { return srcu_read_lock(_srcu); @@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock); * struct dax_inode - anchor object for dax services * @inode: core vfs * @cdev: optional character interface for "device dax" + * @host: optional name for lookups where the device path is not available * @private: dax driver private data * @alive: !alive + rcu grace period == no new operations / mappings */ struct dax_inode { + struct hlist_node list; struct inode inode; struct cdev cdev; + const char *host; void *private; bool alive; }; @@ -63,6 +70,11 @@ bool dax_inode_alive(struct dax_inode *dax_inode) } EXPORT_SYMBOL_GPL(dax_inode_alive); +static int dax_host_hash(const char *host) +{ + return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE; +} + /* * Note, rcu is not protecting the liveness of dax_inode, rcu is * ensuring that any fault handlers or operations that might have seen @@ -75,6 +87,12 @@ void kill_dax_inode(struct dax_inode *dax_inode) return; dax_inode->alive = false; + + spin_lock(_host_lock); + if (!hlist_unhashed(_inode->list)) + hlist_del_init(_inode->list); + spin_unlock(_host_lock); + synchronize_srcu(_srcu); dax_inode->private = NULL; } @@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head) struct inode *inode = container_of(head, struct inode, i_rcu); struct dax_inode *dax_inode = to_dax_inode(inode); + kfree(dax_inode->host); + dax_inode->host = NULL; ida_simple_remove(_minor_ida, MINOR(inode->i_rdev)); kmem_cache_free(dax_cache, dax_inode); } @@ -169,26 +189,49 @@ static struct dax_inode *dax_inode_get(dev_t devt) return dax_inode; } -struct dax_inode *alloc_dax_inode(void *private) +static void dax_add_host(struct dax_inode *dax_inode, const char *host) +{ + int hash; + + INIT_HLIST_NODE(_inode->list); + if (!host) + return; + + dax_inode->host = host; + hash = dax_host_hash(host); + spin_lock(_host_lock); + hlist_add_head(_inode->list, _host_list[hash]); + spin_unlock(_host_lock); +} + +struct dax_inode *alloc_dax_inode(void *private, const char *__host) { struct dax_inode *dax_inode; + const char *host; dev_t devt; int minor; + host = kstrdup(__host, GFP_KERNEL); + if (__host && !host) + return NULL; + minor = ida_simple_get(_minor_ida, 0, nr_dax, GFP_KERNEL); if (minor < 0) - return NULL; + goto err_minor; devt = MKDEV(MAJOR(dax_devt), minor); dax_inode = dax_inode_get(devt); if (!dax_inode)
[RFC PATCH 01/17] dax: refactor dax-fs into a generic provider of dax inodes
We want dax capable drivers to be able to publish a set of dax operations [1]. However, we do not want to further abuse block_devices to advertise these operations. Instead we will attach these operations to a dax inode and add a lookup mechanism to go from block device path to a dax inode. A dax capable driver like pmem or brd is responsible for registering a dax inode, alongside a block device, and then a dax capable filesystem is responsible for retrieving the dax inode by path name if it wants to call dax_operations. For now, we refactor the dax pseudo-fs to be a generic facility, rather than an implementation detail, of the device-dax use case. Where a "dax inode" is just an inode + dax infrastructure, and "Device DAX" is a mapping service layered on top of that base inode. "Filesystem DAX" is then a mapping service that layers a filesystem on top of the base dax inode. Filesystem DAX goes through a block_device for now, but perhaps directly to a dax inode in the future, or for new pmem-only filesystems. [1]: https://lkml.org/lkml/2017/1/19/880 Suggested-by: Christoph HellwigSigned-off-by: Dan Williams --- drivers/Makefile|2 drivers/dax/Kconfig |8 + drivers/dax/Makefile|5 + drivers/dax/dax.h | 24 ++- drivers/dax/device-dax.h| 25 +++ drivers/dax/device.c| 241 + drivers/dax/pmem.c |2 drivers/dax/super.c | 310 +++ tools/testing/nvdimm/Kbuild |6 - 9 files changed, 402 insertions(+), 221 deletions(-) create mode 100644 drivers/dax/device-dax.h rename drivers/dax/{dax.c => device.c} (75%) create mode 100644 drivers/dax/super.c diff --git a/drivers/Makefile b/drivers/Makefile index 060026a02f59..17f42e4a6717 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -68,7 +68,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM)+= nvdimm/ -obj-$(CONFIG_DEV_DAX) += dax/ +obj-$(CONFIG_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS)+= nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 3e2ab3b14eea..39bcbf4c5e40 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,6 +1,11 @@ -menuconfig DEV_DAX +menuconfig DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX + +if DAX + +config DEV_DAX + tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE help Support raw access to differentiated (persistence, bandwidth, @@ -10,7 +15,6 @@ menuconfig DEV_DAX baseline memory pool. Mappings of a /dev/daxX.Y device impose restrictions that make the mapping behavior deterministic. -if DEV_DAX config DEV_DAX_PMEM tristate "PMEM DAX: direct access to persistent memory" diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 27c54e38478a..dc7422530462 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -1,4 +1,7 @@ -obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DAX) += dax.o +obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +dax-y := super.o dax_pmem-y := pmem.o +device_dax-y := device.o diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index ddd829ab58c0..def061aa75f4 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -1,5 +1,5 @@ /* - * Copyright(c) 2016 Intel Corporation. All rights reserved. + * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of version 2 of the GNU General Public License as @@ -12,14 +12,16 @@ */ #ifndef __DAX_H__ #define __DAX_H__ -struct device; -struct dax_dev; -struct resource; -struct dax_region; -void dax_region_put(struct dax_region *dax_region); -struct dax_region *alloc_dax_region(struct device *parent, - int region_id, struct resource *res, unsigned int align, - void *addr, unsigned long flags); -struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region, - struct resource *res, int count); +struct dax_inode; +struct dax_inode *alloc_dax_inode(void *private); +void put_dax_inode(struct dax_inode *dax_inode); +bool dax_inode_alive(struct dax_inode *dax_inode); +void kill_dax_inode(struct dax_inode *dax_inode); +struct dax_inode *inode_to_dax_inode(struct inode *inode); +struct inode *dax_inode_to_inode(struct dax_inode *dax_inode); +void *dax_inode_get_private(struct dax_inode *dax_inode); +int dax_inode_register(struct dax_inode *dax_inode, + const struct file_operations *fops, struct module *owner, + struct
[RFC PATCH 00/17] introduce a dax_inode for dax_operations
Recently there was an effort to introduce dax_operations to unwind the abuse of the user-copy api in the pmem api [1]. Christoph noted that we should not add new block-dax operations as it is further abuse of struct block_device [2]. The ->direct_access() method in block_device_operations was an expedient way to get the filesystem-dax capability bootstrapped. However, looking forward to native persistent memory filesystems, they can forgo the block layer and mount directly on a provider of dax services, a dax inode. For the time being, since current dax capable filesystems are block based, we need a facility to look up this dax object via the block-device name. If this approach looks reasonable I'll follow up with reworking the proposed ->copy_from_iter(), ->flush(), and ->clear() dax operations into this new scheme. These patches survive a run of the libnvdimm unit tests, but I have not tested the non-libnvdimm dax drivers. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008638.html --- Dan Williams (17): dax: refactor dax-fs into a generic provider of dax inodes dax: convert dax_inode locking to srcu dax: add a facility to lookup a dax inode by 'host' device name dax: introduce dax_operations pmem: add dax_operations support axon_ram: add dax_operations support brd: add dax_operations support dcssblk: add dax_operations support block: kill bdev_dax_capable() block: introduce bdev_dax_direct_access() dm: add dax_operations support (producer) dm: add dax_operations support (consumer) fs: update mount_bdev() to lookup dax infrastructure ext2, ext4, xfs: retrieve dax_inode through iomap operations Revert "block: use DAX for partition table reads" fs, dax: convert filesystem-dax to bdev_dax_direct_access block: remove block_device_operations.direct_access and related infrastructure arch/powerpc/platforms/Kconfig |1 arch/powerpc/sysdev/axonram.c | 37 +++ block/Kconfig |1 block/partition-generic.c | 17 -- drivers/Makefile|2 drivers/block/Kconfig |1 drivers/block/brd.c | 48 +++- drivers/dax/Kconfig |9 + drivers/dax/Makefile|5 drivers/dax/dax.h | 19 +- drivers/dax/device-dax.h| 25 ++ drivers/dax/device.c| 257 --- drivers/dax/pmem.c |2 drivers/dax/super.c | 434 +++ drivers/md/Kconfig |1 drivers/md/dm-core.h|3 drivers/md/dm-linear.c | 15 + drivers/md/dm-snap.c|8 + drivers/md/dm-stripe.c | 16 + drivers/md/dm-table.c |2 drivers/md/dm-target.c | 10 + drivers/md/dm.c | 43 +++- drivers/nvdimm/Kconfig |1 drivers/nvdimm/pmem.c | 46 +++- drivers/nvdimm/pmem.h |7 - drivers/s390/block/Kconfig |1 drivers/s390/block/dcssblk.c| 41 +++- fs/block_dev.c | 75 ++- fs/dax.c| 149 ++--- fs/ext2/inode.c |1 fs/ext4/inode.c |1 fs/iomap.c |3 fs/super.c | 32 +++ fs/xfs/xfs_aops.c | 13 + fs/xfs/xfs_aops.h |1 fs/xfs/xfs_buf.h|1 fs/xfs/xfs_iomap.c |1 fs/xfs/xfs_super.c |3 include/linux/blkdev.h |7 - include/linux/dax.h | 29 ++- include/linux/device-mapper.h | 16 + include/linux/fs.h |1 include/linux/iomap.h |1 tools/testing/nvdimm/Kbuild |6 - tools/testing/nvdimm/pmem-dax.c | 12 - 45 files changed, 927 insertions(+), 477 deletions(-) create mode 100644 drivers/dax/device-dax.h rename drivers/dax/{dax.c => device.c} (74%) create mode 100644 drivers/dax/super.c -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 02/17] dax: convert dax_inode locking to srcu
In preparation for adding dax_operations that perform ->direct_access() and user copy operations relative to a dax_inode, convert the existing dax_inode locking to srcu. Some dax drivers need to sleep in their ->direct_access() methods and user copying may fault / sleep. Signed-off-by: Dan Williams--- drivers/dax/Kconfig |1 + drivers/dax/device.c | 18 +- drivers/dax/super.c | 20 include/linux/dax.h |3 +++ 4 files changed, 29 insertions(+), 13 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 39bcbf4c5e40..b7053eafd88e 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,5 +1,6 @@ menuconfig DAX tristate "DAX: direct access to differentiated memory" + select SRCU default m if NVDIMM_DAX if DAX diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 5b5572314929..af06d0bfd6ea 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -333,16 +333,16 @@ static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { - int rc; + int rc, id; struct file *filp = vma->vm_file; struct dax_dev *dax_dev = filp->private_data; dev_dbg(_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, current->comm, (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", vma->vm_start, vma->vm_end); - rcu_read_lock(); + id = dax_read_lock(); rc = __dax_dev_fault(dax_dev, vma, vmf); - rcu_read_unlock(); + dax_read_unlock(id); return rc; } @@ -390,7 +390,7 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned int flags) { - int rc; + int rc, id; struct file *filp = vma->vm_file; struct dax_dev *dax_dev = filp->private_data; @@ -398,9 +398,9 @@ static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, current->comm, (flags & FAULT_FLAG_WRITE) ? "write" : "read", vma->vm_start, vma->vm_end); - rcu_read_lock(); + id = dax_read_lock(); rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); - rcu_read_unlock(); + dax_read_unlock(id); return rc; } @@ -412,8 +412,8 @@ static const struct vm_operations_struct dax_dev_vm_ops = { static int dax_mmap(struct file *filp, struct vm_area_struct *vma) { + int rc, id; struct dax_dev *dax_dev = filp->private_data; - int rc; dev_dbg(_dev->dev, "%s\n", __func__); @@ -421,9 +421,9 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma) * We lock to check dax_inode liveness and will re-check at * fault time. */ - rcu_read_lock(); + id = dax_read_lock(); rc = check_vma(dax_dev, vma, __func__); - rcu_read_unlock(); + dax_read_unlock(id); if (rc) return rc; diff --git a/drivers/dax/super.c b/drivers/dax/super.c index e6369b851619..7c4dc97d53a8 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -24,11 +24,24 @@ module_param(nr_dax, int, S_IRUGO); MODULE_PARM_DESC(nr_dax, "max number of dax device instances"); static dev_t dax_devt; +DEFINE_STATIC_SRCU(dax_srcu); static struct vfsmount *dax_mnt; static DEFINE_IDA(dax_minor_ida); static struct kmem_cache *dax_cache __read_mostly; static struct super_block *dax_superblock __read_mostly; +int dax_read_lock(void) +{ + return srcu_read_lock(_srcu); +} +EXPORT_SYMBOL_GPL(dax_read_lock); + +void dax_read_unlock(int id) +{ + srcu_read_unlock(_srcu, id); +} +EXPORT_SYMBOL_GPL(dax_read_unlock); + /** * struct dax_inode - anchor object for dax services * @inode: core vfs @@ -45,8 +58,7 @@ struct dax_inode { bool dax_inode_alive(struct dax_inode *dax_inode) { - RCU_LOCKDEP_WARN(!rcu_read_lock_held(), - "dax operations require rcu_read_lock()\n"); + lockdep_assert_held(_srcu); return dax_inode->alive; } EXPORT_SYMBOL_GPL(dax_inode_alive); @@ -55,7 +67,7 @@ EXPORT_SYMBOL_GPL(dax_inode_alive); * Note, rcu is not protecting the liveness of dax_inode, rcu is * ensuring that any fault handlers or operations that might have seen * dax_inode_alive(), have completed. Any operations that start after - * synchronize_rcu() has run will abort upon seeing !dax_inode_alive(). + * synchronize_srcu() has run will abort upon seeing !dax_inode_alive(). */ void kill_dax_inode(struct dax_inode *dax_inode) { @@ -63,7 +75,7 @@ void kill_dax_inode(struct dax_inode *dax_inode) return; dax_inode->alive = false; - synchronize_rcu(); + synchronize_srcu(_srcu); dax_inode->private = NULL; }
Re: [PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request
On Fri, Jan 27, 2017 at 06:39:46PM +, Bart Van Assche wrote: > Why have the scsi_release_buffers() and scsi_put_command(cmd) calls been > moved up? I haven't found an explanation for this change in the patch > description. Because they reference the scsi_cmnd, which are now part of the request and thus freed by blk_finish_request. And yes, I should have mentioned it in the changelog, sorry. > Please also consider to remove the cmd->request->special = NULL assignments > via this patch. Since this patch makes the lifetime of struct scsi_cmnd and > struct request identical these assignments are no longer needed. True. If I had to resend again I would have fixed it up, but it's probably not worth the churn now. > This patch introduces the function scsi_exit_rq(). Having two functions > for the single-queue path that release resources (scsi_release_buffers() > and scsi_exit_rq()) is confusing. Since every scsi_release_buffers() call > is followed by a blk_unprep_request() call, have you considered to move > the scsi_release_buffers() call into scsi_unprep_fn() via an additional > patch? We could have done that. But it's just more change for a code path that I hope won't survive this calendar year. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: split scsi passthrough fields out of struct request V2
On Fri, Jan 27, 2017 at 09:27:53PM +, Bart Van Assche wrote: > Have you considered to convert all block drivers to the new > approach and to get rid of request.special? If so, do you already > have plans to start working on this? I'm namely wondering wheter I > should start working on this myself. Hi Bart, I'd love to have all drivers move of using .special (and thus reducing request size further). I think the general way to do that is to convert them to blk-mq and not using the legacy cmd_size field. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: split scsi passthrough fields out of struct request V3
On Fri, Jan 27, 2017 at 06:58:53PM +, Bart Van Assche wrote: > Version 3 of the patch with title "block: split scsi_request out of > struct request" (commit 3c30af6ebe12) differs significantly from v2 > of that patch that has been posted on several mailing lists. E.g. v2 > moves __cmd[], cmd and cmd_len from struct request into struct > scsi_request but v3 not. Which version do you want us to review? Hi Bart, I tried to resend the whole updated v3 series, but the mail server stopped accepting mails due to overload. Otherwise it would have included all the patches. Jens instead took the updated version straight from this git branch: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/18] scsi: remove __scsi_alloc_queue
On Fri, Jan 27, 2017 at 05:58:02PM +, Bart Van Assche wrote: > Since __scsi_init_queue() modifies data in the Scsi_Host structure, have you > considered to add the declaration for this function to ? > If you want to keep this declaration in please add a > direct include of that header file to drivers/scsi/scsi_lib.c such that the > declaration remains visible to the compiler if someone would minimize the > number of #include directives in SCSI header files. Feel free to send an incremental patch either way. In the long run I'd really like to kill off __scsi_init_queue and remove the transport BSG queue abuse of SCSI internals, though. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html