Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Fri, Nov 03, 2017 at 02:42:50AM +, Bart Van Assche wrote: > On Fri, 2017-11-03 at 10:12 +0800, Ming Lei wrote: > > [root@ibclient srp-test]# ./run_tests > > modprobe: FATAL: Module target_core_mod is in use. > > LIO must be unloaded before srp-test software is started. Yeah, I can make this test kick off after running 'modprobe -fr ib_isert' first, but looks all tests failed without any hang, could you check if that is the expected result? https://pastebin.com/VXe66Jpg Also could you provide some output from debugfs when this hang happens? such as: dispatch/busy/.. under hctxN -- Ming
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Fri, 2017-11-03 at 10:12 +0800, Ming Lei wrote: > [root@ibclient srp-test]# ./run_tests > modprobe: FATAL: Module target_core_mod is in use. LIO must be unloaded before srp-test software is started. Bart.
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
Hi Laurence, On Thu, Nov 02, 2017 at 09:16:04PM -0400, Laurence Oberman wrote: > Hi Ming > I have used Bart's tests on my test bed they should run fine. > Are you using my SRP setup. Yeah, I am using your SRP setup. Once the three directories are created, I saw the new failure, and both ib/srp and dm-mpath has been working well now. Laurence and Bart, do you have receipt for making the test working first? [root@ibclient srp-test]# ./run_tests modprobe: FATAL: Module target_core_mod is in use. [root@ibclient srp-test]# lsmod | grep target iscsi_target_mod 294912 1 ib_isert target_core_mod 352256 2 iscsi_target_mod,ib_isert [root@ibclient srp-test]# lsblk | head -n 10 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdy 65:128 0 3.9G 0 disk └─36001405b2b5c6c24c084b6fa4d55da2f253:16 0 3.9G 0 mpath sdam66:96 0 3.9G 0 disk └─360014055f563ebf2abc4609a00a16794253:11 0 3.9G 0 mpath sdbb67:80 0 3.9G 0 disk └─36001405f58163b828714988bd2e88358253:17 0 3.9G 0 mpath sdf 8:80 0 3.9G 0 disk └─36001405fb1402260c1346688f23d07e0253:70 3.9G 0 mpath sdau66:224 0 3.9G 0 disk Thanks, Ming > > Thanks > Laurence > > > On Nov 2, 2017 8:48 PM, "Bart Van Assche"wrote: > > > On Fri, 2017-11-03 at 08:15 +0800, Ming Lei wrote: > > > On Thu, Nov 02, 2017 at 11:54:57PM +, Bart Van Assche wrote: > > > > On Fri, 2017-11-03 at 07:48 +0800, Ming Lei wrote: > > > > > Could you please share your srp-tests script? I may find a IB/SRP > > system > > > > > to see if I can reproduce this issue and figure out one solution. > > > > > > > > Please have a look at https://github.com/bvanassche/srp-test. > > > > > > Could you provide me the exact command line for the reproduction? > > > > > > I simply run ./run_tests, and looks it doesn't work. > > > > > > [root@ srp-test]# ./run_tests > > > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > > > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > > > Unmounting /home/ming/git/srp-test/mnt2 from > > > umount: /home/ming/git/srp-test/mnt2: mountpoint not found > > > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > > > Error: unmounting /home/ming/git/srp-test/mnt2 failed > > > > This means that I have to add the following to the README document: > > > > Run "mkdir mnt1 mnt2 mnt3" before starting the tests. > > > > Bart.
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Fri, 2017-11-03 at 08:15 +0800, Ming Lei wrote: > On Thu, Nov 02, 2017 at 11:54:57PM +, Bart Van Assche wrote: > > On Fri, 2017-11-03 at 07:48 +0800, Ming Lei wrote: > > > Could you please share your srp-tests script? I may find a IB/SRP system > > > to see if I can reproduce this issue and figure out one solution. > > > > Please have a look at https://github.com/bvanassche/srp-test. > > Could you provide me the exact command line for the reproduction? > > I simply run ./run_tests, and looks it doesn't work. > > [root@ srp-test]# ./run_tests > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > Unmounting /home/ming/git/srp-test/mnt2 from > umount: /home/ming/git/srp-test/mnt2: mountpoint not found > df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory > Error: unmounting /home/ming/git/srp-test/mnt2 failed This means that I have to add the following to the README document: Run "mkdir mnt1 mnt2 mnt3" before starting the tests. Bart.
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Thu, Nov 02, 2017 at 11:54:57PM +, Bart Van Assche wrote: > On Fri, 2017-11-03 at 07:48 +0800, Ming Lei wrote: > > Could you please share your srp-tests script? I may find a IB/SRP system > > to see if I can reproduce this issue and figure out one solution. > > Please have a look at https://github.com/bvanassche/srp-test. Could you provide me the exact command line for the reproduction? I simply run ./run_tests, and looks it doesn't work. [root@ srp-test]# ./run_tests df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory Unmounting /home/ming/git/srp-test/mnt2 from umount: /home/ming/git/srp-test/mnt2: mountpoint not found df: ‘/home/ming/git/srp-test/mnt2’: No such file or directory Error: unmounting /home/ming/git/srp-test/mnt2 failed Thanks, Ming
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Fri, 2017-11-03 at 07:48 +0800, Ming Lei wrote: > Could you please share your srp-tests script? I may find a IB/SRP system > to see if I can reproduce this issue and figure out one solution. Please have a look at https://github.com/bvanassche/srp-test. Bart.
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Thu, Nov 02, 2017 at 11:43:55PM +, Bart Van Assche wrote: > On Fri, 2017-11-03 at 07:38 +0800, Ming Lei wrote: > > On Thu, Nov 02, 2017 at 03:57:05PM +, Bart Van Assche wrote: > > > On Wed, 2017-11-01 at 08:21 -0600, Jens Axboe wrote: > > > > Fixed that up, and applied these two patches as well. > > > > > > Hello Jens, > > > > > > Recently I noticed that a test system sporadically hangs during boot (Dell > > > PowerEdge R720 that boots from a hard disk connected to a MegaRAID SAS > > > adapter) > > > and also that srp-tests systematically hangs. Reverting the two patches > > > from > > > this series fixes both issues. I'm not sure there is another solution than > > > reverting the two patches from this series. > > > > Then we need to find the root cause, instead of using sort of workaround > > as before. > > > > For SCSI, the restart is always run from scsi_end_request(), and this > > kind of restart from all hctx isn't necessary at all. > > Yes, we need to find the root cause, but after these two patches have either > been reverted or have been removed from the for-4.15 queue. Otherwise if a > bisect lands in the middle of these two patches and the fix for these patches > that would be very frustrating for the person who is running the bisect. Could you please share your srp-tests script? I may find a IB/SRP system to see if I can reproduce this issue and figure out one solution. -- Ming
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Fri, 2017-11-03 at 07:38 +0800, Ming Lei wrote: > On Thu, Nov 02, 2017 at 03:57:05PM +, Bart Van Assche wrote: > > On Wed, 2017-11-01 at 08:21 -0600, Jens Axboe wrote: > > > Fixed that up, and applied these two patches as well. > > > > Hello Jens, > > > > Recently I noticed that a test system sporadically hangs during boot (Dell > > PowerEdge R720 that boots from a hard disk connected to a MegaRAID SAS > > adapter) > > and also that srp-tests systematically hangs. Reverting the two patches from > > this series fixes both issues. I'm not sure there is another solution than > > reverting the two patches from this series. > > Then we need to find the root cause, instead of using sort of workaround > as before. > > For SCSI, the restart is always run from scsi_end_request(), and this > kind of restart from all hctx isn't necessary at all. Yes, we need to find the root cause, but after these two patches have either been reverted or have been removed from the for-4.15 queue. Otherwise if a bisect lands in the middle of these two patches and the fix for these patches that would be very frustrating for the person who is running the bisect. Bart.
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Thu, Nov 02, 2017 at 03:57:05PM +, Bart Van Assche wrote: > On Wed, 2017-11-01 at 08:21 -0600, Jens Axboe wrote: > > Fixed that up, and applied these two patches as well. > > Hello Jens, > > Recently I noticed that a test system sporadically hangs during boot (Dell > PowerEdge R720 that boots from a hard disk connected to a MegaRAID SAS > adapter) > and also that srp-tests systematically hangs. Reverting the two patches from > this series fixes both issues. I'm not sure there is another solution than > reverting the two patches from this series. Then we need to find the root cause, instead of using sort of workaround as before. For SCSI, the restart is always run from scsi_end_request(), and this kind of restart from all hctx isn't necessary at all. > > Bart. > > > BTW, the following appeared in the kernel log when I tried to run srp-tests > against a kernel with the two patches from this series applied: > > INFO: task kworker/19:1:209 blocked for more than 480 seconds. > INFO: task kworker/19:1:209 blocked for more than 480 seconds. > Tainted: GW 4.14.0-rc7-dbg+ #1 > Tainted: GW 4.14.0-rc7-dbg+ #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/19:1D0 209 2 0x8000 > kworker/19:1D0 209 2 0x8000 > Workqueue: srp_remove srp_remove_work [ib_srp] > Workqueue: srp_remove srp_remove_work [ib_srp] > Call Trace: > Call Trace: > __schedule+0x2fa/0xbb0 > __schedule+0x2fa/0xbb0 > schedule+0x36/0x90 > schedule+0x36/0x90 > async_synchronize_cookie_domain+0x88/0x130 > async_synchronize_cookie_domain+0x88/0x130 > ? finish_wait+0x90/0x90 > ? finish_wait+0x90/0x90 > async_synchronize_full_domain+0x18/0x20 > async_synchronize_full_domain+0x18/0x20 > sd_remove+0x4d/0xc0 [sd_mod] > sd_remove+0x4d/0xc0 [sd_mod] > device_release_driver_internal+0x160/0x210 > device_release_driver_internal+0x160/0x210 > device_release_driver+0x12/0x20 > device_release_driver+0x12/0x20 > bus_remove_device+0x100/0x180 > bus_remove_device+0x100/0x180 > device_del+0x1d8/0x340 > device_del+0x1d8/0x340 > __scsi_remove_device+0xfc/0x130 > __scsi_remove_device+0xfc/0x130 > scsi_forget_host+0x25/0x70 > scsi_forget_host+0x25/0x70 > scsi_remove_host+0x79/0x120 > scsi_remove_host+0x79/0x120 > srp_remove_work+0x90/0x1d0 [ib_srp] > srp_remove_work+0x90/0x1d0 [ib_srp] > process_one_work+0x20a/0x660 > process_one_work+0x20a/0x660 > worker_thread+0x3d/0x3b0 > worker_thread+0x3d/0x3b0 > kthread+0x13a/0x150 > kthread+0x13a/0x150 > ? process_one_work+0x660/0x660 > ? process_one_work+0x660/0x660 > ? kthread_create_on_node+0x40/0x40 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x27/0x40 > ret_from_fork+0x27/0x40 > > Showing all locks held in the system: > > Showing all locks held in the system: > 1 lock held by khungtaskd/170: > 1 lock held by khungtaskd/170: > #0: (tasklist_lock){.+.+}, at: [] > debug_show_all_locks+0x3d/0x1a0 > #0: (tasklist_lock){.+.+}, at: [] > debug_show_all_locks+0x3d/0x1a0 > 4 locks held by kworker/19:1/209: > 4 locks held by kworker/19:1/209: > #0: ("%s"("srp_remove")){+.+.}, at: [] > process_one_work+0x195/0x660 > #0: ("%s"("srp_remove")){+.+.}, at: [] > process_one_work+0x195/0x660 > #1: ((>remove_work)){+.+.}, at: [] > process_one_work+0x195/0x660 > #1: ((>remove_work)){+.+.}, at: [] > process_one_work+0x195/0x660 > #2: (>scan_mutex){+.+.}, at: [] > scsi_remove_host+0x1f/0x120 > #2: (>scan_mutex){+.+.}, at: [] > scsi_remove_host+0x1f/0x120 > #3: (>mutex){}, at: [] > device_release_driver_internal+0x39/0x210 > #3: (>mutex){}, at: [] > device_release_driver_internal+0x39/0x210 > 2 locks held by kworker/u66:0/1927: > 2 locks held by kworker/u66:0/1927: > #0: ("events_unbound"){+.+.}, at: [] > process_one_work+0x195/0x660 > #0: ("events_unbound"){+.+.}, at: [] > process_one_work+0x195/0x660 > #1: ((>work)){+.+.}, at: [] > process_one_work+0x195/0x660 > #1: ((>work)){+.+.}, at: [] > process_one_work+0x195/0x660 > 2 locks held by kworker/5:0/2047: > 2 locks held by kworker/5:0/2047: > #0: ("kaluad"){+.+.}, at: [] process_one_work+0x195/0x660 > #0: ("kaluad"){+.+.}, at: [] process_one_work+0x195/0x660 > #1: ((&(>rtpg_work)->work)){+.+.}, at: [] > process_one_work+0x195/0x660 > #1: ((&(>rtpg_work)->work)){+.+.}, at: [] > process_one_work+0x195/0x660 > > > = > > = > > INFO: task kworker/19:1:209 blocked for more than 480 seconds. > INFO: task kworker/19:1:209 blocked for more than 480 seconds. > Tainted: GW 4.14.0-rc7-dbg+ #1 > Tainted: GW 4.14.0-rc7-dbg+ #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
[PATCH 5/5] nvme: also expose the namespace identification sysfs files for mpath nodes
We do this by adding a helper that returns the ns_head for a device that can belong to either the per-controller or per-subsystem block device nodes, and otherwise reuse all the existing code. Signed-off-by: Christoph HellwigReviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 57 +++ drivers/nvme/host/multipath.c | 6 + drivers/nvme/host/nvme.h | 1 + 3 files changed, 38 insertions(+), 26 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 09f0e826b426..9209b8bca64b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2258,12 +2258,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev, } static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan); +static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (disk->fops == _fops) + return nvme_get_ns_from_dev(dev)->head; + else + return disk->private_data; +} + static ssize_t wwid_show(struct device *dev, struct device_attribute *attr, - char *buf) + char *buf) { - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - struct nvme_ns_ids *ids = >head->ids; - struct nvme_subsystem *subsys = ns->ctrl->subsys; + struct nvme_ns_head *head = dev_to_ns_head(dev); + struct nvme_ns_ids *ids = >ids; + struct nvme_subsystem *subsys = head->subsys; int serial_len = sizeof(subsys->serial); int model_len = sizeof(subsys->model); @@ -2285,23 +2295,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr, return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len, subsys->serial, model_len, subsys->model, - ns->head->ns_id); + head->ns_id); } static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL); static ssize_t nguid_show(struct device *dev, struct device_attribute *attr, - char *buf) + char *buf) { - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - return sprintf(buf, "%pU\n", ns->head->ids.nguid); + return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid); } static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL); static ssize_t uuid_show(struct device *dev, struct device_attribute *attr, - char *buf) + char *buf) { - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - struct nvme_ns_ids *ids = >head->ids; + struct nvme_ns_ids *ids = _to_ns_head(dev)->ids; /* For backward compatibility expose the NGUID to userspace if * we have no UUID set @@ -2316,22 +2324,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr, static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL); static ssize_t eui_show(struct device *dev, struct device_attribute *attr, - char *buf) + char *buf) { - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - return sprintf(buf, "%8phd\n", ns->head->ids.eui64); + return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64); } static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL); static ssize_t nsid_show(struct device *dev, struct device_attribute *attr, - char *buf) + char *buf) { - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - return sprintf(buf, "%d\n", ns->head->ns_id); + return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id); } static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL); -static struct attribute *nvme_ns_attrs[] = { +static struct attribute *nvme_ns_id_attrs[] = { _attr_wwid.attr, _attr_uuid.attr, _attr_nguid.attr, @@ -2340,12 +2346,11 @@ static struct attribute *nvme_ns_attrs[] = { NULL, }; -static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj, +static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); - struct nvme_ns *ns = nvme_get_ns_from_dev(dev); - struct nvme_ns_ids *ids = >head->ids; + struct nvme_ns_ids *ids = _to_ns_head(dev)->ids; if (a == _attr_uuid.attr) { if (uuid_is_null(>uuid) || @@ -2363,9 +2368,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj, return a->mode; } -static const struct attribute_group nvme_ns_attr_group = { - .attrs = nvme_ns_attrs, - .is_visible =
[PATCH 7/7] block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues. Mostly needed for the NVMe multipath code, but could also be useful elsewhere. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- block/blk-core.c | 11 +++ block/blk-mq.c | 14 +- drivers/nvme/target/io-cmd.c | 2 +- fs/block_dev.c | 4 ++-- fs/direct-io.c | 2 +- fs/iomap.c | 2 +- include/linux/blkdev.h | 4 +++- mm/page_io.c | 2 +- 8 files changed, 25 insertions(+), 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 54fe673fb3ba..1cab621a552a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2319,6 +2319,17 @@ blk_qc_t submit_bio(struct bio *bio) } EXPORT_SYMBOL(submit_bio); +bool blk_poll(struct request_queue *q, blk_qc_t cookie) +{ + if (!q->poll_fn || !blk_qc_t_valid(cookie)) + return false; + + if (current->plug) + blk_flush_plug_list(current->plug, false); + return q->poll_fn(q, cookie); +} +EXPORT_SYMBOL_GPL(blk_poll); + /** * blk_cloned_rq_check_limits - Helper function to check a cloned request * for new the queue limits diff --git a/block/blk-mq.c b/block/blk-mq.c index 7f01d69879d6..10c99cf6fd71 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -37,6 +37,7 @@ #include "blk-wbt.h" #include "blk-mq-sched.h" +static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie); static void blk_mq_poll_stats_start(struct request_queue *q); static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); @@ -2401,6 +2402,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, spin_lock_init(>requeue_lock); blk_queue_make_request(q, blk_mq_make_request); + if (q->mq_ops->poll) + q->poll_fn = blk_mq_poll; /* * Do this after blk_queue_make_request() overrides it... @@ -2860,20 +2863,14 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) return false; } -bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie) +static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie) { struct blk_mq_hw_ctx *hctx; - struct blk_plug *plug; struct request *rq; - if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) || - !test_bit(QUEUE_FLAG_POLL, >queue_flags)) + if (!test_bit(QUEUE_FLAG_POLL, >queue_flags)) return false; - plug = current->plug; - if (plug) - blk_flush_plug_list(plug, false); - hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)]; if (!blk_qc_t_is_internal(cookie)) rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie)); @@ -2891,7 +2888,6 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie) return __blk_mq_poll(hctx, rq); } -EXPORT_SYMBOL_GPL(blk_mq_poll); static int __init blk_mq_init(void) { diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c index 0d4c23dc4532..db632818777d 100644 --- a/drivers/nvme/target/io-cmd.c +++ b/drivers/nvme/target/io-cmd.c @@ -94,7 +94,7 @@ static void nvmet_execute_rw(struct nvmet_req *req) cookie = submit_bio(bio); - blk_mq_poll(bdev_get_queue(req->ns->bdev), cookie); + blk_poll(bdev_get_queue(req->ns->bdev), cookie); } static void nvmet_execute_flush(struct nvmet_req *req) diff --git a/fs/block_dev.c b/fs/block_dev.c index 93d088ffc05c..49a55246ba50 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -249,7 +249,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, if (!READ_ONCE(bio.bi_private)) break; if (!(iocb->ki_flags & IOCB_HIPRI) || - !blk_mq_poll(bdev_get_queue(bdev), qc)) + !blk_poll(bdev_get_queue(bdev), qc)) io_schedule(); } __set_current_state(TASK_RUNNING); @@ -414,7 +414,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) break; if (!(iocb->ki_flags & IOCB_HIPRI) || - !blk_mq_poll(bdev_get_queue(bdev), qc)) + !blk_poll(bdev_get_queue(bdev), qc)) io_schedule(); } __set_current_state(TASK_RUNNING); diff --git a/fs/direct-io.c b/fs/direct-io.c index 62cf812ed0e5..d2bc339cb1e9 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -486,7 +486,7 @@ static struct bio *dio_await_one(struct dio *dio) dio->waiter = current; spin_unlock_irqrestore(>bio_lock, flags); if (!(dio->iocb->ki_flags & IOCB_HIPRI) || - !blk_mq_poll(dio->bio_disk->queue, dio->bio_cookie)) + !blk_poll(dio->bio_disk->queue, dio->bio_cookie))
nvme multipath support V6
Hi all, this series adds support for multipathing, that is accessing nvme namespaces through multiple controllers to the nvme core driver. It is a very thin and efficient implementation that relies on close cooperation with other bits of the nvme driver, and few small and simple block helpers. Compared to dm-multipath the important differences are how management of the paths is done, and how the I/O path works. Management of the paths is fully integrated into the nvme driver, for each newly found nvme controller we check if there are other controllers that refer to the same subsystem, and if so we link them up in the nvme driver. Then for each namespace found we check if the namespace id and identifiers match to check if we have multiple controllers that refer to the same namespaces. For now path availability is based entirely on the controller status, which at least for fabrics will be continuously updated based on the mandatory keep alive timer. Once the Asynchronous Namespace Access (ANA) proposal passes in NVMe we will also get per-namespace states in addition to that, but for now any details of that remain confidential to NVMe members. The I/O path is very different from the existing multipath drivers, which is enabled by the fact that NVMe (unlike SCSI) does not support partial completions - a controller will either complete a whole command or not, but never only complete parts of it. Because of that there is no need to clone bios or requests - the I/O path simply redirects the I/O to a suitable path. For successful commands multipath is not in the completion stack at all. For failed commands we decide if the error could be a path failure, and if yes remove the bios from the request structure and requeue them before completing the request. All together this means there is no performance degradation compared to normal nvme operation when using the multipath device node (at least not until I find a dual ported DRAM backed device :)) A git tree is available at: git://git.infradead.org/users/hch/block.git nvme-mpath gitweb: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath Changes since V5: - dropped various prep patches merged into the nvme tree - removed a leftover debug printk - small cleanups to blk_steal_bio - add a sysfs attribute for hidden gendisks - don't complete cancelled requests if another path is available - use the instance index for device naming - make the multipath code optional at compile time - don't use the multipath node for single-controller subsystems - add a module_option to disable multipathing (temporarily) Changes since V4: - add a refcount to release the device in struct nvme_subsystem - use the instance to name the nvme_subsystems in sysfs - remove a NULL check before nvme_put_ns_head - take a ns_head reference in ->open - improve open protection for GENHD_FL_HIDDEN - add poll support for the mpath device Changes since V3: - new block layer support for hidden gendisks - a couple new patches to refactor device handling before the actual multipath support - don't expose per-controller block device nodes - use /dev/nvmeXnZ as the device nodes for the whole subsystem. - expose subsystems in sysfs (Hannes Reinecke) - fix a subsystem leak when duplicate NQNs are found - fix up some names - don't clear current_path if freeing a different namespace Changes since V2: - don't create duplicate subsystems on reset (Keith Bush) - free requests properly when failing over in I/O completion (Keith Bush) - new devices names: /dev/nvm-sub%dn%d - expose the namespace identification sysfs files for the mpath nodes Changes since V1: - introduce new nvme_ns_ids structure to clean up identifier handling - generic_make_request_fast is now named direct_make_request and calls generic_make_request_checks - reset bi_disk on resubmission - create sysfs links between the existing nvme namespace block devices and the new share mpath device - temporarily added the timeout patches from James, this should go into nvme-4.14, though
[PATCH 4/5] nvme: implement multipath access to nvme subsystems
This patch adds native multipath support to the nvme driver. For each namespace we create only single block device node, which can be used to access that namespace through any of the controllers that refer to it. The gendisk for each controllers path to the name space still exists inside the kernel, but is hidden from userspace. The character device nodes are still available on a per-controller basis. A new link from the sysfs directory for the subsystem allows to find all controllers for a given subsystem. Currently we will always send I/O to the first available path, this will be changed once the NVMe Asynchronous Namespace Access (ANA) TP is ratified and implemented, at which point we will look at the ANA state for each namespace. Another possibility that was prototyped is to use the path that is closes to the submitting NUMA code, which will be mostly interesting for PCI, but might also be useful for RDMA or FC transports in the future. There is not plan to implement round robin or I/O service time path selectors, as those are not scalable with the performance rates provided by NVMe. The multipath device will go away once all paths to it disappear, any delay to keep it alive needs to be implemented at the controller level. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/Kconfig | 9 ++ drivers/nvme/host/Makefile| 1 + drivers/nvme/host/core.c | 133 +++--- drivers/nvme/host/multipath.c | 255 ++ drivers/nvme/host/nvme.h | 57 ++ 5 files changed, 440 insertions(+), 15 deletions(-) create mode 100644 drivers/nvme/host/multipath.c diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 46d6cb1e03bd..45886800a1df 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -13,6 +13,15 @@ config BLK_DEV_NVME To compile this driver as a module, choose M here: the module will be called nvme. +config NVME_MULTIPATH + bool "NVMe multipath support" + depends on NVME_CORE + ---help--- + This option enables support for multipath access to NVMe + subsystems. If this option is enabled only a single + /dev/nvneXnY device will show up for each NVMe namespaces, + even if it is accessible through multiple controllers. + config NVME_FABRICS tristate diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index cc0aacb4c8b4..b856f2f549cd 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o obj-$(CONFIG_NVME_FC) += nvme-fc.o nvme-core-y:= core.o +nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o nvme-core-$(CONFIG_NVM)+= lightnvm.o nvme-y += pci.o diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7a9ea44814ed..09f0e826b426 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -177,17 +177,22 @@ static inline bool nvme_req_needs_retry(struct request *req) return false; if (nvme_req(req)->retries >= nvme_max_retries) return false; - if (blk_queue_dying(req->q)) - return false; return true; } void nvme_complete_rq(struct request *req) { if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) { - nvme_req(req)->retries++; - blk_mq_requeue_request(req, true); - return; + if (nvme_req_needs_failover(req)) { + nvme_failover_req(req); + return; + } + + if (!blk_queue_dying(req->q)) { + nvme_req(req)->retries++; + blk_mq_requeue_request(req, true); + return; + } } blk_mq_end_request(req, nvme_error_status(req)); @@ -278,7 +283,8 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, ctrl->state = new_state; spin_unlock_irqrestore(>lock, flags); - + if (changed && ctrl->state == NVME_CTRL_LIVE) + nvme_kick_requeue_lists(ctrl); return changed; } EXPORT_SYMBOL_GPL(nvme_change_ctrl_state); @@ -288,6 +294,7 @@ static void nvme_free_ns_head(struct kref *ref) struct nvme_ns_head *head = container_of(ref, struct nvme_ns_head, ref); + nvme_mpath_remove_disk(head); ida_simple_remove(>subsys->ns_ida, head->instance); list_del_init(>entry); cleanup_srcu_struct(>srcu); @@ -1051,11 +1058,33 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return status; } -static int nvme_ioctl(struct block_device *bdev, fmode_t mode, - unsigned int cmd, unsigned long arg) +/* + * Issue ioctl requests on
[PATCH 2/5] nvme: introduce a nvme_ns_ids structure
This allows us to manage the various uniqueue namespace identifiers together instead needing various variables and arguments. Signed-off-by: Christoph HellwigReviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke --- drivers/nvme/host/core.c | 69 +++- drivers/nvme/host/nvme.h | 14 +++--- 2 files changed, 49 insertions(+), 34 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f385c35ed2d9..dc0bbbd4ef46 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -788,7 +788,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id) } static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, - u8 *eui64, u8 *nguid, uuid_t *uuid) + struct nvme_ns_ids *ids) { struct nvme_command c = { }; int status; @@ -824,7 +824,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, goto free_data; } len = NVME_NIDT_EUI64_LEN; - memcpy(eui64, data + pos + sizeof(*cur), len); + memcpy(ids->eui64, data + pos + sizeof(*cur), len); break; case NVME_NIDT_NGUID: if (cur->nidl != NVME_NIDT_NGUID_LEN) { @@ -834,7 +834,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, goto free_data; } len = NVME_NIDT_NGUID_LEN; - memcpy(nguid, data + pos + sizeof(*cur), len); + memcpy(ids->nguid, data + pos + sizeof(*cur), len); break; case NVME_NIDT_UUID: if (cur->nidl != NVME_NIDT_UUID_LEN) { @@ -844,7 +844,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, goto free_data; } len = NVME_NIDT_UUID_LEN; - uuid_copy(uuid, data + pos + sizeof(*cur)); + uuid_copy(>uuid, data + pos + sizeof(*cur)); break; default: /* Skip unnkown types */ @@ -1146,22 +1146,31 @@ static void nvme_config_discard(struct nvme_ctrl *ctrl, } static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, - struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid) + struct nvme_id_ns *id, struct nvme_ns_ids *ids) { + memset(ids, 0, sizeof(*ids)); + if (ctrl->vs >= NVME_VS(1, 1, 0)) - memcpy(eui64, id->eui64, sizeof(id->eui64)); + memcpy(ids->eui64, id->eui64, sizeof(id->eui64)); if (ctrl->vs >= NVME_VS(1, 2, 0)) - memcpy(nguid, id->nguid, sizeof(id->nguid)); + memcpy(ids->nguid, id->nguid, sizeof(id->nguid)); if (ctrl->vs >= NVME_VS(1, 3, 0)) { /* Don't treat error as fatal we potentially * already have a NGUID or EUI-64 */ - if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid)) + if (nvme_identify_ns_descs(ctrl, nsid, ids)) dev_warn(ctrl->device, "%s: Identify Descriptors failed\n", __func__); } } +static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b) +{ + return uuid_equal(>uuid, >uuid) && + memcmp(>nguid, >nguid, sizeof(a->nguid)) == 0 && + memcmp(>eui64, >eui64, sizeof(a->eui64)) == 0; +} + static void nvme_update_disk_info(struct gendisk *disk, struct nvme_ns *ns, struct nvme_id_ns *id) { @@ -1217,8 +1226,7 @@ static int nvme_revalidate_disk(struct gendisk *disk) struct nvme_ns *ns = disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; struct nvme_id_ns *id; - u8 eui64[8] = { 0 }, nguid[16] = { 0 }; - uuid_t uuid = uuid_null; + struct nvme_ns_ids ids; int ret = 0; if (test_bit(NVME_NS_DEAD, >flags)) { @@ -1235,10 +1243,8 @@ static int nvme_revalidate_disk(struct gendisk *disk) goto out; } - nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, ); - if (!uuid_equal(>uuid, ) || - memcmp(>nguid, , sizeof(ns->nguid)) || - memcmp(>eui, , sizeof(ns->eui))) { + nvme_report_ns_ids(ctrl, ns->ns_id, id, ); + if (!nvme_ns_ids_equal(>ids, )) { dev_err(ctrl->device, "identifiers changed for nsid %d\n", ns->ns_id); ret = -ENODEV; @@ -2142,18 +2148,19 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
[PATCH 3/5] nvme: track shared namespaces
Introduce a new struct nvme_ns_head that holds information about an actual namespace, unlike struct nvme_ns, which only holds the per-controller namespace information. For private namespaces there is a 1:1 relation of the two, but for shared namespaces this lets us discover all the paths to it. For now only the identifiers are moved to the new structure, but most of the information in struct nvme_ns should eventually move over. To allow lockless path lookup the list of nvme_ns structures per nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns structure through call_srcu. Signed-off-by: Christoph HellwigReviewed-by: Keith Busch Reviewed-by: Javier González Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke --- drivers/nvme/host/core.c | 220 +++ drivers/nvme/host/lightnvm.c | 14 +-- drivers/nvme/host/nvme.h | 26 - 3 files changed, 210 insertions(+), 50 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dc0bbbd4ef46..7a9ea44814ed 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -283,6 +283,22 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, } EXPORT_SYMBOL_GPL(nvme_change_ctrl_state); +static void nvme_free_ns_head(struct kref *ref) +{ + struct nvme_ns_head *head = + container_of(ref, struct nvme_ns_head, ref); + + ida_simple_remove(>subsys->ns_ida, head->instance); + list_del_init(>entry); + cleanup_srcu_struct(>srcu); + kfree(head); +} + +static void nvme_put_ns_head(struct nvme_ns_head *head) +{ + kref_put(>ref, nvme_free_ns_head); +} + static void nvme_free_ns(struct kref *kref) { struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref); @@ -291,7 +307,7 @@ static void nvme_free_ns(struct kref *kref) nvme_nvm_unregister(ns); put_disk(ns->disk); - ida_simple_remove(>ctrl->ns_ida, ns->instance); + nvme_put_ns_head(ns->head); nvme_put_ctrl(ns->ctrl); kfree(ns); } @@ -427,7 +443,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns, { memset(cmnd, 0, sizeof(*cmnd)); cmnd->common.opcode = nvme_cmd_flush; - cmnd->common.nsid = cpu_to_le32(ns->ns_id); + cmnd->common.nsid = cpu_to_le32(ns->head->ns_id); } static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, @@ -458,7 +474,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, memset(cmnd, 0, sizeof(*cmnd)); cmnd->dsm.opcode = nvme_cmd_dsm; - cmnd->dsm.nsid = cpu_to_le32(ns->ns_id); + cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id); cmnd->dsm.nr = cpu_to_le32(segments - 1); cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD); @@ -497,7 +513,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, memset(cmnd, 0, sizeof(*cmnd)); cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read); - cmnd->rw.nsid = cpu_to_le32(ns->ns_id); + cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req))); cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); @@ -978,7 +994,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) memset(, 0, sizeof(c)); c.rw.opcode = io.opcode; c.rw.flags = io.flags; - c.rw.nsid = cpu_to_le32(ns->ns_id); + c.rw.nsid = cpu_to_le32(ns->head->ns_id); c.rw.slba = cpu_to_le64(io.slba); c.rw.length = cpu_to_le16(io.nblocks); c.rw.control = cpu_to_le16(io.control); @@ -1043,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, switch (cmd) { case NVME_IOCTL_ID: force_successful_syscall_return(); - return ns->ns_id; + return ns->head->ns_id; case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg); case NVME_IOCTL_IO_CMD: @@ -1164,6 +1180,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, } } +static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids) +{ + return !uuid_is_null(>uuid) || + memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) || + memchr_inv(ids->eui64, 0, sizeof(ids->eui64)); +} + static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b) { return uuid_equal(>uuid, >uuid) && @@ -1234,7 +1257,7 @@ static int nvme_revalidate_disk(struct gendisk *disk) return -ENODEV; } - id = nvme_identify_ns(ctrl, ns->ns_id); + id = nvme_identify_ns(ctrl, ns->head->ns_id); if (!id) return -ENODEV; @@ -1243,10 +1266,10 @@ static int
[PATCH 6/7] block: introduce GENHD_FL_HIDDEN
With this flag a driver can create a gendisk that can be used for I/O submission inside the kernel, but which is not registered as user facing block device. This will be useful for the NVMe multipath implementation. Signed-off-by: Christoph Hellwig--- block/genhd.c | 68 +-- include/linux/genhd.h | 1 + 2 files changed, 51 insertions(+), 18 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 1174d24e405e..835e907e6e01 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk) */ pm_runtime_set_memalloc_noio(ddev, true); + if (disk->flags & GENHD_FL_HIDDEN) { + dev_set_uevent_suppress(ddev, 0); + return; + } + disk->part0.holder_dir = kobject_create_and_add("holders", >kobj); disk->slave_dir = kobject_create_and_add("slaves", >kobj); @@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk) while ((part = disk_part_iter_next())) kobject_uevent(_to_dev(part)->kobj, KOBJ_ADD); disk_part_iter_exit(); + + err = sysfs_create_link(>kobj, + >queue->backing_dev_info->dev->kobj, + "bdi"); + WARN_ON(err); } /** @@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk) */ void device_add_disk(struct device *parent, struct gendisk *disk) { - struct backing_dev_info *bdi; dev_t devt; int retval; @@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk) * parameters make sense. */ WARN_ON(disk->minors && !(disk->major || disk->first_minor)); - WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT)); + WARN_ON(!disk->minors && + !(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN))); disk->flags |= GENHD_FL_UP; @@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk) WARN_ON(1); return; } - disk_to_dev(disk)->devt = devt; disk->major = MAJOR(devt); disk->first_minor = MINOR(devt); disk_alloc_events(disk); - /* Register BDI before referencing it from bdev */ - bdi = disk->queue->backing_dev_info; - bdi_register_owner(bdi, disk_to_dev(disk)); - - blk_register_region(disk_devt(disk), disk->minors, NULL, - exact_match, exact_lock, disk); + if (disk->flags & GENHD_FL_HIDDEN) { + /* +* Don't let hidden disks show up in /proc/partitions, +* and don't bother scanning for partitions either. +*/ + disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO; + disk->flags |= GENHD_FL_NO_PART_SCAN; + } else { + /* Register BDI before referencing it from bdev */ + disk_to_dev(disk)->devt = devt; + bdi_register_owner(disk->queue->backing_dev_info, + disk_to_dev(disk)); + blk_register_region(disk_devt(disk), disk->minors, NULL, + exact_match, exact_lock, disk); + } register_disk(parent, disk); blk_register_queue(disk); @@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk) */ WARN_ON_ONCE(!blk_get_queue(disk->queue)); - retval = sysfs_create_link(_to_dev(disk)->kobj, >dev->kobj, - "bdi"); - WARN_ON(retval); - disk_add_events(disk); blk_integrity_add(disk); } @@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk) set_capacity(disk, 0); disk->flags &= ~GENHD_FL_UP; - sysfs_remove_link(_to_dev(disk)->kobj, "bdi"); + if (!(disk->flags & GENHD_FL_HIDDEN)) + sysfs_remove_link(_to_dev(disk)->kobj, "bdi"); if (disk->queue) { /* * Unregister bdi before releasing device numbers (as they can @@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk) } else { WARN_ON(1); } - blk_unregister_region(disk_devt(disk), disk->minors); + + if (!(disk->flags & GENHD_FL_HIDDEN)) { + blk_unregister_region(disk_devt(disk), disk->minors); + kobject_put(disk->part0.holder_dir); + kobject_put(disk->slave_dir); + } part_stat_set_all(>part0, 0); disk->part0.stamp = 0; - - kobject_put(disk->part0.holder_dir); - kobject_put(disk->slave_dir); if (!sysfs_deprecated) sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk))); pm_runtime_set_memalloc_noio(disk_to_dev(disk), false); @@ -781,6
[PATCH 1/5] nvme: track subsystems
This adds a new nvme_subsystem structure so that we can track multiple controllers that belong to a single subsystem. For now we only use it to store the NQN, and to check that we don't have duplicate NQNs unless the involved subsystems support multiple controllers. Includes code originally from Hannes Reinecke to expose the subsystems in sysfs. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/nvme/host/core.c| 210 ++-- drivers/nvme/host/fabrics.c | 4 +- drivers/nvme/host/nvme.h| 27 -- 3 files changed, 205 insertions(+), 36 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 66211617e7e8..f385c35ed2d9 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -68,9 +68,14 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives"); struct workqueue_struct *nvme_wq; EXPORT_SYMBOL_GPL(nvme_wq); +static DEFINE_IDA(nvme_subsystems_ida); +static LIST_HEAD(nvme_subsystems); +static DEFINE_MUTEX(nvme_subsystems_lock); + static DEFINE_IDA(nvme_instance_ida); static dev_t nvme_chr_devt; static struct class *nvme_class; +static struct class *nvme_subsys_class; static __le32 nvme_get_log_dw10(u8 lid, size_t size) { @@ -1717,14 +1722,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id, string_matches(id->fr, q->fr, sizeof(id->fr)); } -static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) +static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl, + struct nvme_id_ctrl *id) { size_t nqnlen; int off; nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE); if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) { - strcpy(ctrl->subnqn, id->subnqn); + strcpy(subsys->subnqn, id->subnqn); return; } @@ -1732,14 +1738,140 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n"); /* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */ - off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE, + off = snprintf(subsys->subnqn, NVMF_NQN_SIZE, "nqn.2014.08.org.nvmexpress:%4x%4x", le16_to_cpu(id->vid), le16_to_cpu(id->ssvid)); - memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn)); + memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn)); off += sizeof(id->sn); - memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn)); + memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn)); off += sizeof(id->mn); - memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off); + memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off); +} + +static void __nvme_release_subsystem(struct nvme_subsystem *subsys) +{ + ida_simple_remove(_subsystems_ida, subsys->instance); + kfree(subsys); +} + +static void nvme_release_subsystem(struct device *dev) +{ + __nvme_release_subsystem(container_of(dev, struct nvme_subsystem, dev)); +} + +static void nvme_destroy_subsystem(struct kref *ref) +{ + struct nvme_subsystem *subsys = + container_of(ref, struct nvme_subsystem, ref); + + mutex_lock(_subsystems_lock); + list_del(>entry); + mutex_unlock(_subsystems_lock); + + device_del(>dev); + put_device(>dev); +} + +static void nvme_put_subsystem(struct nvme_subsystem *subsys) +{ + kref_put(>ref, nvme_destroy_subsystem); +} + +static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn) +{ + struct nvme_subsystem *subsys; + + lockdep_assert_held(_subsystems_lock); + + list_for_each_entry(subsys, _subsystems, entry) { + if (strcmp(subsys->subnqn, subsysnqn)) + continue; + if (!kref_get_unless_zero(>ref)) + continue; + return subsys; + } + + return NULL; +} + +static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) +{ + struct nvme_subsystem *subsys, *found; + int ret; + + subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); + if (!subsys) + return -ENOMEM; + ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL); + if (ret < 0) { + kfree(subsys); + return ret; + } + subsys->instance = ret; + mutex_init(>lock); + kref_init(>ref); + INIT_LIST_HEAD(>ctrls); + nvme_init_subnqn(subsys, ctrl, id); + memcpy(subsys->serial, id->sn, sizeof(subsys->serial)); + memcpy(subsys->model, id->mn, sizeof(subsys->model)); + memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev)); + subsys->vendor_id = le16_to_cpu(id->vid); + subsys->cmic = id->cmic; + +
[PATCH 5/7] block: don't look at the struct device dev_t in disk_devt
The hidden gendisks introduced in the next patch need to keep the dev field in their struct device empty so that udev won't try to create block device nodes for them. To support that rewrite disk_devt to look at the major and first_minor fields in the gendisk itself instead of looking into the struct device. Signed-off-by: Christoph HellwigReviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke --- block/genhd.c | 4 include/linux/genhd.h | 2 +- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index dd305c65ffb0..1174d24e405e 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -649,10 +649,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk) return; } disk_to_dev(disk)->devt = devt; - - /* ->major and ->first_minor aren't supposed to be -* dereferenced from here on, but set them just in case. -*/ disk->major = MAJOR(devt); disk->first_minor = MINOR(devt); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index ea652bfcd675..5c0ed5db33c2 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -234,7 +234,7 @@ static inline bool disk_part_scan_enabled(struct gendisk *disk) static inline dev_t disk_devt(struct gendisk *disk) { - return disk_to_dev(disk)->devt; + return MKDEV(disk->major, disk->first_minor); } static inline dev_t part_devt(struct hd_struct *part) -- 2.14.2
[PATCH 4/7] block: add a blk_steal_bios helper
This helpers allows to bounce steal the uncompleted bios from a request so that they can be reissued on another path. Signed-off-by: Christoph HellwigReviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke --- block/blk-core.c | 21 + include/linux/blkdev.h | 2 ++ 2 files changed, 23 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index b8c80f39f5fe..54fe673fb3ba 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2767,6 +2767,27 @@ struct request *blk_fetch_request(struct request_queue *q) } EXPORT_SYMBOL(blk_fetch_request); +/* + * Steal bios from a request and add them to a bio list. + * The request must not have been partially completed before. + */ +void blk_steal_bios(struct bio_list *list, struct request *rq) +{ + if (rq->bio) { + if (list->tail) + list->tail->bi_next = rq->bio; + else + list->head = rq->bio; + list->tail = rq->biotail; + + rq->bio = NULL; + rq->biotail = NULL; + } + + rq->__data_len = 0; +} +EXPORT_SYMBOL_GPL(blk_steal_bios); + /** * blk_update_request - Special helper function for request stacking drivers * @req: the request being processed diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 780f01db5899..45c63764a14e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1110,6 +1110,8 @@ extern struct request *blk_peek_request(struct request_queue *q); extern void blk_start_request(struct request *rq); extern struct request *blk_fetch_request(struct request_queue *q); +void blk_steal_bios(struct bio_list *list, struct request *rq); + /* * Request completion related functions. * -- 2.14.2
[PATCH 2/7] block: add REQ_DRV bit
Set aside a bit in the request/bio flags for driver use. Signed-off-by: Christoph HellwigReviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke Reviewed-by: Johannes Thumshirn --- include/linux/blk_types.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index acc2f3cdc2fc..7ec2ed097a8a 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -229,6 +229,9 @@ enum req_flag_bits { /* command specific flags for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ + /* for driver use */ + __REQ_DRV, + __REQ_NR_BITS, /* stops here */ }; @@ -249,6 +252,8 @@ enum req_flag_bits { #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP) +#define REQ_DRV(1ULL << __REQ_DRV) + #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.14.2
[PATCH 3/7] block: provide a direct_make_request helper
This helper allows reinserting a bio into a new queue without much overhead, but requires all queue limits to be the same for the upper and lower queues, and it does not provide any recursion preventions. Signed-off-by: Christoph HellwigReviewed-by: Sagi Grimberg Reviewed-by: Javier González Reviewed-by: Hannes Reinecke --- block/blk-core.c | 34 ++ include/linux/blkdev.h | 1 + 2 files changed, 35 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 14f7674fa0b1..b8c80f39f5fe 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2241,6 +2241,40 @@ blk_qc_t generic_make_request(struct bio *bio) } EXPORT_SYMBOL(generic_make_request); +/** + * direct_make_request - hand a buffer directly to its device driver for I/O + * @bio: The bio describing the location in memory and on the device. + * + * This function behaves like generic_make_request(), but does not protect + * against recursion. Must only be used if the called driver is known + * to not call generic_make_request (or direct_make_request) again from + * its make_request function. (Calling direct_make_request again from + * a workqueue is perfectly fine as that doesn't recurse). + */ +blk_qc_t direct_make_request(struct bio *bio) +{ + struct request_queue *q = bio->bi_disk->queue; + bool nowait = bio->bi_opf & REQ_NOWAIT; + blk_qc_t ret; + + if (!generic_make_request_checks(bio)) + return BLK_QC_T_NONE; + + if (unlikely(blk_queue_enter(q, nowait))) { + if (nowait && !blk_queue_dying(q)) + bio->bi_status = BLK_STS_AGAIN; + else + bio->bi_status = BLK_STS_IOERR; + bio_endio(bio); + return BLK_QC_T_NONE; + } + + ret = q->make_request_fn(q, bio); + blk_queue_exit(q); + return ret; +} +EXPORT_SYMBOL_GPL(direct_make_request); + /** * submit_bio - submit a bio to the block device layer for I/O * @bio: The bio which describes the I/O diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 02fa42d24b52..780f01db5899 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -936,6 +936,7 @@ do { \ extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); extern blk_qc_t generic_make_request(struct bio *bio); +extern blk_qc_t direct_make_request(struct bio *bio); extern void blk_rq_init(struct request_queue *q, struct request *rq); extern void blk_init_request_from_bio(struct request *req, struct bio *bio); extern void blk_put_request(struct request *); -- 2.14.2
block layer patches for nvme multipath support
Hi Jens, these patches add the block layer helpers / tweaks for NVMe multipath support. Can you review them for inclusion? There have been no functional changes to the versions posted with previous nvme multipath patchset.
[PATCH 1/7] block: move REQ_NOWAIT
This flag should be before the operation-specific REQ_NOUNMAP bit. Signed-off-by: Christoph HellwigReviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke Reviewed-by: Johannes Thumshirn --- include/linux/blk_types.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index a2d2aa709cef..acc2f3cdc2fc 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -224,11 +224,11 @@ enum req_flag_bits { __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ __REQ_BACKGROUND, /* background IO */ + __REQ_NOWAIT, /* Don't wait if request will block */ /* command specific flags for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ - __REQ_NOWAIT, /* Don't wait if request will block */ __REQ_NR_BITS, /* stops here */ }; @@ -245,9 +245,9 @@ enum req_flag_bits { #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH) #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) +#define REQ_NOWAIT (1ULL << __REQ_NOWAIT) #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP) -#define REQ_NOWAIT (1ULL << __REQ_NOWAIT) #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.14.2
[PATCH 5/6] nvme: set the chunk size before freezing the queue
We don't need a frozen queue to update the chunk_size, which just is a hint, and moving it a little earlier will allow for some better code reuse with the multipath code. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index bfdbecdc307a..3758528be0ef 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1184,12 +1184,13 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) if (ctrl->nr_streams && ns->sws && ns->sgs) stream_alignment = ns->sws * ns->sgs; + if (ns->noiob) + nvme_set_chunk_size(ns); + blk_mq_freeze_queue(disk->queue); blk_integrity_unregister(disk); blk_queue_logical_block_size(ns->queue, bs); - if (ns->noiob) - nvme_set_chunk_size(ns); if (ns->ms && !ns->ext && (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) nvme_init_integrity(disk, ns->ms, ns->pi_type); -- 2.14.2
[PATCH 6/6] nvme: split __nvme_revalidate_disk
Split out the code that applies the calculate value to a given disk/queue into new helper that can be reused by the multipath code. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 49 +--- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3758528be0ef..66211617e7e8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1157,12 +1157,34 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, } } +static void nvme_update_disk_info(struct gendisk *disk, + struct nvme_ns *ns, struct nvme_id_ns *id) +{ + sector_t capacity = le64_to_cpup(>nsze) << (ns->lba_shift - 9); + unsigned stream_alignment = 0; + + if (ns->ctrl->nr_streams && ns->sws && ns->sgs) + stream_alignment = ns->sws * ns->sgs; + + blk_mq_freeze_queue(disk->queue); + blk_integrity_unregister(disk); + + blk_queue_logical_block_size(disk->queue, 1 << ns->lba_shift); + if (ns->ms && !ns->ext && + (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) + nvme_init_integrity(disk, ns->ms, ns->pi_type); + if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) + capacity = 0; + set_capacity(disk, capacity); + + if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM) + nvme_config_discard(ns->ctrl, stream_alignment, disk->queue); + blk_mq_unfreeze_queue(disk->queue); +} + static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) { struct nvme_ns *ns = disk->private_data; - struct nvme_ctrl *ctrl = ns->ctrl; - unsigned stream_alignment = 0; - u16 bs; /* * If identify namespace failed, use default 512 byte block size so @@ -1171,7 +1193,6 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds; if (ns->lba_shift == 0) ns->lba_shift = 9; - bs = 1 << ns->lba_shift; ns->noiob = le16_to_cpu(id->noiob); ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT); ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms); @@ -1181,27 +1202,9 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) else ns->pi_type = 0; - if (ctrl->nr_streams && ns->sws && ns->sgs) - stream_alignment = ns->sws * ns->sgs; - if (ns->noiob) nvme_set_chunk_size(ns); - - blk_mq_freeze_queue(disk->queue); - blk_integrity_unregister(disk); - - blk_queue_logical_block_size(ns->queue, bs); - if (ns->ms && !ns->ext && - (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) - nvme_init_integrity(disk, ns->ms, ns->pi_type); - if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) - set_capacity(disk, 0); - else - set_capacity(disk, le64_to_cpup(>nsze) << (ns->lba_shift - 9)); - - if (ctrl->oncs & NVME_CTRL_ONCS_DSM) - nvme_config_discard(ctrl, stream_alignment, disk->queue); - blk_mq_unfreeze_queue(disk->queue); + nvme_update_disk_info(disk, ns, id); } static int nvme_revalidate_disk(struct gendisk *disk) -- 2.14.2
[PATCH 2/6] nvme: always unregister the integrity profile in __nvme_revalidate_disk
This is safe because the queue is always frozen when we revalidate, and it simplifies both the existing code as well as the multipath implementation. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 40 ++-- 1 file changed, 10 insertions(+), 30 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1c2686d4eaf3..bc27f6603861 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1081,29 +1081,6 @@ static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) } #ifdef CONFIG_BLK_DEV_INTEGRITY -static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id, - u16 bs) -{ - struct nvme_ns *ns = disk->private_data; - u16 old_ms = ns->ms; - u8 pi_type = 0; - - ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms); - ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT); - - /* PI implementation requires metadata equal t10 pi tuple size */ - if (ns->ms == sizeof(struct t10_pi_tuple)) - pi_type = id->dps & NVME_NS_DPS_PI_MASK; - - if (blk_get_integrity(disk) && - (ns->pi_type != pi_type || ns->ms != old_ms || -bs != queue_logical_block_size(disk->queue) || -(ns->ms && ns->ext))) - blk_integrity_unregister(disk); - - ns->pi_type = pi_type; -} - static void nvme_init_integrity(struct nvme_ns *ns) { struct blk_integrity integrity; @@ -1130,10 +1107,6 @@ static void nvme_init_integrity(struct nvme_ns *ns) blk_queue_max_integrity_segments(ns->queue, 1); } #else -static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id, - u16 bs) -{ -} static void nvme_init_integrity(struct nvme_ns *ns) { } @@ -1202,15 +1175,22 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) ns->lba_shift = 9; bs = 1 << ns->lba_shift; ns->noiob = le16_to_cpu(id->noiob); + ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT); + ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms); + /* the PI implementation requires metadata equal t10 pi tuple size */ + if (ns->ms == sizeof(struct t10_pi_tuple)) + ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK; + else + ns->pi_type = 0; blk_mq_freeze_queue(disk->queue); + blk_integrity_unregister(disk); - if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED) - nvme_prep_integrity(disk, id, bs); blk_queue_logical_block_size(ns->queue, bs); if (ns->noiob) nvme_set_chunk_size(ns); - if (ns->ms && !blk_get_integrity(disk) && !ns->ext) + if (ns->ms && !ns->ext && + (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) nvme_init_integrity(ns); if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) set_capacity(disk, 0); -- 2.14.2
[PATCH 1/6] nvme: move the dying queue check from cancel to completion
With multipath we don't want a hard DNR bit on a request that is cancelled by a controller reset, but instead want to be able to retry it on another patch. To archive this don't always set the DNR bit when the queue is dying in nvme_cancel_request, but defer that decision to nvme_req_needs_retry. Note that it applies to any command there and not just cancelled commands, but one the queue is dying that is the right thing to do anyway. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 07b9f4e1c283..1c2686d4eaf3 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -172,6 +172,8 @@ static inline bool nvme_req_needs_retry(struct request *req) return false; if (nvme_req(req)->retries >= nvme_max_retries) return false; + if (blk_queue_dying(req->q)) + return false; return true; } @@ -189,18 +191,13 @@ EXPORT_SYMBOL_GPL(nvme_complete_rq); void nvme_cancel_request(struct request *req, void *data, bool reserved) { - int status; - if (!blk_mq_request_started(req)) return; dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device, "Cancelling I/O %d", req->tag); - status = NVME_SC_ABORT_REQ; - if (blk_queue_dying(req->q)) - status |= NVME_SC_DNR; - nvme_req(req)->status = status; + nvme_req(req)->status = NVME_SC_ABORT_REQ; blk_mq_complete_request(req); } -- 2.14.2
[PATCH 3/6] nvme: don't pass struct nvme_ns to nvme_init_integrity
To allow reusing this function for the multipath node. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index bc27f6603861..8702e49c5c45 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1081,12 +1081,12 @@ static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) } #ifdef CONFIG_BLK_DEV_INTEGRITY -static void nvme_init_integrity(struct nvme_ns *ns) +static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type) { struct blk_integrity integrity; memset(, 0, sizeof(integrity)); - switch (ns->pi_type) { + switch (pi_type) { case NVME_NS_DPS_PI_TYPE3: integrity.profile = _pi_type3_crc; integrity.tag_size = sizeof(u16) + sizeof(u32); @@ -1102,12 +1102,12 @@ static void nvme_init_integrity(struct nvme_ns *ns) integrity.profile = NULL; break; } - integrity.tuple_size = ns->ms; - blk_integrity_register(ns->disk, ); - blk_queue_max_integrity_segments(ns->queue, 1); + integrity.tuple_size = ms; + blk_integrity_register(disk, ); + blk_queue_max_integrity_segments(disk->queue, 1); } #else -static void nvme_init_integrity(struct nvme_ns *ns) +static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type) { } #endif /* CONFIG_BLK_DEV_INTEGRITY */ @@ -1191,7 +1191,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) nvme_set_chunk_size(ns); if (ns->ms && !ns->ext && (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) - nvme_init_integrity(ns); + nvme_init_integrity(disk, ns->ms, ns->pi_type); if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) set_capacity(disk, 0); else -- 2.14.2
[PATCH 4/6] nvme: don't pass struct nvme_ns to nvme_config_discard
To allow reusing this function for the multipath node. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 33 + 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 8702e49c5c45..bfdbecdc307a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1118,29 +1118,26 @@ static void nvme_set_chunk_size(struct nvme_ns *ns) blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size)); } -static void nvme_config_discard(struct nvme_ns *ns) +static void nvme_config_discard(struct nvme_ctrl *ctrl, + unsigned stream_alignment, struct request_queue *queue) { - struct nvme_ctrl *ctrl = ns->ctrl; - u32 logical_block_size = queue_logical_block_size(ns->queue); + u32 size = queue_logical_block_size(queue); + + if (stream_alignment) + size *= stream_alignment; BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) < NVME_DSM_MAX_RANGES); - if (ctrl->nr_streams && ns->sws && ns->sgs) { - unsigned int sz = logical_block_size * ns->sws * ns->sgs; + queue->limits.discard_alignment = size; + queue->limits.discard_granularity = size; - ns->queue->limits.discard_alignment = sz; - ns->queue->limits.discard_granularity = sz; - } else { - ns->queue->limits.discard_alignment = logical_block_size; - ns->queue->limits.discard_granularity = logical_block_size; - } - blk_queue_max_discard_sectors(ns->queue, UINT_MAX); - blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES); - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue); + blk_queue_max_discard_sectors(queue, UINT_MAX); + blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES); + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue); if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) - blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX); + blk_queue_max_write_zeroes_sectors(queue, UINT_MAX); } static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, @@ -1164,6 +1161,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) { struct nvme_ns *ns = disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; + unsigned stream_alignment = 0; u16 bs; /* @@ -1183,6 +1181,9 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) else ns->pi_type = 0; + if (ctrl->nr_streams && ns->sws && ns->sgs) + stream_alignment = ns->sws * ns->sgs; + blk_mq_freeze_queue(disk->queue); blk_integrity_unregister(disk); @@ -1198,7 +1199,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) set_capacity(disk, le64_to_cpup(>nsze) << (ns->lba_shift - 9)); if (ctrl->oncs & NVME_CTRL_ONCS_DSM) - nvme_config_discard(ns); + nvme_config_discard(ctrl, stream_alignment, disk->queue); blk_mq_unfreeze_queue(disk->queue); } -- 2.14.2
nvme cleanups in preparation for the multipath code
Hi all, below are a couple cleanup patches to prepare for the next version of the nvme multipath code.
Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
On Mon, Oct 30, 2017 at 11:37:55AM +0800, Guan Junxiong wrote: > > + head->disk->flags = GENHD_FL_EXT_DEVT; > > + sprintf(head->disk->disk_name, "nvme%dn%d", > > + ctrl->subsys->instance, nsid); > > Is it okay to use head->instance instead of nsid for disk name nvme#n# ? > Becuase _nsid_ sets are not continuous sometimes, so disk name is ugly in > that case. This actually was supposed to be the ns_head instance, that's why the ns_ida moved to the subsystem. I've fixed it up.
Re: [PATCH] bcache: add a comment in journal bucket reading
Tang-- On 11/01/2017 08:08 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > Journal bucket is a circular buffer, the bucket > can be like YYYNNNYY, which means the first valid journal in > the 7th bucket, and the latest valid journal in third bucket, in > this case, if we do not try we the zero index first, We > may get a valid journal in the 7th bucket, then we call > find_next_bit(bitmap,ca->sb.njournal_buckets, l + 1) to get the > first invalid bucket after the 7th bucket, because all these > buckets is valid, so no bit 1 in bitmap, thus find_next_bit() > function would return with ca->sb.njournal_buckets (8). So, after > that, bcache only read journal in 7th and 8the bucket, > the first to the third buckets are lost. > > So, it is important to let developer know that, we need to try > the zero index at first in the hash-search, and avoid any breaks > in future's code modification. > > Signed-off-by: Tang Junhui Fixed a small whitespace error at end of line. Reviewed-by: Michael Lyle Thank you for making this much more clear. Mike
Re: [PATCH V2 0/2] block: remove unnecessary RESTART
On Wed, 2017-11-01 at 08:21 -0600, Jens Axboe wrote: > Fixed that up, and applied these two patches as well. Hello Jens, Recently I noticed that a test system sporadically hangs during boot (Dell PowerEdge R720 that boots from a hard disk connected to a MegaRAID SAS adapter) and also that srp-tests systematically hangs. Reverting the two patches from this series fixes both issues. I'm not sure there is another solution than reverting the two patches from this series. Bart. BTW, the following appeared in the kernel log when I tried to run srp-tests against a kernel with the two patches from this series applied: INFO: task kworker/19:1:209 blocked for more than 480 seconds. INFO: task kworker/19:1:209 blocked for more than 480 seconds. Tainted: GW 4.14.0-rc7-dbg+ #1 Tainted: GW 4.14.0-rc7-dbg+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/19:1D0 209 2 0x8000 kworker/19:1D0 209 2 0x8000 Workqueue: srp_remove srp_remove_work [ib_srp] Workqueue: srp_remove srp_remove_work [ib_srp] Call Trace: Call Trace: __schedule+0x2fa/0xbb0 __schedule+0x2fa/0xbb0 schedule+0x36/0x90 schedule+0x36/0x90 async_synchronize_cookie_domain+0x88/0x130 async_synchronize_cookie_domain+0x88/0x130 ? finish_wait+0x90/0x90 ? finish_wait+0x90/0x90 async_synchronize_full_domain+0x18/0x20 async_synchronize_full_domain+0x18/0x20 sd_remove+0x4d/0xc0 [sd_mod] sd_remove+0x4d/0xc0 [sd_mod] device_release_driver_internal+0x160/0x210 device_release_driver_internal+0x160/0x210 device_release_driver+0x12/0x20 device_release_driver+0x12/0x20 bus_remove_device+0x100/0x180 bus_remove_device+0x100/0x180 device_del+0x1d8/0x340 device_del+0x1d8/0x340 __scsi_remove_device+0xfc/0x130 __scsi_remove_device+0xfc/0x130 scsi_forget_host+0x25/0x70 scsi_forget_host+0x25/0x70 scsi_remove_host+0x79/0x120 scsi_remove_host+0x79/0x120 srp_remove_work+0x90/0x1d0 [ib_srp] srp_remove_work+0x90/0x1d0 [ib_srp] process_one_work+0x20a/0x660 process_one_work+0x20a/0x660 worker_thread+0x3d/0x3b0 worker_thread+0x3d/0x3b0 kthread+0x13a/0x150 kthread+0x13a/0x150 ? process_one_work+0x660/0x660 ? process_one_work+0x660/0x660 ? kthread_create_on_node+0x40/0x40 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x27/0x40 ret_from_fork+0x27/0x40 Showing all locks held in the system: Showing all locks held in the system: 1 lock held by khungtaskd/170: 1 lock held by khungtaskd/170: #0: (tasklist_lock){.+.+}, at: [] debug_show_all_locks+0x3d/0x1a0 #0: (tasklist_lock){.+.+}, at: [] debug_show_all_locks+0x3d/0x1a0 4 locks held by kworker/19:1/209: 4 locks held by kworker/19:1/209: #0: ("%s"("srp_remove")){+.+.}, at: [] process_one_work+0x195/0x660 #0: ("%s"("srp_remove")){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((>remove_work)){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((>remove_work)){+.+.}, at: [] process_one_work+0x195/0x660 #2: (>scan_mutex){+.+.}, at: [] scsi_remove_host+0x1f/0x120 #2: (>scan_mutex){+.+.}, at: [] scsi_remove_host+0x1f/0x120 #3: (>mutex){}, at: [] device_release_driver_internal+0x39/0x210 #3: (>mutex){}, at: [] device_release_driver_internal+0x39/0x210 2 locks held by kworker/u66:0/1927: 2 locks held by kworker/u66:0/1927: #0: ("events_unbound"){+.+.}, at: [] process_one_work+0x195/0x660 #0: ("events_unbound"){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((>work)){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((>work)){+.+.}, at: [] process_one_work+0x195/0x660 2 locks held by kworker/5:0/2047: 2 locks held by kworker/5:0/2047: #0: ("kaluad"){+.+.}, at: [] process_one_work+0x195/0x660 #0: ("kaluad"){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((&(>rtpg_work)->work)){+.+.}, at: [] process_one_work+0x195/0x660 #1: ((&(>rtpg_work)->work)){+.+.}, at: [] process_one_work+0x195/0x660 = = INFO: task kworker/19:1:209 blocked for more than 480 seconds. INFO: task kworker/19:1:209 blocked for more than 480 seconds. Tainted: GW 4.14.0-rc7-dbg+ #1 Tainted: GW 4.14.0-rc7-dbg+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/19:1D0 209 2 0x8000 kworker/19:1D0 209 2 0x8000 Workqueue: srp_remove srp_remove_work [ib_srp] Workqueue: srp_remove srp_remove_work [ib_srp] Call Trace: Call Trace: __schedule+0x2fa/0xbb0 __schedule+0x2fa/0xbb0 schedule+0x36/0x90 schedule+0x36/0x90 async_synchronize_cookie_domain+0x88/0x130 async_synchronize_cookie_domain+0x88/0x130 ? finish_wait+0x90/0x90 ? finish_wait+0x90/0x90 async_synchronize_full_domain+0x18/0x20 async_synchronize_full_domain+0x18/0x20
[PATCH V3 0/7] blk-mq: don't allocate driver tag beforehand for flush rq
Hi Jens, This patchset avoids to allocate driver tag beforehand for flush rq in case of I/O scheduler, then flush rq isn't treated specially wrt. get/put driver tag, code gets cleanup much, such as, reorder_tags_to_front() is removed, and we needn't to worry about request order in dispatch list for avoiding I/O deadlock. 'dbench -t 30 -s 64' has been run on different devices(shared tag, multi-queue, singele queue, ...), and no issues are observed, even very low queue depth test are run, debench still works well. Please consider it for V4.15, thanks! V3: - rebase on the latest for-next of block tree - clean up commit log and comment - include Jianchao's fix on put driver tag V2: - release driver tag before requeue - warning on inserting a req with driver tag Jianchao Wang (1): blk-mq: put the driver tag of nxt rq before first one is requeued Ming Lei (6): blk-flush: don't run queue for requests of bypassing flush block: pass 'run_queue' to blk_mq_request_bypass_insert blk-flush: use blk_mq_request_bypass_insert() blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h blk-mq: don't allocate driver tag beforehand for flush rq block/blk-core.c | 2 +- block/blk-flush.c| 37 ++-- block/blk-mq-sched.c | 63 +- block/blk-mq.c | 97 ++-- block/blk-mq.h | 35 ++- 5 files changed, 97 insertions(+), 137 deletions(-) -- 2.9.5
[PATCH V3 1/7] blk-mq: put the driver tag of nxt rq before first one is requeued
From: Jianchao WangWhen free the driver tag of the next rq with I/O scheduler configured, it get the first entry of the list, however, at the moment, the failed rq has been requeued at the head of the list. The rq it gets is the failed rq not the next rq. Free the driver tag of next rq before the failed one is requeued in the failure branch of queue_rq callback and it is just needed there. Signed-off-by: Jianchao Wang --- block/blk-mq.c | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index e4d2490f4e7e..fe14b28760eb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1091,7 +1091,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { struct blk_mq_hw_ctx *hctx; - struct request *rq; + struct request *rq, *nxt; int errors, queued; if (list_empty(list)) @@ -1153,14 +1153,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, if (list_empty(list)) bd.last = true; else { - struct request *nxt; - nxt = list_first_entry(list, struct request, queuelist); bd.last = !blk_mq_get_driver_tag(nxt, NULL, false); } ret = q->mq_ops->queue_rq(hctx, ); if (ret == BLK_STS_RESOURCE) { + /* +* If an I/O scheduler has been configured and we got a +* driver tag for the next request already, free it again. +*/ + if (!list_empty(list)) { + nxt = list_first_entry(list, struct request, queuelist); + blk_mq_put_driver_tag(nxt); + } blk_mq_put_driver_tag_hctx(hctx, rq); list_add(>queuelist, list); __blk_mq_requeue_request(rq); @@ -1184,13 +1190,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { - /* -* If an I/O scheduler has been configured and we got a driver -* tag for the next request already, free it again. -*/ - rq = list_first_entry(list, struct request, queuelist); - blk_mq_put_driver_tag(rq); - spin_lock(>lock); list_splice_init(list, >dispatch); spin_unlock(>lock); -- 2.9.5
[PATCH V3 4/7] blk-flush: use blk_mq_request_bypass_insert()
In the following patch, we will use RQF_FLUSH_SEQ to decide: 1) if the flag isn't set, the flush rq need to be inserted via blk_insert_flush() 2) otherwise, the flush rq need to be dispatched directly since it is in flush machinery now. So we use blk_mq_request_bypass_insert() for requests of bypassing flush machinery, just like what the legacy path did. Signed-off-by: Ming Lei--- block/blk-flush.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 81bd1a843043..a9773d2075ac 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if (q->mq_ops) - blk_mq_sched_insert_request(rq, false, false, false, false); + blk_mq_request_bypass_insert(rq, false); else list_add_tail(>queuelist, >queue_head); return; -- 2.9.5
[PATCH V3 2/7] blk-flush: don't run queue for requests of bypassing flush
blk_insert_flush() should only insert request since run queue always follows it. In case of bypassing flush, we don't need to run queue because every blk_insert_flush() follows one run queue. Signed-off-by: Ming Lei--- block/blk-flush.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 4938bec8cfef..81bd1a843043 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if (q->mq_ops) - blk_mq_sched_insert_request(rq, false, true, false, false); + blk_mq_sched_insert_request(rq, false, false, false, false); else list_add_tail(>queuelist, >queue_head); return; -- 2.9.5
[PATCH V3 5/7] blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ
In case of IO scheduler we always pre-allocate one driver tag before calling blk_insert_flush(), and flush request will be marked as RQF_FLUSH_SEQ once it is in flush machinary. So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle the request, otherwise the flush request is dispatched to ->dispatch list directly. This is a prepare patch for not preallocating driver tag on flush rq, and don't treat flush rq as special case, just what the legacy path did. Signed-off-by: Ming Lei--- block/blk-mq-sched.c | 29 - 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 7775f6b12fa9..32e4ea2d0a77 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -355,21 +355,23 @@ void blk_mq_sched_request_inserted(struct request *rq) EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted); static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, + bool has_sched, struct request *rq) { - if (rq->tag == -1) { + /* dispatch flush rq directly */ + if (rq->rq_flags & RQF_FLUSH_SEQ) { + spin_lock(>lock); + list_add(>queuelist, >dispatch); + spin_unlock(>lock); + return true; + } + + if (has_sched) { rq->rq_flags |= RQF_SORTED; - return false; + WARN_ON(rq->tag != -1); } - /* -* If we already have a real request tag, send directly to -* the dispatch list. -*/ - spin_lock(>lock); - list_add(>queuelist, >dispatch); - spin_unlock(>lock); - return true; + return false; } /* @@ -395,12 +397,13 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); - if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) { + /* flush rq in flush machinery need to be dispatched directly */ + if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) { blk_mq_sched_insert_flush(hctx, rq, can_block); return; } - if (e && blk_mq_sched_bypass_insert(hctx, rq)) + if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) goto run; if (e && e->type->ops.mq.insert_requests) { @@ -438,7 +441,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q, list_for_each_entry_safe(rq, next, list, queuelist) { if (WARN_ON_ONCE(rq->tag != -1)) { list_del_init(>queuelist); - blk_mq_sched_bypass_insert(hctx, rq); + blk_mq_sched_bypass_insert(hctx, true, rq); } } } -- 2.9.5
[PATCH V3 6/7] blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h
We need this helper to put the driver tag for flush rq, since we will not share tag in the flush request sequence in the following patch in case that I/O scheduler is applied. Signed-off-by: Ming Lei--- block/blk-mq.c | 32 block/blk-mq.h | 33 + 2 files changed, 33 insertions(+), 32 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 5e2866b83305..bf62065a15ec 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -993,38 +993,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; } -static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); - rq->tag = -1; - - if (rq->rq_flags & RQF_MQ_INFLIGHT) { - rq->rq_flags &= ~RQF_MQ_INFLIGHT; - atomic_dec(>nr_active); - } -} - -static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - __blk_mq_put_driver_tag(hctx, rq); -} - -static void blk_mq_put_driver_tag(struct request *rq) -{ - struct blk_mq_hw_ctx *hctx; - - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); - __blk_mq_put_driver_tag(hctx, rq); -} - /* * If we fail getting a driver tag because all the driver tags are already * assigned and on the dispatch list, BUT the first entry does not have a diff --git a/block/blk-mq.h b/block/blk-mq.h index 713be5ae83ba..f55d39b0c6ed 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -2,6 +2,7 @@ #define INT_BLK_MQ_H #include "blk-stat.h" +#include "blk-mq-tag.h" struct blk_mq_tag_set; @@ -157,4 +158,36 @@ static inline blk_status_t blk_mq_get_dispatch_budget( return BLK_STS_OK; } +static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); + rq->tag = -1; + + if (rq->rq_flags & RQF_MQ_INFLIGHT) { + rq->rq_flags &= ~RQF_MQ_INFLIGHT; + atomic_dec(>nr_active); + } +} + +static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + if (rq->tag == -1 || rq->internal_tag == -1) + return; + + __blk_mq_put_driver_tag(hctx, rq); +} + +static inline void blk_mq_put_driver_tag(struct request *rq) +{ + struct blk_mq_hw_ctx *hctx; + + if (rq->tag == -1 || rq->internal_tag == -1) + return; + + hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); + __blk_mq_put_driver_tag(hctx, rq); +} + #endif -- 2.9.5
[PATCH V3 7/7] blk-mq: don't allocate driver tag beforehand for flush rq
The behind idea is simple: 1) for none scheduler, driver tag has to be borrowed for flush rq, otherwise we may run out of tag, and IO hang is caused. And get/put driver tag is actually noop, so reorder tags isn't necessary at all. 2) for real I/O scheduler, we needn't to allocate driver tag beforehand for flush rq, and it works just fine to follow the way for normal requests: allocate driver tag for each rq just before calling .queue_rq(). One visible change to driver is that the driver tag isn't shared in the flush request sequence, that won't be a problem, since we always do that in legacy path. Then flush rq needn't to be treated specially wrt. get/put driver tag, code gets cleanup much, such as, reorder_tags_to_front() is removed, and we needn't to worry about request order in dispatch list for avoiding I/O deadlock. Also we have to put the driver tag before requeueing. Signed-off-by: Ming Lei--- block/blk-flush.c| 35 ++- block/blk-mq-sched.c | 42 +- block/blk-mq.c | 41 ++--- 3 files changed, 37 insertions(+), 81 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index a9773d2075ac..f17170675917 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -231,8 +231,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error) /* release the tag's ownership to the req cloned from */ spin_lock_irqsave(>mq_flush_lock, flags); hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu); - blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq); - flush_rq->tag = -1; + if (!q->elevator) { + blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq); + flush_rq->tag = -1; + } else { + blk_mq_put_driver_tag_hctx(hctx, flush_rq); + flush_rq->internal_tag = -1; + } } running = >flush_queue[fq->flush_running_idx]; @@ -318,19 +323,26 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq) blk_rq_init(q, flush_rq); /* -* Borrow tag from the first request since they can't -* be in flight at the same time. And acquire the tag's -* ownership for flush req. +* In case of none scheduler, borrow tag from the first request +* since they can't be in flight at the same time. And acquire +* the tag's ownership for flush req. +* +* In case of IO scheduler, flush rq need to borrow scheduler tag +* just for cheating put/get driver tag. */ if (q->mq_ops) { struct blk_mq_hw_ctx *hctx; flush_rq->mq_ctx = first_rq->mq_ctx; - flush_rq->tag = first_rq->tag; - fq->orig_rq = first_rq; - hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu); - blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq); + if (!q->elevator) { + fq->orig_rq = first_rq; + flush_rq->tag = first_rq->tag; + hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu); + blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq); + } else { + flush_rq->internal_tag = first_rq->internal_tag; + } } flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH; @@ -394,6 +406,11 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error) hctx = blk_mq_map_queue(q, ctx->cpu); + if (q->elevator) { + WARN_ON(rq->tag < 0); + blk_mq_put_driver_tag_hctx(hctx, rq); + } + /* * After populating an empty queue, kick it to avoid stall. Read * the comment in flush_end_io(). diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 32e4ea2d0a77..eb082f82b470 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -366,29 +366,12 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, return true; } - if (has_sched) { + if (has_sched) rq->rq_flags |= RQF_SORTED; - WARN_ON(rq->tag != -1); - } return false; } -/* - * Add flush/fua to the queue. If we fail getting a driver tag, then - * punt to the requeue list. Requeue will re-invoke us from a context - * that's safe to block from. - */ -static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx, - struct request *rq, bool can_block) -{ - if (blk_mq_get_driver_tag(rq, , can_block)) { - blk_insert_flush(rq); - blk_mq_run_hw_queue(hctx, true); - } else - blk_mq_add_to_requeue_list(rq, false, true); -} -
Re: [PATCH] ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
On Thu, 2017-11-02 at 10:36 +0800, Hongxu Jia wrote: > Apply this patch, and the test on my platform is passed. Thank you for having tested this patch. Does this mean that you are OK with adding the following to this patch: Tested-by: Hongxu Jia? Bart.
Re: [PATCH] ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
On 11/01/2017 04:31 PM, Bart Van Assche wrote: > The changes introduced through commit 82ed4db499b8 assume that the > sense buffer pointer in struct scsi_request is initialized for all > requests - passthrough and filesystem requests. Hence make sure > that that pointer is initialized for filesystem requests. Remove > the memset() call that clears .cmd because the scsi_req_init() > call in ide_initialize_rq() already initializes the .cmd. I'll queue this up for 4.15, post the initial merge. -- Jens Axboe
Re: [PATCH] skd: use ktime_get_real_seconds()
On 11/02/2017 05:42 AM, Arnd Bergmann wrote: > Like many storage drivers, skd uses an unsigned 32-bit number for > interchanging the current time with the firmware. This will overflow in > y2106 and is otherwise safe. > > However, the get_seconds() function is generally considered deprecated > since the behavior is different between 32-bit and 64-bit architectures, > and using it may indicate a bigger problem. > > To annotate that we've thought about this, let's add a comment here > and migrate to the ktime_get_real_seconds() function that consistently > returns a 64-bit number. Thanks Arnd, applied. -- Jens Axboe
答复: [bug report after v4.5-rc1]block: When the scsi device has a timeout IO, the scsi device is stuck when it is deleted
OK,3ks. -邮件原件- 发件人: chenxiang (M) 发送时间: 2017年11月2日 20:34 收件人: Zouming (IT); linux-block@vger.kernel.org; ax...@fb.com 抄送: wangzhoumengjian 主题: Re: [bug report after v4.5-rc1]block: When the scsi device has a timeout IO, the scsi device is stuck when it is deleted 在 2017/11/2 20:16, Zouming (IT) 写道: > 1.Repeat steps: > (1) send IO on the device /dev/sdx. > (2) Simulate an IO lost > (3) Use the command before to delete scsi device before IO timeout >ehco 1 > /sys/class/sdx/device/delete > > 2.The stack of delete thead is before: > [] msleep+0x2f/0x40 > [] __blk_drain_queue+0xa4/0x170 [] > blk_cleanup_queue+0x13d/0x150 [] > __scsi_remove_device+0x4a/0xd0 [] > scsi_remove_device+0x26/0x40 [] > sdev_store_delete_callback+0x15/0x20 > [] sysfs_schedule_callback_work+0x14/0x60 > [] process_one_work+0x17a/0x440 [] > worker_thread+0x126/0x3c0 [] kthread+0xcf/0xe0 > [] ret_from_fork+0x58/0x90 > > 3.The reason is before: >(1) When the scsi device is deleted, invoke blk_cleanup_queue funtion to > set the flag of request_queue dying, and wait all IO back. > >(2) when IO timout,the timeout workqueue invoke blk_timeout_work function > to abort IO, > but it will not abort the IO because it call blk_queue_enter funtion > judge the request_queue is dying and return direct without doing > anything. Hi Zouming, You can have a test on Bart's patch "[PATCH] block: Fix a race between blk_cleanup_queue() and timeout handling" for this issue. I think this patch can solve your issue.
Re: [bug report after v4.5-rc1]block: When the scsi device has a timeout IO, the scsi device is stuck when it is deleted
在 2017/11/2 20:16, Zouming (IT) 写道: 1.Repeat steps: (1) send IO on the device /dev/sdx. (2) Simulate an IO lost (3) Use the command before to delete scsi device before IO timeout ehco 1 > /sys/class/sdx/device/delete 2.The stack of delete thead is before: [] msleep+0x2f/0x40 [] __blk_drain_queue+0xa4/0x170 [] blk_cleanup_queue+0x13d/0x150 [] __scsi_remove_device+0x4a/0xd0 [] scsi_remove_device+0x26/0x40 [] sdev_store_delete_callback+0x15/0x20 [] sysfs_schedule_callback_work+0x14/0x60 [] process_one_work+0x17a/0x440 [] worker_thread+0x126/0x3c0 [] kthread+0xcf/0xe0 [] ret_from_fork+0x58/0x90 3.The reason is before: (1) When the scsi device is deleted, invoke blk_cleanup_queue funtion to set the flag of request_queue dying, and wait all IO back. (2) when IO timout,the timeout workqueue invoke blk_timeout_work function to abort IO, but it will not abort the IO because it call blk_queue_enter funtion judge the request_queue is dying and return direct without doing anything. Hi Zouming, You can have a test on Bart's patch "[PATCH] block: Fix a race between blk_cleanup_queue() and timeout handling" for this issue. I think this patch can solve your issue.
[bug report after v4.5-rc1]block: When the scsi device has a timeout IO, the scsi device is stuck when it is deleted
1.Repeat steps: (1) send IO on the device /dev/sdx. (2) Simulate an IO lost (3) Use the command before to delete scsi device before IO timeout ehco 1 > /sys/class/sdx/device/delete 2.The stack of delete thead is before: [] msleep+0x2f/0x40 [] __blk_drain_queue+0xa4/0x170 [] blk_cleanup_queue+0x13d/0x150 [] __scsi_remove_device+0x4a/0xd0 [] scsi_remove_device+0x26/0x40 [] sdev_store_delete_callback+0x15/0x20 [] sysfs_schedule_callback_work+0x14/0x60 [] process_one_work+0x17a/0x440 [] worker_thread+0x126/0x3c0 [] kthread+0xcf/0xe0 [] ret_from_fork+0x58/0x90 3.The reason is before: (1) When the scsi device is deleted, invoke blk_cleanup_queue funtion to set the flag of request_queue dying, and wait all IO back. (2) when IO timout,the timeout workqueue invoke blk_timeout_work function to abort IO, but it will not abort the IO because it call blk_queue_enter funtion judge the request_queue is dying and return direct without doing anything.
[PATCH] skd: use ktime_get_real_seconds()
Like many storage drivers, skd uses an unsigned 32-bit number for interchanging the current time with the firmware. This will overflow in y2106 and is otherwise safe. However, the get_seconds() function is generally considered deprecated since the behavior is different between 32-bit and 64-bit architectures, and using it may indicate a bigger problem. To annotate that we've thought about this, let's add a comment here and migrate to the ktime_get_real_seconds() function that consistently returns a 64-bit number. Signed-off-by: Arnd Bergmann--- drivers/block/skd_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index 64d0fc17c174..2819f23e8bf2 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -1967,7 +1967,8 @@ static void skd_isr_msg_from_dev(struct skd_device *skdev) break; case FIT_MTD_CMD_LOG_HOST_ID: - skdev->connect_time_stamp = get_seconds(); + /* hardware interface overflows in y2106 */ + skdev->connect_time_stamp = (u32)ktime_get_real_seconds(); data = skdev->connect_time_stamp & 0x; mtd = FIT_MXD_CONS(FIT_MTD_CMD_LOG_TIME_STAMP_LO, 0, data); SKD_WRITEL(skdev, mtd, FIT_MSG_TO_DEVICE); -- 2.9.0