Re: [PATCH v2] blk-mq-debugfs: Add names for recently added flags
On Thu, Aug 24, 2017 at 11:30:46PM +, Bart Van Assche wrote: > On Fri, 2017-08-18 at 15:52 -0700, Bart Van Assche wrote: > > The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED > > and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to > > blk-mq-debugfs.c such that these appear as names in debugfs instead of > > as numbers. > > Hi Jens, > > Have you already had the time to have a look at this patch? > > Thanks, > > Bart. Reviewed-by: Omar SandovalFWIW
Re: [PATCH v2] blk-mq-debugfs: Add names for recently added flags
On Fri, 2017-08-18 at 15:52 -0700, Bart Van Assche wrote: > The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED > and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to > blk-mq-debugfs.c such that these appear as names in debugfs instead of > as numbers. Hi Jens, Have you already had the time to have a look at this patch? Thanks, Bart.
BFQ + dm-mpath
Hello Paolo, Has BFQ ever been tested in combination with dm-mpath? A few seconds after I had started a run of the srp-tests software with BFQ a kernel oops appeared on the console. The command I ran is: $ while true; do ~bart/software/infiniband/srp-test/run_tests -d -r 30 -e bfq; done And this is what appeared on the console: [89251.977201] BUG: unable to handle kernel NULL pointer dereference at 0018 [89251.977201] BUG: unable to handle kernel NULL pointer dereference at 0018 [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] [89251.995241] PGD 0 [89251.995241] PGD 0 [89251.995243] P4D 0 [89251.995243] P4D 0 [89251.999194] [89251.999194] [89252.006423] Oops: [#1] PREEMPT SMP [89252.006423] Oops: [#1] PREEMPT SMP [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd bfq dm_service_time rdma_cm iw_cm bridge stp llc ip6table_filter ip 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas crct10dif_pclmul ioatdma crc32_pclmul mei_me joydev ghash_clmulni_intel mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi mfd_core ib_ipoib ib_cm ib_uverbs ib_umad mlx4_ib ib_core dm_mul tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd bfq dm_service_time rdma_cm iw_cm bridge stp llc ip6table_filter ip 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas crct10dif_pclmul ioatdma crc32_pclmul mei_me joydev ghash_clmulni_intel mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi mfd_core ib_ipoib ib_cm ib_uverbs ib_umad mlx4_ib ib_core dm_mul tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common pps_core megaraid_sas libphy [last unloaded: scsi_transport_srp] [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common pps_core megaraid_sas libphy [last unloaded: scsi_transport_srp] [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW 4.13.0-rc6-dbg+ #2 [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW 4.13.0-rc6-dbg+ #2 [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012 [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012 [89252.150243] Workqueue: kblockd blk_mq_run_work_fn [89252.150243] Workqueue: kblockd blk_mq_run_work_fn [89252.157449] task: 880911d80040 task.stack: c90007c64000 [89252.157449] task: 880911d80040 task.stack: c90007c64000 [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 0001 [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 0001 [89252.191990] RDX: 8807b9988700 RSI: RDI: 880288554a68 [89252.191990] RDX: 8807b9988700 RSI: RDI: 880288554a68 [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 880288554dc8 [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 880288554dc8 [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 8807bbe03528 [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 8807bbe03528 [89252.231715] FS: () GS:88093f78() knlGS: [89252.231715] FS: () GS:88093f78() knlGS: [89252.242667] CS: 0010 DS: ES: CR0: 80050033 [89252.242667] CS: 0010 DS: ES: CR0: 80050033 [89252.250958] CR2: 0018 CR3:
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Thu, 2017-08-24 at 10:59 +0200, h...@lst.de wrote: > On Wed, Aug 23, 2017 at 06:21:55PM +, Bart Van Assche wrote: > > Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path: > > can the same kind of soft lockups occur with the NVMe multipathing code as > > with the current upstream device mapper multipathing code? See e.g. > > "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity" > > (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html). > > I suspect the code is not going to hit it because we check the controller > state before trying to queue I/O on the lower queue. But if you point > me to a good reproducer test case I'd like to check. For NVMe over RDMA, how about the simulate_network_failure_loop() function in https://github.com/bvanassche/srp-test/blob/master/lib/functions? It simulates a network failure by writing into the reset_controller sysfs attribute. > Also does the "single queue" case in your mail refer to the old > request code? nvme only uses blk-mq so it would not hit that. > But either way I think get_request should be fixed to return > BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN. The description in the patch I referred to indeed refers to the old request code in the block layer. When I prepared that patch I had analyzed the behavior of the old request code only. Bart.
[PATCH V2 2/2] block/loop: allow request merge for directio mode
From: Shaohua LiCurrently loop disables merge. While it makes sense for buffer IO mode, directio mode can benefit from request merge. Without merge, loop could send small size IO to underlayer disk and harm performance. Reviewed-by: Omar Sandoval Signed-off-by: Shaohua Li --- drivers/block/loop.c | 66 drivers/block/loop.h | 1 + 2 files changed, 52 insertions(+), 15 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 428da07..75a8f6e 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) */ blk_mq_freeze_queue(lo->lo_queue); lo->use_dio = use_dio; - if (use_dio) + if (use_dio) { + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); lo->lo_flags |= LO_FLAGS_DIRECT_IO; - else + } else { + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; + } blk_mq_unfreeze_queue(lo->lo_queue); } @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) { struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); + kfree(cmd->bvec); + cmd->bvec = NULL; cmd->ret = ret; blk_mq_complete_request(cmd->rq); } @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, { struct iov_iter iter; struct bio_vec *bvec; - struct bio *bio = cmd->rq->bio; + struct request *rq = cmd->rq; + struct bio *bio = rq->bio; struct file *file = lo->lo_backing_file; + unsigned int offset; + int segments = 0; int ret; - /* nomerge for loop request queue */ - WARN_ON(cmd->rq->bio != cmd->rq->biotail); + if (rq->bio != rq->biotail) { + struct req_iterator iter; + struct bio_vec tmp; + + __rq_for_each_bio(bio, rq) + segments += bio_segments(bio); + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL); + if (!bvec) + return -EIO; + cmd->bvec = bvec; + + /* +* The bios of the request may be started from the middle of +* the 'bvec' because of bio splitting, so we can't directly +* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment +* API will take care of all details for us. +*/ + rq_for_each_segment(tmp, rq, iter) { + *bvec = tmp; + bvec++; + } + bvec = cmd->bvec; + offset = 0; + } else { + /* +* Same here, this bio may be started from the middle of the +* 'bvec' because of bio splitting, so offset from the bvec +* must be passed to iov iterator +*/ + offset = bio->bi_iter.bi_bvec_done; + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); + segments = bio_segments(bio); + } - bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); iov_iter_bvec(, ITER_BVEC | rw, bvec, - bio_segments(bio), blk_rq_bytes(cmd->rq)); - /* -* This bio may be started from the middle of the 'bvec' -* because of bio splitting, so offset from the bvec must -* be passed to iov iterator -*/ - iter.iov_offset = bio->bi_iter.bi_bvec_done; + segments, blk_rq_bytes(rq)); + iter.iov_offset = offset; cmd->iocb.ki_pos = pos; cmd->iocb.ki_filp = file; @@ -1800,9 +1833,12 @@ static int loop_add(struct loop_device **l, int i) lo->lo_queue->queuedata = lo; blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); + /* -* It doesn't make sense to enable merge because the I/O -* submitted to backing file is handled page by page. +* By default, we do buffer IO, so it doesn't make sense to enable +* merge because the I/O submitted to backing file is handled page by +* page. For directio mode, merge does help to dispatch bigger request +* to underlayer disk. We will enable merge once directio is enabled. */ queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); diff --git a/drivers/block/loop.h b/drivers/block/loop.h index fecd3f9..bc9cf91 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -72,6 +72,7 @@ struct loop_cmd { bool use_aio; /* use AIO interface to handle I/O */ long ret; struct kiocb iocb; + struct bio_vec *bvec; }; /* Support for loadable transfer modules */
[PATCH V2 0/2] block/loop: improve performance
From: Shaohua Litwo small patches to improve performance for loop in directio mode. The goal is to increase IO size sending to underlayer disks. As Omar pointed out, the patches have slight conflict with his, but should be easy to fix. Thanks, Shaohua Shaohua Li (2): block/loop: set hw_sectors block/loop: allow request merge for directio mode drivers/block/loop.c | 67 drivers/block/loop.h | 1 + 2 files changed, 53 insertions(+), 15 deletions(-) -- 2.9.5
[PATCH V2 1/2] block/loop: set hw_sectors
From: Shaohua LiLoop can handle any size of request. Limiting it to 255 sectors just burns the CPU for bio split and request merge for underlayer disk and also cause bad fs block allocation in directio mode. Reviewed-by: Omar Sandoval Signed-off-by: Shaohua Li --- drivers/block/loop.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b55a1f8..428da07 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i) } lo->lo_queue->queuedata = lo; + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); /* * It doesn't make sense to enable merge because the I/O * submitted to backing file is handled page by page. -- 2.9.5
Re: [bug report] skd: Avoid that module unloading triggers a use-after-free
On Thu, Aug 24, 2017 at 03:04:12PM +, Bart Van Assche wrote: > On Thu, 2017-08-24 at 14:04 +0300, Dan Carpenter wrote: > > Hello Bart Van Assche, > > > > This is a semi-automatic email about new static checker warnings. > > > > The patch 7277cc67b391: "skd: Avoid that module unloading triggers a > > use-after-free" from Aug 17, 2017, leads to the following Smatch > > complaint: > > > > drivers/block/skd_main.c:3080 skd_free_disk() > > error: we previously assumed 'disk' could be null (see line 3074) > > > > drivers/block/skd_main.c > > 3073 > > 3074 if (disk && (disk->flags & GENHD_FL_UP)) > > > > Existing code checked for NULL. The new code shuffles things around. > > > > 3075 del_gendisk(disk); > > 3076 > > 3077 if (skdev->queue) { > > 3078 blk_cleanup_queue(skdev->queue); > > 3079 skdev->queue = NULL; > > 3080 disk->queue = NULL; > > ^^^ > > Now we don't check here. > > > > 3081 } > > 3082 > > > > regards, > > dan carpenter > > Hello Dan, > > If you have a look at skd_cons_disk() you will see that skdev->queue != NULL > implies that skdev->disk != NULL. So I think the above report is a false > positive. > Oh, yeah. You're right. Thanks for taking a look at this. regards, dan carpenter
Re: [PATCH 2/2] block/loop: allow request merge for directio mode
On Thu, Aug 24, 2017 at 10:57:39AM -0700, Omar Sandoval wrote: > On Wed, Aug 23, 2017 at 04:49:24PM -0700, Shaohua Li wrote: > > From: Shaohua Li> > > > Currently loop disables merge. While it makes sense for buffer IO mode, > > directio mode can benefit from request merge. Without merge, loop could > > send small size IO to underlayer disk and harm performance. > > A couple of nits on comments below, besides that > > Reviewed-by: Omar Sandoval > > > Signed-off-by: Shaohua Li > > --- > > drivers/block/loop.c | 43 +++ > > drivers/block/loop.h | 1 + > > 2 files changed, 36 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > index 428da07..1bfa3ff 100644 > > --- a/drivers/block/loop.c > > +++ b/drivers/block/loop.c > > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, > > bool dio) > > */ > > blk_mq_freeze_queue(lo->lo_queue); > > lo->use_dio = use_dio; > > - if (use_dio) > > + if (use_dio) { > > + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > > lo->lo_flags |= LO_FLAGS_DIRECT_IO; > > - else > > + } else { > > + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; > > + } > > Could you please also update the comment in loop_add() where we set > QUEUE_FLAG_NOMERGES to say something like > > /* >* By default we do buffered I/O, so it doesn't make sense to >* enable merging because the I/O submitted to backing file is >* handled page by page. We enable merging when switching to >* direct I/O mode. >*/ sure, will add > > blk_mq_unfreeze_queue(lo->lo_queue); > > } > > > > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long > > ret, long ret2) > > { > > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); > > > > + kfree(cmd->bvec); > > + cmd->bvec = NULL; > > cmd->ret = ret; > > blk_mq_complete_request(cmd->rq); > > } > > @@ -473,22 +478,44 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > { > > struct iov_iter iter; > > struct bio_vec *bvec; > > - struct bio *bio = cmd->rq->bio; > > + struct request *rq = cmd->rq; > > + struct bio *bio = rq->bio; > > struct file *file = lo->lo_backing_file; > > + unsigned int offset; > > + int segments = 0; > > int ret; > > > > - /* nomerge for loop request queue */ > > - WARN_ON(cmd->rq->bio != cmd->rq->biotail); > > + if (rq->bio != rq->biotail) { > > + struct req_iterator iter; > > + struct bio_vec tmp; > > + > > + __rq_for_each_bio(bio, rq) > > + segments += bio_segments(bio); > > + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL); > > + if (!bvec) > > + return -EIO; > > + cmd->bvec = bvec; > > + > > + rq_for_each_segment(tmp, rq, iter) { > > + *bvec = tmp; > > + bvec++; > > + } > > + bvec = cmd->bvec; > > + offset = 0; > > + } else { > > + offset = bio->bi_iter.bi_bvec_done; > > + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > > + segments = bio_segments(bio); > > + } > > > > - bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > > iov_iter_bvec(, ITER_BVEC | rw, bvec, > > - bio_segments(bio), blk_rq_bytes(cmd->rq)); > > + segments, blk_rq_bytes(rq)); > > /* > > * This bio may be started from the middle of the 'bvec' > > * because of bio splitting, so offset from the bvec must > > * be passed to iov iterator > > */ > > This comment should be moved above to where you do > > offset = bio->bi_iter.bi_bvec_done; This comment actually applies to 'rq->bio != rq->biotail' case too for each bio of the request. That's why I don't just copy bio->bi_io_vec. I'll explain this in different way. Thanks, Shaohua
[PATCH] block: update comments to reflect REQ_FLUSH -> REQ_PREFLUSH rename
From: Omar SandovalNormally I wouldn't bother with this, but in my opinion the comments are the most important part of this whole file since without them no one would have any clue how this insanity works. Signed-off-by: Omar Sandoval --- Based on Jens' for-next branch, for 4.14. block/blk-flush.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index ed5fe322abba..3e77676244d9 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -1,12 +1,12 @@ /* - * Functions to sequence FLUSH and FUA writes. + * Functions to sequence PREFLUSH and FUA writes. * * Copyright (C) 2011 Max Planck Institute for Gravitational Physics * Copyright (C) 2011 Tejun Heo * * This file is released under the GPLv2. * - * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three + * REQ_{PREFLUSH|FUA} requests are decomposed to sequences consisted of three * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request * properties and hardware capability. * @@ -16,9 +16,9 @@ * REQ_FUA means that the data must be on non-volatile media on request * completion. * - * If the device doesn't have writeback cache, FLUSH and FUA don't make any - * difference. The requests are either completed immediately if there's no - * data or executed as normal requests otherwise. + * If the device doesn't have writeback cache, PREFLUSH and FUA don't make any + * difference. The requests are either completed immediately if there's no data + * or executed as normal requests otherwise. * * If the device has writeback cache and supports FUA, REQ_PREFLUSH is * translated to PREFLUSH but REQ_FUA is passed down directly with DATA. @@ -31,7 +31,7 @@ * fq->flush_queue[fq->flush_pending_idx]. Once certain criteria are met, a * REQ_OP_FLUSH is issued and the pending_idx is toggled. When the flush * completes, all the requests which were pending are proceeded to the next - * step. This allows arbitrary merging of different types of FLUSH/FUA + * step. This allows arbitrary merging of different types of PREFLUSH/FUA * requests. * * Currently, the following conditions are used to determine when to issue @@ -47,19 +47,19 @@ * C3. The second condition is ignored if there is a request which has * waited longer than FLUSH_PENDING_TIMEOUT. This is to avoid * starvation in the unlikely case where there are continuous stream of - * FUA (without FLUSH) requests. + * FUA (without PREFLUSH) requests. * * For devices which support FUA, it isn't clear whether C2 (and thus C3) * is beneficial. * - * Note that a sequenced FLUSH/FUA request with DATA is completed twice. + * Note that a sequenced PREFLUSH/FUA request with DATA is completed twice. * Once while executing DATA and again after the whole sequence is * complete. The first completion updates the contained bio but doesn't * finish it so that the bio submitter is notified only after the whole * sequence is complete. This is implemented by testing RQF_FLUSH_SEQ in * req_bio_endio(). * - * The above peculiarity requires that each FLUSH/FUA request has only one + * The above peculiarity requires that each PREFLUSH/FUA request has only one * bio attached to it, which is guaranteed as they aren't allowed to be * merged in the usual way. */ @@ -76,7 +76,7 @@ #include "blk-mq-tag.h" #include "blk-mq-sched.h" -/* FLUSH/FUA sequences */ +/* PREFLUSH/FUA sequences */ enum { REQ_FSEQ_PREFLUSH = (1 << 0), /* pre-flushing in progress */ REQ_FSEQ_DATA = (1 << 1), /* data write in progress */ @@ -148,7 +148,7 @@ static bool blk_flush_queue_rq(struct request *rq, bool add_front) /** * blk_flush_complete_seq - complete flush sequence - * @rq: FLUSH/FUA request being sequenced + * @rq: PREFLUSH/FUA request being sequenced * @fq: flush queue * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero) * @error: whether an error occurred @@ -406,7 +406,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error) } /** - * blk_insert_flush - insert a new FLUSH/FUA request + * blk_insert_flush - insert a new PREFLUSH/FUA request * @rq: request to insert * * To be called from __elv_add_request() for %ELEVATOR_INSERT_FLUSH insertions. -- 2.14.1
Re: [PATCH 2/2] block/loop: allow request merge for directio mode
On Wed, Aug 23, 2017 at 04:49:24PM -0700, Shaohua Li wrote: > From: Shaohua Li> > Currently loop disables merge. While it makes sense for buffer IO mode, > directio mode can benefit from request merge. Without merge, loop could > send small size IO to underlayer disk and harm performance. A couple of nits on comments below, besides that Reviewed-by: Omar Sandoval > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 43 +++ > drivers/block/loop.h | 1 + > 2 files changed, 36 insertions(+), 8 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 428da07..1bfa3ff 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, > bool dio) >*/ > blk_mq_freeze_queue(lo->lo_queue); > lo->use_dio = use_dio; > - if (use_dio) > + if (use_dio) { > + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > lo->lo_flags |= LO_FLAGS_DIRECT_IO; > - else > + } else { > + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; > + } Could you please also update the comment in loop_add() where we set QUEUE_FLAG_NOMERGES to say something like /* * By default we do buffered I/O, so it doesn't make sense to * enable merging because the I/O submitted to backing file is * handled page by page. We enable merging when switching to * direct I/O mode. */ > blk_mq_unfreeze_queue(lo->lo_queue); > } > > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long > ret, long ret2) > { > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); > > + kfree(cmd->bvec); > + cmd->bvec = NULL; > cmd->ret = ret; > blk_mq_complete_request(cmd->rq); > } > @@ -473,22 +478,44 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > { > struct iov_iter iter; > struct bio_vec *bvec; > - struct bio *bio = cmd->rq->bio; > + struct request *rq = cmd->rq; > + struct bio *bio = rq->bio; > struct file *file = lo->lo_backing_file; > + unsigned int offset; > + int segments = 0; > int ret; > > - /* nomerge for loop request queue */ > - WARN_ON(cmd->rq->bio != cmd->rq->biotail); > + if (rq->bio != rq->biotail) { > + struct req_iterator iter; > + struct bio_vec tmp; > + > + __rq_for_each_bio(bio, rq) > + segments += bio_segments(bio); > + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL); > + if (!bvec) > + return -EIO; > + cmd->bvec = bvec; > + > + rq_for_each_segment(tmp, rq, iter) { > + *bvec = tmp; > + bvec++; > + } > + bvec = cmd->bvec; > + offset = 0; > + } else { > + offset = bio->bi_iter.bi_bvec_done; > + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > + segments = bio_segments(bio); > + } > > - bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > iov_iter_bvec(, ITER_BVEC | rw, bvec, > - bio_segments(bio), blk_rq_bytes(cmd->rq)); > + segments, blk_rq_bytes(rq)); > /* >* This bio may be started from the middle of the 'bvec' >* because of bio splitting, so offset from the bvec must >* be passed to iov iterator >*/ This comment should be moved above to where you do offset = bio->bi_iter.bi_bvec_done; > - iter.iov_offset = bio->bi_iter.bi_bvec_done; > + iter.iov_offset = offset; > > cmd->iocb.ki_pos = pos; > cmd->iocb.ki_filp = file; > diff --git a/drivers/block/loop.h b/drivers/block/loop.h > index fecd3f9..bc9cf91 100644 > --- a/drivers/block/loop.h > +++ b/drivers/block/loop.h > @@ -72,6 +72,7 @@ struct loop_cmd { > bool use_aio; /* use AIO interface to handle I/O */ > long ret; > struct kiocb iocb; > + struct bio_vec *bvec; > }; > > /* Support for loadable transfer modules */ > -- > 2.9.5 >
Re: [PATCH 1/2] block/loop: set hw_sectors
On Wed, Aug 23, 2017 at 04:49:23PM -0700, Shaohua Li wrote: > From: Shaohua Li> > Loop can handle any size of request. Limiting it to 255 sectors just > burns the CPU for bio split and request merge for underlayer disk and > also cause bad fs block allocation in directio mode. Reviewed-by: Omar Sandoval Note that this will conflict with my loop blocksize series, we can fix up whichever series goes in second. > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index b55a1f8..428da07 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i) > } > lo->lo_queue->queuedata = lo; > > + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); > /* >* It doesn't make sense to enable merge because the I/O >* submitted to backing file is handled page by page. > -- > 2.9.5 >
[PATCH] block, scheduler: convert xxx_var_store to void
The last parameter "count" never be used in xxx_var_store, convert these functions to void. Signed-off-by: weiping zhang--- block/bfq-iosched.c | 33 + block/cfq-iosched.c | 13 ++--- block/deadline-iosched.c | 9 - block/mq-deadline.c | 9 - 4 files changed, 31 insertions(+), 33 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 436b6ca..7a4085d 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4787,16 +4787,13 @@ static ssize_t bfq_var_show(unsigned int var, char *page) return sprintf(page, "%u\n", var); } -static ssize_t bfq_var_store(unsigned long *var, const char *page, -size_t count) +static void bfq_var_store(unsigned long *var, const char *page) { unsigned long new_val; int ret = kstrtoul(page, 10, _val); if (ret == 0) *var = new_val; - - return count; } #define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ @@ -4838,7 +4835,7 @@ __FUNC(struct elevator_queue *e, const char *page, size_t count) \ { \ struct bfq_data *bfqd = e->elevator_data; \ unsigned long uninitialized_var(__data);\ - int ret = bfq_var_store(&__data, (page), count);\ + bfq_var_store(&__data, (page)); \ if (__data < (MIN)) \ __data = (MIN); \ else if (__data > (MAX))\ @@ -4849,7 +4846,7 @@ __FUNC(struct elevator_queue *e, const char *page, size_t count) \ *(__PTR) = (u64)__data * NSEC_PER_MSEC; \ else\ *(__PTR) = __data; \ - return ret; \ + return count; \ } STORE_FUNCTION(bfq_fifo_expire_sync_store, >bfq_fifo_expire[1], 1, INT_MAX, 2); @@ -4866,13 +4863,13 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\ { \ struct bfq_data *bfqd = e->elevator_data; \ unsigned long uninitialized_var(__data);\ - int ret = bfq_var_store(&__data, (page), count);\ + bfq_var_store(&__data, (page)); \ if (__data < (MIN)) \ __data = (MIN); \ else if (__data > (MAX))\ __data = (MAX); \ *(__PTR) = (u64)__data * NSEC_PER_USEC; \ - return ret; \ + return count; \ } USEC_STORE_FUNCTION(bfq_slice_idle_us_store, >bfq_slice_idle, 0, UINT_MAX); @@ -4883,7 +4880,8 @@ static ssize_t bfq_max_budget_store(struct elevator_queue *e, { struct bfq_data *bfqd = e->elevator_data; unsigned long uninitialized_var(__data); - int ret = bfq_var_store(&__data, (page), count); + + bfq_var_store(&__data, (page)); if (__data == 0) bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd); @@ -4895,7 +4893,7 @@ static ssize_t bfq_max_budget_store(struct elevator_queue *e, bfqd->bfq_user_max_budget = __data; - return ret; + return count; } /* @@ -4907,7 +4905,8 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, { struct bfq_data *bfqd = e->elevator_data; unsigned long uninitialized_var(__data); - int ret = bfq_var_store(&__data, (page), count); + + bfq_var_store(&__data, (page)); if (__data < 1) __data = 1; @@ -4918,7 +4917,7 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, if (bfqd->bfq_user_max_budget == 0) bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd); - return ret; + return count; } static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e, @@ -4926,7 +4925,8 @@ static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e, { struct bfq_data *bfqd = e->elevator_data; unsigned long uninitialized_var(__data); - int ret = bfq_var_store(&__data, (page), count); + + bfq_var_store(&__data, (page)); if (__data > 1) __data = 1; @@
Re: [PATCH 3/4] loop: add ioctl for changing logical block size
On Thu, Aug 24, 2017 at 10:06:57AM +0200, Karel Zak wrote: > On Thu, Aug 24, 2017 at 12:03:43AM -0700, Omar Sandoval wrote: > > From: Omar Sandoval> > > > This is a different approach from the first attempt in f2c6df7dbf9a > > ("loop: support 4k physical blocksize"). Rather than extending > > LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block > > size. > > > > Signed-off-by: Omar Sandoval > > --- > > drivers/block/loop.c | 24 > > include/uapi/linux/loop.h | 1 + > > 2 files changed, 25 insertions(+) > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > index 1a5b4ecf54ec..6b1d932028e3 100644 > > --- a/drivers/block/loop.c > > +++ b/drivers/block/loop.c > > @@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo) > > memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE); > > memset(lo->lo_crypt_name, 0, LO_NAME_SIZE); > > memset(lo->lo_file_name, 0, LO_NAME_SIZE); > > + blk_queue_logical_block_size(lo->lo_queue, 512); > > if (bdev) { > > bdput(bdev); > > invalidate_bdev(bdev); > > @@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, > > unsigned long arg) > > return error; > > } > > > > +static int loop_set_block_size(struct loop_device *lo, unsigned long arg) > > +{ > > + if (lo->lo_state != Lo_bound) > > + return -ENXIO; > > + > > + if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg)) > > + return -EINVAL; > > I'm not sure if the PAGE_SIZE is the right limit for the use-case > where backing file is a block device ...but it's nothing important. A bigger logical block size just means we'll send bigger I/Os down to the backing file, so I think it should be okay. > > + blk_mq_freeze_queue(lo->lo_queue); > > + > > + blk_queue_logical_block_size(lo->lo_queue, arg); > > Just note that this function also updates physical_block_size if the > 'arg' is greater than the current setting. That's good thing :-) Patch 2 sets the physical block size to PAGE_SIZE, and we cap the logical block size to PAGE_SIZE, so we won't actually hit that case anyways :) > > + loop_update_dio(lo); > > + > > + blk_mq_unfreeze_queue(lo->lo_queue); > > + > > + return 0; > > +} > > + > > static int lo_ioctl(struct block_device *bdev, fmode_t mode, > > unsigned int cmd, unsigned long arg) > > { > > @@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, > > fmode_t mode, > > if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) > > err = loop_set_dio(lo, arg); > > break; > > + case LOOP_SET_BLOCK_SIZE: > > + err = -EPERM; > > + if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) > > + err = loop_set_block_size(lo, arg); > > + break; > > I like it this ioctl idea. It makes loop based tests more comfortable > as we can change sector size on the fly without loop device reset. > > I did not test it yet, but this implementation seems the best. Thanks. Thanks! Let me know if you spot any issues when you test it.
Re: [PATCH 4/5] skd: Avoid double completions in case of a timeout
On Wed, 2017-08-23 at 20:08 +0200, Christoph Hellwig wrote: > Although as a follow on I would also move the debug printks > from skd_end_request to skd_softirq_done, then remove skd_end_request > and rename skd_softirq_done to skd_complete_rq to fit what other > drivers are doing a little more closely. Hello Christoph, Thanks for the review of this patch series. I will include the above changes when I post my next patch series for the skd driver. Bart.
Re: [bug report] skd: Avoid that module unloading triggers a use-after-free
On Thu, 2017-08-24 at 14:04 +0300, Dan Carpenter wrote: > Hello Bart Van Assche, > > This is a semi-automatic email about new static checker warnings. > > The patch 7277cc67b391: "skd: Avoid that module unloading triggers a > use-after-free" from Aug 17, 2017, leads to the following Smatch > complaint: > > drivers/block/skd_main.c:3080 skd_free_disk() >error: we previously assumed 'disk' could be null (see line 3074) > > drivers/block/skd_main.c > 3073 > 3074if (disk && (disk->flags & GENHD_FL_UP)) > > Existing code checked for NULL. The new code shuffles things around. > > 3075del_gendisk(disk); > 3076 > 3077if (skdev->queue) { > 3078blk_cleanup_queue(skdev->queue); > 3079skdev->queue = NULL; > 3080disk->queue = NULL; > ^^^ > Now we don't check here. > > 3081} > 3082 > > regards, > dan carpenter Hello Dan, If you have a look at skd_cons_disk() you will see that skdev->queue != NULL implies that skdev->disk != NULL. So I think the above report is a false positive. Bart.
Re: [PATCH 2/2] compat_hdio_ioctl: Fix a declaration
On 08/23/2017 04:29 PM, Bart Van Assche wrote: > This patch avoids that sparse reports the following warning messages: > > block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different > address spaces) > block/compat_ioctl.c:85:11:expected unsigned long *[noderef] p > block/compat_ioctl.c:85:11:got void [noderef] * > block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different > address spaces) > block/compat_ioctl.c:91:21:expected void const volatile [noderef] > * > block/compat_ioctl.c:91:21:got unsigned long *[noderef] p > block/compat_ioctl.c:87:53: warning: dereference of noderef expression > block/compat_ioctl.c:91:21: warning: dereference of noderef expression Applied for 4.14, thanks Bart. -- Jens Axboe
Re: [PATCH 4/4] block_dev: support RFW_NOWAIT on block device nodes
On Tue 22-08-17 18:17:12, Christoph Hellwig wrote: > All support is already there in the generic code, we just need to wire > it up. > > Signed-off-by: Christoph HellwigLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/block_dev.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 9941dc8342df..ea21d18d8e79 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1739,6 +1739,8 @@ static int blkdev_open(struct inode * inode, struct > file * filp) >*/ > filp->f_flags |= O_LARGEFILE; > > + filp->f_mode |= FMODE_NOWAIT; > + > if (filp->f_flags & O_NDELAY) > filp->f_mode |= FMODE_NDELAY; > if (filp->f_flags & O_EXCL) > @@ -1891,6 +1893,9 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct > iov_iter *from) > if (iocb->ki_pos >= size) > return -ENOSPC; > > + if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT) > + return -EOPNOTSUPP; > + > iov_iter_truncate(from, size - iocb->ki_pos); > > blk_start_plug(); > -- > 2.11.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH 3/4] fs: support RWF_NOWAIT for buffered reads
On Tue 22-08-17 18:17:11, Christoph Hellwig wrote: > This is based on the old idea and code from Milosz Tanski. With the aio > nowait code it becomes mostly trivial now. Buffered writes continue to > return -EOPNOTSUPP if RWF_NOWAIT is passed. > > Signed-off-by: Christoph HellwigLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/aio.c | 6 -- > fs/btrfs/file.c| 6 +- > fs/ext4/file.c | 6 +++--- > fs/xfs/xfs_file.c | 11 +-- > include/linux/fs.h | 6 +++--- > 5 files changed, 20 insertions(+), 15 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index dcad3a66748c..d93daa076726 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1593,12 +1593,6 @@ static int io_submit_one(struct kioctx *ctx, struct > iocb __user *user_iocb, > goto out_put_req; > } > > - if ((req->common.ki_flags & IOCB_NOWAIT) && > - !(req->common.ki_flags & IOCB_DIRECT)) { > - ret = -EOPNOTSUPP; > - goto out_put_req; > - } > - > ret = put_user(KIOCB_KEY, _iocb->aio_key); > if (unlikely(ret)) { > pr_debug("EFAULT: aio_key\n"); > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 9e75d8a39aac..e62dd55b4079 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1886,6 +1886,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb > *iocb, > loff_t oldsize; > int clean_page = 0; > > + if (!(iocb->ki_flags & IOCB_DIRECT) && > + (iocb->ki_flags & IOCB_NOWAIT)) > + return -EOPNOTSUPP; > + > if (!inode_trylock(inode)) { > if (iocb->ki_flags & IOCB_NOWAIT) > return -EAGAIN; > @@ -3105,7 +3109,7 @@ static loff_t btrfs_file_llseek(struct file *file, > loff_t offset, int whence) > > static int btrfs_file_open(struct inode *inode, struct file *filp) > { > - filp->f_mode |= FMODE_AIO_NOWAIT; > + filp->f_mode |= FMODE_NOWAIT; > return generic_file_open(inode, filp); > } > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 0d7cf0cc9b87..f83521337b8f 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -223,6 +223,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter > *from) > if (IS_DAX(inode)) > return ext4_dax_write_iter(iocb, from); > #endif > + if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) > + return -EOPNOTSUPP; > > if (!inode_trylock(inode)) { > if (iocb->ki_flags & IOCB_NOWAIT) > @@ -448,9 +450,7 @@ static int ext4_file_open(struct inode * inode, struct > file * filp) > return ret; > } > > - /* Set the flags to support nowait AIO */ > - filp->f_mode |= FMODE_AIO_NOWAIT; > - > + filp->f_mode |= FMODE_NOWAIT; > return dquot_file_open(inode, filp); > } > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c4893e226fd8..1a09104b3eb0 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -259,7 +259,11 @@ xfs_file_buffered_aio_read( > > trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos); > > - xfs_ilock(ip, XFS_IOLOCK_SHARED); > + if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + xfs_ilock(ip, XFS_IOLOCK_SHARED); > + } > ret = generic_file_read_iter(iocb, to); > xfs_iunlock(ip, XFS_IOLOCK_SHARED); > > @@ -636,6 +640,9 @@ xfs_file_buffered_aio_write( > int enospc = 0; > int iolock; > > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EOPNOTSUPP; > + > write_retry: > iolock = XFS_IOLOCK_EXCL; > xfs_ilock(ip, iolock); > @@ -912,7 +919,7 @@ xfs_file_open( > return -EFBIG; > if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb))) > return -EIO; > - file->f_mode |= FMODE_AIO_NOWAIT; > + file->f_mode |= FMODE_NOWAIT; > return 0; > } > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 6e1fd5d21248..ded258edc425 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -146,8 +146,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t > offset, > /* File was opened by fanotify and shouldn't generate fanotify events */ > #define FMODE_NONOTIFY ((__force fmode_t)0x400) > > -/* File is capable of returning -EAGAIN if AIO will block */ > -#define FMODE_AIO_NOWAIT ((__force fmode_t)0x800) > +/* File is capable of returning -EAGAIN if I/O will block */ > +#define FMODE_NOWAIT ((__force fmode_t)0x800) > > /* > * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector > @@ -3149,7 +3149,7 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, > int flags) > return -EOPNOTSUPP; > > if (flags &
Re: [PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read
On Tue 22-08-17 18:17:10, Christoph Hellwig wrote: > From: Milosz Tanski> > Allow generic_file_buffered_read to bail out early instead of waiting for > the page lock or reading a page if IOCB_NOWAIT is specified. > > Signed-off-by: Milosz Tanski > Reviewed-by: Christoph Hellwig > Reviewed-by: Jeff Moyer > Acked-by: Sage Weil > --- > mm/filemap.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 4bcfa74ad802..9f60255fb7bb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb > *iocb, > > page = find_get_page(mapping, index); > if (!page) { > + if (iocb->ki_flags & IOCB_NOWAIT) > + goto would_block; > page_cache_sync_readahead(mapping, > ra, filp, > index, last_index - index); Hum, we have: if (!PageUptodate(page)) { /* * See comment in do_read_cache_page on why * wait_on_page_locked is used to avoid * unnecessarily * serialisations and why it's safe. */ error = wait_on_page_locked_killable(page); if (unlikely(error)) goto readpage_error; if (PageUptodate(page)) goto page_ok; And wait_on_page_locked_killable() above does not seem to be handled in your patch. I would just check IOCB_NOWAIT in !PageUptodate(page) branch above and bail - which also removes the need for the two checks below... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH] block: remove blk_free_devt in add_partition
On 08/18/2017 09:54 AM, weiping zhang wrote: > put_device(pdev) will call pdev->type->release finally, and blk_free_devt > has been called in part_release(), so remove it. Your analysis looks correct to me, applied for 4.14. -- Jens Axboe
Re: [PATCH 1/4] fs: pass iocb to do_generic_file_read
On Tue 22-08-17 18:17:09, Christoph Hellwig wrote: > And rename it to the more descriptive generic_file_buffered_read while > at it. > > Signed-off-by: Christoph HellwigLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > mm/filemap.c | 15 --- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index a49702445ce0..4bcfa74ad802 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1886,9 +1886,8 @@ static void shrink_readahead_size_eio(struct file *filp, > } > > /** > - * do_generic_file_read - generic file read routine > - * @filp:the file to read > - * @ppos:current file position > + * generic_file_buffered_read - generic file read routine > + * @iocb:the iocb to read > * @iter:data destination > * @written: already copied > * > @@ -1898,12 +1897,14 @@ static void shrink_readahead_size_eio(struct file > *filp, > * This is really ugly. But the goto's actually try to clarify some > * of the logic when it comes to error handling etc. > */ > -static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos, > +static ssize_t generic_file_buffered_read(struct kiocb *iocb, > struct iov_iter *iter, ssize_t written) > { > + struct file *filp = iocb->ki_filp; > struct address_space *mapping = filp->f_mapping; > struct inode *inode = mapping->host; > struct file_ra_state *ra = >f_ra; > + loff_t *ppos = >ki_pos; > pgoff_t index; > pgoff_t last_index; > pgoff_t prev_index; > @@ -2151,14 +2152,14 @@ static ssize_t do_generic_file_read(struct file > *filp, loff_t *ppos, > ssize_t > generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) > { > - struct file *file = iocb->ki_filp; > - ssize_t retval = 0; > size_t count = iov_iter_count(iter); > + ssize_t retval = 0; > > if (!count) > goto out; /* skip atime */ > > if (iocb->ki_flags & IOCB_DIRECT) { > + struct file *file = iocb->ki_filp; > struct address_space *mapping = file->f_mapping; > struct inode *inode = mapping->host; > loff_t size; > @@ -2199,7 +2200,7 @@ generic_file_read_iter(struct kiocb *iocb, struct > iov_iter *iter) > goto out; > } > > - retval = do_generic_file_read(file, >ki_pos, iter, retval); > + retval = generic_file_buffered_read(iocb, iter, retval); > out: > return retval; > } > -- > 2.11.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH v2 0/1] bsg: fix regression resulting in panics when sending commands via BSG
On 08/23/2017 05:57 PM, Benjamin Block wrote: > Hello all, > > This is the second try for fixing the regression in the BSG-interface that > exists since v4.11 (for more infos see the first series). > > I separated my other changes from the bug-fix so that it is easier to apply > if judged good. I will rebase my cleanups I sent in v1 and send them when I > get a bit more time. But the regression-fix is more important, so here's > that. > > I did some more tests on it than on v1, including some heavy parallel I/O > on the same blk-queue using both BSG and the normal SCSI-stack at the same > time (throwing some intentional bad commands in it too). That seemed to > work all well enough - i.e. it didn't crash and got the expected results. I > haven't done any external error-inject, but IMO that would be beyond the > scope right now. > > The fix is based on Christoph's idea, I discussed this with him off-list > already. > > I rebased the series on Jens' for-next. Added for 4.13, thanks. -- Jens Axboe
Re: [PATCH for-next] block: Fix left-over bi_bdev usage
On 08/24/2017 02:41 AM, Christoph Hellwig wrote: > On Wed, Aug 23, 2017 at 07:45:55PM -0400, Keith Busch wrote: >> The bi_bdev field was replaced with the gendisk. This patch just fixes >> an omission. > > Thanks Keith. I'm pretty sure I fixed exactly this one when > rebasing from my nvme tree to the block tree. But I guess I didn't > finish the rebase fully, as usual - git rebase considered harmful :) Problem is, the issue doesn't exist in my for-4.14/block branch, it only exists when that is merged with master. I'll apply it to my for-next branch, even though that one is not persistent, so that it builds. -- Jens Axboe
Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs
Hello, Laurent. On Thu, Aug 24, 2017 at 02:10:31PM +0200, Laurent Vivier wrote: > > Yeah, it just needs to match up new cpus to the cpu ids assigned to > > the right node. > > We are not able to assign the cpu ids to the right node before the CPU > is present, because firmware doesn't provide CPU mapping <-> node id > before that. What I meant was to assign the matching CPU ID when the CPU becomes present - ie. have CPU IDs available for different nodes and allocate them to the new CPU according to its node mapping when it actually comes up. Please note that I'm not saying this is the way to go, just that it is a solvable problem from the arch code. > > The node mapping for that cpu id changes *dynamically* while the > > system is running and that can race with node-affinity sensitive > > operations such as memory allocations. > > Memory is mapped to the node through its own firmware entry, so I don't > think cpu id change can affect memory affinity, and before we know the > node id of the CPU, the CPU is not present and thus it can't use memory. The latter part isn't true. For example, percpu memory gets alloacted for all possible CPUs according to their node affinity, so the memory node association change which happens when the CPU comes up for the first time can race against such allocations. I don't know whether that's actually problematic but we don't have *any* synchronization around it. If you think it's safe to have such races, please explain why that is. > > Please take a step back and think through the problem again. You > > can't bandaid it this way. > > Could you give some ideas, proposals? > As the firmware doesn't provide the information before the CPU is really > plugged, I really don't know how to manage this problem. There are two possible approaches, I think. 1. Make physical cpu -> logical cpu mapping indirect so that the kernel's cpu ID assignment is always on the right numa node. This may mean that the kernel might have to keep more possible CPUs around than necessary but it does have the benefit that all memory allocations are affine to the right node. 2. Make cpu <-> node mapping properly dynamic. Identify what sort of synchronization we'd need around the mapping changing dynamically. Note that we might not need much but it'll most likely need some. Build synchronization and notification infrastructure around it. Thanks. -- tejun
Re: [PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
On Thu, Aug 24, 2017 at 10:45:56AM +0200, Christoph Hellwig wrote: > > /** > > - * bsg_destroy_job - routine to teardown/delete a bsg job > > + * bsg_teardown_job - routine to teardown a bsg job > > * @job: bsg_job that is to be torn down > > */ > > -static void bsg_destroy_job(struct kref *kref) > > +static void bsg_teardown_job(struct kref *kref) > > Why this rename? The destroy name seems to be one of the most > common patterns for the kref_put callbacks. > Hmm, I did it mostly so it is symmetric with bsg_prepare_job() and it doesn't really itself destroy the job-struct anymore. If there are other thing amiss I can change that along with them, if it bothers poeple. Beste Grüße / Best regards, - Benjamin Block > > Otherwise this looks fine: > > Reviewed-by: Christoph Hellwig> -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 4/4] loop: fold loop_switch() into callers
On 08/24/2017 09:03 AM, Omar Sandoval wrote: > From: Omar Sandoval> > The comments here are really outdated, and blk-mq made flushing much > simpler, so just fold the two cases into the callers. > > Signed-off-by: Omar Sandoval > --- > drivers/block/loop.c | 76 > > 1 file changed, 11 insertions(+), 65 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 3/4] loop: add ioctl for changing logical block size
On 08/24/2017 09:03 AM, Omar Sandoval wrote: > From: Omar Sandoval> > This is a different approach from the first attempt in f2c6df7dbf9a > ("loop: support 4k physical blocksize"). Rather than extending > LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block > size. > > Signed-off-by: Omar Sandoval > --- > drivers/block/loop.c | 24 > include/uapi/linux/loop.h | 1 + > 2 files changed, 25 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 1a5b4ecf54ec..6b1d932028e3 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo) > memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE); > memset(lo->lo_crypt_name, 0, LO_NAME_SIZE); > memset(lo->lo_file_name, 0, LO_NAME_SIZE); > + blk_queue_logical_block_size(lo->lo_queue, 512); > if (bdev) { > bdput(bdev); > invalidate_bdev(bdev); > @@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, > unsigned long arg) > return error; > } > > +static int loop_set_block_size(struct loop_device *lo, unsigned long arg) > +{ > + if (lo->lo_state != Lo_bound) > + return -ENXIO; > + > + if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg)) > + return -EINVAL; > + > + blk_mq_freeze_queue(lo->lo_queue); > + > + blk_queue_logical_block_size(lo->lo_queue, arg); > + loop_update_dio(lo); > + > + blk_mq_unfreeze_queue(lo->lo_queue); > + > + return 0; > +} > + > static int lo_ioctl(struct block_device *bdev, fmode_t mode, > unsigned int cmd, unsigned long arg) > { > @@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, fmode_t > mode, > if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) > err = loop_set_dio(lo, arg); > break; > + case LOOP_SET_BLOCK_SIZE: > + err = -EPERM; > + if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) > + err = loop_set_block_size(lo, arg); > + break; > default: > err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL; > } > diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h > index c8125ec1f4f2..23158dbe2424 100644 > --- a/include/uapi/linux/loop.h > +++ b/include/uapi/linux/loop.h > @@ -88,6 +88,7 @@ struct loop_info64 { > #define LOOP_CHANGE_FD 0x4C06 > #define LOOP_SET_CAPACITY0x4C07 > #define LOOP_SET_DIRECT_IO 0x4C08 > +#define LOOP_SET_BLOCK_SIZE 0x4C09 > > /* /dev/loop-control interface */ > #define LOOP_CTL_ADD 0x4C80 > Hmm. If the losetup / util-linux maintainers are on board with this, okay. Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 2/4] loop: set physical block size to PAGE_SIZE
On 08/24/2017 09:03 AM, Omar Sandoval wrote: > From: Omar Sandoval> > The physical block size is "the lowest possible sector size that the > hardware can operate on without reverting to read-modify-write > operations" (from the comment on blk_queue_physical_block_size()). Since > loop does buffered I/O on the backing file by default, the RMW unit is a > page. This isn't the case for direct I/O mode, but let's keep it simple. > > Signed-off-by: Omar Sandoval > --- > drivers/block/loop.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 54e091887199..1a5b4ecf54ec 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1764,6 +1764,8 @@ static int loop_add(struct loop_device **l, int i) > } > lo->lo_queue->queuedata = lo; > > + blk_queue_physical_block_size(lo->lo_queue, PAGE_SIZE); > + > /* >* It doesn't make sense to enable merge because the I/O >* submitted to backing file is handled page by page. > Let's see it this one goes through ... Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 1/4] loop: get rid of lo_blocksize
On 08/24/2017 09:03 AM, Omar Sandoval wrote: > From: Omar Sandoval> > This is only used for setting the soft block size on the struct > block_device once and then never used again. > > Signed-off-by: Omar Sandoval > --- > drivers/block/loop.c | 10 ++ > drivers/block/loop.h | 1 - > 2 files changed, 2 insertions(+), 9 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs
On 23/08/2017 15:26, Tejun Heo wrote: > Hello, Michael. > > On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote: >>> I don't think that's true. The CPU id used in kernel doesn't have to >>> match the physical one and arch code should be able to pre-map CPU IDs >>> to nodes and use the matching one when hotplugging CPUs. I'm not >>> saying that's the best way to solve the problem tho. >> >> We already virtualise the CPU numbers, but not the node IDs. And it's >> the node IDs that are really the problem. > > Yeah, it just needs to match up new cpus to the cpu ids assigned to > the right node. We are not able to assign the cpu ids to the right node before the CPU is present, because firmware doesn't provide CPU mapping <-> node id before that. >>> It could be that the best way forward is making cpu <-> node mapping >>> dynamic and properly synchronized. >> >> We don't need it to be dynamic (at least for this bug). > > The node mapping for that cpu id changes *dynamically* while the > system is running and that can race with node-affinity sensitive > operations such as memory allocations. Memory is mapped to the node through its own firmware entry, so I don't think cpu id change can affect memory affinity, and before we know the node id of the CPU, the CPU is not present and thus it can't use memory. >> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just >> that because some CPUs aren't present at boot we don't know what the >> node mapping is. (Correct me if I'm wrong Laurent). >> >> So all we need is: >> - the workqueue code to cope with CPUs that are possible but not online >>having NUMA_NO_NODE to begin with. >> - a way to update the workqueue cpumask when the CPU comes online. >> >> Which seems reasonable to me? > > Please take a step back and think through the problem again. You > can't bandaid it this way. Could you give some ideas, proposals? As the firmware doesn't provide the information before the CPU is really plugged, I really don't know how to manage this problem. Thanks, Laurent
Re: [PATCH] block: remove blk_free_devt in add_partition
On Fri, Aug 18, 2017 at 11:54:45PM +0800, weiping zhang wrote: > put_device(pdev) will call pdev->type->release finally, and blk_free_devt > has been called in part_release(), so remove it. > > Signed-off-by: weiping zhang> --- > block/partition-generic.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/block/partition-generic.c b/block/partition-generic.c > index c5ec824..04d82dd 100644 > --- a/block/partition-generic.c > +++ b/block/partition-generic.c > @@ -391,7 +391,6 @@ struct hd_struct *add_partition(struct gendisk *disk, int > partno, > device_del(pdev); > out_put: > put_device(pdev); > - blk_free_devt(devt); > return ERR_PTR(err); > } > > -- > 2.9.4 > Hi Jens, would you please look this trivial patch at your convenience? Thanks
[bug report] skd: Avoid that module unloading triggers a use-after-free
Hello Bart Van Assche, This is a semi-automatic email about new static checker warnings. The patch 7277cc67b391: "skd: Avoid that module unloading triggers a use-after-free" from Aug 17, 2017, leads to the following Smatch complaint: drivers/block/skd_main.c:3080 skd_free_disk() error: we previously assumed 'disk' could be null (see line 3074) drivers/block/skd_main.c 3073 3074 if (disk && (disk->flags & GENHD_FL_UP)) Existing code checked for NULL. The new code shuffles things around. 3075 del_gendisk(disk); 3076 3077 if (skdev->queue) { 3078 blk_cleanup_queue(skdev->queue); 3079 skdev->queue = NULL; 3080 disk->queue = NULL; ^^^ Now we don't check here. 3081 } 3082 regards, dan carpenter
Re: [PATCH 2/6] raid5: remove a call to get_start_sect
On Wed, Aug 23, 2017 at 11:23:38AM -0700, Shaohua Li wrote: > On Wed, Aug 23, 2017 at 07:10:28PM +0200, Christoph Hellwig wrote: > > The block layer always remaps partitions before calling into the > > ->make_request methods of drivers. Thus the call to get_start_sect in > > in_chunk_boundary will always return 0 and can be removed. > > > > Signed-off-by: Christoph Hellwig> > --- > > drivers/md/raid5.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index 0fc2748aaf95..d687aeb1b538 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -5092,10 +5092,12 @@ static int raid5_congested(struct mddev *mddev, int > > bits) > > static int in_chunk_boundary(struct mddev *mddev, struct bio *bio) > > { > > struct r5conf *conf = mddev->private; > > - sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev); > > + sector_t sector = bio->bi_iter.bi_sector; > > unsigned int chunk_sectors; > > unsigned int bio_sectors = bio_sectors(bio); > > > > + WARN_ON_ONCE(bio->bi_partno); > > + Meh, of course bi_partno is only added a few patches later, so this breaks bisectability. But given that the patch is already in there's probably nothing I can do here.
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Wed, Aug 23, 2017 at 06:21:55PM +, Bart Van Assche wrote: > On Wed, 2017-08-23 at 19:58 +0200, Christoph Hellwig wrote: > > +static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio) > > +{ > > + struct nvme_ns_head *head = q->queuedata; > > + struct nvme_ns *ns; > > + blk_qc_t ret = BLK_QC_T_NONE; > > + int srcu_idx; > > + > > + srcu_idx = srcu_read_lock(>srcu); > > + ns = srcu_dereference(head->current_path, >srcu); > > + if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE)) > > + ns = nvme_find_path(head); > > + if (likely(ns)) { > > + bio->bi_disk = ns->disk; > > + bio->bi_opf |= REQ_FAILFAST_TRANSPORT; > > + ret = generic_make_request_fast(bio); > > + } else if (!list_empty_careful(>list)) { > > + printk_ratelimited("no path available - requeing I/O\n"); > > + > > + spin_lock_irq(>requeue_lock); > > + bio_list_add(>requeue_list, bio); > > + spin_unlock_irq(>requeue_lock); > > + } else { > > + printk_ratelimited("no path - failing I/O\n"); > > + > > + bio->bi_status = BLK_STS_IOERR; > > + bio_endio(bio); > > + } > > + > > + srcu_read_unlock(>srcu, srcu_idx); > > + return ret; > > +} > > Hello Christoph, > > Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path: > can the same kind of soft lockups occur with the NVMe multipathing code as > with the current upstream device mapper multipathing code? See e.g. > "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity" > (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html). I suspect the code is not going to hit it because we check the controller state before trying to queue I/O on the lower queue. But if you point me to a good reproducer test case I'd like to check. Also does the "single queue" case in your mail refer to the old request code? nvme only uses blk-mq so it would not hit that. But either way I think get_request should be fixed to return BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN. > Another question about this code is what will happen if > generic_make_request_fast() returns BLK_STS_AGAIN and the submit_bio() or > generic_make_request() caller ignores the return value of the called > function? A quick grep revealed that there is plenty of code that ignores > the return value of these last two functions. generic_make_request and generic_make_request_fast only return the polling cookie (blk_qc_t), not a block status. Note that we do not use blk_get_request / blk_mq_alloc_request for the request allocation of the request on the lower device, so unless the caller passed REQ_NOWAIT and is able to handle BLK_STS_AGAIN we won't ever return it.
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Wed, Aug 23, 2017 at 06:53:00PM -0400, Keith Busch wrote: > On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote: > > > > TODO: implement sysfs interfaces for the new subsystem and > > subsystem-namespace object. Unless we can come up with something > > better than sysfs here.. > > Can we get symlinks from the multipath'ed nvme block device to the > individual paths? I think it should be something like the following: > > > @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, > > unsigned nsid) > > if (ns->ndev && nvme_nvm_register_sysfs(ns)) > > pr_warn("%s: failed to register lightnvm sysfs group for > > identification\n", > > ns->disk->disk_name); > > + > > + if (new) > > + add_disk(ns->head->disk); > + > + sysfs_create_link(_to_dev(ns->head->disk)->kobj, > + _to_dev(ns->disk)->kobj, > + ns->disk->name); Yeah, probably..
Re: [PATCH 06/10] nvme: track subsystems
On Wed, Aug 23, 2017 at 06:04:34PM -0400, Keith Busch wrote: > > + struct nvme_subsystem *subsys; > > + > > + lockdep_assert_held(_subsystems_lock); > > + > > + list_for_each_entry(subsys, _subsystems, entry) { > > + if (strcmp(subsys->subnqn, subsysnqn)) > > + continue; > > + if (!kref_get_unless_zero(>ref)) > > + continue; > > You should be able to just return immediately here since there can't be > a duplicated subsysnqn in the list. We could have a race where we tear one subsystem instance down and are setting up another one. > > + INIT_LIST_HEAD(>ctrls); > > + kref_init(>ref); > > + nvme_init_subnqn(subsys, ctrl, id); > > + mutex_init(>lock); > > + > > + mutex_lock(_subsystems_lock); > > This could be a spinlock instead of a mutex. We could. But given that the lock is not in the hot path there seems to be no point in making it a spinlock. > > + found = __nvme_find_get_subsystem(subsys->subnqn); > > + if (found) { > > + /* > > +* Verify that the subsystem actually supports multiple > > +* controllers, else bail out. > > +*/ > > + kfree(subsys); > > + if (!(id->cmic & (1 << 1))) { > > + dev_err(ctrl->device, > > + "ignoring ctrl due to duplicate subnqn (%s).\n", > > + found->subnqn); > > + mutex_unlock(_subsystems_lock); > > + return -EINVAL; > > Returning -EINVAL here will cause nvme_init_identify to fail. Do we want > that to happen here? I think we want to be able to manage controllers > in such a state, but just checking if there's a good reason to not allow > them. Without this we will get duplicate nvme_subsystem structures, messing up the whole lookup. We could mark them as buggy with a flag and make sure controllers without CMIC bit 1 set will never be linked.
Re: [PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
> /** > - * bsg_destroy_job - routine to teardown/delete a bsg job > + * bsg_teardown_job - routine to teardown a bsg job > * @job: bsg_job that is to be torn down > */ > -static void bsg_destroy_job(struct kref *kref) > +static void bsg_teardown_job(struct kref *kref) Why this rename? The destroy name seems to be one of the most common patterns for the kref_put callbacks. Otherwise this looks fine: Reviewed-by: Christoph Hellwig
Re: [PATCH for-next] block: Fix left-over bi_bdev usage
On Wed, Aug 23, 2017 at 07:45:55PM -0400, Keith Busch wrote: > The bi_bdev field was replaced with the gendisk. This patch just fixes > an omission. Thanks Keith. I'm pretty sure I fixed exactly this one when rebasing from my nvme tree to the block tree. But I guess I didn't finish the rebase fully, as usual - git rebase considered harmful :) Acked-by: Christoph Hellwig
Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
On 8/24/17 16:10, Ming Lei wrote: > On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote: >> >> On 8/23/17 05:43, Bart Van Assche wrote: >>> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx) +{ + return !list_empty_careful(>dispatch); +} + +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + spin_lock(>lock); + list_add(>queuelist, >dispatch); + blk_mq_hctx_set_dispatch_busy(hctx); + spin_unlock(>lock); +} + +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx, + struct list_head *list) +{ + spin_lock(>lock); + list_splice_init(list, >dispatch); + blk_mq_hctx_set_dispatch_busy(hctx); + spin_unlock(>lock); +} + +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx, + struct list_head *list) +{ + spin_lock(>lock); + list_splice_tail_init(list, >dispatch); + blk_mq_hctx_set_dispatch_busy(hctx); + spin_unlock(>lock); +} + +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx, + struct list_head *list) +{ + spin_lock(>lock); + list_splice_init(>dispatch, list); + spin_unlock(>lock); +} >>> >>> Same comment for this patch: these helper functions are so short that I'm >>> not >>> sure it is useful to introduce these helper functions. >>> >>> Bart. >> >> Personally, I like those very much as they give a place to hook up >> different dispatch_list handling without having to change blk-mq.c and >> blk-mq-sched.c all over the place. >> >> I am thinking of SMR (zoned block device) support here since we need to >> to sort insert write requests in blk_mq_add_rq_to_dispatch() and >> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would > > Could you explain a bit why you sort insert write rq in these two > helpers? which are only triggered in case of dispatch busy. For host-managed zoned block devices (SMR disks), writes have to be sequential per zone of the device, otherwise an "unaligned write error" is returned by the device. This constraint is exposed to the user (FS, device mapper or applications doing raw I/Os to the block device). So a well behaved user (f2fs and dm-zoned in the kernel) will issue sequential writes to zones (this includes write same and write zeroes). However, the current current blk-mq code does not guarantee that this issuing order will be respected at dispatch time. There are multiple reasons: 1) On submit, requests first go into per CPU context list. So if the issuer thread is moved from one CPU to another in the middle of a sequential write request sequence, the sequence would be broken in two parts into different lists. When a scheduler is used, this can get fixed by a nice LBA ordering (mq-deadline would do that), but in the "nonw" scheduler case, the requests in the CPU lists are merged into the hctx dispatch queue in CPU number order, so potentially reordering write sequence. Hence the sort to fix that. 2) Dispatch may happen in parallel with multiple contexts getting requests from the scheduler (or the CPU context lists) and again break write sequences. I think your patch partly addresses this with the BUSY flag, but when requests are in the dispatch queue already. However, at dispatch time, requests go directly from the scheduler (or CPU context lists) into the dispatch caller local list, bypassing the hctx dispatch queue. Write ordering gets broken again here. 3) SCSI layer locks a device zone when it sees a write to prevent subsequent write to the same zone. This is to avoid reordering at the HBA level (AHCI has a 100% chance of doing it). In effect, this is a write QD=1 per zone implementation, and that causes requeue of write requests. In that case, this is a requeue happening after ->queue_rq() call, so using the dispatch context local list. Combine that with (2) and reordering can happen here too. I have a series implementing a "dispatcher", which is basically a set of operations that a device driver can specify to handle the dispatch queue. The operations are basically "insert" and "get", so what you have as helpers. The default dispatcher uses operations very similar to your helpers minus the busy bit. That is, the default dispatcher does not change the current behavior. But a ZBC dispatcher can be installed by the scsi layer for a ZBC drive at device scan time and do the sort insert of write requests to address all the potential reordering. That series goes beyond what you did, and I was waiting for your series to stabilize to rebase (or rework) what I did on top of your changes which look like they would simplify handling of ZBC drives. Note: ZBC drives are
Re: [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie
On Tue, Aug 22, 2017 at 08:51:32PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > This patch introduces per-request_queue dispatch > > list for this purpose, and only when all requests > > in this list are dispatched out successfully, we > > can restart to dequeue request from sw/scheduler > > queue and dispath it to lld. > > Wasn't one of the design goals of blk-mq to avoid a per-request_queue list? Yes, but why does scsi device have per-request_queue depth? Anyway this single dispatch list won't hurt devices which doesn't have per-request_queue depth. For scsi device which has q->queue_depth, if one hctx is stuck, all hctxs are stuck actually, the single dispatch list just fits this kind of use case, doesn't it? -- Ming
Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote: > > On 8/23/17 05:43, Bart Van Assche wrote: > > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > >> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx) > >> +{ > >> + return !list_empty_careful(>dispatch); > >> +} > >> + > >> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx, > >> + struct request *rq) > >> +{ > >> + spin_lock(>lock); > >> + list_add(>queuelist, >dispatch); > >> + blk_mq_hctx_set_dispatch_busy(hctx); > >> + spin_unlock(>lock); > >> +} > >> + > >> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx, > >> + struct list_head *list) > >> +{ > >> + spin_lock(>lock); > >> + list_splice_init(list, >dispatch); > >> + blk_mq_hctx_set_dispatch_busy(hctx); > >> + spin_unlock(>lock); > >> +} > >> + > >> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx > >> *hctx, > >> + struct list_head *list) > >> +{ > >> + spin_lock(>lock); > >> + list_splice_tail_init(list, >dispatch); > >> + blk_mq_hctx_set_dispatch_busy(hctx); > >> + spin_unlock(>lock); > >> +} > >> + > >> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx > >> *hctx, > >> + struct list_head *list) > >> +{ > >> + spin_lock(>lock); > >> + list_splice_init(>dispatch, list); > >> + spin_unlock(>lock); > >> +} > > > > Same comment for this patch: these helper functions are so short that I'm > > not > > sure it is useful to introduce these helper functions. > > > > Bart. > > Personally, I like those very much as they give a place to hook up > different dispatch_list handling without having to change blk-mq.c and > blk-mq-sched.c all over the place. > > I am thinking of SMR (zoned block device) support here since we need to > to sort insert write requests in blk_mq_add_rq_to_dispatch() and > blk_mq_add_list_to_dispatch_tail(). For this last one, the name would Could you explain a bit why you sort insert write rq in these two helpers? which are only triggered in case of dispatch busy. > become a little awkward though. This sort insert would be to avoid > breaking a sequential write request sequence sent by the disk user. This > is needed since this reordering breakage cannot be solved only from the > SCSI layer. Basically this patchset tries to flush hctx->dispatch first, then fetch requests from scheduler queue after htct->dispatch is flushed. But it isn't strictly in this way because the implementation is lockless. So looks the request from hctx->dispatch won't break one coming requests from scheduler. -- Ming
[PATCH 0/4] loop: support different logical block sizes
From: Omar SandovalHi, everyone, Here's another attempt at supporting different logical block sizes for loop devices. First, some terminology: * Logical block size: the lowest possible block size that the storage device can address. * Physical block size: the lowest possible sector size that the hardware can operate on without reverting to read-modify-write operations. Always greater than or equal to the logical block size. These are characteristics of the device itself tracked in the request_queue. For loop devices, both are currently always 512 bytes. struct block_device also has another block size, basically a "soft" block size which is the preferred I/O size and can be changed from userspace. For loop devices, this is PAGE_SIZE if the backing file is a regular file or otherwise the soft block size of the backing block device. Patch 1 simplifies how the soft block size is set. I'm tempted to not set it at all and use the default of PAGE_SIZE, but maybe someone is depending on inheriting the soft block size from the backing block device... Patch 2 sets the physical block size of loop devices to a more meaningful value for loop, PAGE_SIZE. Patch 3 allows us to change the logical block size. It adds a new ioctl instead of the previous approach (which extends LOOP_{GET,SET}_STATUS). Hannes, I assume you had a specific reason for doing it the way you did, care to explain? Patch 4 is a cleanup. This is based on my patch reverting the original block size support, targeting 4.14. Thanks! Omar Sandoval (4): loop: get rid of lo_blocksize loop: set physical block size to PAGE_SIZE loop: add ioctl for changing logical block size loop: fold loop_switch() into callers drivers/block/loop.c | 112 -- drivers/block/loop.h | 1 - include/uapi/linux/loop.h | 1 + 3 files changed, 40 insertions(+), 74 deletions(-) -- 2.14.1
[PATCH 1/4] loop: get rid of lo_blocksize
From: Omar SandovalThis is only used for setting the soft block size on the struct block_device once and then never used again. Signed-off-by: Omar Sandoval --- drivers/block/loop.c | 10 ++ drivers/block/loop.h | 1 - 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index f321b96405f5..54e091887199 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -573,8 +573,6 @@ static void do_loop_switch(struct loop_device *lo, struct switch_request *p) mapping = file->f_mapping; mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask); lo->lo_backing_file = file; - lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ? - mapping->host->i_bdev->bd_block_size : PAGE_SIZE; lo->old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); loop_update_dio(lo); @@ -867,7 +865,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, struct file *file, *f; struct inode*inode; struct address_space *mapping; - unsigned lo_blocksize; int lo_flags = 0; int error; loff_t size; @@ -911,9 +908,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, !file->f_op->write_iter) lo_flags |= LO_FLAGS_READ_ONLY; - lo_blocksize = S_ISBLK(inode->i_mode) ? - inode->i_bdev->bd_block_size : PAGE_SIZE; - error = -EFBIG; size = get_loop_size(lo, file); if ((loff_t)(sector_t)size != size) @@ -927,7 +921,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0); lo->use_dio = false; - lo->lo_blocksize = lo_blocksize; lo->lo_device = bdev; lo->lo_flags = lo_flags; lo->lo_backing_file = file; @@ -947,7 +940,8 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, /* let user-space know about the new size */ kobject_uevent(_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE); - set_blocksize(bdev, lo_blocksize); + set_blocksize(bdev, S_ISBLK(inode->i_mode) ? + block_size(inode->i_bdev) : PAGE_SIZE); lo->lo_state = Lo_bound; if (part_shift) diff --git a/drivers/block/loop.h b/drivers/block/loop.h index fecd3f97ef8c..efe57189d01e 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -48,7 +48,6 @@ struct loop_device { struct file * lo_backing_file; struct block_device *lo_device; - unsignedlo_blocksize; void*key_data; gfp_t old_gfp_mask; -- 2.14.1
[PATCH 2/4] loop: set physical block size to PAGE_SIZE
From: Omar SandovalThe physical block size is "the lowest possible sector size that the hardware can operate on without reverting to read-modify-write operations" (from the comment on blk_queue_physical_block_size()). Since loop does buffered I/O on the backing file by default, the RMW unit is a page. This isn't the case for direct I/O mode, but let's keep it simple. Signed-off-by: Omar Sandoval --- drivers/block/loop.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 54e091887199..1a5b4ecf54ec 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1764,6 +1764,8 @@ static int loop_add(struct loop_device **l, int i) } lo->lo_queue->queuedata = lo; + blk_queue_physical_block_size(lo->lo_queue, PAGE_SIZE); + /* * It doesn't make sense to enable merge because the I/O * submitted to backing file is handled page by page. -- 2.14.1
[PATCH 4/4] loop: fold loop_switch() into callers
From: Omar SandovalThe comments here are really outdated, and blk-mq made flushing much simpler, so just fold the two cases into the callers. Signed-off-by: Omar Sandoval --- drivers/block/loop.c | 76 1 file changed, 11 insertions(+), 65 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 6b1d932028e3..e60b4ac77577 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -546,72 +546,12 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) } } -struct switch_request { - struct file *file; - struct completion wait; -}; - static inline void loop_update_dio(struct loop_device *lo) { __loop_update_dio(lo, io_is_direct(lo->lo_backing_file) | lo->use_dio); } -/* - * Do the actual switch; called from the BIO completion routine - */ -static void do_loop_switch(struct loop_device *lo, struct switch_request *p) -{ - struct file *file = p->file; - struct file *old_file = lo->lo_backing_file; - struct address_space *mapping; - - /* if no new file, only flush of queued bios requested */ - if (!file) - return; - - mapping = file->f_mapping; - mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask); - lo->lo_backing_file = file; - lo->old_gfp_mask = mapping_gfp_mask(mapping); - mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); - loop_update_dio(lo); -} - -/* - * loop_switch performs the hard work of switching a backing store. - * First it needs to flush existing IO, it does this by sending a magic - * BIO down the pipe. The completion of this BIO does the actual switch. - */ -static int loop_switch(struct loop_device *lo, struct file *file) -{ - struct switch_request w; - - w.file = file; - - /* freeze queue and wait for completion of scheduled requests */ - blk_mq_freeze_queue(lo->lo_queue); - - /* do the switch action */ - do_loop_switch(lo, ); - - /* unfreeze */ - blk_mq_unfreeze_queue(lo->lo_queue); - - return 0; -} - -/* - * Helper to flush the IOs in loop, but keeping loop thread running - */ -static int loop_flush(struct loop_device *lo) -{ - /* loop not yet configured, no running thread, nothing to flush */ - if (lo->lo_state != Lo_bound) - return 0; - return loop_switch(lo, NULL); -} - static void loop_reread_partitions(struct loop_device *lo, struct block_device *bdev) { @@ -676,9 +616,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, goto out_putf; /* and ... switch */ - error = loop_switch(lo, file); - if (error) - goto out_putf; + blk_mq_freeze_queue(lo->lo_queue); + mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask); + lo->lo_backing_file = file; + lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); + mapping_set_gfp_mask(file->f_mapping, +lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); + loop_update_dio(lo); + blk_mq_unfreeze_queue(lo->lo_queue); fput(old_file); if (lo->lo_flags & LO_FLAGS_PARTSCAN) @@ -1601,12 +1546,13 @@ static void lo_release(struct gendisk *disk, fmode_t mode) err = loop_clr_fd(lo); if (!err) return; - } else { + } else if (lo->lo_state == Lo_bound) { /* * Otherwise keep thread (if running) and config, * but flush possible ongoing bios in thread. */ - loop_flush(lo); + blk_mq_freeze_queue(lo->lo_queue); + blk_mq_unfreeze_queue(lo->lo_queue); } mutex_unlock(>lo_ctl_mutex); -- 2.14.1
[PATCH 3/4] loop: add ioctl for changing logical block size
From: Omar SandovalThis is a different approach from the first attempt in f2c6df7dbf9a ("loop: support 4k physical blocksize"). Rather than extending LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block size. Signed-off-by: Omar Sandoval --- drivers/block/loop.c | 24 include/uapi/linux/loop.h | 1 + 2 files changed, 25 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 1a5b4ecf54ec..6b1d932028e3 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo) memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE); memset(lo->lo_crypt_name, 0, LO_NAME_SIZE); memset(lo->lo_file_name, 0, LO_NAME_SIZE); + blk_queue_logical_block_size(lo->lo_queue, 512); if (bdev) { bdput(bdev); invalidate_bdev(bdev); @@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg) return error; } +static int loop_set_block_size(struct loop_device *lo, unsigned long arg) +{ + if (lo->lo_state != Lo_bound) + return -ENXIO; + + if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg)) + return -EINVAL; + + blk_mq_freeze_queue(lo->lo_queue); + + blk_queue_logical_block_size(lo->lo_queue, arg); + loop_update_dio(lo); + + blk_mq_unfreeze_queue(lo->lo_queue); + + return 0; +} + static int lo_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg) { @@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode, if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) err = loop_set_dio(lo, arg); break; + case LOOP_SET_BLOCK_SIZE: + err = -EPERM; + if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + err = loop_set_block_size(lo, arg); + break; default: err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL; } diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h index c8125ec1f4f2..23158dbe2424 100644 --- a/include/uapi/linux/loop.h +++ b/include/uapi/linux/loop.h @@ -88,6 +88,7 @@ struct loop_info64 { #define LOOP_CHANGE_FD 0x4C06 #define LOOP_SET_CAPACITY 0x4C07 #define LOOP_SET_DIRECT_IO 0x4C08 +#define LOOP_SET_BLOCK_SIZE0x4C09 /* /dev/loop-control interface */ #define LOOP_CTL_ADD 0x4C80 -- 2.14.1
Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list
On Tue, Aug 22, 2017 at 08:43:12PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx) > > +{ > > + return !list_empty_careful(>dispatch); > > +} > > + > > +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx, > > + struct request *rq) > > +{ > > + spin_lock(>lock); > > + list_add(>queuelist, >dispatch); > > + blk_mq_hctx_set_dispatch_busy(hctx); > > + spin_unlock(>lock); > > +} > > + > > +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx, > > + struct list_head *list) > > +{ > > + spin_lock(>lock); > > + list_splice_init(list, >dispatch); > > + blk_mq_hctx_set_dispatch_busy(hctx); > > + spin_unlock(>lock); > > +} > > + > > +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx > > *hctx, > > + struct list_head *list) > > +{ > > + spin_lock(>lock); > > + list_splice_tail_init(list, >dispatch); > > + blk_mq_hctx_set_dispatch_busy(hctx); > > + spin_unlock(>lock); > > +} > > + > > +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx > > *hctx, > > + struct list_head *list) > > +{ > > + spin_lock(>lock); > > + list_splice_init(>dispatch, list); > > + spin_unlock(>lock); > > +} > > Same comment for this patch: these helper functions are so short that I'm not > sure it is useful to introduce these helper functions. It will become long in the following patches. -- Ming
Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
On Wed, Aug 23, 2017 at 02:02:20PM -0600, Jens Axboe wrote: > On Tue, Aug 22 2017, Bart Van Assche wrote: > > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx > > > *hctx) > > > +{ > > > + return test_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > > +} > > > + > > > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx > > > *hctx) > > > +{ > > > + set_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > > +} > > > + > > > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx > > > *hctx) > > > +{ > > > + clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > > +} > > > > Hello Ming, > > > > Are these helper functions modified in a later patch? If not, please drop > > this patch. One line helper functions are not useful and do not improve > > readability of source code. > > Agree, they just obfuscate the code. Only reason to do this is to do > things like: If you look at the following patch, these introduced functions are modified a lot. > > static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx) > { > if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state)) > clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > } > > to avoid unecessary RMW (and locked instruction). At least then you can > add a single comment as to why it's done that way. Apart from that, I > prefer to open-code it so people don't have to grep to figure out wtf > blk_mq_hctx_clear_dispatch_busy() does. Ok, will do that in this way. -- Ming
Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state
On Tue, Aug 22, 2017 at 08:41:14PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx) > > +{ > > + return test_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > +} > > + > > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx > > *hctx) > > +{ > > + set_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > +} > > + > > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx > > *hctx) > > +{ > > + clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > +} > > Hello Ming, > > Are these helper functions modified in a later patch? If not, please drop > this patch. One line helper functions are not useful and do not improve > readability of source code. It is definitely modified in later patch, :-) -- Ming
Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
On Tue, Aug 22, 2017 at 09:55:57PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > + /* > > +* if there is q->queue_depth, all hw queues share > > +* this queue depth limit > > +*/ > > + if (q->queue_depth) { > > + queue_for_each_hw_ctx(q, hctx, i) > > + hctx->flags |= BLK_MQ_F_SHARED_DEPTH; > > + } > > + > > + if (!q->elevator) > > + goto exit; > > Hello Ming, > > It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if > q->queue_depth != 0. Wouldn't it be better to let the block driver tell the > block layer whether or not there is a queue depth limit across hardware > queues, e.g. through a tag set flag? One reason for not doing in that way is because q->queue_depth can be changed via sysfs interface. Another reason is that better to not exposing this flag to drivers since it isn't necessary, that said it is an internal flag actually. -- Ming
Re: [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests
On Tue, Aug 22, 2017 at 08:20:20PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > SCSI sets q->queue_depth from shost->cmd_per_lun, and > > q->queue_depth is per request_queue and more related to > > scheduler queue compared with hw queue depth, which can be > > shared by queues, such as TAG_SHARED. > > > > This patch trys to use q->queue_depth as hint for computing > > tries? Will fix it in V3. > > q->nr_requests, which should be more effective than > > current way. > > Reviewed-by: Bart Van AsscheThanks! -- Ming
Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
On Wed, Aug 23, 2017 at 01:56:50PM -0600, Jens Axboe wrote: > On Sat, Aug 05 2017, Ming Lei wrote: > > During dispatching, we moved all requests from hctx->dispatch to > > one temporary list, then dispatch them one by one from this list. > > Unfortunately duirng this period, run queue from other contexts > > may think the queue is idle, then start to dequeue from sw/scheduler > > queue and still try to dispatch because ->dispatch is empty. This way > > hurts sequential I/O performance because requests are dequeued when > > lld queue is busy. > > > > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to > > make sure that request isn't dequeued until ->dispatch is > > flushed. > > I don't like how this patch introduces a bunch of locked setting of a > flag under the hctx lock. Especially since I think we can easily avoid > it. Actually the lock isn't needed for setting the flag, will move it out in V3. > > > - } else if (!has_sched_dispatch & !q->queue_depth) { > > + blk_mq_dispatch_rq_list(q, _list); > > + > > + /* > > +* We may clear DISPATCH_BUSY just after it > > +* is set from another context, the only cost > > +* is that one request is dequeued a bit early, > > +* we can survive that. Given the window is > > +* too small, no need to worry about performance > > +* effect. > > +*/ > > + if (list_empty_careful(>dispatch)) > > + clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state); > > This is basically the only place where we modify it without holding the > hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is The problem is that blk_mq_dispatch_rq_list() don't know if it is handling requests from hctx->dispatch or sw/scheduler queue. We only need to clear the bit after hctx->dispatch is flushed. So the clearing can't be moved into blk_mq_dispatch_rq_list(). > generally empty, unless for the case where we splice residuals back. If > we splice them back, we grab the lock anyway. > > The other places it's set under the hctx lock, yet we end up using an > atomic operation to do it. As the comment explained, it needn't to be atomic. -- Ming
Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
On Tue, Aug 22, 2017 at 08:09:32PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > + /* > > +* Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue() > > +* will be run to try to make progress, so it is always > > +* safe to check the state here. > > +*/ > > + if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state)) > > + return; > > The comment above test_bit() is useful but does not explain the purpose of > the early return. Is the purpose of the early return perhaps to serialize > blk_mq_sched_dispatch_requests() calls? If so, please mention this. If the bit is set, that means there are requests in hctx->dispatch, so return early for avoiding to dequeue requests from sw/scheduler queue unnecessarily. I thought the code is self-comment, so not explain it, will add the comment. -- Ming
Re: [PATCH] Revert "loop: support 4k physical blocksize"
On 08/23/2017 11:59 PM, Jens Axboe wrote: > On 08/23/2017 03:54 PM, Omar Sandoval wrote: >> From: Omar Sandoval>> >> There's some stuff still up in the air, let's not get stuck with a >> subpar ABI. I'll follow up with something better for 4.14. > > I think that's a good idea, there was already too much late churn > fixing up the 4k support. Dropped the fixups, added this. > I'll never get it upstream ... Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue
On Tue, Aug 22, 2017 at 08:57:03PM +, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > > +static inline void blk_mq_do_dispatch_ctx(struct request_queue *q, > > + struct blk_mq_hw_ctx *hctx) > > +{ > > + LIST_HEAD(rq_list); > > + struct blk_mq_ctx *ctx = NULL; > > + > > + do { > > + struct request *rq; > > + > > + rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx); > > + if (!rq) > > + break; > > + list_add(>queuelist, _list); > > + > > + /* round robin for fair dispatch */ > > + ctx = blk_mq_next_ctx(hctx, rq->mq_ctx); > > + } while (blk_mq_dispatch_rq_list(q, _list)); > > +} > > An additional question about this patch: shouldn't request dequeuing start > at the software queue next to the last one from which a request got dequeued > instead of always starting at the first software queue (struct blk_mq_ctx > *ctx = NULL) to be truly round robin? Looks a good idea, will introduce a ctx hint in hctx for starting when blk_mq_dispatch_rq_list() returns zero. No lock is needed since it is just a hint. -- Ming