Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 9:34 AM, Jens Axboe wrote: > On 12/6/18 6:22 PM, jianchao.wang wrote: >> >> >> On 12/7/18 9:13 AM, Jens Axboe wrote: >>> On 12/6/18 6:04 PM, jianchao.wang wrote: >>>> >>>> >>>> On 12/7/18 6:20 AM, Jens Axboe wrote: >>>>> After the direct dispatch corruption fix, we permanently disallow direct >>>>> dispatch of non read/write requests. This works fine off the normal IO >>>>> path, as they will be retried like any other failed direct dispatch >>>>> request. But for the blk_insert_cloned_request() that only DM uses to >>>>> bypass the bottom level scheduler, we always first attempt direct >>>>> dispatch. For some types of requests, that's now a permanent failure, >>>>> and no amount of retrying will make that succeed. >>>>> >>>>> Don't use direct dispatch off the cloned insert path, always just use >>>>> bypass inserts. This still bypasses the bottom level scheduler, which is >>>>> what DM wants. >>>>> >>>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >>>>> Signed-off-by: Jens Axboe >>>>> >>>>> --- >>>>> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>> index deb56932f8c4..4c44e6fa0d08 100644 >>>>> --- a/block/blk-core.c >>>>> +++ b/block/blk-core.c >>>>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct >>>>> request_queue *q, struct request * >>>>>* bypass a potential scheduler on the bottom device for >>>>>* insert. >>>>>*/ >>>>> - return blk_mq_request_issue_directly(rq); >>>>> + blk_mq_request_bypass_insert(rq, true); >>>>> + return BLK_STS_OK; >>>>> } >>>>> >>>>> spin_lock_irqsave(q->queue_lock, flags); >>>>> >>>> Not sure about this because it will break the merging promotion for >>>> request based DM >>>> from Ming. >>>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 >>>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request >>>> feedback) >>>> >>>> We could use some other way to fix this. >>> >>> That really shouldn't matter as this is the cloned insert, merging should >>> have been done on the original request. >>> >>> >> Just quote some comments from the patch. >> >> " >>But dm-rq currently can't get the underlying queue's >> dispatch feedback at all. Without knowing whether a request was issued >> or not (e.g. due to underlying queue being busy) the dm-rq elevator will >> not be able to provide effective IO merging (as a side-effect of dm-rq >> currently blindly destaging a request from its elevator only to requeue >> it after a delay, which kills any opportunity for merging). This >> obviously causes very bad sequential IO performance. >> ... >> With this, request-based DM's blk-mq sequential IO performance is vastly >> improved (as much as 3X in mpath/virtio-scsi testing) >> " >> >> Using blk_mq_request_bypass_insert to replace the >> blk_mq_request_issue_directly >> could be a fast method to fix the current issue. Maybe we could get the >> merging >> promotion back after some time. > > This really sucks, mostly because DM wants to have it both ways - not use > the bottom level IO scheduler, but still actually use it if it makes sense. > > There is another way to fix this - still do the direct dispatch, but have > dm track if it failed and do bypass insert in that case. I didn't want do > to that since it's more involved, but it's doable. > > Let me cook that up and test it... Don't like it, though. > Actually, I have tried to fix this issue in the 1st patch of my patchset blk-mq: refactor code of issue directly. Just insert the non-read-write command into dispatch list directly and return BLK_STS_OK. Thanks Jianchao
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 9:13 AM, Jens Axboe wrote: > On 12/6/18 6:04 PM, jianchao.wang wrote: >> >> >> On 12/7/18 6:20 AM, Jens Axboe wrote: >>> After the direct dispatch corruption fix, we permanently disallow direct >>> dispatch of non read/write requests. This works fine off the normal IO >>> path, as they will be retried like any other failed direct dispatch >>> request. But for the blk_insert_cloned_request() that only DM uses to >>> bypass the bottom level scheduler, we always first attempt direct >>> dispatch. For some types of requests, that's now a permanent failure, >>> and no amount of retrying will make that succeed. >>> >>> Don't use direct dispatch off the cloned insert path, always just use >>> bypass inserts. This still bypasses the bottom level scheduler, which is >>> what DM wants. >>> >>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >>> Signed-off-by: Jens Axboe >>> >>> --- >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index deb56932f8c4..4c44e6fa0d08 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct >>> request_queue *q, struct request * >>> * bypass a potential scheduler on the bottom device for >>> * insert. >>> */ >>> - return blk_mq_request_issue_directly(rq); >>> + blk_mq_request_bypass_insert(rq, true); >>> + return BLK_STS_OK; >>> } >>> >>> spin_lock_irqsave(q->queue_lock, flags); >>> >> Not sure about this because it will break the merging promotion for request >> based DM >> from Ming. >> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 >> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request >> feedback) >> >> We could use some other way to fix this. > > That really shouldn't matter as this is the cloned insert, merging should > have been done on the original request. > > Just quote some comments from the patch. " But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. ... With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing) " Using blk_mq_request_bypass_insert to replace the blk_mq_request_issue_directly could be a fast method to fix the current issue. Maybe we could get the merging promotion back after some time. Thanks Jianchao
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 6:20 AM, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); > Not sure about this because it will break the merging promotion for request based DM from Ming. 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback) We could use some other way to fix this. Thanks Jianchao
Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
On 11/16/18 11:28 AM, Ming Lei wrote: ... > > +struct blk_mq_kobj { > + struct kobject kobj; > +}; > + > static void blk_mq_sysfs_release(struct kobject *kobj) > { > + struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj, > +kobj); > + kfree(mq_kobj); > +} > + ... > > -void blk_mq_sysfs_init(struct request_queue *q) > +int blk_mq_sysfs_init(struct request_queue *q) > { > struct blk_mq_ctx *ctx; > int cpu; > + struct blk_mq_kobj *mq_kobj; > + > + mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL); > + if (!mq_kobj) > + return -ENOMEM; > > - kobject_init(>mq_kobj, _mq_ktype); > + kobject_init(_kobj->kobj, _mq_ktype); > > for_each_possible_cpu(cpu) { > - ctx = per_cpu_ptr(q->queue_ctx, cpu); > + ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu)); > + if (!ctx) > + goto fail; > + *per_cpu_ptr(q->queue_ctx, cpu) = ctx; > kobject_init(>kobj, _mq_ctx_ktype); > } > + q->mq_kobj = _kobj->kobj; > + return 0; > + > + fail: > + for_each_possible_cpu(cpu) { > + ctx = *per_cpu_ptr(q->queue_ctx, cpu); > + if (ctx) > + kobject_put(>kobj); > + } > + kobject_put(_kobj->kobj); > + return -ENOMEM; > } blk_mq_kobj looks meaningless, why do we need it, or do I miss something ? And maybe we should allocate ctx in blk_mq_init_allocated_queue. Thanks Jianchao
Re: [PATCH 05/11] block: avoid ordered task state change for polled IO
Hi Jens On 11/13/18 11:42 PM, Jens Axboe wrote: > @@ -181,6 +181,7 @@ static void blkdev_bio_end_io_simple(struct bio *bio) > struct task_struct *waiter = bio->bi_private; > > WRITE_ONCE(bio->bi_private, NULL); > + smp_wmb(); > wake_up_process(waiter); > } The wake up path has contained a full memory barrier with the raw_spin_lock and following smp_mb__after_spinlock Please refer to: wake_up_process -> try_to_wake_up raw_spin_lock_irqsave(>pi_lock, flags); smp_mb__after_spinlock(); So the smp_wmb here is not necessary. Thanks Jianchao
Re: [PATCH 4/6] block: avoid ordered task state change for polled IO
Hi Jens On 11/13/18 12:26 AM, Jens Axboe wrote: > On 11/12/18 2:35 AM, jianchao.wang wrote: >> Hi Jens >> >> On 11/10/18 11:13 PM, Jens Axboe wrote: >>> We only really need the barrier if we're going to be sleeping, >>> if we're just polling we're fine with the __set_current_state() >>> variant. >>> >>> Signed-off-by: Jens Axboe >>> --- >>> fs/block_dev.c | 25 + >>> 1 file changed, 21 insertions(+), 4 deletions(-) >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index c039abfb2052..e7550b1f9670 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct >>> iov_iter *iter, >>> >>> qc = submit_bio(); >>> for (;;) { >>> - set_current_state(TASK_UNINTERRUPTIBLE); >>> + /* >>> +* Using non-atomic task state for polling >>> +*/ >>> + __set_current_state(TASK_UNINTERRUPTIBLE); >>> + >>> if (!READ_ONCE(bio.bi_private)) >>> break; >> >> When we don't use polling, the blkdev_bio_end_io_simple may come on a >> different cpu before >> submit_bio returns >> For example, >> __blkdev_direct_IO_simple >> qc = submit_bio() >> blkdev_bio_end_io_simple >> WRITE_ONCE(bio.bi_private, NULL) >> wake_up_process(waiter) >> __set_current_state(TASK_UNINTERRUPTIBLE) >> READ_ONCE(bio.bi_private) > > While I agree that you are right, that's an existing issue. The store > barrier implied by set_current_state() doesn't fix that. > > How about this variant: > > https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3D036ba7a8d1334889c0fe55d4858d78f4e8022f12=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA=AnmYMIcGpIILUrjDFsabj61pcTyMCRHB1Pu8XXi47t0= > > which then changes the later wake up helper patch to be: > > https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3Df8c3f188425967adb040cfefb799b0a5a1df769d=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA=PdajuUul55e6p0FED_AzzWt5Gip0ekLi1eGYBMLiZPo= > The wake up path has contained a full memory barrier with the raw_spin_lock and following smp_mb__after_spinlock wake_up_process -> try_to_wake_up raw_spin_lock_irqsave(>pi_lock, flags); smp_mb__after_spinlock(); So a smp_rmb() could be enough. :) Thanks Jianchao
Re: [PATCH 4/6] block: avoid ordered task state change for polled IO
Hi Jens On 11/10/18 11:13 PM, Jens Axboe wrote: > We only really need the barrier if we're going to be sleeping, > if we're just polling we're fine with the __set_current_state() > variant. > > Signed-off-by: Jens Axboe > --- > fs/block_dev.c | 25 + > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index c039abfb2052..e7550b1f9670 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > iov_iter *iter, > > qc = submit_bio(); > for (;;) { > - set_current_state(TASK_UNINTERRUPTIBLE); > + /* > + * Using non-atomic task state for polling > + */ > + __set_current_state(TASK_UNINTERRUPTIBLE); > + > if (!READ_ONCE(bio.bi_private)) > break; When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before submit_bio returns For example, __blkdev_direct_IO_simple qc = submit_bio() blkdev_bio_end_io_simple WRITE_ONCE(bio.bi_private, NULL) wake_up_process(waiter) __set_current_state(TASK_UNINTERRUPTIBLE) READ_ONCE(bio.bi_private) At the moment, a smp_rmb at least maybe needed before READ_ONCE(bio.bi_private) to ensure the WRITE_ONCE(bio.bi_private, NULL) is seen if we miss the wakeup. > + > if (!(iocb->ki_flags & IOCB_HIPRI) || > - !blk_poll(bdev_get_queue(bdev), qc)) > + !blk_poll(bdev_get_queue(bdev), qc)) { > + /* > + * If we're scheduling, ensure that task state > + * change is ordered and seen. > + */ > + smp_mb(); > io_schedule(); > + } > } > __set_current_state(TASK_RUNNING); > > @@ -403,13 +414,19 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > return -EIOCBQUEUED; > > for (;;) { > - set_current_state(TASK_UNINTERRUPTIBLE); > + /* > + * See __blkdev_direct_IO_simple() loop > + */ > + __set_current_state(TASK_UNINTERRUPTIBLE); > + Similar with above. > if (!READ_ONCE(dio->waiter)) > break; > > if (!(iocb->ki_flags & IOCB_HIPRI) || > - !blk_poll(bdev_get_queue(bdev), qc)) > + !blk_poll(bdev_get_queue(bdev), qc)) { > + smp_mb(); > io_schedule(); > + } > } > __set_current_state(TASK_RUNNING); > > Thanks Jianchao
Re: [PATCHSET v2 0/2] Add queue_is_busy helper
On 11/9/18 9:53 AM, jianchao.wang wrote: > Hi Jens > > On 11/9/18 12:06 AM, Jens Axboe wrote: >> DM currently uses atomic inc/dec to maintain a busy count of >> IO on a given device. For the dm-mq path, we can replace this >> with helper that just checks the state of the tags on the device. >> >> First patch is a prep patch that allows the iteration helpers >> to return true/false, like we support internally in sbitmap. >> For a busy check we don't care about how many requests are >> busy, just if some are or not. Hence we can stop iterating >> tags as soon as we find one that is allocated. > > If we don't care about how many requests are busy, why not check > sb->map[idex].word directly ? It could be more efficient. > Oh, we need to check rq->q for the tag shared case. > Thanks > Jianchao >
Re: [PATCHSET v2 0/2] Add queue_is_busy helper
Hi Jens On 11/9/18 12:06 AM, Jens Axboe wrote: > DM currently uses atomic inc/dec to maintain a busy count of > IO on a given device. For the dm-mq path, we can replace this > with helper that just checks the state of the tags on the device. > > First patch is a prep patch that allows the iteration helpers > to return true/false, like we support internally in sbitmap. > For a busy check we don't care about how many requests are > busy, just if some are or not. Hence we can stop iterating > tags as soon as we find one that is allocated. If we don't care about how many requests are busy, why not check sb->map[idex].word directly ? It could be more efficient. Thanks Jianchao
Re: [PATCH 01/14] blk-mq: kill q->mq_map
On 10/29/18 11:00 AM, Jens Axboe wrote: > On 10/28/18 8:40 PM, jianchao.wang wrote: >> >> >> On 10/29/18 10:02 AM, jianchao.wang wrote: >>> Hi Jens >>> >>> On 10/28/18 12:52 AM, Jens Axboe wrote: >>>> On 10/27/18 10:48 AM, Jens Axboe wrote: >>>>> On 10/27/18 8:19 AM, jianchao.wang wrote: >>>>>> Hi Jens >>>>>> >>>>>> On 10/26/18 5:16 AM, Jens Axboe wrote: >>>>>>> It's just a pointer to set->mq_map, use that instead. >>>>>> Instead of using the set->mq_map and then a two-dimensional set->mq_map, >>>>>> how about migrate the mq_map from per-set to per-cpuctx ? >>>>>> something like: >>>>>> q->queue_hw_ctx[ctx->map[type]] >>>>> >>>>> I think the current series is pretty clean in that regard, it goes >>>>> from members -> map -> map array. I'd be willing to look at a >>>>> conversion on top of that, if it makes things cleaner. >>>> >>>> On top of that, this: >>>> >>>> q->queue_hw_ctx[set->map[type].mq_map[cpu]] >>>> >>>> is one less pointer dereference, so more efficient. >>>> >>> >>> Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and >>> efficient. >>> something like, >>> >>>ctx->map_hctxs[type] >>> >>> all of the memory accessing is local. >>> >>> What we need to pay for this is do mapping for every request_queue. >> >> Actually not, this could be done in blk_mq_map_swqueue. > > I'm not interested in pursuing this for this series. If you want > to work on it on top and have it tested and evaluated separately, > then that's obviously fine. But there's no point in mixing it > up with these changes, as it's an orthogonal change. > Yes, got it. Thanks for your kindly response. Sincerely Jianchao
Re: [PATCH 01/14] blk-mq: kill q->mq_map
On 10/29/18 10:02 AM, jianchao.wang wrote: > Hi Jens > > On 10/28/18 12:52 AM, Jens Axboe wrote: >> On 10/27/18 10:48 AM, Jens Axboe wrote: >>> On 10/27/18 8:19 AM, jianchao.wang wrote: >>>> Hi Jens >>>> >>>> On 10/26/18 5:16 AM, Jens Axboe wrote: >>>>> It's just a pointer to set->mq_map, use that instead. >>>> Instead of using the set->mq_map and then a two-dimensional set->mq_map, >>>> how about migrate the mq_map from per-set to per-cpuctx ? >>>> something like: >>>> q->queue_hw_ctx[ctx->map[type]] >>> >>> I think the current series is pretty clean in that regard, it goes >>> from members -> map -> map array. I'd be willing to look at a >>> conversion on top of that, if it makes things cleaner. >> >> On top of that, this: >> >> q->queue_hw_ctx[set->map[type].mq_map[cpu]] >> >> is one less pointer dereference, so more efficient. >> > > Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and > efficient. > something like, > >ctx->map_hctxs[type] > > all of the memory accessing is local. > > What we need to pay for this is do mapping for every request_queue. Actually not, this could be done in blk_mq_map_swqueue. > > Thanks > Jianchao > > >
Re: [PATCH 01/14] blk-mq: kill q->mq_map
Hi Jens On 10/28/18 12:52 AM, Jens Axboe wrote: > On 10/27/18 10:48 AM, Jens Axboe wrote: >> On 10/27/18 8:19 AM, jianchao.wang wrote: >>> Hi Jens >>> >>> On 10/26/18 5:16 AM, Jens Axboe wrote: >>>> It's just a pointer to set->mq_map, use that instead. >>> Instead of using the set->mq_map and then a two-dimensional set->mq_map, >>> how about migrate the mq_map from per-set to per-cpuctx ? >>> something like: >>> q->queue_hw_ctx[ctx->map[type]] >> >> I think the current series is pretty clean in that regard, it goes >> from members -> map -> map array. I'd be willing to look at a >> conversion on top of that, if it makes things cleaner. > > On top of that, this: > > q->queue_hw_ctx[set->map[type].mq_map[cpu]] > > is one less pointer dereference, so more efficient. > Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and efficient. something like, ctx->map_hctxs[type] all of the memory accessing is local. What we need to pay for this is do mapping for every request_queue. Thanks Jianchao
Re: [PATCH 01/14] blk-mq: kill q->mq_map
Hi Jens On 10/26/18 5:16 AM, Jens Axboe wrote: > It's just a pointer to set->mq_map, use that instead. Instead of using the set->mq_map and then a two-dimensional set->mq_map, how about migrate the mq_map from per-set to per-cpuctx ? something like: q->queue_hw_ctx[ctx->map[type]] Thanks Jianchao
Re: [PATCH] blk-mq: place trace_block_getrq() in correct place
On 10/23/18 10:30 PM, Xiaoguang Wang wrote: > trace_block_getrq() is to indicate a request struct has been allcoated > for queue, so put it in right place. > > Signed-off-by: Xiaoguang Wang > --- > block/blk-mq.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index e3c39ea8e17b..93aec25f1340 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1833,8 +1833,6 @@ static blk_qc_t blk_mq_make_request(struct > request_queue *q, struct bio *bio) > > rq_qos_throttle(q, bio, NULL); > > - trace_block_getrq(q, bio, bio->bi_opf); > - > rq = blk_mq_get_request(q, bio, bio->bi_opf, ); > if (unlikely(!rq)) { > rq_qos_cleanup(q, bio); > @@ -1843,6 +1841,8 @@ static blk_qc_t blk_mq_make_request(struct > request_queue *q, struct bio *bio) > return BLK_QC_T_NONE; > } > > + trace_block_getrq(q, bio, bio->bi_opf); > + > rq_qos_track(q, rq, bio); > > cookie = request_to_qc_t(data.hctx, rq); > Looks fine. Reviewed-by: Jianchao Wang Thanks Jianchao
Re: [PATCH] blk-mq: complete req in softirq context in case of single queue
Hi Ming On 09/27/2018 12:08 AM, Ming Lei wrote: > Lot of controllers may have only one irq vector for completing IO > request. And usually affinity of the only irq vector is all possible > CPUs, however, on most of ARCH, there may be only one specific CPU > for handling this interrupt. > > So if all IOs are completed in hardirq context, it is inevitable to > degrade IO performance because of increased irq latency. > > This patch tries to address this issue by allowing to complete request > in softirq context, like the legacy IO path. > > IOPS is observed as ~13%+ in the following randread test on raid0 over > virtio-scsi. > > mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 > /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi > > fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 > --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 > --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k > > Cc: Zach Marano > Cc: Christoph Hellwig > Cc: Bart Van Assche > Cc: Jianchao Wang > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 14 ++ > block/blk-softirq.c | 7 +-- > 2 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 85a1c1a59c72..d4792c3ac983 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -565,6 +565,20 @@ static void __blk_mq_complete_request(struct request *rq) > if (rq->internal_tag != -1) > blk_mq_sched_completed_request(rq); > > + /* > + * Most of single queue controllers, there is only one irq vector > + * for handling IO completion, and the only irq's affinity is set > + * as all possible CPUs. On most of ARCHs, this affinity means the > + * irq is handled on one specific CPU. > + * > + * So complete IO reqeust in softirq context in case of single queue > + * for not degrading IO performance by irqsoff latency. > + */ > + if (rq->q->nr_hw_queues == 1) { > + __blk_complete_request(rq); > + return; > + } > + > if (!test_bit(QUEUE_FLAG_SAME_COMP, >q->queue_flags)) { > rq->q->softirq_done_fn(rq); > return; > diff --git a/block/blk-softirq.c b/block/blk-softirq.c > index 15c1f5e12eb8..b1df9b6c1731 100644 > --- a/block/blk-softirq.c > +++ b/block/blk-softirq.c > @@ -101,17 +101,20 @@ void __blk_complete_request(struct request *req) > struct request_queue *q = req->q; > unsigned long flags; > bool shared = false; > + int rq_cpu; > > BUG_ON(!q->softirq_done_fn); > > + rq_cpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu; > + > local_irq_save(flags); > cpu = smp_processor_id(); > > /* >* Select completion CPU >*/ > - if (req->cpu != -1) { > - ccpu = req->cpu; > + if (rq_cpu != -1) { > + ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu; > if (!test_bit(QUEUE_FLAG_SAME_FORCE, >queue_flags)) > shared = cpus_share_cache(cpu, ccpu); > } else > ; Only QUEUE_FLAG_SAME_COMP is set, we will try to complete the request on the cpu where it is allocated. For the blk-legacy, the req->cpu is -1 when QUEUE_FLAG_SAME_COMP is not set. blk_queue_bio if (test_bit(QUEUE_FLAG_SAME_COMP, >queue_flags)) req->cpu = raw_smp_processor_id() But for the blk-mq, the req->mq_ctx->cpu must be valid. So we should check the QUEUE_FLAG_SAME_COMP for the blk-mq case. On the other hand, how about split the softirq completion part out of __blk_complete_request ? It is confused when find __blk_complete_request in __blk_mq_complete_request. Thanks Jianchao
Re: [PATCHv2] blk-mq: Allow blocking queue tag iter callbacks
Hi Keith On 09/26/2018 12:36 AM, Keith Busch wrote: > A recent commit runs tag iterator callbacks under the rcu read lock, > but existing callbacks do not satisfy the non-blocking requirement. > The commit intended to prevent an iterator from accessing a queue that's > being modified. This patch fixes the original issue by taking a queue > reference instead of reading it, which allows callbacks to make blocking > calls. > > Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with > blk_mq_queue_tag_busy_iter") > Cc: Jianchao Wang > Signed-off-by: Keith Busch > --- > v1 -> v2: > > Leave the iterator interface as-is, making this change transparent > to the callers. This will increment the queue usage counter in the > timeout path twice, which is harmless. > This is OK to me. Acked-by: Jianchao Wang > Changelog. > > block/blk-mq-tag.c | 13 - > 1 file changed, 4 insertions(+), 9 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 94e1ed667b6e..41317c50a446 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -322,16 +322,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue > *q, busy_iter_fn *fn, > > /* >* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and > - * queue_hw_ctx after freeze the queue. So we could use q_usage_counter > - * to avoid race with it. __blk_mq_update_nr_hw_queues will users > - * synchronize_rcu to ensure all of the users go out of the critical > - * section below and see zeroed q_usage_counter. > + * queue_hw_ctx after freeze the queue, so we use q_usage_counter > + * to avoid race with it. >*/ > - rcu_read_lock(); > - if (percpu_ref_is_zero(>q_usage_counter)) { > - rcu_read_unlock(); > + if (!percpu_ref_tryget(>q_usage_counter)) > return; > - } > > queue_for_each_hw_ctx(q, hctx, i) { > struct blk_mq_tags *tags = hctx->tags; > @@ -347,7 +342,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, > busy_iter_fn *fn, > bt_for_each(hctx, >breserved_tags, fn, priv, > true); > bt_for_each(hctx, >bitmap_tags, fn, priv, false); > } > - rcu_read_unlock(); > + blk_queue_exit(q); > } > > static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, >
Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks
Hi Bart On 09/25/2018 10:20 AM, Bart Van Assche wrote: > On 9/24/18 7:11 PM, jianchao.wang wrote: >> Hi Keith >> >> On 09/25/2018 05:09 AM, Keith Busch wrote: >>> - /* A deadlock might occur if a request is stuck requiring a >>> - * timeout at the same time a queue freeze is waiting >>> - * completion, since the timeout code would not be able to >>> - * acquire the queue reference here. >>> - * >>> - * That's why we don't use blk_queue_enter here; instead, we use >>> - * percpu_ref_tryget directly, because we need to be able to >>> - * obtain a reference even in the short window between the queue >>> - * starting to freeze, by dropping the first reference in >>> - * blk_freeze_queue_start, and the moment the last request is >>> - * consumed, marked by the instant q_usage_counter reaches >>> - * zero. >>> - */ >>> - if (!percpu_ref_tryget(>q_usage_counter)) >>> + if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, )) >>> return; >> >> We cannot discard the percpu_ref_tryget here. >> >> There following code in blk_mq_timeout_work still need it: >> >> if (next != 0) { >> mod_timer(>timeout, next); >> } else { >> queue_for_each_hw_ctx(q, hctx, i) { >> /* the hctx may be unmapped, so check it here */ >> if (blk_mq_hw_queue_mapped(hctx)) >> blk_mq_tag_idle(hctx); >> } >> } > > Hi Jianchao, > > Had you noticed that the percpu_ref_tryget() call has been moved into > blk_mq_queue_tag_busy_iter()? > Yes. But the issue is the left part of blk_mq_timeout_work is moved out of protection of q refcount. Thanks Jianchao
Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks
Hi Keith On 09/25/2018 05:09 AM, Keith Busch wrote: > - /* A deadlock might occur if a request is stuck requiring a > - * timeout at the same time a queue freeze is waiting > - * completion, since the timeout code would not be able to > - * acquire the queue reference here. > - * > - * That's why we don't use blk_queue_enter here; instead, we use > - * percpu_ref_tryget directly, because we need to be able to > - * obtain a reference even in the short window between the queue > - * starting to freeze, by dropping the first reference in > - * blk_freeze_queue_start, and the moment the last request is > - * consumed, marked by the instant q_usage_counter reaches > - * zero. > - */ > - if (!percpu_ref_tryget(>q_usage_counter)) > + if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, )) > return; We cannot discard the percpu_ref_tryget here. There following code in blk_mq_timeout_work still need it: if (next != 0) { mod_timer(>timeout, next); } else { queue_for_each_hw_ctx(q, hctx, i) { /* the hctx may be unmapped, so check it here */ if (blk_mq_hw_queue_mapped(hctx)) blk_mq_tag_idle(hctx); } } Thanks Jianchao
Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended
Hi Bart On 09/19/2018 01:44 AM, Bart Van Assche wrote: >>> + * ago. Since blk_queue_enter() is called by the request allocation >>> + * code before pm_request_resume(), if no requests have a tag assigned >>> + * it is safe to suspend the device. >>> + */ >>> + ret = -EBUSY; >>> + if (blk_requests_in_flight(q) == 0) { >>> + /* >>> + * Call synchronize_rcu() such that later blk_queue_enter() >>> + * calls see the pm-only state. See also >>> + * >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=. >>> + */ >>> + synchronize_rcu(); >>> + if (blk_requests_in_flight(q) == 0) >> >> Seems not safe here. >> >> For blk-mq path: >> Someone may have escaped from the preempt checking, missed the >> blk_pm_request_resume there, >> entered into generic_make_request, but have not allocated request and >> occupied any tag. >> >> There could be a similar scenario for blk-legacy path, the q->nr_pending is >> increased when >> request is queued. >> >> So I guess the q_usage_counter checking is still needed here. > > There is only one blk_pm_request_resume() call and that call is inside > blk_queue_enter() after the pm_only counter has been checked. > > For the legacy block layer, nr_pending is increased after the > blk_queue_enter() call from inside blk_old_get_request() succeeded. > > So I don't see how blk_pm_request_resume() or q->nr_pending++ could escape > from the preempt checking? For example: blk_pre_runtime_suspendgeneric_make_request blk_queue_enter // success here, no blk_pm_request_resume here blk_mq_make_requset blk_set_pm_only if (blk_requests_in_flight(q) == 0) { synchronize_rcu(); if (blk_requests_in_flight(q) == 0) ret = 0; } blk_mq_get_request Thanks Jianchao
Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
Hi Ming On 09/18/2018 03:42 PM, Ming Lei wrote: > On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote: >> Hi Ming >> >> On 09/17/2018 08:07 PM, Ming Lei wrote: >>>>> This way will delay runtime pm or system suspend until the queue is >>>>> unfrozen, >>>>> but it isn't reasonable. >>>> This interface is for the __scsi_execute only, before we call into >>>> function, we should have >>>> get the device resumed synchronously. >>> I mean when the queue is frozen, it is reasonable to runtime suspend the >>> queue. However, >>> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing >>> first. >> >> We don't freeze the queue, but set preempt-only mode when suspend the queue. >> :) > > But the queue can be frozen by other paths. Even though it is frozen, the > queue > should be allowed to suspend too. > Yes, the race between freeze and preempt-only is a tricky issue. Thanks Jianchao
Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended
Hi Bart On 09/18/2018 04:15 AM, Bart Van Assche wrote: > Instead of allowing requests that are not power management requests > to enter the queue in runtime suspended status (RPM_SUSPENDED), make > the blk_get_request() caller block. This change fixes a starvation > issue: it is now guaranteed that power management requests will be > executed no matter how many blk_get_request() callers are waiting. > Instead of maintaining the q->nr_pending counter, rely on > q->q_usage_counter. Looks like we still depend on this nr_pending for blk-legacy. Call pm_runtime_mark_last_busy() every time a > request finishes instead of only if the queue depth drops to zero. ... > /* > diff --git a/block/blk-pm.c b/block/blk-pm.c > index 9b636960d285..5f21bedcb307 100644 > --- a/block/blk-pm.c > +++ b/block/blk-pm.c > @@ -1,8 +1,11 @@ > // SPDX-License-Identifier: GPL-2.0 > > +#include > #include > #include > #include > +#include "blk-mq.h" > +#include "blk-mq-tag.h" > > /** > * blk_pm_runtime_init - Block layer runtime PM initialization routine > @@ -40,6 +43,36 @@ void blk_pm_runtime_init(struct request_queue *q, struct > device *dev) > } > EXPORT_SYMBOL(blk_pm_runtime_init); > > +struct in_flight_data { > + struct request_queue*q; > + int in_flight; > +}; > + > +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request > *rq, > + void *priv, bool reserved) > +{ > + struct in_flight_data *in_flight = priv; > + > + if (rq->q == in_flight->q) > + in_flight->in_flight++; > +} > + > +/* > + * Count the number of requests that are in flight for request queue @q. Use > + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq > + * queues. Use blk_mq_queue_tag_busy_iter() instead of > + * blk_mq_tagset_busy_iter() because the latter only considers requests that > + * have already been started. > + */ > +static int blk_requests_in_flight(struct request_queue *q) > +{ > + struct in_flight_data in_flight = { .q = q }; > + > + if (q->mq_ops) > + blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, _flight); > + return q->nr_pending + in_flight.in_flight; > +} blk_mq_queue_tag_busy_iter only accounts the driver tags. This will only work w/o io scheduler > + > /** > * blk_pre_runtime_suspend - Pre runtime suspend check > * @q: the queue of the device > @@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q) > if (!q->dev) > return ret; > > + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); > + > + blk_set_pm_only(q); > + /* > + * This function only gets called if the most recent > + * pm_request_resume() call occurred at least autosuspend_delay_ms ^^^ pm_runtime_mark_last_busy ? > + * ago. Since blk_queue_enter() is called by the request allocation > + * code before pm_request_resume(), if no requests have a tag assigned > + * it is safe to suspend the device. > + */ > + ret = -EBUSY; > + if (blk_requests_in_flight(q) == 0) { > + /* > + * Call synchronize_rcu() such that later blk_queue_enter() > + * calls see the pm-only state. See also > + * > https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=. > + */ > + synchronize_rcu(); > + if (blk_requests_in_flight(q) == 0) Seems not safe here. For blk-mq path: Someone may have escaped from the preempt checking, missed the blk_pm_request_resume there, entered into generic_make_request, but have not allocated request and occupied any tag. There could be a similar scenario for blk-legacy path, the q->nr_pending is increased when request is queued. So I guess the q_usage_counter checking is still needed here. > + ret = 0; > + } > + > spin_lock_irq(q->queue_lock); > - if (q->nr_pending) { > - ret = -EBUSY; > + if (ret < 0) > pm_runtime_mark_last_busy(q->dev); > - } else { > + else > q->rpm_status = RPM_SUSPENDING; > - } > spin_unlock_irq(q->queue_lock); > + > + if (ret) > + blk_clear_pm_only(q); > + > return ret; > } > EXPORT_SYMBOL(blk_pre_runtime_suspend); > @@ -106,6 +163,9 @@ void blk_post_runtime_suspend(struct request_queue *q, > int err) > pm_runtime_mark_last_busy(q->dev); > } > spin_unlock_irq(q->queue_lock); > + > + if (err) > + blk_clear_pm_only(q); > } > EXPORT_SYMBOL(blk_post_runtime_suspend); > > @@ -153,13 +213,15 @@ void blk_post_runtime_resume(struct request_queue *q, > int err) > spin_lock_irq(q->queue_lock); >
Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
Hi Ming On 09/17/2018 08:07 PM, Ming Lei wrote: >>> This way will delay runtime pm or system suspend until the queue is >>> unfrozen, >>> but it isn't reasonable. >> This interface is for the __scsi_execute only, before we call into function, >> we should have >> get the device resumed synchronously. > I mean when the queue is frozen, it is reasonable to runtime suspend the > queue. However, > blk_queue_preempt_enter() is still waiting for queue becoming unfreezing > first. We don't freeze the queue, but set preempt-only mode when suspend the queue. :) Thanks Jianchao
Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
Hi Ming Thanks for your kindly response. On 09/16/2018 09:09 PM, Ming Lei wrote: > On Fri, Sep 14, 2018 at 03:27:44PM +0800, jianchao.wang wrote: >> Hi Ming >> >> On 09/13/2018 08:15 PM, Ming Lei wrote: >>> This patchset introduces per-host admin request queue for submitting >>> admin request only, and uses this approach to implement both SCSI >>> quiesce and runtime PM in one very simply way. Also runtime PM deadlock >>> can be avoided in case that request pool is used up, such as when too >>> many IO requests are allocated before resuming device >> >> To be honest, after look in to the SCSI part of this patchset, I really >> suspect whether it is worth to introduce this per scsi-host adminq. >> Too much complexity is introduced into SCSI. Compared with this, the current > > Can you comment on individual patch about which part is complicated? > I will explain them to you only by one. I have read through the scsi part of your patch many times. :) After this patchset, we could control the IO request_queues separately. This is more convenient. But what we have to pay is the relative complexity _spread_ over scsi-layer code. Especially, we have to handle that a request to a scsi device could be from its own IO request_queue or the per-host adminq at both submit and complete side. We have handled those cases in the patchset, but that looks really boring. And also it is risky in current stable scsi mid-layer code. I really think we should control the complexity in a very small range. > >> preempt-only feature look much simpler. >> >> If you just don't like the preempt-only checking in the hot path, we may >> introduce >> a new interface similar with blk_queue_enter for the non hot path. >> >> With your patch percpu-refcount: relax limit on percpu_ref_reinit() > > Seems this way may be one improvement on Bart's V6. > >> >> (totally no test) >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 4dbc93f..dd7f007 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue); >> * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was >> not >> * set and 1 if the flag was already set. >> */ >> -int blk_set_preempt_only(struct request_queue *q) >> +int blk_set_preempt_only(struct request_queue *q, bool drain) >> { >> -return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); >> +if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q)) >> +return 1; >> +percpu_ref_kill(>q_usage_counter); >> +synchronize_sched(); >> + >> +if (drain) >> +wait_event(q->mq_freeze_wq, >> +percpu_ref_is_zero(>q_usage_counter)); >> + >> +return 0; >> } >> EXPORT_SYMBOL_GPL(blk_set_preempt_only); > > It is easy to cause race between the normal freeze interface and the > above one. That may be one new complexity of the preempt only approach, > because there can be more than one path to kill/reinit .q_usage_counter. Yes, indeed. > >> >> void blk_clear_preempt_only(struct request_queue *q) >> { >> +percpu_ref_reinit(>q_usage_counter); >> blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); >> wake_up_all(>mq_freeze_wq); >> } >> @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue); >> /** >> * blk_queue_enter() - try to increase q->q_usage_counter >> * @q: request queue pointer >> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT >> + * @flags: BLK_MQ_REQ_NOWAIT >> */ >> int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) >> { >> -const bool preempt = flags & BLK_MQ_REQ_PREEMPT; >> - >> while (true) { >> -bool success = false; >> - >> -rcu_read_lock(); >> -if (percpu_ref_tryget_live(>q_usage_counter)) { >> -/* >> - * The code that sets the PREEMPT_ONLY flag is >> - * responsible for ensuring that that flag is globally >> - * visible before the queue is unfrozen. >> - */ >> -if (preempt || !blk_queue_preempt_only(q)) { >> -success = true; >> -} else { >> -percpu_ref_put(>q_usage_counter); >> -} >> -} >> -rcu_read_unlock(); >> >> -if (success) >> +if (percpu_ref_tryget_live(>q_usage_counter)) >> return 0; >> >> if (flags
Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
Hi Ming On 09/13/2018 08:15 PM, Ming Lei wrote: > This patchset introduces per-host admin request queue for submitting > admin request only, and uses this approach to implement both SCSI > quiesce and runtime PM in one very simply way. Also runtime PM deadlock > can be avoided in case that request pool is used up, such as when too > many IO requests are allocated before resuming device To be honest, after look in to the SCSI part of this patchset, I really suspect whether it is worth to introduce this per scsi-host adminq. Too much complexity is introduced into SCSI. Compared with this, the current preempt-only feature look much simpler. If you just don't like the preempt-only checking in the hot path, we may introduce a new interface similar with blk_queue_enter for the non hot path. With your patch percpu-refcount: relax limit on percpu_ref_reinit() (totally no test) diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f..dd7f007 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue); * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +int blk_set_preempt_only(struct request_queue *q, bool drain) { -return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); +if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q)) +return 1; +percpu_ref_kill(>q_usage_counter); +synchronize_sched(); + +if (drain) +wait_event(q->mq_freeze_wq, +percpu_ref_is_zero(>q_usage_counter)); + +return 0; } EXPORT_SYMBOL_GPL(blk_set_preempt_only); void blk_clear_preempt_only(struct request_queue *q) { +percpu_ref_reinit(>q_usage_counter); blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); wake_up_all(>mq_freeze_wq); } @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue); /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + * @flags: BLK_MQ_REQ_NOWAIT */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { -const bool preempt = flags & BLK_MQ_REQ_PREEMPT; - while (true) { -bool success = false; - -rcu_read_lock(); -if (percpu_ref_tryget_live(>q_usage_counter)) { -/* - * The code that sets the PREEMPT_ONLY flag is - * responsible for ensuring that that flag is globally - * visible before the queue is unfrozen. - */ -if (preempt || !blk_queue_preempt_only(q)) { -success = true; -} else { -percpu_ref_put(>q_usage_counter); -} -} -rcu_read_unlock(); -if (success) +if (percpu_ref_tryget_live(>q_usage_counter)) return 0; if (flags & BLK_MQ_REQ_NOWAIT) @@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && -(preempt || !blk_queue_preempt_only(q))) || + !blk_queue_preempt_only(q)) || + blk_queue_dying(q)); +if (blk_queue_dying(q)) +return -ENODEV; +} +} + +/* + * When set PREEMPT_ONLY mode, q->q_usage_counter is killed. + * If only PREEMPT_ONLY mode, go on. + * If queue frozen, wait. + */ +int blk_queue_preempt_enter(struct request_queue *q, +blk_mq_req_flags_t flags) +{ +while (true) { + +if (percpu_ref_tryget_live(>q_usage_counter)) +return 0; + +smp_rmb(); + +/* + * If queue is only in PREEMPT_ONLY mode, get the ref + * directly. The one who set PREEMPT_ONLY mode is responsible + * to wake up the waiters on mq_freeze_wq. + */ +if (!atomic_read(>mq_freeze_depth) && +blk_queue_preempt_only(q)) { +percpu_ref_get_many(>q_usage_counter, 1); +return 0; +} + +if (flags & BLK_MQ_REQ_NOWAIT) +return -EBUSY; + +wait_event(q->mq_freeze_wq, + (atomic_read(>mq_freeze_depth) == 0 || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; @@ -1587,7 +1616,7 @@ static struct request *blk_old_get_request(struct request_queue *q, /* create ioc upfront */ create_io_context(gfp_mask, q->node); -ret = blk_queue_enter(q, flags); +ret = blk_queue_preempt_enter(q, flags); if (ret) return ERR_PTR(ret); spin_lock_irq(q->queue_lock); diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a..3aea78f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -403,7 +403,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, struct request *rq; int ret; -ret =
Re: No protection on the hctx->dispatch_busy
On 08/27/2018 03:00 PM, Ming Lei wrote: > On Mon, Aug 27, 2018 at 01:56:39PM +0800, jianchao.wang wrote: >> Hi Ming >> >> Currently, blk_mq_update_dispatch_busy is hooked in blk_mq_dispatch_rq_list >> and __blk_mq_issue_directly. blk_mq_update_dispatch_busy could be invoked on >> multiple >> cpus concurrently. But there is not any protection on the >> hctx->dispatch_busy. We cannot >> ensure the update on the dispatch_busy atomically. > > The update itself is atomic given type of this variable is 'unsigned int'. The blk_mq_update_dispatch_busy doesn't just write on a unsigned int variable, but read, calculate and write. The whole operation is not atomic. > >> >> >> Look at the test result after applied the debug patch below: >> >> fio-1761 [000] 227.246251: >> blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2 >> fio-1766 [004] 227.246252: >> blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1 >> fio-1755 [000] 227.246366: >> blk_mq_update_dispatch_busy.part.50: old 1 ewma 0 cur 0 >> fio-1754 [003] 227.266050: >> blk_mq_update_dispatch_busy.part.50: old 2 ewma 3 cur 3 >> fio-1763 [007] 227.266050: >> blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2 >> fio-1761 [000] 227.266051: >> blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2 >> fio-1766 [004] 227.266051: >> blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2 >> fio-1760 [005] 227.266165: >> blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1 >> ... >> >> Is it expected ? > > Yes, it won't be a issue in reality given hctx->dispatch_busy is used as > a hint, and it often works as expected and hctx->dispatch_busy is convergent > finally because it is exponential weighted moving average. I just concern the value of dispatch_busy will bounce up and down in small range with high workload on 32 or higher core system due to the cache and non-atomic update > > Thanks, > Ming >
No protection on the hctx->dispatch_busy
Hi Ming Currently, blk_mq_update_dispatch_busy is hooked in blk_mq_dispatch_rq_list and __blk_mq_issue_directly. blk_mq_update_dispatch_busy could be invoked on multiple cpus concurrently. But there is not any protection on the hctx->dispatch_busy. We cannot ensure the update on the dispatch_busy atomically. Look at the test result after applied the debug patch below: fio-1761 [000] 227.246251: blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2 fio-1766 [004] 227.246252: blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1 fio-1755 [000] 227.246366: blk_mq_update_dispatch_busy.part.50: old 1 ewma 0 cur 0 fio-1754 [003] 227.266050: blk_mq_update_dispatch_busy.part.50: old 2 ewma 3 cur 3 fio-1763 [007] 227.266050: blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2 fio-1761 [000] 227.266051: blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2 fio-1766 [004] 227.266051: blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2 fio-1760 [005] 227.266165: blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1088,11 +1088,12 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx, static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy) { unsigned int ewma; + unsigned int old; if (hctx->queue->elevator) return; - ewma = hctx->dispatch_busy; + old = ewma = hctx->dispatch_busy; if (!ewma && !busy) return; @@ -1103,6 +1104,8 @@ static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy) ewma /= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT; hctx->dispatch_busy = ewma; + + trace_printk("old %u ewma %u cur %u\n", old, ewma, READ_ONCE(hctx->dispatch_busy)); } Is it expected ? Thanks Jianchao
Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
Hi Jens On 08/25/2018 11:41 PM, Jens Axboe wrote: > do { > - set_current_state(TASK_UNINTERRUPTIBLE); > + if (test_bit(0, )) > + break; > > - if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) > + WARN_ON_ONCE(list_empty()); > + > + if (!has_sleeper && > + rq_wait_inc_below(rqw, get_limit(rwb, rw))) { > + finish_wait(>wait, ); > + > + /* > + * We raced with wbt_wake_function() getting a token, > + * which means we now have two. Put ours and wake > + * anyone else potentially waiting for one. > + */ > + if (test_bit(0, )) > + wbt_rqw_done(rwb, rqw, wb_acct); > break; Just use 'bool' variable should be OK After finish_wait, no one could race with us here. > + } > > if (lock) { > spin_unlock_irq(lock); > @@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum > wbt_flags wb_acct, > spin_lock_irq(lock); > } else > io_schedule(); > + > has_sleeper = false; > } while (1); I cannot get the point of "since we can't rely on just being woken from the ->func handler we set". Do you mean there could be someone else could wake up this task ? Thanks Jianchao
Re: [PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()
Hi Bart On 08/11/2018 12:17 AM, Bart Van Assche wrote: > void blk_pm_runtime_exit(struct request_queue *q) > { > if (!q->dev) > return; > > pm_runtime_get_sync(q->dev); > blk_freeze_queue(q); > q->dev = NULL; > blk_unfreeze_queue(q); > } I'm afraid this will not work. In following patch: @@ -972,6 +972,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) if (success) return 0; + blk_pm_request_resume(q); + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; We could still invoke blk_pm_request_resume even if the queue is frozen. And we have blk_pm_request_resume later as following: +static inline void blk_pm_request_resume(struct request_queue *q) +{ + if (q->dev && (q->rpm_status == RPM_SUSPENDED || + q->rpm_status == RPM_SUSPENDING)) + pm_request_resume(q->dev); +} Thanks Jianchao
Re: [PATCH v6 10/12] block: Change the runtime power management approach (1/2)
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > Instead of scheduling runtime resume of a request queue after a > request has been queued, schedule asynchronous resume during request > allocation. The new pm_request_resume() calls occur after > blk_queue_enter() has increased the q_usage_counter request queue ^^^ > member. This change is needed for a later patch that will make request > allocation block while the queue status is not RPM_ACTIVE. Is it "after getting q->q_usage_counter fails" ? And also this blk_pm_request_resume will not affect the normal path. ;) Thanks Jianchao
Re: [PATCH v6 11/12] block: Change the runtime power management approach (2/2)
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > + > + blk_set_pm_only(q); > + /* > + * This function only gets called if the most recent > + * pm_request_resume() call occurred at least autosuspend_delay_ms > + * ago. Since blk_queue_enter() is called by the request allocation > + * code before pm_request_resume(), if no requests have a tag assigned > + * it is safe to suspend the device. > + */ > + ret = -EBUSY; > + if (blk_requests_in_flight(q) == 0) { > + /* > + * Call synchronize_rcu() such that later blk_queue_enter() > + * calls see the preempt-only state. See also > + * > https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=U9uPCJD2WnkXvdzrWaKPh2wJuk8-IHvxZ9sWDVrg2Tg=c9E23TPCpNQkiZpuzGztwHxjWF8qrESfRnPmI-e-Z48=. > + */ > + synchronize_rcu(); > + if (blk_requests_in_flight(q) == 0) > + ret = 0; > + } I still think blk_set_pm_only should be moved after blk_requests_in_flight. Otherwise, the normal IO will be blocked for a little while if there are still busy requests. Thanks Jianchao
Re: [PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > +/* > + * Whether or not blk_queue_enter() should proceed. RQF_PM requests are > always > + * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not > been > + * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has > been > + * set. > + */ > +static inline bool blk_enter_allowed(struct request_queue *q, > + blk_mq_req_flags_t flags) > +{ > + return flags & BLK_MQ_REQ_PM || > + (!blk_queue_pm_only(q) && > + (flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q))); > +} If a new state is indeed necessary, I think this kind of checking in hot path is inefficient. How about introduce a new state into request_queue, such as request_queue->gate_state. Set the PM_ONLY and DV_ONLY into this state, then we could just check request_queue->gate_state > 0 before do further checking. Thanks Jianchao
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
On 08/08/2018 02:11 PM, jianchao.wang wrote: > Hi Bart > > On 08/08/2018 06:51 AM, Bart Van Assche wrote: >> @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct >> request_queue *q, >> } >> } >> data->hctx->queued++; >> + >> +blk_pm_add_request(q, rq); >> + >> return rq; >> } > > The request_queue is in pm_only mode when suspended, who can reach here to do > the resume ? I mean, in the original blk-legacy runtime pm implementation, any new IO could trigger the resume, after your patch set, only the pm request could pass the blk_queue_enter while the queue is suspended and in pm-only mode. But if no resume, where does the pm request come from ? The blk_pm_add_request should be added to blk_queue_enter. It looks like as following: 1. when an normal io reaches blk_queue_enter, if queue is in suspended mode, it invoke blk_pm_add_request to trigger the resume, then wait here for the pm_only mode to be cleared. 2. the runtime pm core does the resume work and clear the pm_only more finally 3. the task blocked in blk_queue_enter is waked up and proceed. Thanks Jianchao
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
Hi Bart On 08/08/2018 06:51 AM, Bart Van Assche wrote: > @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > } > } > data->hctx->queued++; > + > + blk_pm_add_request(q, rq); > + > return rq; > } The request_queue is in pm_only mode when suspended, who can reach here to do the resume ? Thanks Jianchao
Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
Hi Bart On 08/04/2018 08:03 AM, Bart Van Assche wrote: > Instead of allowing requests that are not power management requests > to enter the queue in runtime suspended status (RPM_SUSPENDED), make > the blk_get_request() caller block. Who will resume the runtime suspended device ? In original blk-legacy, when a request is added, blk_pm_add_request will call pm_request_resume could do that. The request will be issued after the q is resumed successfully. After this patch, non-pm request will be blocked and pm_request_resume is removed from blk_pm_add_request, who does resume the runtime suspended q and device ? Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/27/2018 11:29 PM, Bart Van Assche wrote: > On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote: >> Something like: >> >> percpu_ref_switch_to_atomic_sync(>q_usage_counter); >> if (!percpu_ref_is_zero(>q_usage_counter)) { >> ret = -EBUSY; >> pm_runtime_mark_last_busy(q->dev); >> } else { >> blk_set_preempt_only(q); >> if (!percpu_ref_is_zero(>q_usage_counter) { >> ret = -EBUSY; >> pm_runtime_mark_last_busy(q->dev); >> blk_clear_preempt_only(q); >> } else { >> q->rpm_status = RPM_SUSPENDIN; >> } >> } > > I think this code is racy. Because there is no memory barrier in > blk_queue_enter() between the percpu_ref_tryget_live() and the > blk_queue_preempt_only() calls, the context that sets the PREEMPT_ONLY flag > has to use synchronize_rcu() or call_rcu() to ensure that blk_queue_enter() > sees the PREEMPT_ONLY flag after it has called percpu_ref_tryget_live(). > See also > https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIFgQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=mkAOXQtxuVTAej2tBjOvEed2h4TRvX3mE-EPeNEUD5E=xTgTB2x1JwvRUQVyOL8m3rhbbk8xhOkZMC_Io9bmpFc=. Yes, a synchrorize_rcu is indeed needed here to ensure a q_usage_counter is got successfully with increasing one, or unsuccessfully without increasing one. Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
Hi Bart On 07/27/2018 11:09 PM, Bart Van Assche wrote: > On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote: >> If q_usage_counter is not zero here, we will leave the request_queue in >> preempt only mode. > > That's on purpose. If q_usage_counter is not zero then > blk_pre_runtime_suspend() > will return -EBUSY. That error code will be passed to > blk_post_runtime_suspend() > and that will cause that function to clear QUEUE_FLAG_PREEMPT_ONLY. > static int sdev_runtime_suspend(struct device *dev) { const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; struct scsi_device *sdev = to_scsi_device(dev); int err = 0; err = blk_pre_runtime_suspend(sdev->request_queue); if (err) return err; if (pm && pm->runtime_suspend) err = pm->runtime_suspend(dev); blk_post_runtime_suspend(sdev->request_queue, err); return err; } If blk_pre_runtime_suspend returns -EBUSY, blk_post_runtime_suspend will not be invoked. >> The request_queue should be set to preempt only mode only when we confirm we >> could set >> rpm_status to RPM_SUSPENDING or RPM_RESUMING. > > Why do you think this? https://marc.info/?l=linux-scsi=133727953625963=2 " If q->rpm_status is RPM_SUSPENDED, they shouldn't do anything -- act as though the queue is empty. If q->rpm_status is RPM_SUSPENDING or RPM_RESUMING, they should hand over the request only if it has the REQ_PM flag set. " In additon, if we set the preempt only here unconditionally, the normal IO will be blocked during the blk_pre_runtime_suspend. In your patch, q_usage_counter will be switched to atomic mode, this could cost some time. Is it really OK ? Thanks Jianchao > > Bart. > >
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/27/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote: >> On 07/26/2018 06:26 AM, Bart Van Assche wrote: >>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) >>> return ret; >>> >>> blk_pm_runtime_lock(q); >>> + blk_set_preempt_only(q); >> >> We only stop non-RQF_PM request entering when RPM_SUSPENDING and >> RPM_RESUMING. >> blk_pre_runtime_suspend should only _check_ whether runtime suspend is >> allowed. >> So we should not set preempt_only here. >> >>> + percpu_ref_switch_to_atomic_sync(>q_usage_counter); >>> >>> spin_lock_irq(q->queue_lock); >>> - if (q->nr_pending) { >>> + if (!percpu_ref_is_zero(>q_usage_counter)) { >>> ret = -EBUSY; >>> pm_runtime_mark_last_busy(q->dev); >>> } else { >>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) >>> } >>> spin_unlock_irq(q->queue_lock); >>> >>> + percpu_ref_switch_to_percpu(>q_usage_counter); >>> blk_pm_runtime_unlock(q); > > Hello Jianchao, > > There is only one caller of blk_post_runtime_suspend(), namely > sdev_runtime_suspend(). That function calls pm->runtime_suspend() before > it calls blk_post_runtime_suspend(). I think it would be wrong to set the > PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could > cause pm->runtime_suspend() while a request is in progress. > Hi Bart If q_usage_counter is not zero here, we will leave the request_queue in preempt only mode. The request_queue should be set to preempt only mode only when we confirm we could set rpm_status to RPM_SUSPENDING or RPM_RESUMING. Maybe we could set the PREEMPT_ONLY after we confirm we could set the rpm_status to RPM_SUSPENDING. Something like: percpu_ref_switch_to_atomic_sync(>q_usage_counter); if (!percpu_ref_is_zero(>q_usage_counter)) { ret = -EBUSY; pm_runtime_mark_last_busy(q->dev); } else { blk_set_preempt_only(q); if (!percpu_ref_is_zero(>q_usage_counter) { ret = -EBUSY; pm_runtime_mark_last_busy(q->dev); blk_clear_preempt_only(q); } else { q->rpm_status = RPM_SUSPENDIN; } } Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/26/2018 03:52 PM, jianchao.wang wrote: > In addition, .runtime_suspend is invoked under spinlock and irq-disabled. > So sleep is forbidden here. > Please refer to rpm_suspend > > * This function must be called under dev->power.lock with interrupts disabled > __rpm_callback will unlock the spinlock before invoke the callback. Sorry for the noise. Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/26/2018 10:45 AM, jianchao.wang wrote: > Hi Bart > > On 07/26/2018 06:26 AM, Bart Van Assche wrote: >> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) >> return ret; >> >> blk_pm_runtime_lock(q); >> +blk_set_preempt_only(q); > > We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING. > blk_pre_runtime_suspend should only _check_ whether runtime suspend is > allowed. > So we should not set preempt_only here. > > >> +percpu_ref_switch_to_atomic_sync(>q_usage_counter); In addition, .runtime_suspend is invoked under spinlock and irq-disabled. So sleep is forbidden here. Please refer to rpm_suspend * This function must be called under dev->power.lock with interrupts disabled Thanks Jianchao >> >> spin_lock_irq(q->queue_lock); >> -if (q->nr_pending) { >> +if (!percpu_ref_is_zero(>q_usage_counter)) { >> ret = -EBUSY; >> pm_runtime_mark_last_busy(q->dev); >> } else { >> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) >> } >> spin_unlock_irq(q->queue_lock); >> >> +percpu_ref_switch_to_percpu(>q_usage_counter); >> blk_pm_runtime_unlock(q); >
Re: [PATCH v2 3/5] block: Serialize queue freezing and blk_pre_runtime_suspend()
Hi Bart On 07/26/2018 06:26 AM, Bart Van Assche wrote: > + > +void blk_pm_runtime_lock(struct request_queue *q) > +{ > + spin_lock(>rpm_lock); > + wait_event_interruptible_locked(q->rpm_wq, > + q->rpm_owner == NULL || q->rpm_owner == current); > + if (q->rpm_owner == NULL) > + q->rpm_owner = current; > + q->rpm_nesting_level++; > + spin_unlock(>rpm_lock); > +} The lock which wait_event_interruptible_locked want to hold is the wq.lock. Please refer to comment of wait_event_interruptible_locked * It must be called with wq.lock being held. This spinlock is * unlocked while sleeping but @condition testing is done while lock * is held and when this macro exits the lock is held. Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
Hi Bart On 07/26/2018 06:26 AM, Bart Van Assche wrote: > @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) > return ret; > > blk_pm_runtime_lock(q); > + blk_set_preempt_only(q); We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING. blk_pre_runtime_suspend should only _check_ whether runtime suspend is allowed. So we should not set preempt_only here. > + percpu_ref_switch_to_atomic_sync(>q_usage_counter); > > spin_lock_irq(q->queue_lock); > - if (q->nr_pending) { > + if (!percpu_ref_is_zero(>q_usage_counter)) { > ret = -EBUSY; > pm_runtime_mark_last_busy(q->dev); > } else { > @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) > } > spin_unlock_irq(q->queue_lock); > > + percpu_ref_switch_to_percpu(>q_usage_counter); > blk_pm_runtime_unlock(q);
Re: [PATCH] blk-mq: Avoid that a request queue stalls when restarting a shared hctx
On 07/23/2018 11:50 PM, Bart Van Assche wrote: > The patch below fixes queue stalling when shared hctx marked for restart > (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The > root cause is that hctxs are shared between queues, but 'shared_hctx_restart' The blk_mq_hw_ctx structure is also per request_queue Please refer to blk_mq_init_allocated_queue -> blk_mq_realloc_hw_ctxs > belongs to the particular queue, which in fact may not need to be restarted, > thus we return from blk_mq_sched_restart() and leave shared hctx of another > queue never restarted. > > The fix is to make shared_hctx_restart counter belong not to the queue, but > to tags, thereby counter will reflect real number of shared hctx needed to > be restarted. > > During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests > were noticed in dd->fifo_list of mq-deadline scheduler. > > Seeming possible sequence of events: > > 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. > > 2. Request B of queue A bypasses scheduler and goes directly to >hctx->dispatch. > > 3. Request C of queue B is inserted. > > 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not >empty (request B is in the list) hctx is only marked for for next restart >and request A is left in a list (see comment "So it's best to leave them >there for as long as we can. Mark the hw queue as needing a restart in >that case." in blk-mq-sched.c) > > 5. Eventually request B is completed/freed and blk_mq_sched_restart() is >called, but by chance hctx from queue B is chosen for restart and request C >gets a chance to be dispatched. Request C is just inserted into queue B. If there is no mark restart there, it will not be chosen. blk_mq_sched_restart_hctx if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state)) return false; If blk_mq_sched_restart choose queue B, one of its hctx must have SCHED_RESTART flag, and q->shared_hctx_restart must not be zero. > > 6. Eventually request C is completed/freed and blk_mq_sched_restart() is >called, but shared_hctx_restart for queue B is zero and we return without >attempt to restart hctx from queue A, thus request A is stuck forever. > > But stalling queue is not the only one problem with blk_mq_sched_restart(). > My tests show that those loops thru all queues and hctxs can be very costly, > even with shared_hctx_restart counter, which aims to fix performance issue. Currently, SCHED_RESTART is always set when there is reqs on hctx->dispatch list in blk_mq_sched_dispatch_requests. And no driver tag case is the main reason for hctx->dispatch is not empty when there is io scheduler. Therefore, most of time, blk_mq_sched_restart will be invoked further for no driver tag case. For non-share-tag case, it will wakeup the hctx. But for shared-tag case, it is unnecessary, because the sbitmap_queue wakeup hook will work and hctx_may_queue will avoid starvation of other ones. Therefore, the costly loop through the queues and hctxs is unnecessary most of time. Thanks Jianchao
Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset
Hi Keith On 07/19/2018 01:45 AM, Keith Busch wrote: >>> + list_for_each_entry(q, >tag_list, tag_set_list) { >>> /* >>> * Request timeouts are handled as a forward rolling timer. If >>> * we end up here it means that no requests are pending and >>> @@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct >>> *work) >>> blk_mq_tag_idle(hctx); >>> } >>> } >>> - blk_queue_exit(q); The tags sharing fairness mechanism between different request_queues cannot work well here. When timer is per-request_queue, if there is no request on one request_queue, it could be idled. But now, with per-tagset timer, we cannot detect the idle one at all. > >>> + timer_setup(>timer, blk_mq_timed_out_timer, 0); >>> + INIT_WORK(>timeout_work, blk_mq_timeout_work); >>> [ ... ] >>> --- a/include/linux/blk-mq.h >>> +++ b/include/linux/blk-mq.h >>> @@ -86,6 +86,8 @@ struct blk_mq_tag_set { >>> >>> struct blk_mq_tags **tags; >>> >>> + struct timer_list timer; >>> + struct work_struct timeout_work; >> Can the timer and timeout_work data structures be replaced by a single >> delayed_work instance? > I think so. I wanted to keep blk_add_timer relatively unchanged for this > proposal, so I followed the existing pattern with the timer kicking the > work. I don't see why that extra indirection is necessary, so I think > it's a great idea. Unless anyone knows a reason not to, we can collapse > this into a single delayed work for both mq and legacy as a prep patch > before this one. mod_delayed_work_on is very tricky in our scenario. It will grab the pending work entry and queue it again. delayed_work.timer trigger queue_work timeout_work delayed_work.timer not pending mod_delayed_work_on grab the pending timeout_work re-arm the timer The timeout_work would not be run. Thanks Jianchao
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On 07/13/2018 06:24 AM, Bart Van Assche wrote: > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code. blk_rq_check_expired if (time_after_eq(jiffies, rq->deadline)) { list_del_init(>timeout_list); /* * Check if we raced with end io completion */ if (!blk_mark_rq_complete(rq)) blk_rq_timed_out(rq); } blk_complete_request if (!blk_mark_rq_complete(req)) __blk_complete_request(req); blk_mq_check_expired if (time_after_eq(jiffies, rq->deadline)) { if (!blk_mark_rq_complete(rq)) blk_mq_rq_timed_out(rq, reserved); } blk_mq_complete_request if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); Thanks Jianchao > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis?
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On 07/13/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote: >> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote: >>> What prevents that a request finishes and gets reused after the >>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is >>> called? Is this perhaps a race condition that has not yet been triggered by >>> any existing block layer test? Please note that there is no such race >>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling >>> again" - >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html=DwIGaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU=). >> >> I don't think there's any such race in the merged implementation >> either. If the request is reallocated, then the kref check may succeed, >> but the blk_mq_req_expired() check would surely fail, allowing the >> request to proceed as normal. The code comments at least say as much. > > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? > Oh, thanks gods for hearing Bart said this. I was always saying the same thing in the mail https://marc.info/?l=linux-block=152950093831738=2 Even though my voice is tiny, I support Bart's point definitely. Thanks Jianchao > Thanks, > > Bart. > > > > > > > > >
Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
On 06/28/2018 01:42 PM, Kashyap Desai wrote: > Ming - > > Performance drop is resolved on my setup, but may be some stability of the > kernel is caused due to this patch set. I have not tried without patch > set, but in case you can figure out if below crash is due to this patch > set, I can try potential fix as well. I am seeing kernel panic if I use > below three settings in my fio script. > > iodepth_batch=8 > iodepth_batch_complete_min=4 > iodepth_batch_complete_max=16 > > > Here is panic detail - > > [65237.831029] [ cut here ] > [65237.831031] list_add corruption. prev->next should be next > (bb1d0a41fdf8), but was 94d0f0c57240. (prev=94d0f0c57240). > [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28 > __list_add_valid+0x6a/0x70 > [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure > scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun > ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 > xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc > ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle > ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle > iptable_security iptable_raw ebtable_filter ebtables ip6table_filter > ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel > snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd > cryptd > [65237.831079] snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt > snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801 > joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd > acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss > nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe > ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror > dm_region_hash dm_log dm_mod > [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G > OE 4.18.0-rc1+ #1 > [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0 > 06/22/2017 > [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70 > [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0 > c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff > <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57 > [65237.831130] RSP: 0018:bb1d0a41fdd8 EFLAGS: 00010286 > [65237.831131] RAX: RBX: 94d0ebf9 RCX: > 0006 > [65237.831132] RDX: RSI: 0086 RDI: > 94d91ec16970 > [65237.831132] RBP: db1cff59e980 R08: R09: > 0663 > [65237.831133] R10: 0004 R11: R12: > 0001 > [65237.831134] R13: R14: 94d0f0c57240 R15: > 94d0f0f04c40 > [65237.831135] FS: 7fad10d02740() GS:94d91ec0() > knlGS: > [65237.831135] CS: 0010 DS: ES: CR0: 80050033 > [65237.831136] CR2: 7fad10c2a004 CR3: 00085600e006 CR4: > 007606e0 > [65237.831137] DR0: DR1: DR2: > > [65237.831138] DR3: DR6: fffe0ff0 DR7: > 0400 > [65237.831138] PKRU: 5554 > [65237.831138] Call Trace: > [65237.831145] blk_mq_flush_plug_list+0x12d/0x260 > [65237.831150] blk_flush_plug_list+0xe7/0x240 > [65237.831152] blk_finish_plug+0x27/0x40 > [65237.831156] __x64_sys_io_submit+0xc9/0x180 > [65237.831162] do_syscall_64+0x5b/0x180 > [65237.831166] entry_SYSCALL_64_after_hwframe+0x44/0xa9 I guess we need to clean list after list_splice_tail in the 1/1 patch as following @@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct list_head *list) { ... + + spin_lock(>lock); + list_splice_tail(list, >rq_list); + INIT_LIST_HEAD(list); // ---> HERE ! blk_mq_hctx_mark_pending(hctx, ctx); spin_unlock(>lock); Thanks Jianchao
Re: Regarding the recent new blk-mq timeout handling
On 06/12/2018 09:01 PM, jianchao.wang wrote: > Hi ming > > Thanks for your kindly response. > > On 06/12/2018 06:17 PM, Ming Lei wrote: >> On Tue, Jun 12, 2018 at 6:04 PM, jianchao.wang >> wrote: >>> Hi Jens and Christoph >>> >>> In the recent commit of new blk-mq timeout handling, we don't have any >>> protection >>> on timed out request against the completion path. We just hold a >>> request->ref count, >>> it just could avoid the request tag to be released and life-recycle, but >>> not completion. >>> >>> For the scsi mid-layer, what if a request is in error handler and normal >>> completion come >>> at the moment ? >> >> Per my understanding, now the protection needs to be done completely by >> driver. >> > > But looks like the drivers have not prepared well to take over this work > right now. > I modified the scsi-debug module as the attachment 0001-scsi-debug-make-normal-completion-and-timeout-could-.patch and try to simulate the scenario where timeout and completion path occur concurrently. The system would run into crash easily. 4.17.rc7 survived from this test. Maybe we could do as the attachment 0001-blk-mq-protect-timed-out-request-against-completion-.patch then replace all the blk_mq_complete_request in timeout path. Then we could preserve the capability to protect the timed out request against completion path. The patch also survived from the test. Thanks Jianchao >> >> Thanks, >> Ming Lei >> > >From 640a67e7b4386ac42ee789f54dd0898ecd00f8f7 Mon Sep 17 00:00:00 2001 From: Jianchao Wang Date: Tue, 12 Jun 2018 12:04:26 +0800 Subject: [PATCH] scsi-debug: make normal completion and timeout could occur concurrently Invoke blk_abort_request to force the request timed out periodically, when complete the request in workqueue context. Signed-off-by: Jianchao Wang --- drivers/scsi/scsi_debug.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 656c98e..2ca0280 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -4323,6 +4323,8 @@ static void setup_inject(struct sdebug_queue *sqp, sqcp->inj_host_busy = !!(SDEBUG_OPT_HOST_BUSY & sdebug_opts); } +static atomic_t g_abort_counter; + /* Complete the processing of the thread that queued a SCSI command to this * driver. It either completes the command by calling cmnd_done() or * schedules a hr timer or work queue then returns 0. Returns @@ -4459,6 +4461,11 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip, sd_dp->issuing_cpu = raw_smp_processor_id(); sd_dp->defer_t = SDEB_DEFER_WQ; schedule_work(_dp->ew.work); + atomic_inc(_abort_counter); + if (atomic_read(_abort_counter)%2000 == 0) { + blk_abort_request(cmnd->request); + trace_printk("abort request tag %d\n", cmnd->request->tag); + } } if (unlikely((SDEBUG_OPT_Q_NOISE & sdebug_opts) && (scsi_result == device_qfull_result))) @@ -5843,6 +5850,7 @@ static int sdebug_driver_probe(struct device *dev) struct Scsi_Host *hpnt; int hprot; + atomic_set(_abort_counter, 0); sdbg_host = to_sdebug_host(dev); sdebug_driver_template.can_queue = sdebug_max_queue; -- 2.7.4 >From fcc515b3a642c909e8b82d2a240014faff5acd44 Mon Sep 17 00:00:00 2001 From: Jianchao Wang Date: Tue, 12 Jun 2018 21:20:13 +0800 Subject: [PATCH] blk-mq: protect timed out request against completion path Signed-off-by: Jianchao Wang --- block/blk-mq.c | 22 +++--- include/linux/blk-mq.h | 1 + include/linux/blkdev.h | 6 ++ 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6332940..2714a23 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -473,6 +473,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + WRITE_ONCE(rq->state, MQ_RQ_IDLE); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) @@ -509,7 +510,6 @@ void blk_mq_free_request(struct request *rq) if (blk_rq_rl(rq)) blk_put_rl(blk_rq_rl(rq)); - WRITE_ONCE(rq->state, MQ_RQ_IDLE); if (refcount_dec_and_test(>ref)) __blk_mq_free_request(rq); } @@ -552,15 +552,17 @@ static void __blk_mq_complete_request_remote(void *data) rq->q->softirq_done_fn(rq); } -static void __blk_mq_complete_request(struct request *rq) +/* + * The LLDD timeout path must invoke this interface to complete + * the request. + */ +void __blk_mq_complete_request(struct request *rq) { struct blk_mq_ctx *ctx = rq->mq_ctx; bool shared = false; int cpu; - if (cmpxchg(>state, MQ_RQ_IN_FL
Re: Regarding the recent new blk-mq timeout handling
Hi ming Thanks for your kindly response. On 06/12/2018 06:17 PM, Ming Lei wrote: > On Tue, Jun 12, 2018 at 6:04 PM, jianchao.wang > wrote: >> Hi Jens and Christoph >> >> In the recent commit of new blk-mq timeout handling, we don't have any >> protection >> on timed out request against the completion path. We just hold a >> request->ref count, >> it just could avoid the request tag to be released and life-recycle, but not >> completion. >> >> For the scsi mid-layer, what if a request is in error handler and normal >> completion come >> at the moment ? > > Per my understanding, now the protection needs to be done completely by > driver. > But looks like the drivers have not prepared well to take over this work right now. Thanks Jianchao > > Thanks, > Ming Lei >
Regarding the recent new blk-mq timeout handling
Hi Jens and Christoph In the recent commit of new blk-mq timeout handling, we don't have any protection on timed out request against the completion path. We just hold a request->ref count, it just could avoid the request tag to be released and life-recycle, but not completion. For the scsi mid-layer, what if a request is in error handler and normal completion come at the moment ? Or do I miss anything about this commit ? Thanks Jianchao
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Jens and Holger Thank for your kindly response. That's really appreciated. I will post next version based on Jens' patch. Thanks Jianchao On 05/23/2018 02:32 AM, Holger Hoffstätte wrote: This looks great but prevents kyber from being built as module, which is AFAIK supposed to work (and works now): .. CC [M] block/kyber-iosched.o Building modules, stage 2. MODPOST 313 modules ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined! ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined! ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined! .. It does build fine when compiled in, obviously. :) >>> It's basically duplicating the contents of blk_attempt_plug_merge(). >>> I would suggest abstracting out the list loop and merge check >>> into a helper, that could then both be called from kyber and the >>> plug merge function. >> See attached, prep patch and yours rebased on top of it. > That was quick. :) > > Applies smoothly on top of my 4.16++ tree, now builds correctly as > module and is reproducibly (slightly) faster even on my pedestrian > SATA SSDs, now on par or occasionally even faster than mq-deadline. > What's not to like? So: > > Tested-by: Holger Hoffstätte
Re: [PATCH] block: kyber: make kyber more friendly with merging
Hi Omar Thanks for your kindly response. On 05/23/2018 04:02 AM, Omar Sandoval wrote: > On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote: >> Currently, kyber is very unfriendly with merging. kyber depends >> on ctx rq_list to do merging, however, most of time, it will not >> leave any requests in ctx rq_list. This is because even if tokens >> of one domain is used up, kyber will try to dispatch requests >> from other domain and flush the rq_list there. > > That's a great catch, I totally missed this. > > This approach does end up duplicating a lot of code with the blk-mq core > even after Jens' change, so I'm curious if you tried other approaches. > One idea I had is to try the bio merge against the kqd->rqs lists. Since > that's per-queue, the locking overhead might be too high. Alternatively, Yes, I used to make a patch as you say, try the bio merge against kqd->rqs directly. The patch looks even simpler. However, because the khd->lock is needed every time when try bio merge, there maybe high contending overhead on hkd->lock when cpu-hctx mapping is not 1:1. > you could keep the software queues as-is but add our own version of > flush_busy_ctxs() that only removes requests of the domain that we want. > If one domain gets backed up, that might get messy with long iterations, > though. Yes, I also considered this approach :) But the long iterations on every ctx->rq_list looks really inefficient. > > Regarding this approach, a couple of comments below. ... >> } >> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e) >> static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int >> hctx_idx) >> { >> struct kyber_hctx_data *khd; >> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; >> int i; >> +int sd; >> >> khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node); >> if (!khd) >> return -ENOMEM; >> >> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *), >> +GFP_KERNEL, hctx->numa_node); >> +if (!khd->kcqs) >> +goto err_khd; > > Why the double indirection of a percpu allocation per hardware queue > here? With, say, 56 cpus and that many hardware queues, that's 3136 > pointers, which seems like overkill. Can't you just use the percpu array > in the kqd directly, or make it per-hardware queue instead? oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx. The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked, so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd. ... >> +static int bio_sched_domain(const struct bio *bio) >> +{ >> +unsigned int op = bio->bi_opf; >> + >> +if ((op & REQ_OP_MASK) == REQ_OP_READ) >> +return KYBER_READ; >> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op)) >> +return KYBER_SYNC_WRITE; >> +else >> +return KYBER_OTHER; >> +} > > Please add a common helper for rq_sched_domain() and bio_sched_domain() > instead of duplicating the logic. > Yes, I will do it in next version. Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Ming On 05/15/2018 08:56 PM, Ming Lei wrote: > Looks a nice fix on nvme_create_queue(), but seems the change on > adapter_alloc_cq() is missed in above patch. > > Could you prepare a formal one so that I may integrate it to V6? Please refer to Thanks Jianchao >From 9bb6db79901ef303cd40c4c911604bc053b1ad95 Mon Sep 17 00:00:00 2001 From: Jianchao WangDate: Wed, 16 May 2018 09:45:45 +0800 Subject: [PATCH] nvme-pci: set nvmeq->cq_vector after alloc cq/sq Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc cq/sq command timeout, nvme_suspend_queue will invoke free_irq for the nvmeq because the cq_vector is valid, this will cause warning 'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after alloc cq/sq successes to fix this. Signed-off-by: Jianchao Wang --- drivers/nvme/host/pci.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fbc71fa..c830092 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id) } static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, - struct nvme_queue *nvmeq) + struct nvme_queue *nvmeq, int cq_vector) { struct nvme_command c; int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED; @@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, c.create_cq.cqid = cpu_to_le16(qid); c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1); c.create_cq.cq_flags = cpu_to_le16(flags); - c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector); + c.create_cq.irq_vector = cpu_to_le16(cq_vector); return nvme_submit_sync_cmd(dev->ctrl.admin_q, , NULL, 0); } @@ -1450,6 +1450,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; + int cq_vector; if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), @@ -1462,15 +1463,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) * A queue's vector matches the queue identifier unless the controller * has only one vector available. */ - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; - result = adapter_alloc_cq(dev, qid, nvmeq); + cq_vector = dev->num_vecs == 1 ? 0 : qid; + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); if (result < 0) - goto release_vector; + goto out; result = adapter_alloc_sq(dev, qid, nvmeq); if (result < 0) goto release_cq; + /* + * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command + * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning + * 'Trying to free already-free IRQ xxx' + */ + nvmeq->cq_vector = cq_vector; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) @@ -1478,13 +1485,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) return result; - release_sq: +release_sq: + nvmeq->cq_vector = -1; dev->online_queues--; adapter_delete_sq(dev, qid); - release_cq: +release_cq: adapter_delete_cq(dev, qid); - release_vector: - nvmeq->cq_vector = -1; +out: return result; } -- 2.7.4
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/16/2018 10:09 AM, Ming Lei wrote: > So could you check if only the patch("unquiesce admin queue after shutdown > controller") can fix your IO hang issue? I indeed tested this before fix the warning. It could fix the io hung issue. :) Thanks Jianchao
Re: [PATCH V5 9/9] nvme: pci: support nested EH
Hi ming On 05/11/2018 08:29 PM, Ming Lei wrote: > +static void nvme_eh_done(struct nvme_eh_work *eh_work, int result) > +{ > + struct nvme_dev *dev = eh_work->dev; > + bool top_eh; > + > + spin_lock(>eh_lock); > + top_eh = list_is_last(_work->list, >eh_head); > + dev->nested_eh--; > + > + /* Fail controller if the top EH can't recover it */ > + if (!result) > + wake_up_all(>eh_wq); > + else if (top_eh) { > + dev->ctrl_failed = true; > + nvme_eh_sched_fail_ctrl(dev); > + wake_up_all(>eh_wq); > + } > + > + list_del(_work->list); > + spin_unlock(>eh_lock); > + > + dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n", > + eh_work->seq, dev->ctrl.state, result, top_eh); > + wait_event(dev->eh_wq, nvme_eh_reset_done(dev)); decrease the nested_eh before it exits, another new EH will have confusing seq number. please refer to following log: [ 1342.961869] nvme nvme0: Abort status: 0x0 [ 1342.961878] nvme nvme0: Abort status: 0x0 [ 1343.148341] nvme nvme0: EH 0: after shutdown, top eh: 1 [ 1403.828484] nvme nvme0: I/O 21 QID 0 timeout, disable controller [ 1403.828603] nvme nvme0: EH 1: before shutdown ... waring logs are ignored here [ 1403.984731] nvme nvme0: EH 0: state 4, eh_done -4, top eh 0 // EH0 go to wait [ 1403.984786] nvme nvme0: EH 1: after shutdown, top eh: 1 [ 1464.856290] nvme nvme0: I/O 22 QID 0 timeout, disable controller // timeout again in EH 1 [ 1464.856411] nvme nvme0: EH 1: before shutdown // a new EH has a 1 seq number Is it expected that the new EH has seq number 1 instead of 2 ? Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/15/2018 08:33 AM, Ming Lei wrote: > We still have to quiesce admin queue before canceling request, so looks > the following patch is better, so please ignore the above patch and try > the following one and see if your hang can be addressed: > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index f509d37b2fb8..c2adc76472a8 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > dev->ctrl.admin_q = NULL; > return -ENODEV; > } > - } else > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > + } > > return 0; > } > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, > bool shutdown, bool >*/ > if (shutdown) > nvme_start_queues(>ctrl); > + > + /* > + * Avoid to suck reset because timeout may happen during reset and > + * reset may hang forever if admin queue is kept as quiesced > + */ > + blk_mq_unquiesce_queue(dev->ctrl.admin_q); > mutex_unlock(>shutdown_lock); > } w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now. @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; + int cq_vector; if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) * A queue's vector matches the queue identifier unless the controller * has only one vector available. */ - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; - result = adapter_alloc_cq(dev, qid, nvmeq); + cq_vector = dev->num_vecs == 1 ? 0 : qid; + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); if (result < 0) - goto release_vector; + goto out; result = adapter_alloc_sq(dev, qid, nvmeq); if (result < 0) goto release_cq; - + + nvmeq->cq_vector = cq_vector; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) @@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) return result; release_sq: + nvmeq->cq_vector = -1; dev->online_queues--; adapter_delete_sq(dev, qid); release_cq: adapter_delete_cq(dev, qid); - release_vector: - nvmeq->cq_vector = -1; + out: return result; } Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/14/2018 05:38 PM, Ming Lei wrote: >> Here is the deadlock scenario. >> >> nvme_eh_work // EH0 >> -> nvme_reset_dev //hold reset_lock >> -> nvme_setup_io_queues >> -> nvme_create_io_queues >> -> nvme_create_queue >> -> set nvmeq->cq_vector >> -> adapter_alloc_cq >> -> adapter_alloc_sq >> irq has not been requested >> io timeout >> nvme_eh_work //EH1 >> -> nvme_dev_disable >> -> quiesce the adminq //> here ! >> -> nvme_suspend_queue >> print out warning Trying to free >> already-free IRQ 133 >> -> nvme_cancel_request // complete >> the timeout admin request >> -> require reset_lock >> -> adapter_delete_cq > If the admin IO submitted in adapter_alloc_sq() is timed out, > nvme_dev_disable() in EH1 will complete it which is set as > REQ_FAILFAST_DRIVER, > then adapter_alloc_sq() should return error, and the whole reset in EH0 > should have been terminated immediately. Please refer to the following segment: static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; ... nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; result = adapter_alloc_cq(dev, qid, nvmeq); if (result < 0) goto release_vector; result = adapter_alloc_sq(dev, qid, nvmeq); // if timeout and failed here if (result < 0) goto release_cq; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) goto release_sq; return result; release_sq: dev->online_queues--; adapter_delete_sq(dev, qid); release_cq:// we will be here ! adapter_delete_cq(dev, qid);// another cq delete admin command will be sent out. release_vector: nvmeq->cq_vector = -1; return result; } > > I guess the issue should be that nvme_create_io_queues() ignores the failure. > > Could you dump the stack trace of EH0 reset task? So that we may see > where EH0 reset kthread hangs. root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack [<0>] blk_execute_rq+0xf7/0x150 [<0>] __nvme_submit_sync_cmd+0x94/0x110 [<0>] nvme_submit_sync_cmd+0x1b/0x20 [<0>] adapter_delete_queue+0xad/0xf0 [<0>] nvme_reset_dev+0x1b67/0x2450 [<0>] nvme_eh_work+0x19c/0x4b0 [<0>] process_one_work+0x3ca/0xaa0 [<0>] worker_thread+0x89/0x6c0 [<0>] kthread+0x18d/0x1e0 [<0>] ret_from_fork+0x24/0x30 [<0>] 0x root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack [<0>] nvme_eh_work+0x11a/0x4b0 [<0>] process_one_work+0x3ca/0xaa0 [<0>] worker_thread+0x89/0x6c0 [<0>] kthread+0x18d/0x1e0 [<0>] ret_from_fork+0x24/0x30 [<0>] 0x > >> -> adapter_delete_queue // submit to the adminq which has been >> quiesced. >> -> nvme_submit_sync_cmd >> -> blk_execute_rq >> -> wait_for_completion_io_timeout >> hang_check is true, so there is no hung task warning for >> this context >> >> EH0 submit cq delete admin command, but it will never be completed or timed >> out, because the admin request queue has >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get >> reset_lock and make things forward. > The nvme_dev_disable() in outer EH(EH1 in above log) will complete all > admin command, which won't be retried because it is set as > REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in > nvme_dev_disable(). This cq delete admin command is sent out after EH 1 nvme_dev_disable completed and failed the previous timeout sq alloc admin command. please refer to the code segment above. Thanks jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming Please refer to my test log and analysis. [ 229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller [ 229.872649] nvme nvme0: EH 0: before shutdown [ 229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller [ 229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller [ 229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller [ 229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller [ 229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller [ 229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller [ 229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller [ 229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting [ 229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting [ 229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller [ 229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting [ 229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting [ 229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting [ 229.874041] nvme nvme0: Abort status: 0x0 [ 229.874064] nvme nvme0: Abort status: 0x0 [ 229.874078] nvme nvme0: Abort status: 0x0 [ 229.874092] nvme nvme0: Abort status: 0x0 [ 229.874106] nvme nvme0: Abort status: 0x0 [ 230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1 [ 290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller [ 290.840746] nvme nvme0: EH 1: before shutdown [ 290.992650] [ cut here ] [ 290.992751] Trying to free already-free IRQ 133 [ 290.992783] WARNING: CPU: 6 PID: 227 at /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 __free_irq+0xf6/0x420 [ 290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 4.17.0-rc3+ #37 [ 290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 290.993418] Workqueue: nvme-reset-wq nvme_eh_work [ 290.993436] RIP: 0010:__free_irq+0xf6/0x420 ... [ 290.993541] Call Trace: [ 290.993581] free_irq+0x4c/0xa0 [ 290.993600] pci_free_irq+0x18/0x30 [ 290.993615] nvme_dev_disable+0x20b/0x720 [ 290.993745] nvme_eh_work+0xdd/0x4b0 [ 290.993866] process_one_work+0x3ca/0xaa0 [ 290.993960] worker_thread+0x89/0x6c0 [ 290.994017] kthread+0x18d/0x1e0 [ 290.994061] ret_from_fork+0x24/0x30 [ 338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds. [ 338.913348] Tainted: GW 4.17.0-rc3+ #37 [ 338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 338.913809] kworker/u16:4 D0 227 2 0x8000 [ 338.913842] Workqueue: nvme-reset-wq nvme_eh_work [ 338.913857] Call Trace: [ 338.914002] schedule+0x52/0xe0 [ 338.914019] schedule_preempt_disabled+0x14/0x20 [ 338.914029] __mutex_lock+0x5b4/0xca0 [ 338.914228] nvme_eh_work+0x11a/0x4b0 [ 338.914344] process_one_work+0x3ca/0xaa0 [ 338.914440] worker_thread+0x89/0x6c0 [ 338.914496] kthread+0x18d/0x1e0 [ 338.914542] ret_from_fork+0x24/0x30 [ 338.914648] Here is the deadlock scenario. nvme_eh_work // EH0 -> nvme_reset_dev //hold reset_lock -> nvme_setup_io_queues -> nvme_create_io_queues -> nvme_create_queue -> set nvmeq->cq_vector -> adapter_alloc_cq -> adapter_alloc_sq irq has not been requested io timeout nvme_eh_work //EH1 -> nvme_dev_disable -> quiesce the adminq //> here ! -> nvme_suspend_queue print out warning Trying to free already-free IRQ 133 -> nvme_cancel_request // complete the timeout admin request -> require reset_lock -> adapter_delete_cq -> adapter_delete_queue // submit to the adminq which has been quiesced. -> nvme_submit_sync_cmd -> blk_execute_rq -> wait_for_completion_io_timeout hang_check is true, so there is no hung task warning for this context EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward. Thanks Jianchao
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
Hi Bart On 05/14/2018 12:03 PM, Bart Van Assche wrote: > On Mon, 2018-05-14 at 09:37 +0800, jianchao.wang wrote: >> In addition, on a 64bit system, how do you set up the timer with a 32bit >> deadline ? > > If timeout handling occurs less than (1 << 31) / HZ seconds after a request > has been > submitted then a 32-bit variable is sufficient to store the deadline. If you > want to > verify the implementation, please have a look at the int32_t variables in > blk_mq_check_expired(). > a 32bit deadline is indeed OK to judge whether a request is timeout or not. but how is the expires value determined for __blk_add_timer -> mod_timer ? as we know, when a request is started, we need to arm a timer for it. the expires value is 'unsigned long'. Thanks Jianchao
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
Hi Bart On 05/11/2018 11:26 PM, Bart Van Assche wrote: > The bug is in the above trace_printk() call: blk_rq_deadline() must only be > used > for the legacy block layer and not for blk-mq code. If you have a look at the > value > of the das.deadline field then one can see that the value of that field is > correct: blk_rq_deadline is used by __blk_add_timer, and __blk_add_timer is also used by blk_mq_add_timer. This is why I used blk_rq_deadline to print out the value in my test. on 64bit system, das.deadline will be a correct value, what if the jiffies has crossed the 32bit boundary ? > 0x550c - 0x37c0 = 7500 = 30 * 250. Does that mean that HZ = 250 on > the system > on which you ran this test? > Yes, the HZ is 250 Thanks Jianchao
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
Hi bart On 05/11/2018 11:29 PM, Bart Van Assche wrote: > On Fri, 2018-05-11 at 14:35 +0200, Christoph Hellwig wrote: >>> It should be due to union blk_deadline_and_state. >>> +union blk_deadline_and_state { >>> + struct { >>> + uint32_t generation:30; >>> + uint32_t state:2; >>> + uint32_t deadline; >>> + }; >>> + unsigned long legacy_deadline; >>> + uint64_t das; >>> +}; >> >> Yikes. Or we just move it into a separate field. This patch already >> shrinks struct request a lot, so I'd rather do that to keep it simple. > > Hello Christoph, > > Are you perhaps referring to the legacy_deadline field? Have you noticed that > Jianchao used a legacy block layer function in blk-mq code and that that is > why a wrong value for the deadline appeared in the trace output? > I use blk_rq_deadline because __blk_add_timer uses it. - /* * If the timer isn't already pending or this timeout is earlier * than an existing one, modify the timer. Round up to next nearest * second. */ expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req))); - if (!timer_pending(>timeout) || time_before(expiry, q->timeout.expires)) { unsigned long diff = q->timeout.expires - expiry; This is also the reason why there is no issue about timeout when test. blk_rq_timeout will return reasonable value. In addition, on a 64bit system, how do you set up the timer with a 32bit deadline ? Thanks Jianchao
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
Hi bart I add debug log in blk_mq_add_timer as following void blk_mq_add_timer(struct request *req, enum mq_rq_state old, enum mq_rq_state new) { struct request_queue *q = req->q; if (!req->timeout) req->timeout = q->rq_timeout; if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new)) WARN_ON_ONCE(true); trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n", jiffies, req->timeout, blk_rq_deadline(req), req->das.generation, req->das.deadline); return __blk_add_timer(req); } And get log below: jbd2/sda2-8-320 [000] ...195.030824: blk_mq_add_timer: jiffies 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c kworker/0:1H-136 [000] ...195.031822: blk_mq_add_timer: jiffies 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c kworker/6:1H-244 [006] ...195.041695: blk_mq_add_timer: jiffies 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f kworker/6:1H-244 [006] ...195.041954: blk_mq_add_timer: jiffies 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f The blk_rq_deadline return 550c4000 which looks really crazy. It should be due to union blk_deadline_and_state. +union blk_deadline_and_state { + struct { + uint32_t generation:30; + uint32_t state:2; + uint32_t deadline; + }; + unsigned long legacy_deadline; + uint64_t das; +}; And generation never change. Thanks Jianchao
Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling
Hi ming I did some tests on my local. [ 598.828578] nvme nvme0: I/O 51 QID 4 timeout, disable controller This should be a timeout on nvme_reset_dev->nvme_wait_freeze. [ 598.828743] nvme nvme0: EH 1: before shutdown [ 599.013586] nvme nvme0: EH 1: after shutdown [ 599.137197] nvme nvme0: EH 1: after recovery The EH 1 have mark the state to LIVE [ 599.137241] nvme nvme0: failed to mark controller state 1 So the EH 0 failed to mark state to LIVE The card was removed. This should not be expected by nested EH. [ 599.137322] nvme nvme0: Removing after probe failure status: 0 [ 599.326539] nvme nvme0: EH 0: after recovery [ 599.326760] nvme0n1: detected capacity change from 128035676160 to 0 [ 599.457208] nvme nvme0: failed to set APST feature (-19) nvme_reset_dev should identify whether it is nested. Thanks Jianchao
Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
Hi ming On 05/04/2018 04:02 PM, Ming Lei wrote: >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing >> another interface. >> Then it is more convenient to ensure that there will be only one resetting >> instance running. >> > But as you mentioned above, reset_work has to be splitted into two > contexts for handling IO timeout during wait_freeze in reset_work, > so single instance of nvme_reset_ctrl() may not work well. I mean the EH kthread and the reset_work which both could reset the ctrl instead of the pre and post rest context. Honestly, I suspect a bit that whether it is worthy to try to recover from [1]. The Eh kthread solution could make things easier, but the codes for recovery from [1] has made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. How about just fail the resetting as the Keith's solution ? [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze. Thanks Jianchao
Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
Oh sorry. On 05/04/2018 02:10 PM, jianchao.wang wrote: > nvme_error_handler should invoke nvme_reset_ctrl instead of introducing > another interface. > Then it is more convenient to ensure that there will be only one resetting > instance running. ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work. Thanks Jianchao
Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
Hi ming On 05/04/2018 12:24 PM, Ming Lei wrote: >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the >> other things >> to nvme_reset_work as the v2 patch series seems clearer. > That way may not fix the current race: nvme_dev_disable() will > quiesce/freeze queue again when resetting is in-progress, then > nothing can move on. With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no timeout anymore. If there is timeout due to the admin io in reset_work, this is expected, reset_work will be interrupted and fail. If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot move on. we have multiple method to handle this: 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work. 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts. Actually, there are many other places where the nvme_dev_disable will be invoked. > > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev() > in the EH thread, and make sure queues are started once > nvme_pre_reset_dev() returns. But as you pointed, it may not work well > enough if admin requests are timed-out in nvme_pre_reset_dev(). > >> The primary target of nvme_error_handler is to drain IOs and disable >> controller. >> reset_work could take over the left things of the recovery work > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze() > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be > easy to fix all these races since the .eh_reset_work isn't needed any more, > because quiescing is enough to prevent IOs from entering driver/device. > > The only issue is that more requests may be failed when reset failed, > so I don't want to try that, and just follows the current behaviour. > > IMO the primary goal of nvme EH is to recover controller, that is > what nvme_dev_disable() and resetting should do, if IO isn't drained, > timeout still may be triggered. > > The big trouble is about percpu_ref, once the q_usage_counter is killed > to start freezing for preventing IO from entering block layer, we have > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us. Sorry for my bad description. I mean nvme_dev_disable will drain all the in-flight requests and requeue or fail them, then we will not get any timeout again. In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown. There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues. When the ctrl comes back, the number of hw queues maybe changed. Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that. On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue during the resetting, will be sacrificed. > >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to >> nvme_reset_ctrl should be ok. >> If the admin ios in reset_work timeout, the nvme_error_handler could also >> provide help to complete them. >> > Right, that is one goal of this patchset. > >>> There are actually two issues here, both are covered in this patchset: >>> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may >>> happen again, this patch can shutdown & reset controller again to >>> recover it. >>> >> We could split the nvme_reset_work into two contexts instead of introduce >> another nvme_eh_reset. >> Otherwise, the code looks really complicated. > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and > .eh_post_reset_work, and run the following things in .eh_pre_reset_work: > > - nvme_pre_reset_dev() > - schedule .eh_post_reset_work > > and run draining IO & updating controller state in .eh_post_reset_work. > .eh_pre_reset_work will be scheduled in nvme_error_handler(). > > This way may address the issue of admin req timeout in nvme_pre_reset_dev(), > what do you think of this approach? nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. Then it is more convenient to ensure that there will be only one resetting instance running. Thanks Jianchao
Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
Hi Ming Thanks for your kindly response. On 05/03/2018 06:08 PM, Ming Lei wrote: > nvme_eh_reset() can move on, if controller state is either CONNECTING or > RESETTING, nvme_change_ctrl_state(>ctrl, NVME_CTRL_RESETTING) won't > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called > directly, then follows scheduling of the eh_reset_work. We have to consider another context, nvme_reset_work. There are two contexts that will invoke nvme_pre_reset_dev. nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING. nvme_error_handlernvme_reset_ctrl -> change state to RESETTING -> queue reset work nvme_reset_work -> nvme_dev_disable -> nvme_eh_reset -> nvme_pre_reset_dev -> nvme_pre_reset_dev -> change state to CONNECTING -> change state fails -> nvme_remove_dead_ctrl There seems to be other race scenarios between them. Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things to nvme_reset_work as the v2 patch series seems clearer. The primary target of nvme_error_handler is to drain IOs and disable controller. reset_work could take over the left things of the recovery work IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. > There are actually two issues here, both are covered in this patchset: > > 1) when nvme_wait_freeze() is for draining IO, controller error may > happen again, this patch can shutdown & reset controller again to > recover it. > We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. Otherwise, the code looks really complicated. In addition, this fix for this issue could be deferred after your great EH kthread solution is submitted. Thanks Jianchao > 2) there is also another issue: queues may not be put into freeze state > in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This > issue is addressed by 4th patch. > > Both two can be observed in blktests block/011 and verified by this patches. > > Thanks for your great review, please let me know if there are other > issues.
Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
Hi ming On 05/03/2018 11:17 AM, Ming Lei wrote: > static int io_queue_depth_set(const char *val, const struct kernel_param *kp) > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) > if (nvme_should_reset(dev, csts)) { > nvme_warn_reset(dev, csts); > nvme_dev_disable(dev, false, true); > - nvme_reset_ctrl(>ctrl); > + nvme_eh_reset(dev); > return BLK_EH_HANDLED; > } > > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) >"I/O %d QID %d timeout, reset controller\n", >req->tag, nvmeq->qid); > nvme_dev_disable(dev, false, true); > - nvme_reset_ctrl(>ctrl); > + nvme_eh_reset(dev); w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous. nvme_pre_reset_dev will send a lot of admin io when initialize the controller. if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping to wait admin ios. In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context, but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward. Actually, I used to report this issue to Keith. I met io hung when the controller die in nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting. Here is Keith's commit for this: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html Thanks Jianchao
Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
Hi Ming On 05/02/2018 11:33 AM, Ming Lei wrote: > No, there isn't such race, the 'mod_timer' doesn't make a difference > because 'q->timeout_off' will be visible in new work func after > cancel_work_sync() returns. So even the timer is expired, work func > still returns immediately. Yes, you are right. even if timer is setup , but the timeout work will return. Thanks Jianchao
Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
Hi Ming On 05/02/2018 12:54 PM, Ming Lei wrote: >> We need to return BLK_EH_RESET_TIMER in nvme_timeout then: >> 1. defer the completion. we can't unmap the io request before close the >> controller totally, so not BLK_EH_HANDLED. >> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked >> by nvme_cancel_request, so not BLK_EH_NOT_HANDLED. > We don't need to change return value of .timeout() any more after > calling nvme_quiesce_timeout(): > > Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all > timed-out requests have been handled already. Some of them may be completed, > and others may be handled as RESET_TIMER, but none of them are handled as > NOT_HANDLED because nvme_timeout() won't return that value. > > So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle > all in-flight requests because timeout is drained and quiesced. The key point here is we cannot unmap the io requests before we close the controller directly. The nvme controller may still hold the command after we complete and unmap the io request in nvme_timeout. This will cause memory corruption. So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED. We need to deffer the completion of timeout requests until the controller has been closed totally in nvme_dev_disable. Thanks Jianchao
Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
Hi Ming On 04/29/2018 11:41 PM, Ming Lei wrote: > + > static enum blk_eh_timer_return nvme_timeout(struct request *req, bool > reserved) > { > struct nvme_iod *iod = blk_mq_rq_to_pdu(req); > @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) >*/ > if (nvme_should_reset(dev, csts)) { > nvme_warn_reset(dev, csts); > - nvme_dev_disable(dev, false); > - nvme_reset_ctrl(>ctrl); > + nvme_eh_schedule(dev); > return BLK_EH_HANDLED; > } > > @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) > dev_warn(dev->ctrl.device, >"I/O %d QID %d timeout, disable controller\n", >req->tag, nvmeq->qid); > - nvme_dev_disable(dev, false); > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > + nvme_eh_schedule(dev); > return BLK_EH_HANDLED; > default: > break; > @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) > dev_warn(dev->ctrl.device, >"I/O %d QID %d timeout, reset controller\n", >req->tag, nvmeq->qid); > - nvme_dev_disable(dev, false); > - nvme_reset_ctrl(>ctrl); > - > /* >* Mark the request as handled, since the inline shutdown >* forces all outstanding requests to complete. >*/ > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > + nvme_eh_schedule(dev); > return BLK_EH_HANDLED; > } > > @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool > shutdown) > if (pci_is_enabled(pdev)) { > u32 csts = readl(dev->bar + NVME_REG_CSTS); > > - if (dev->ctrl.state == NVME_CTRL_LIVE || > - dev->ctrl.state == NVME_CTRL_RESETTING) { > + if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE || > + dev->ctrl.state == NVME_CTRL_RESETTING)) { > nvme_start_freeze(>ctrl); > frozen = true; > } > @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, > bool shutdown) > for (i = dev->ctrl.queue_count - 1; i >= 0; i--) > nvme_suspend_queue(>queues[i]); > > + /* safe to sync timeout after queues are quiesced */ > + nvme_quiesce_timeout(>ctrl); > + blk_quiesce_timeout(dev->ctrl.admin_q); > + > nvme_pci_disable(dev); > > + /* > + * Both timeout and interrupt handler have been drained, and all > + * in-flight requests will be canceled now. > + */ > blk_mq_tagset_busy_iter(>tagset, nvme_cancel_request, >ctrl); > blk_mq_tagset_busy_iter(>admin_tagset, nvme_cancel_request, > >ctrl); We need to return BLK_EH_RESET_TIMER in nvme_timeout then: 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED. 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED. Thanks Jianchao
Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
Hi ming On 04/29/2018 11:41 PM, Ming Lei wrote: > > +static void __blk_unquiesce_timeout(struct request_queue *q) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(q->queue_lock, flags); > + q->timeout_off = false; > + spin_unlock_irqrestore(q->queue_lock, flags); > +} > + > +void blk_unquiesce_timeout(struct request_queue *q) > +{ > + __blk_unquiesce_timeout(q); > + mod_timer(>timeout, jiffies + q->rq_timeout); > +} > +EXPORT_SYMBOL(blk_unquiesce_timeout); > + > +void blk_quiesce_timeout(struct request_queue *q) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(q->queue_lock, flags); > + q->timeout_off = true; > + spin_unlock_irqrestore(q->queue_lock, flags); > + > + del_timer_sync(>timeout); > + cancel_work_sync(>timeout_work); > +} > +EXPORT_SYMBOL(blk_quiesce_timeout); > + > /** > * blk_sync_queue - cancel any pending callbacks on a queue > * @q: the queue > @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue); > */ > void blk_sync_queue(struct request_queue *q) > { > - del_timer_sync(>timeout); > - cancel_work_sync(>timeout_work); > + blk_quiesce_timeout(q); > > if (q->mq_ops) { > struct blk_mq_hw_ctx *hctx; > @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q) > } else { > cancel_delayed_work_sync(>delay_work); > } > + > + __blk_unquiesce_timeout(q); > } > EXPORT_SYMBOL(blk_sync_queue); > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 0dc9e341c2a7..890dd3b3e3e1 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work) > }; > struct blk_mq_hw_ctx *hctx; > int i; > + bool timeout_off; > + unsigned long flags; > + > + spin_lock_irqsave(q->queue_lock, flags); > + timeout_off = q->timeout_off; > + spin_unlock_irqrestore(q->queue_lock, flags); > + > + if (timeout_off) > + return; Looks like there is still race as following: blk_quiesce_timeout blk_mq_timeout_work -> timeout_off = q->timeout_off; //still a false -> q->timeout_off = true; -> del_timer_sync(>timeout); -> mod_timer -> cancel_work_sync(>timeout_work); How about this: Introduce a new request queue flag QUEUE_FLAG_TIMEOUT_QUIESCED void blk_quiesce_queue_timeout(struct request_queue *q) { unsigned long flags; spin_lock_irqsave(q->queue_lock, flags); queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q); spin_unlock_irqrestore(q->queue_lock, flags); synchronize_rcu_bh(); cancel_work_sync(>timeout_work); /* * The timeout work will not be invoked anymore. */ del_timer_sync(>timeout); } void blk_set_queue_dying(struct request_queue *q) { spin_lock_irq(q->queue_lock); @@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t) { struct request_queue *q = from_timer(q, t, timeout); + if (blk_queue_timeout_quiesced(q)) + return; kblockd_schedule_work(>timeout_work); } Thanks Jianchao
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
Hi ming On 04/29/2018 09:36 AM, Ming Lei wrote: > On Sun, Apr 29, 2018 at 6:27 AM, Ming Lei <tom.leim...@gmail.com> wrote: >> On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leim...@gmail.com> wrote: >>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang >>> <jianchao.w.w...@oracle.com> wrote: >>>> Hi ming >>>> >>>> On 04/27/2018 10:57 PM, Ming Lei wrote: >>>>> I may not understand your point, once blk_sync_queue() returns, the >>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only >>>>> returns EH_NOT_HANDLED before the deactivation. >>>>> >>>>> That means this timer won't be expired any more, so could you explain >>>>> a bit why timeout can come again after blk_sync_queue() returns >>>> >>>> Please consider the following case: >>>> >>>> blk_sync_queue >>>> -> del_timer_sync >>>> blk_mq_timeout_work >>>> -> blk_mq_check_expired // return the timeout >>>> value >>>> -> blk_mq_terninate_expired >>>> -> .timeout //return EH_NOT_HANDLED >>>> -> mod_timer // setup the timer again based on >>>> the result of blk_mq_check_expired >>>> -> cancel_work_sync >>>> So after the blk_sync_queue, the timer may come back again, then the >>>> timeout work. >>> >>> OK, I was trying to avoid to use blk_abort_request(), but looks we >>> may have to depend on it or similar way. >>> >>> BTW, that means blk_sync_queue() has been broken, even though the uses >>> in blk_cleanup_queue(). >>> >>> Another approach is to introduce one perpcu_ref of >>> 'q->timeout_usage_counter' for >>> syncing timeout, seems a bit over-kill too, but simpler in both theory >>> and implement. >> >> Or one timout_mutex is enough. > > Turns out it is SRCU. > after split the timeout path into timer and workqueue two parts, if we don't drain the in-flight requests, the request_queue->timeout and the timeout work look like an issue of 'chicken and egg'. how about introduce a flag to disable triggering of timeout work ? Thanks Jianchao
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
Hi Ming and Keith Let me detail extend more here. :) On 04/28/2018 09:35 PM, Keith Busch wrote: >> Actually there isn't the case before, even for legacy path, one .timeout() >> handles one request only. Yes, .timeout should be invoked for every timeout request and .timeout should also handle this only one request in principle however, nvme_timeout will invoke nvme_dev_disable > That's not quite what I was talking about. > > Before, only the command that was about to be sent to the driver's > .timeout() was marked completed. The driver could (and did) compete > other timed out commands in a single .timeout(), and the tag would > clear, so we could hanlde all timeouts in a single .timeout(). I think Keith are saying that before this new blk-mq timeout implementation, the logic of blk_mq_timeout_work is get _only_ _one_ timeout request mark completed invoke .timeout, in nvme, it is nvme_timeout then nvme_dev_disable is invoked and thus other requests could be completed by blk_mq_complete_request because they have not been mark completed > > Now, blk-mq marks all timed out commands as aborted prior to calling > the driver's .timeout(). If the driver completes any of those commands, > the tag does not clear, so the driver's .timeout() just gets to be called > again for commands it already reaped. > After the new blk-mq timeout implementation, set the aborted_gstate of _all_ the timeout requests invoke the .timeout one by one for the first timeout request's .timeout, in nvme, it is nvme_timeout nvme_dev_disable is invoked and try to complete all the in-flight requests through blk_mq_complete_request but timeout requests that have been set aborted_gstate cannot handled by blk_mq_complete_request so some requests are leaked by nvme_dev_disable these residual timeout requests will still be handled by blk_mq_timeout_work through invoke .timeout one by one Thanks Jianchao >
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
Hi ming On 04/27/2018 10:57 PM, Ming Lei wrote: > I may not understand your point, once blk_sync_queue() returns, the > timer itself is deactivated, meantime the synced .nvme_timeout() only > returns EH_NOT_HANDLED before the deactivation. > > That means this timer won't be expired any more, so could you explain > a bit why timeout can come again after blk_sync_queue() returns Please consider the following case: blk_sync_queue -> del_timer_sync blk_mq_timeout_work -> blk_mq_check_expired // return the timeout value -> blk_mq_terninate_expired -> .timeout //return EH_NOT_HANDLED -> mod_timer // setup the timer again based on the result of blk_mq_check_expired -> cancel_work_sync So after the blk_sync_queue, the timer may come back again, then the timeout work. Thanks Jianchao
Re: testing io.low limit for blk-throttle
Hi Tejun and Joseph On 04/27/2018 02:32 AM, Tejun Heo wrote: > Hello, > > On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote: >> +Tejun (I guess he might be interested in the results below) > > Our experiments didn't work out too well either. At this point, it > isn't clear whether io.low will ever leave experimental state. We're > trying to find a working solution. Would you please take a look at the following two patches. https://marc.info/?l=linux-block=152325456307423=2 https://marc.info/?l=linux-block=152325457607425=2 In addition, when I tested blk-throtl io.low on NVMe card, I always got even if the iops has been lower than io.low limit for a while, but the due to group is not idle, the downgrade always fails. tg->latency_target && tg->bio_cnt && tg->bad_bio_cnt * 5 < tg->bio_cn the latency always looks well even the sum of two groups's iops has reached the top. so I disable this check on my test, plus the 2 patches above, the io.low could basically works. My NVMe card's max bps is ~600M, and max iops is ~160k. Here is my config io.low riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 latency=10 io.max riops=15 There are two cgroups in my test, both of them have same config. In addition, saying "basically work" is due to the iops of the two cgroup will jump up and down. such as, I launched one fio test per cgroup, the iops will wave as following: group0 30k 50k 70k 60k 40k group1 120k 100k 80k 90k 110k however, if I launched two fio tests only in one cgroup, the iops of two test could stay about 70k~80k. Could help to explain this scenario ? Thanks in advance Jianchao
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
On 04/26/2018 11:57 PM, Ming Lei wrote: > Hi Jianchao, > > On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote: >> Hi Ming >> >> Thanks for your wonderful solution. :) >> >> On 04/26/2018 08:39 PM, Ming Lei wrote: >>> +/* >>> + * This one is called after queues are quiesced, and no in-fligh timeout >>> + * and nvme interrupt handling. >>> + */ >>> +static void nvme_pci_cancel_request(struct request *req, void *data, >>> + bool reserved) >>> +{ >>> + /* make sure timed-out requests are covered too */ >>> + if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) { >>> + req->aborted_gstate = 0; >>> + req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; >>> + } >>> + >>> + nvme_cancel_request(req, data, reserved); >>> +} >>> + >>> static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) >>> { >>> int i; >>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, >>> bool shutdown) >>> for (i = dev->ctrl.queue_count - 1; i >= 0; i--) >>> nvme_suspend_queue(>queues[i]); >>> >>> + /* >>> +* safe to sync timeout after queues are quiesced, then all >>> +* requests(include the time-out ones) will be canceled. >>> +*/ >>> + nvme_sync_queues(>ctrl); >>> + blk_sync_queue(dev->ctrl.admin_q); >>> + >> Looks like blk_sync_queue cannot drain all the timeout work. >> >> blk_sync_queue >> -> del_timer_sync >> blk_mq_timeout_work >> -> mod_timer >> -> cancel_work_sync >> the timeout work may come back again. >> we may need to force all the in-flight requests to be timed out with >> blk_abort_request >> > > blk_abort_request() seems over-kill, we could avoid this race simply by > returning EH_NOT_HANDLED if the controller is in-recovery. return EH_NOT_HANDLED maybe not enough. please consider the following scenario. nvme_error_handler -> nvme_dev_disable -> blk_sync_queue //timeout comes again due to the //scenario above blk_mq_timeout_work -> blk_mq_check_expired -> set aborted_gstate -> nvme_pci_cancel_request -> RQF_MQ_TIMEOUT_EXPIRED has not been set -> nvme_cancel_request -> blk_mq_complete_request -> do nothing -> blk_mq_ternimate_expired -> blk_mq_rq_timed_out -> set RQF_MQ_TIMEOUT_EXPIRED -> .timeout return BLK_EH_NOT_HANDLED Then the timeout request is leaked. > >>> nvme_pci_disable(dev); >> >> the interrupt will not come, but there maybe running one. >> a synchronize_sched() here ? > > We may cover this case by moving nvme_suspend_queue() before > nvme_stop_queues(). > > Both two are very good catch, thanks! >
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
Hi Ming Thanks for your wonderful solution. :) On 04/26/2018 08:39 PM, Ming Lei wrote: > +/* > + * This one is called after queues are quiesced, and no in-fligh timeout > + * and nvme interrupt handling. > + */ > +static void nvme_pci_cancel_request(struct request *req, void *data, > + bool reserved) > +{ > + /* make sure timed-out requests are covered too */ > + if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) { > + req->aborted_gstate = 0; > + req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; > + } > + > + nvme_cancel_request(req, data, reserved); > +} > + > static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) > { > int i; > @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, > bool shutdown) > for (i = dev->ctrl.queue_count - 1; i >= 0; i--) > nvme_suspend_queue(>queues[i]); > > + /* > + * safe to sync timeout after queues are quiesced, then all > + * requests(include the time-out ones) will be canceled. > + */ > + nvme_sync_queues(>ctrl); > + blk_sync_queue(dev->ctrl.admin_q); > + Looks like blk_sync_queue cannot drain all the timeout work. blk_sync_queue -> del_timer_sync blk_mq_timeout_work -> mod_timer -> cancel_work_sync the timeout work may come back again. we may need to force all the in-flight requests to be timed out with blk_abort_request > nvme_pci_disable(dev); the interrupt will not come, but there maybe running one. a synchronize_sched() here ? Thanks Jianchao > > - blk_mq_tagset_busy_iter(>tagset, nvme_cancel_request, >ctrl); > - blk_mq_tagset_busy_iter(>admin_tagset, nvme_cancel_request, > >ctrl); > + blk_mq_tagset_busy_iter(>tagset, nvme_pci_cancel_request, > >ctrl); > + blk_mq_tagset_busy_iter(>admin_tagset, nvme_pci_cancel_request, > >ctrl);
Re: testing io.low limit for blk-throttle
Hi Paolo When I test execute the script, I got this 8:0 rbps=1000 wbps=0 riops=0 wiops=0 idle=0 latency=max The idle is 0. I'm afraid the io.low would not work. Please refer to the following code in tg_set_limit /* force user to configure all settings for low limit */ if (!(tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) || tg->idletime_threshold_conf == DFL_IDLE_THRESHOLD ||//-> HERE tg->latency_target_conf == DFL_LATENCY_TARGET) { tg->bps[READ][LIMIT_LOW] = 0; tg->bps[WRITE][LIMIT_LOW] = 0; tg->iops[READ][LIMIT_LOW] = 0; tg->iops[WRITE][LIMIT_LOW] = 0; tg->idletime_threshold = DFL_IDLE_THRESHOLD; tg->latency_target = DFL_LATENCY_TARGET; } else if (index == LIMIT_LOW) { tg->idletime_threshold = tg->idletime_threshold_conf; tg->latency_target = tg->latency_target_conf; } blk_throtl_update_limit_valid(tg->td); Thanks Jianchao On 04/23/2018 03:37 PM, Paolo Valente wrote: > cd thr-lat-with-interference > sudo ./thr-lat-with-interference.sh -b t -w 1 -W "1000 1000 > 1000 1000 1000 1000" -n 6 -T "read read read read read read" > -R "0 0 0 0 0 0"
Re: testing io.low limit for blk-throttle
Hi Paolo On 04/23/2018 01:32 PM, Paolo Valente wrote: > Thanks for sharing this fix. I tried it too, but nothing changes in > my test :(> That's really sad. > At this point, my doubt is still: am I getting io.low limit right? I > understand that an I/O-bound group should be guaranteed a rbps at > least equal to the rbps set with io.low for that group (of course, > provided that the sum of io.low limits is lower than the rate at which > the device serves all the I/O generated by the groups). Is this > really what io.low shall guarantee? I agree with your point about this even if I'm not qualified to judge it. On the other hand, could you share your test case and blk-throl config here ? Thanks Jianchao
Re: testing io.low limit for blk-throttle
Hi Paolo As I said, I used to meet similar scenario. After dome debug, I found out 3 issues. Here is my setup command: mkdir test0 test1 echo "259:0 riops=15" > test0/io.max echo "259:0 riops=15" > test1/io.max echo "259:0 riops=15" > test2/io.max echo "259:0 riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 latency=10" > test0/io.low echo "259:0 riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 latency=10" > test1/io.low My NVMe card's max bps is ~600M, and max iops is ~160k. Two cgroups' io.low is bps 200M and 50k. io.max is iops 150k 1. I setup 2 cgroups test0 and test1, one process per cgroup. Even if only the process in test0 does IO, its iops is just 50k. This is fixed by following patch. https://marc.info/?l=linux-block=152325457607425=2 2. Let the process in test0 and test1 both do IO. Sometimes, the iops of both cgroup are 50k, look at the log, blk-throl's upgrade always fails. This is fixed by following patch: https://marc.info/?l=linux-block=152325456307423=2 3. After applied patch 1 and 2, still see that one of cgroup's iops will fall down to 30k ~ 40k but blk-throl doesn't downgrade. It is due to even if the iops has been lower than the io.low limit for some time, but the cgroup is idle, so downgrade fails. More detailed, it is due to the code segment in throtl_tg_is_idle (tg->latency_target && tg->bio_cnt && tg->bad_bio_cnt * 5 < tg->bio_cnt) I fixed it with following patch. But I'm not sure about this patch, so I didn't submit it. Please also try it. :) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index b5ba845..c9a43a4 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -1819,7 +1819,7 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg) return ret; } -static bool throtl_tg_is_idle(struct throtl_grp *tg) +static bool throtl_tg_is_idle(struct throtl_grp *tg, bool latency) { /* * cgroup is idle if: @@ -1836,7 +1836,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg) tg->idletime_threshold == DFL_IDLE_THRESHOLD || (ktime_get_ns() >> 10) - tg->last_finish_time > time || tg->avg_idletime > tg->idletime_threshold || - (tg->latency_target && tg->bio_cnt && + (tg->latency_target && tg->bio_cnt && latency && tg->bad_bio_cnt * 5 < tg->bio_cnt); throtl_log(>service_queue, "avg_idle=%ld, idle_threshold=%ld, bad_bio=%d, total_bio=%d, is_idle=%d, scale=%d", @@ -1867,7 +1867,7 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg) if (time_after_eq(jiffies, tg_last_low_overflow_time(tg) + tg->td->throtl_slice) && - throtl_tg_is_idle(tg)) + throtl_tg_is_idle(tg, true)) return true; return false; } @@ -1983,7 +1983,7 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg) if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) && time_after_eq(now, tg_last_low_overflow_time(tg) + td->throtl_slice) && - (!throtl_tg_is_idle(tg) || + (!throtl_tg_is_idle(tg, false) || !list_empty(_to_blkg(tg)->blkcg->css.children))) return true; return false; On 04/22/2018 11:53 PM, Paolo Valente wrote: > > >> Il giorno 22 apr 2018, alle ore 15:29, jianchao.wang >> <jianchao.w.w...@oracle.com> ha scritto: >> >> Hi Paolo >> >> I used to meet similar issue on io.low. >> Can you try the following patch to see whether the issue could be fixed. >> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D152325456307423-26w-3D2=DwIFAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=asJMDy9zIe2AqRVpoLbe9RMjsdZOJZ0HrRWTM3CPZeA=AZ4kllxCfaXspjeSylBpK8K7ai6IPjSiffrGmzt4VEM= >> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D152325457607425-26w-3D2=DwIFAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=asJMDy9zIe2AqRVpoLbe9RMjsdZOJZ0HrRWTM3CPZeA=1EhsoSMte3kIxuVSYBFSE9W2jRKrIWI5z7-stlZ80H4= >> > > Just tried. Unfortunately, nothing seems to change :( > > Thanks, > Paolo > >> Thanks >> Jianchao >> >> On 04/22/2018 05:23 PM, Paolo Valente wrote: >>> Hi Shaohua, all, >>> at last, I started testing your io.low limit for blk-throttle. One of >>> the things I'm interested in is how good throttling is in achieving a >>> high thr
Re: testing io.low limit for blk-throttle
Hi Paolo I used to meet similar issue on io.low. Can you try the following patch to see whether the issue could be fixed. https://marc.info/?l=linux-block=152325456307423=2 https://marc.info/?l=linux-block=152325457607425=2 Thanks Jianchao On 04/22/2018 05:23 PM, Paolo Valente wrote: > Hi Shaohua, all, > at last, I started testing your io.low limit for blk-throttle. One of > the things I'm interested in is how good throttling is in achieving a > high throughput in the presence of realistic, variable workloads. > > However, I seem to have bumped into a totally different problem. The > io.low parameter doesn't seem to guarantee what I understand it is meant > to guarantee: minimum per-group bandwidths. For example, with > - one group, the interfered, containing one process that does sequential > reads with fio > - io.low set to 100MB/s for the interfered > - six other groups, the interferers, with each interferer containing one > process doing sequential read with fio > - io.low set to 10MB/s for each interferer > - the workload executed on an SSD, with a 500MB/s of overall throughput > the interfered gets only 75MB/s. > > In particular, the throughput of the interfered becomes lower and > lower as the number of interferers is increased. So you can make it > become even much lower than the 75MB/s in the example above. There > seems to be no control on bandwidth. > > Am I doing something wrong? Or did I simply misunderstand the goal of > io.low, and the only parameter for guaranteeing the desired bandwidth to > a group is io.max (to be used indirectly, by limiting the bandwidth of > the interferers)? > > If useful for you, you can reproduce the above test very quickly, by > using the S suite [1] and typing: > > cd thr-lat-with-interference > sudo ./thr-lat-with-interference.sh -b t -w 1 -W "1000 1000 > 1000 1000 1000 1000" -n 6 -T "read read read read read read" > -R "0 0 0 0 0 0" > > Looking forward to your feedback, > Paolo > > [1] >
Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
On 04/21/2018 10:10 PM, Jens Axboe wrote: > On 4/21/18 7:34 AM, jianchao.wang wrote: >> Hi Bart >> >> Thanks for your kindly response. >> >> On 04/20/2018 10:11 PM, Bart Van Assche wrote: >>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote: >>>> Hi Bart >>>> >>>> On 04/20/2018 12:43 AM, Bart Van Assche wrote: >>>>> Use the deadline instead of the request generation to detect whether >>>>> or not a request timer fired after reinitialization of a request >>>> >>>> Maybe use deadline to do this is not suitable. >>>> >>>> Let's think of the following scenario. >>>> >>>> T1/T2 times in nano seconds >>>> J1/J2 times in jiffies >>>> >>>> rq is started at T1,J1 >>>> blk_mq_timeout_work >>>> -> blk_mq_check_expired >>>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT) >>>> >>>> rq is completed and freed >>>> rq is allocated and started >>>> again on T2 >>>> rq->__deadline = J2 + >>>> MQ_RQ_IN_FLIGHT >>>> -> synchronize srcu/rcu >>>> -> blk_mq_terminate_expired >>>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT) >>>> >>>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3) >>>>if J2-J1 < 4 jiffies, we will get the same deadline value. >>>> >>>> 2. even if we do some bit shift when save and get deadline >>>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline. >>> >>> Hello Jianchao, >>> >>> How about using the upper 16 or 32 bits of rq->__deadline to store a >>> generation >>> number? I don't know any block driver for which the product of (deadline in >>> seconds) and HZ exceeds (1 << 32). >>> >> yes, we don't need so long timeout value. >> However, req->__deadline is an absolute time, not a relative one, >> its type is unsigned long variable which is same with jiffies. >> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for >> generation number, >> how to handle it when mod_timer ? > > We don't need to support timeouts to the full granularity of jiffies. > Most normal commands are in the 30-60s range, and some lower level > commands may take minutes. We're well within the range of chopping > off some bits and not having to worry about that at all. > Yes. So we have a __deadline as below: deadline gen state |__||_| \___ __/ V granularity and even we usually have a 30~60s timeout, but we should support a 1s granularity at least. How to decide the granularity ? Thanks Jianchao
Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
Hi Bart Thanks for your kindly response. On 04/20/2018 10:11 PM, Bart Van Assche wrote: > On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote: >> Hi Bart >> >> On 04/20/2018 12:43 AM, Bart Van Assche wrote: >>> Use the deadline instead of the request generation to detect whether >>> or not a request timer fired after reinitialization of a request >> >> Maybe use deadline to do this is not suitable. >> >> Let's think of the following scenario. >> >> T1/T2 times in nano seconds >> J1/J2 times in jiffies >> >> rq is started at T1,J1 >> blk_mq_timeout_work >> -> blk_mq_check_expired >> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT) >> >> rq is completed and freed >> rq is allocated and started >> again on T2 >> rq->__deadline = J2 + >> MQ_RQ_IN_FLIGHT >> -> synchronize srcu/rcu >> -> blk_mq_terminate_expired >> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT) >> >> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3) >>if J2-J1 < 4 jiffies, we will get the same deadline value. >> >> 2. even if we do some bit shift when save and get deadline >>if T2 - T1 < time of 1 jiffies, we sill get the same deadline. > > Hello Jianchao, > > How about using the upper 16 or 32 bits of rq->__deadline to store a > generation > number? I don't know any block driver for which the product of (deadline in > seconds) and HZ exceeds (1 << 32). > yes, we don't need so long timeout value. However, req->__deadline is an absolute time, not a relative one, its type is unsigned long variable which is same with jiffies. If we reserve some bits (not just 1 or 2 bits) of req->__deadline for generation number, how to handle it when mod_timer ? Thanks Jianchao
Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
Hi Bart On 04/20/2018 12:43 AM, Bart Van Assche wrote: > Use the deadline instead of the request generation to detect whether > or not a request timer fired after reinitialization of a request Maybe use deadline to do this is not suitable. Let's think of the following scenario. T1/T2 times in nano seconds J1/J2 times in jiffies rq is started at T1,J1 blk_mq_timeout_work -> blk_mq_check_expired -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT) rq is completed and freed rq is allocated and started again on T2 rq->__deadline = J2 + MQ_RQ_IN_FLIGHT -> synchronize srcu/rcu -> blk_mq_terminate_expired -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT) 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3) if J2-J1 < 4 jiffies, we will get the same deadline value. 2. even if we do some bit shift when save and get deadline if T2 - T1 < time of 1 jiffies, we sill get the same deadline. Thanks Jianchao
Re: [PATCH] blk-mq: start request gstate with gen 1
Hi Martin On 04/17/2018 08:10 PM, Martin Steigerwald wrote: > For testing it I add it to 4.16.2 with the patches I have already? You could try to only apply this patch to have a test. :) > > - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a > request.mbox' > > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into > rcu_read_{lock,unlock}().mbox' > > - '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s > timeout isn'\''t handled.mbox' > > - '[PATCH V4 2_2] blk-mq_fix race between complete and > BLK_EH_RESET_TIMER.mbox Thanks Jianchao
Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
Hi bart Thanks for your kindly response. I have sent out the patch. Please refer to https://marc.info/?l=linux-block=152393666517449=2 Thanks Jianchao On 04/17/2018 08:15 AM, Bart Van Assche wrote: > On Tue, 2018-04-17 at 00:04 +0800, jianchao.wang wrote: >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 16e83e6..be9b435 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set >> *set, struct request *rq, >> >> seqcount_init(>gstate_seq); >> u64_stats_init(>aborted_gstate_sync); >> + WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); >> return 0; >> } > > Hello Jianchao, > > Your approach looks interesting to me. Can you send an official patch to Jens? > > Thanks, > > Bart. > > > >
Re: [PATCH] blk-mq: mark hctx RESTART when get budget fails
Hi Ming Thanks for your kindly response. On 04/16/2018 04:15 PM, Ming Lei wrote: >> -if (!blk_mq_get_dispatch_budget(hctx)) >> +if (!blk_mq_get_dispatch_budget(hctx)) { >> +blk_mq_sched_mark_restart_hctx(hctx); > The RESTART flag still may not take into effect if all requests are > completed before calling blk_mq_sched_mark_restart_hctx(). > Yes, this is indeed a very tricky scenario. Sincerely Jianchao
Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
Hi Ming On 04/12/2018 07:38 AM, Ming Lei wrote: > + * > + * Cover complete vs BLK_EH_RESET_TIMER race in slow path with > + * helding queue lock. >*/ > hctx_lock(hctx, _idx); > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > + else { > + unsigned long flags; > + bool need_complete = false; > + > + spin_lock_irqsave(q->queue_lock, flags); > + if (!blk_mq_rq_aborted_gstate(rq)) > + need_complete = true; > + else > + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT); > + spin_unlock_irqrestore(q->queue_lock, flags); What if the .timeout return BLK_EH_HANDLED during this ? timeout context irq context .timeout() blk_mq_complete_request set state to MQ_RQ_COMPLETE_IN_TIMEOUT __blk_mq_complete_request WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT); If further upon blk_mq_free_request, the final freed request maybe changed to MQ_RQ_COMPLETE_IN_TIMEOUT instead of MQ_RQ_IDLE. Thanks Jianchao
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Hi Bart Thanks for your kindly response. On 04/10/2018 09:01 PM, Bart Van Assche wrote: > On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote: >> If yes, how does the timeout handler get the freed request when the tag has >> been freed ? > > Hello Jianchao, > > Have you noticed that the timeout handler does not check whether or not the > request > tag is freed? Additionally, I don't think it would be possible to add such a > check > to the timeout code without introducing a new race condition. Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been allocated/set ? When the request is freed, the tag will be cleared through blk_mq_put_tag->sbitmap_queue_clear Do I miss something else ? Thanks Jianchao > >
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Hi Bart On 04/10/2018 09:34 AM, Bart Van Assche wrote: > If a completion occurs after blk_mq_rq_timed_out() has reset > rq->aborted_gstate and the request is again in flight when the timeout > expires then a request will be completed twice: a first time by the > timeout handler and a second time when the regular completion occurs Would you please detail more here about why the request could be completed twice ? Is it the scenario you described as below in https://marc.info/?l=linux-block=151796816127318 The following race can occur between the code that resets the timer and completion handling: - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate. - A completion occurs and blk_mq_complete_request() calls __blk_mq_complete_request(). - The timeout code calls blk_add_timer() and that function sets the request deadline and adjusts the timer. - __blk_mq_complete_request() frees the request tag. - The timer fires and the timeout handler gets called for a freed request. If yes, how does the timeout handler get the freed request when the tag has been freed ? Thanks Jianchao
Re: blk-mq and CPU hotplug error
Hi Jose On 03/27/2018 05:25 AM, jos...@linux.vnet.ibm.com wrote: > Hello everyone! > > I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le). > Everything looks good until I try to hotplug CPUs in my VM. As soon as I > do that I get the following error in my VM dmesg: Please refer to the following commit: 20e4d813 blk-mq: simplify queue mapping & schedule with each possisble CPU 84676c1 genirq/affinity: assign vectors to all possible CPUs Thanks Jianchao
Re: [PATCH 1/3] blk-mq: Allow PCI vector offset for mapping queues
Hi Keith Thanks for your time and patch for this. On 03/24/2018 06:19 AM, Keith Busch wrote: > The PCI interrupt vectors intended to be associated with a queue may > not start at 0. This patch adds an offset parameter so blk-mq may find > the intended affinity mask. The default value is 0 so existing drivers > that don't care about this parameter don't need to change. > > Signed-off-by: Keith Busch> --- > block/blk-mq-pci.c | 12 ++-- > include/linux/blk-mq-pci.h | 2 ++ > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c > index 76944e3271bf..1040a7705c13 100644 > --- a/block/blk-mq-pci.c > +++ b/block/blk-mq-pci.c > @@ -21,6 +21,7 @@ > * blk_mq_pci_map_queues - provide a default queue mapping for PCI device > * @set: tagset to provide the mapping for > * @pdev:PCI device associated with @set. > + * @offset: PCI irq starting vector offset > * > * This function assumes the PCI device @pdev has at least as many available > * interrupt vectors as @set has queues. It will then query the vector > @@ -28,13 +29,14 @@ > * that maps a queue to the CPUs that have irq affinity for the corresponding > * vector. > */ > -int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev) > +int __blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev, > + int offset) > { > const struct cpumask *mask; > unsigned int queue, cpu; > > for (queue = 0; queue < set->nr_hw_queues; queue++) { > - mask = pci_irq_get_affinity(pdev, queue); > + mask = pci_irq_get_affinity(pdev, queue + offset); > if (!mask) > goto fallback; > Maybe we could provide a callback parameter for __blk_mq_pci_map_queues which give the mapping from hctx queue num to device-relative interrupt vector index. Thanks Jianchao
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
Sorry, Jens, I think I didn't get the point. Do I miss anything ? On 02/01/2018 11:07 AM, Jens Axboe wrote: > Yeah I agree, and my last patch missed that we do care about segments for > discards. Below should be better... > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 8452fc7164cc..055057bd727f 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -553,9 +553,8 @@ static bool req_no_special_merge(struct request *req) > static int ll_merge_requests_fn(struct request_queue *q, struct request *req, > struct request *next) > { > - int total_phys_segments; > - unsigned int seg_size = > - req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size; > + int total_phys_segments = req->nr_phys_segments + > + next->nr_phys_segments; For DISCARD reqs, the total_phys_segments is still zero here. > > /* >* First check if the either of the requests are re-queued > @@ -574,8 +573,15 @@ static int ll_merge_requests_fn(struct request_queue *q, > struct request *req, > blk_rq_get_max_sectors(req, blk_rq_pos(req))) > return 0; > > - total_phys_segments = req->nr_phys_segments + next->nr_phys_segments; > - if (blk_phys_contig_segment(q, req->biotail, next->bio)) { > + /* > + * If the requests aren't carrying any data payloads, we don't need > + * to look at the segment count > + */ > + if (bio_has_data(next->bio) && > + blk_phys_contig_segment(q, req->biotail, next->bio)) { > + unsigned int seg_size = req->biotail->bi_seg_back_size + > + next->bio->bi_seg_front_size; Yes, total_phys_segments will not be decreased. > + > if (req->nr_phys_segments == 1) > req->bio->bi_seg_front_size = seg_size; > if (next->nr_phys_segments == 1) > @@ -584,7 +590,7 @@ static int ll_merge_requests_fn(struct request_queue *q, > struct request *req, > } > > if (total_phys_segments > queue_max_segments(q)) > - return 0; > + return 0; > > if (blk_integrity_merge_rq(q, req, next) == false) > return 0; But finally, the merged DISCARD req's nr_phys_segment is still zero. blk_rq_nr_discard_segments will return 1 but this req has two bios. blk_rq_nr_discard_segments 's comment says -- Each discard bio merged into a request is counted as one segment -- Maybe patch below should be followed with yours. diff --git a/block/blk-core.c b/block/blk-core.c index a2005a4..b444fb7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1763,7 +1763,6 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req, req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); - req->nr_phys_segments = segments + 1; blk_account_io_start(req, false); return true; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4f3df80..1af2138 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1312,7 +1312,13 @@ static inline unsigned short blk_rq_nr_phys_segments(struct request *rq) */ static inline unsigned short blk_rq_nr_discard_segments(struct request *rq) { - return max_t(unsigned short, rq->nr_phys_segments, 1); + struct bio *bio; + unsigned short count = 0; + + __rq_for_each_bio(bio, req) + count ++; + + return count; } Thanks Jianchao
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
Hi Jens On 01/31/2018 11:29 PM, Jens Axboe wrote: > How about something like the below? > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 8452fc7164cc..cee102fb060e 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue *q, > struct request *req, > blk_rq_get_max_sectors(req, blk_rq_pos(req))) > return 0; > > + /* > + * For DISCARDs, the segment count isn't interesting since > + * the requests have no data attached. > + */ > total_phys_segments = req->nr_phys_segments + next->nr_phys_segments; > - if (blk_phys_contig_segment(q, req->biotail, next->bio)) { > + if (total_phys_segments && > + blk_phys_contig_segment(q, req->biotail, next->bio)) { > if (req->nr_phys_segments == 1) > req->bio->bi_seg_front_size = seg_size; > if (next->nr_phys_segments == 1) This patch will avoid the nr_phys_segments to be set to 0x, but the merged req will have two bios but zero nr_phys_segments. We have to align with the DISCARD merging strategy. Please refer to: /* * Number of discard segments (or ranges) the driver needs to fill in. * Each discard bio merged into a request is counted as one segment. */ static inline unsigned short blk_rq_nr_discard_segments(struct request *rq) { return max_t(unsigned short, rq->nr_phys_segments, 1); } bool bio_attempt_discard_merge(struct request_queue *q, struct request *req, struct bio *bio) { unsigned short segments = blk_rq_nr_discard_segments(req); if (segments >= queue_max_discard_segments(q)) goto no_merge; if (blk_rq_sectors(req) + bio_sectors(bio) > blk_rq_get_max_sectors(req, blk_rq_pos(req))) goto no_merge; req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); req->nr_phys_segments = segments + 1; blk_account_io_start(req, false); return true; no_merge: req_set_nomerge(q, req); return false; } blk_rq_nr_discard_segments will get a wrong value finally. Maybe we could change blk_rq_nr_discard_segments to iterate and count the bios in one request to decide the DISCARD request nr_phy_segment. And discard the nr_phys_segments operations in the DISCARD merging path, plus your patch here. Thanks Jianchao
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
Hi Jens On 01/30/2018 11:57 PM, Jens Axboe wrote: > On 1/30/18 8:41 AM, Jens Axboe wrote: >> Hi, >> >> I just hit this on 4.15+ on the laptop, it's running Linus' git >> as of yesterday, right after the block tree merge: >> >> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0 >> Merge: 9697e9da8429 796baeeef85a >> Author: Linus Torvalds>> Date: Mon Jan 29 11:51:49 2018 -0800 >> >> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block >> >> It's hitting this case: >> >> if (WARN_ON_ONCE(n != segments)) { >> >> >> in nvme, indicating that rq->nr_phys_segments does not equal the >> number of bios in the request. Anyone seen this? I'll take a closer >> look as time permits, full trace below. >> >> >> WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 >> nvme_setup_cmd+0x3d3/0x430 >> Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc >> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat >> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 >> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq >> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb >> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer >> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 >> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore >> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd >> intel_gtt >> CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ >> #176 >> Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) >> 12/19/2017 >> RIP: 0010:nvme_setup_cmd+0x3d3/0x430 >> RSP: 0018:880423e9f838 EFLAGS: 00010217 >> RAX: RBX: 880423e9f8c8 RCX: 0001 >> RDX: 88022b200010 RSI: 0002 RDI: 327f >> RBP: 880421251400 R08: 88022b20 R09: 0009 >> R10: R11: R12: >> R13: 88042341e280 R14: R15: 880421251440 >> FS: () GS:88044150() knlGS: >> CS: 0010 DS: ES: CR0: 80050033 >> CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0 >> DR0: DR1: DR2: >> DR3: DR6: fffe0ff0 DR7: 0400 >> Call Trace: >> nvme_queue_rq+0x40/0xa00 >> ? __sbitmap_queue_get+0x24/0x90 >> ? blk_mq_get_tag+0xa3/0x250 >> ? wait_woken+0x80/0x80 >> ? blk_mq_get_driver_tag+0x97/0xf0 >> blk_mq_dispatch_rq_list+0x7b/0x4a0 >> ? deadline_remove_request+0x49/0xb0 >> blk_mq_do_dispatch_sched+0x4f/0xc0 >> blk_mq_sched_dispatch_requests+0x106/0x170 >> __blk_mq_run_hw_queue+0x53/0xa0 >> __blk_mq_delay_run_hw_queue+0x83/0xa0 >> blk_mq_run_hw_queue+0x6c/0xd0 >> blk_mq_sched_insert_request+0x96/0x140 >> __blk_mq_try_issue_directly+0x3d/0x190 >> blk_mq_try_issue_directly+0x30/0x70 >> blk_mq_make_request+0x1a4/0x6a0 >> generic_make_request+0xfd/0x2f0 >> ? submit_bio+0x5c/0x110 >> submit_bio+0x5c/0x110 >> ? __blkdev_issue_discard+0x152/0x200 >> submit_bio_wait+0x43/0x60 >> ext4_process_freed_data+0x1cd/0x440 >> ? account_page_dirtied+0xe2/0x1a0 >> ext4_journal_commit_callback+0x4a/0xc0 >> jbd2_journal_commit_transaction+0x17e2/0x19e0 >> ? kjournald2+0xb0/0x250 >> kjournald2+0xb0/0x250 >> ? wait_woken+0x80/0x80 >> ? commit_timeout+0x10/0x10 >> kthread+0x111/0x130 >> ? kthread_create_worker_on_cpu+0x50/0x50 >> ? do_group_exit+0x3a/0xa0 >> ret_from_fork+0x1f/0x30 >> Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 >> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 >> 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff >> ---[ end trace 50d361cc444506c8 ]--- >> print_req_error: I/O error, dev nvme0n1, sector 847167488 > > Looking at the disassembly, 'n' is 2 and 'segments' is 0x. > Let's consider the case following: blk_mq_bio_to_request() -> blk_init_request_from_bio() -> blk_rq_bio_prep() if (bio_has_data(bio)) rq->nr_phys_segments = bio_phys_segments(q, bio); static inline bool bio_has_data(struct bio *bio) { if (bio && bio->bi_iter.bi_size && bio_op(bio) != REQ_OP_DISCARD && //-> HERE ! bio_op(bio) != REQ_OP_SECURE_ERASE && bio_op(bio) != REQ_OP_WRITE_ZEROES) return true; return false; } For the DISCARD req, the nr_phys_segments is 0. dd_insert_request() -> blk_mq_sched_try_insert_merge() -> elv_attempt_insert_merge() -> blk_attempt_req_merge() -> attempt_merge() -> ll_merge_requests_fn() total_phys_segments = req->nr_phys_segments + next->nr_phys_segments; //
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi ming Sorry for delayed report this. On 01/17/2018 05:57 PM, Ming Lei wrote: > 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue > is run, there isn't warning, but once the IO is submitted to hardware, > after it is completed, how does the HBA/hw queue notify CPU since CPUs > assigned to this hw queue(irq vector) are offline? blk-mq's timeout > handler may cover that, but looks too tricky. In theory, the irq affinity will be migrated to other cpu. This is done by fixup_irqs() in the context of stop_machine. However, in my test, I found this log: [ 267.161043] do_IRQ: 7.33 No irq handler for vector The 33 is the vector used by nvme cq. The irq seems to be missed and sometimes IO hang occurred. It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq. I add dump stack behind the error log and get following: [ 267.161043] do_IRQ: 7.33 No irq handler for vector migration/7 [ 267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27 [ 267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 267.161046] Call Trace: [ 267.161047] [ 267.161052] dump_stack+0x7c/0xb5 [ 267.161054] do_IRQ+0xb9/0xf0 [ 267.161056] common_interrupt+0xa2/0xa2 [ 267.161057] [ 267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120 [ 267.161060] RSP: 0018:bb6c81af7e70 EFLAGS: 0202 ORIG_RAX: ffde [ 267.161061] RAX: 0001 RBX: 0004 RCX: [ 267.161062] RDX: 0006 RSI: 898c4591 RDI: 0202 [ 267.161063] RBP: bb6c826e7c88 R08: 991abc1256bc R09: 0005 [ 267.161063] R10: bb6c81af7db8 R11: 89c91d20 R12: 0001 [ 267.161064] R13: bb6c826e7cac R14: 0003 R15: [ 267.161067] ? cpu_stop_queue_work+0x90/0x90 [ 267.161068] cpu_stopper_thread+0x83/0x100 [ 267.161070] smpboot_thread_fn+0x161/0x220 [ 267.161072] kthread+0xf5/0x130 [ 267.161073] ? sort_range+0x20/0x20 [ 267.161074] ? kthread_associate_blkcg+0xe0/0xe0 [ 267.161076] ret_from_fork+0x24/0x30 The irq just occurred after the irq is enabled in multi_cpu_stop. 0x8112d655 is in multi_cpu_stop (/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223). 218 */ 219 touch_nmi_watchdog(); 220 } 221 } while (curstate != MULTI_STOP_EXIT); 222 223 local_irq_restore(flags); 224 return err; 225 } Thanks Jianchao
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi ming Thanks for your kindly response. On 01/17/2018 02:22 PM, Ming Lei wrote: > This warning can't be removed completely, for example, the CPU figured > in blk_mq_hctx_next_cpu(hctx) can be put on again just after the > following call returns and before __blk_mq_run_hw_queue() is scheduled > to run. > > kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > >run_work, msecs_to_jiffies(msecs)) We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window. There is a big gap between cpu_online and cpu_active. rebind_workers is also between them. > > Just be curious how you trigger this issue? And is it triggered in CPU > hotplug stress test? Or in a normal use case? In fact, this is my own investigation about whether the .queue_rq to one hardware queue could be executed on the cpu where it is not mapped. Finally, found this hole when cpu hotplug. I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx. - A special patch that could hold some requests on ctx->rq_list though .get_budget - A script issues IOs with fio - A script online/offline the cpus continuously At first, just the warning above. Then after this patch was introduced, panic came up. Thanks Jianchao
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi ming Thanks for your kindly response. On 01/17/2018 11:52 AM, Ming Lei wrote: >> It is here. >> __blk_mq_run_hw_queue() >> >> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && >> cpu_online(hctx->next_cpu)); > I think this warning is triggered after the CPU of hctx->next_cpu becomes > online again, and it should have been dealt with by the following trick, > and this patch is against the previous one, please test it and see if > the warning can be fixed. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 23f0f3ddffcf..0620ccb65e4e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx > *hctx) > tried = true; > goto select_cpu; > } > + > + /* handle after this CPU of hctx->next_cpu becomes online again > */ > + hctx->next_cpu_batch = 1; > return WORK_CPU_UNBOUND; > } > return hctx->next_cpu; > The WARNING still could be triggered. [ 282.194532] WARNING: CPU: 0 PID: 222 at /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 __blk_mq_run_hw_queue+0x92/0xa0 [ 282.194534] Modules linked in: [ 282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW 4.15.0-rc7+ #17 [ 282.194562] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 282.194563] Workqueue: kblockd blk_mq_run_work_fn [ 282.194565] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0 [ 282.194565] RSP: 0018:96c50223be60 EFLAGS: 00010202 [ 282.194566] RAX: 0001 RBX: 8a7110233800 RCX: [ 282.194567] RDX: 8a711255f2d0 RSI: RDI: 8a7110233800 [ 282.194567] RBP: 8a7122ca2140 R08: R09: [ 282.194568] R10: 0263 R11: R12: 8a7122ca8200 [ 282.194568] R13: R14: 08a7122ca820 R15: 8a70bcef0780 [ 282.194569] FS: () GS:8a7122c0() knlGS: [ 282.194569] CS: 0010 DS: ES: CR0: 80050033 [ 282.194570] CR2: 555e14d89d60 CR3: 3c809002 CR4: 003606f0 [ 282.194571] DR0: DR1: DR2: [ 282.194571] DR3: DR6: fffe0ff0 DR7: 0400 [ 282.194571] Call Trace: [ 282.194576] process_one_work+0x14b/0x420 [ 282.194577] worker_thread+0x47/0x420 [ 282.194579] kthread+0xf5/0x130 [ 282.194581] ? process_one_work+0x420/0x420 [ 282.194581] ? kthread_associate_blkcg+0xe0/0xe0 [ 282.194583] ret_from_fork+0x24/0x30 [ 282.194585] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b [ 282.194601] ---[ end trace c104f0a63d3df31b ]--- >> >> >> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu >> even if the cpu is offlined and modify the cpu_online above to cpu_active. >> The kworkers of the per-cpu pool must have be migrated back when the cpu is >> set active. >> But there seems to be issues in DASD as your previous comment. > Yes, we can't break DASD. > >> That is the original version of this patch, and both Christian and Stefan >> reported that system can't boot from DASD in this way[2], and I changed >> to AND with cpu_online_mask, then their system can boot well >> On the other hand, there is also risk in >> >> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct >> request_queue *q, >> blk_queue_exit(q); >> return ERR_PTR(-EXDEV); >> } >> -cpu = cpumask_first(alloc_data.hctx->cpumask); >> +cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); >> alloc_data.ctx = __blk_mq_get_ctx(q, cpu); >> >> what if the cpus in alloc_data.hctx->cpumask are all offlined ? > This one is crazy, and is used by NVMe only, it should be fine if > the passed 'hctx_idx' is retrieved by the current running CPU, such > as the way of blk_mq_map_queue(). But if not, bad thing may happen. Even if retrieved by current running cpu, the task still could be migrated away when the cpu is offlined. Thanks Jianchao
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi ming Thanks for your patch and kindly response. On 01/16/2018 11:32 PM, Ming Lei wrote: > OK, I got it, and it should have been the only corner case in which > all CPUs mapped to this hctx become offline, and I believe the following > patch should address this case, could you give a test? > > --- > diff --git a/block/blk-mq.c b/block/blk-mq.c > index c376d1b6309a..23f0f3ddffcf 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct > blk_mq_hw_ctx *hctx) > */ > static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) > { > + bool tried = false; > + > if (hctx->queue->nr_hw_queues == 1) > return WORK_CPU_UNBOUND; > > if (--hctx->next_cpu_batch <= 0) { > int next_cpu; > +select_cpu: > > next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, > cpu_online_mask); > if (next_cpu >= nr_cpu_ids) > next_cpu = > cpumask_first_and(hctx->cpumask,cpu_online_mask); > > - hctx->next_cpu = next_cpu; > + /* > + * No online CPU can be found here when running from > + * blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu > + * is set correctly. > + */ > + if (next_cpu >= nr_cpu_ids) > + hctx->next_cpu = cpumask_first_and(hctx->cpumask, > + cpu_possible_mask); > + else > + hctx->next_cpu = next_cpu; > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; > } > > + /* > + * Do unbound schedule if we can't find a online CPU for this hctx, > + * and it should happen only if hctx->next_cpu is becoming DEAD. > + */ > + if (!cpu_online(hctx->next_cpu)) { > + if (!tried) { > + tried = true; > + goto select_cpu; > + } > + return WORK_CPU_UNBOUND; > + } > return hctx->next_cpu; > } I have tested this patch. The panic was gone, but I got the following: [ 231.674464] WARNING: CPU: 0 PID: 263 at /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 __blk_mq_run_hw_queue+0x92/0xa0 [ 231.674466] Modules linked in: . [ 231.674494] CPU: 0 PID: 263 Comm: kworker/2:1H Not tainted 4.15.0-rc7+ #12 [ 231.674495] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 231.674496] Workqueue: kblockd blk_mq_run_work_fn [ 231.674498] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0 [ 231.674499] RSP: 0018:a9c801fcfe60 EFLAGS: 00010202 [ 231.674500] RAX: 0001 RBX: 9c7c90231400 RCX: [ 231.674500] RDX: 9c7c9255b0f8 RSI: RDI: 9c7c90231400 [ 231.674500] RBP: 9c7ca2ca2140 R08: R09: [ 231.674501] R10: 03cb R11: R12: 9c7ca2ca8200 [ 231.674501] R13: R14: 09c7ca2ca820 R15: 9c7c3df25240 [ 231.674502] FS: () GS:9c7ca2c0() knlGS: [ 231.674502] CS: 0010 DS: ES: CR0: 80050033 [ 231.674503] CR2: 01727008 CR3: 000336409003 CR4: 003606f0 [ 231.674504] DR0: DR1: DR2: [ 231.674504] DR3: DR6: fffe0ff0 DR7: 0400 [ 231.674504] Call Trace: [ 231.674509] process_one_work+0x14b/0x420 [ 231.674510] worker_thread+0x47/0x420 [ 231.674512] kthread+0xf5/0x130 [ 231.674514] ? process_one_work+0x420/0x420 [ 231.674515] ? kthread_associate_blkcg+0xe0/0xe0 [ 231.674517] ? do_group_exit+0x3a/0xa0 [ 231.674518] ret_from_fork+0x24/0x30 [ 231.674520] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b [ 231.674537] ---[ end trace cc2de957e0e0fed4 ]--- It is here. __blk_mq_run_hw_queue() WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && cpu_online(hctx->next_cpu)); To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu even if the cpu is offlined and modify the cpu_online above to cpu_active. The kworkers of the per-cpu pool must have be migrated back when the cpu is set active. But there seems to be issues in DASD as your previous comment. That is the original version of this patch, and both Christian and Stefan reported that system can't boot from DASD in this way[2], and I changed to AND with cpu_online_mask, then their system can boot well On the other hand, there is also risk in @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, blk_queue_exit(q); return ERR_PTR(-EXDEV); } - cpu =
Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
Hi minglei On 01/16/2018 08:10 PM, Ming Lei wrote: >>> - next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask); >>> + next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, >>> + cpu_online_mask); >>> if (next_cpu >= nr_cpu_ids) >>> - next_cpu = cpumask_first(hctx->cpumask); >>> + next_cpu = >>> cpumask_first_and(hctx->cpumask,cpu_online_mask); >> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask >> is online. > That supposes not happen because storage device(blk-mq hw queue) is > generally C/S model, that means the queue becomes only active when > there is online CPU mapped to it. > > But it won't be true for non-block-IO queue, such as HPSA's queues[1], and > network controller RX queues. > > [1] > https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU= > > One thing I am still not sure(but generic irq affinity supposes to deal with > well) is that the CPU may become offline after the IO is just submitted, > then where the IRQ controller delivers the interrupt of this hw queue > to? > >> This could be reproduced on NVMe with a patch that could hold some rqs on >> ctx->rq_list, >> meanwhile a script online and offline the cpus. Then a panic occurred in >> __queue_work(). > That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list > are dispatched directly, please see blk_mq_hctx_notify_dead(). Yes, I know. The blk_mq_hctx_notify_dead will be invoked after the cpu has been set offlined. Please refer to the following diagram. CPU A CPU T kick _cpu_down() -> cpuhp_thread_fun (cpuhpT kthread) AP_ACTIVE (clear cpu_active_mask) | v AP_WORKQUEUE_ONLINE (unbind workers) | v TEARDOWN_CPU(stop_machine) , | execute \_ _ _ _ _ _ v preempt V take_cpu_down ( migration kthread) set_cpu_online(smp_processor_id(), false) (__cpu_disable) --> Here !!! TEARDOWN_CPU | cpuhpT kthead is| v migrated away ,AP_SCHED_STARTING (migrate_tasks) _ _ _ _ _ _ _ _/ | V v CPU X AP_OFFLINE | , _ _ _ _ _ / V do_idle (idle task) <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead complete st->done_down __cpu_die (cpuhpT kthread, teardown_cpu) AP_OFFLINE | v BRINGUP_CPU | v BLK_MQ_DEAD---> Here !!! | v OFFLINE The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is invoked. If the device is NVMe which only has one cpu mapped on the hctx, cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value. I even got a backtrace showed that the panic occurred in blk_mq_hctx_notify_dead -> kblocked_schedule_delayed_work_on -> __queue_work. Kdump doesn't work well on my machine, so I cannot share the backtrace here, that's really sad. I even added BUG_ON as following, it could be triggered. if (next_cpu >= nr_cpu_ids) next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask); BUG_ON(next_cpu >= nr_cpu_ids); Thanks Jianchao > >> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu >> has been unbound. >> It should be ok to queue on them.