Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 9:34 AM, Jens Axboe wrote:
> On 12/6/18 6:22 PM, jianchao.wang wrote:
>>
>>
>> On 12/7/18 9:13 AM, Jens Axboe wrote:
>>> On 12/6/18 6:04 PM, jianchao.wang wrote:
>>>>
>>>>
>>>> On 12/7/18 6:20 AM, Jens Axboe wrote:
>>>>> After the direct dispatch corruption fix, we permanently disallow direct
>>>>> dispatch of non read/write requests. This works fine off the normal IO
>>>>> path, as they will be retried like any other failed direct dispatch
>>>>> request. But for the blk_insert_cloned_request() that only DM uses to
>>>>> bypass the bottom level scheduler, we always first attempt direct
>>>>> dispatch. For some types of requests, that's now a permanent failure,
>>>>> and no amount of retrying will make that succeed.
>>>>>
>>>>> Don't use direct dispatch off the cloned insert path, always just use
>>>>> bypass inserts. This still bypasses the bottom level scheduler, which is
>>>>> what DM wants.
>>>>>
>>>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>>>>> Signed-off-by: Jens Axboe 
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index deb56932f8c4..4c44e6fa0d08 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>>>>> request_queue *q, struct request *
>>>>>* bypass a potential scheduler on the bottom device for
>>>>>* insert.
>>>>>*/
>>>>> - return blk_mq_request_issue_directly(rq);
>>>>> + blk_mq_request_bypass_insert(rq, true);
>>>>> + return BLK_STS_OK;
>>>>>   }
>>>>>  
>>>>>   spin_lock_irqsave(q->queue_lock, flags);
>>>>>
>>>> Not sure about this because it will break the merging promotion for 
>>>> request based DM
>>>> from Ming.
>>>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
>>>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
>>>> feedback)
>>>>
>>>> We could use some other way to fix this.
>>>
>>> That really shouldn't matter as this is the cloned insert, merging should
>>> have been done on the original request.
>>>
>>>
>> Just quote some comments from the patch.
>>
>> "
>>But dm-rq currently can't get the underlying queue's
>> dispatch feedback at all.  Without knowing whether a request was issued
>> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
>> not be able to provide effective IO merging (as a side-effect of dm-rq
>> currently blindly destaging a request from its elevator only to requeue
>> it after a delay, which kills any opportunity for merging).  This
>> obviously causes very bad sequential IO performance.
>> ...
>> With this, request-based DM's blk-mq sequential IO performance is vastly
>> improved (as much as 3X in mpath/virtio-scsi testing)
>> "
>>
>> Using blk_mq_request_bypass_insert to replace the 
>> blk_mq_request_issue_directly
>> could be a fast method to fix the current issue. Maybe we could get the 
>> merging
>> promotion back after some time.
> 
> This really sucks, mostly because DM wants to have it both ways - not use
> the bottom level IO scheduler, but still actually use it if it makes sense.
> 
> There is another way to fix this - still do the direct dispatch, but have
> dm track if it failed and do bypass insert in that case. I didn't want do
> to that since it's more involved, but it's doable.
> 
> Let me cook that up and test it... Don't like it, though.
> 
Actually, I have tried to fix this issue in the 1st patch of my patchset
blk-mq: refactor code of issue directly.

Just insert the non-read-write command into dispatch list directly and return
BLK_STS_OK.

Thanks
Jianchao


Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 9:13 AM, Jens Axboe wrote:
> On 12/6/18 6:04 PM, jianchao.wang wrote:
>>
>>
>> On 12/7/18 6:20 AM, Jens Axboe wrote:
>>> After the direct dispatch corruption fix, we permanently disallow direct
>>> dispatch of non read/write requests. This works fine off the normal IO
>>> path, as they will be retried like any other failed direct dispatch
>>> request. But for the blk_insert_cloned_request() that only DM uses to
>>> bypass the bottom level scheduler, we always first attempt direct
>>> dispatch. For some types of requests, that's now a permanent failure,
>>> and no amount of retrying will make that succeed.
>>>
>>> Don't use direct dispatch off the cloned insert path, always just use
>>> bypass inserts. This still bypasses the bottom level scheduler, which is
>>> what DM wants.
>>>
>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>>> Signed-off-by: Jens Axboe 
>>>
>>> ---
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index deb56932f8c4..4c44e6fa0d08 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>>> request_queue *q, struct request *
>>>  * bypass a potential scheduler on the bottom device for
>>>  * insert.
>>>  */
>>> -   return blk_mq_request_issue_directly(rq);
>>> +   blk_mq_request_bypass_insert(rq, true);
>>> +   return BLK_STS_OK;
>>> }
>>>  
>>> spin_lock_irqsave(q->queue_lock, flags);
>>>
>> Not sure about this because it will break the merging promotion for request 
>> based DM
>> from Ming.
>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
>> feedback)
>>
>> We could use some other way to fix this.
> 
> That really shouldn't matter as this is the cloned insert, merging should
> have been done on the original request.
> 
> 
Just quote some comments from the patch.

"
   But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.
...
With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing)
"

Using blk_mq_request_bypass_insert to replace the blk_mq_request_issue_directly
could be a fast method to fix the current issue. Maybe we could get the merging
promotion back after some time.



Thanks
Jianchao



Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 6:20 AM, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> 
Not sure about this because it will break the merging promotion for request 
based DM
from Ming.
396eaf21ee17c476e8f66249fb1f4a39003d0ab4
(blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback)

We could use some other way to fix this.

Thanks
Jianchao


Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-15 Thread jianchao.wang



On 11/16/18 11:28 AM, Ming Lei wrote:
...
>  
> +struct blk_mq_kobj {
> + struct kobject kobj;
> +};
> +
>  static void blk_mq_sysfs_release(struct kobject *kobj)
>  {
> + struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj,
> +kobj);
> + kfree(mq_kobj);
> +}
> +
...
>  
> -void blk_mq_sysfs_init(struct request_queue *q)
> +int blk_mq_sysfs_init(struct request_queue *q)
>  {
>   struct blk_mq_ctx *ctx;
>   int cpu;
> + struct blk_mq_kobj *mq_kobj;
> +
> + mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL);
> + if (!mq_kobj)
> + return -ENOMEM;
>  
> - kobject_init(>mq_kobj, _mq_ktype);
> + kobject_init(_kobj->kobj, _mq_ktype);
>  
>   for_each_possible_cpu(cpu) {
> - ctx = per_cpu_ptr(q->queue_ctx, cpu);
> + ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
> + if (!ctx)
> + goto fail;
> + *per_cpu_ptr(q->queue_ctx, cpu) = ctx;
>   kobject_init(>kobj, _mq_ctx_ktype);
>   }
> + q->mq_kobj = _kobj->kobj;
> + return 0;
> +
> + fail:
> + for_each_possible_cpu(cpu) {
> + ctx = *per_cpu_ptr(q->queue_ctx, cpu);
> + if (ctx)
> + kobject_put(>kobj);
> + }
> + kobject_put(_kobj->kobj);
> + return -ENOMEM;
>  }


blk_mq_kobj looks meaningless, why do we need it, or do I miss something ?
And maybe we should allocate ctx in blk_mq_init_allocated_queue.

Thanks
Jianchao



Re: [PATCH 05/11] block: avoid ordered task state change for polled IO

2018-11-13 Thread jianchao.wang
Hi Jens

On 11/13/18 11:42 PM, Jens Axboe wrote:
> @@ -181,6 +181,7 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>   struct task_struct *waiter = bio->bi_private;
>  
>   WRITE_ONCE(bio->bi_private, NULL);
> + smp_wmb();
>   wake_up_process(waiter);
>  }


The wake up path has contained a full memory barrier with the raw_spin_lock
and following smp_mb__after_spinlock

Please refer to:

wake_up_process
  -> try_to_wake_up

raw_spin_lock_irqsave(>pi_lock, flags);
smp_mb__after_spinlock();

So the smp_wmb here is not necessary.

Thanks
Jianchao


Re: [PATCH 4/6] block: avoid ordered task state change for polled IO

2018-11-12 Thread jianchao.wang
Hi Jens

On 11/13/18 12:26 AM, Jens Axboe wrote:
> On 11/12/18 2:35 AM, jianchao.wang wrote:
>> Hi Jens
>>
>> On 11/10/18 11:13 PM, Jens Axboe wrote:
>>> We only really need the barrier if we're going to be sleeping,
>>> if we're just polling we're fine with the __set_current_state()
>>> variant.
>>>
>>> Signed-off-by: Jens Axboe 
>>> ---
>>>  fs/block_dev.c | 25 +
>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index c039abfb2052..e7550b1f9670 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
>>> iov_iter *iter,
>>>  
>>> qc = submit_bio();
>>> for (;;) {
>>> -   set_current_state(TASK_UNINTERRUPTIBLE);
>>> +   /*
>>> +* Using non-atomic task state for polling
>>> +*/
>>> +   __set_current_state(TASK_UNINTERRUPTIBLE);
>>> +
>>> if (!READ_ONCE(bio.bi_private))
>>> break;
>>
>> When we don't use polling, the blkdev_bio_end_io_simple may come on a 
>> different cpu before
>> submit_bio returns 
>> For example,
>> __blkdev_direct_IO_simple
>>   qc = submit_bio() 
>>   blkdev_bio_end_io_simple
>> WRITE_ONCE(bio.bi_private, NULL)
>> wake_up_process(waiter)
>>   __set_current_state(TASK_UNINTERRUPTIBLE)  
>>   READ_ONCE(bio.bi_private)
> 
> While I agree that you are right, that's an existing issue. The store
> barrier implied by set_current_state() doesn't fix that.
> 
> How about this variant:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3D036ba7a8d1334889c0fe55d4858d78f4e8022f12=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA=AnmYMIcGpIILUrjDFsabj61pcTyMCRHB1Pu8XXi47t0=
> 
> which then changes the later wake up helper patch to be:
>  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3Df8c3f188425967adb040cfefb799b0a5a1df769d=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA=PdajuUul55e6p0FED_AzzWt5Gip0ekLi1eGYBMLiZPo=
> 

The wake up path has contained a full memory barrier with the raw_spin_lock and 
following smp_mb__after_spinlock

wake_up_process
  -> try_to_wake_up

raw_spin_lock_irqsave(>pi_lock, flags);
smp_mb__after_spinlock();

So a smp_rmb() could be enough. :)

Thanks
Jianchao


Re: [PATCH 4/6] block: avoid ordered task state change for polled IO

2018-11-12 Thread jianchao.wang
Hi Jens

On 11/10/18 11:13 PM, Jens Axboe wrote:
> We only really need the barrier if we're going to be sleeping,
> if we're just polling we're fine with the __set_current_state()
> variant.
> 
> Signed-off-by: Jens Axboe 
> ---
>  fs/block_dev.c | 25 +
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c039abfb2052..e7550b1f9670 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>  
>   qc = submit_bio();
>   for (;;) {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> + /*
> +  * Using non-atomic task state for polling
> +  */
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> +
>   if (!READ_ONCE(bio.bi_private))
>   break;

When we don't use polling, the blkdev_bio_end_io_simple may come on a different 
cpu before
submit_bio returns 
For example,
__blkdev_direct_IO_simple
  qc = submit_bio() 
  blkdev_bio_end_io_simple
WRITE_ONCE(bio.bi_private, NULL)
wake_up_process(waiter)
  __set_current_state(TASK_UNINTERRUPTIBLE)  
  READ_ONCE(bio.bi_private)

At the moment, a smp_rmb at least maybe needed before READ_ONCE(bio.bi_private)
to ensure the WRITE_ONCE(bio.bi_private, NULL) is seen if we miss the wakeup.

> +
>   if (!(iocb->ki_flags & IOCB_HIPRI) ||
> - !blk_poll(bdev_get_queue(bdev), qc))
> + !blk_poll(bdev_get_queue(bdev), qc)) {
> + /*
> +  * If we're scheduling, ensure that task state
> +  * change is ordered and seen.
> +  */
> + smp_mb();
>   io_schedule();
> + }
>   }
>   __set_current_state(TASK_RUNNING);
>  
> @@ -403,13 +414,19 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   return -EIOCBQUEUED;
>  
>   for (;;) {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> + /*
> +  * See __blkdev_direct_IO_simple() loop
> +  */
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> +

Similar with above.

>   if (!READ_ONCE(dio->waiter))
>   break;
>  
>   if (!(iocb->ki_flags & IOCB_HIPRI) ||
> - !blk_poll(bdev_get_queue(bdev), qc))
> + !blk_poll(bdev_get_queue(bdev), qc)) {
> + smp_mb();
>   io_schedule();
> + }
>   }
>   __set_current_state(TASK_RUNNING);
>  
> 

Thanks
Jianchao


Re: [PATCHSET v2 0/2] Add queue_is_busy helper

2018-11-08 Thread jianchao.wang



On 11/9/18 9:53 AM, jianchao.wang wrote:
> Hi Jens
> 
> On 11/9/18 12:06 AM, Jens Axboe wrote:
>> DM currently uses atomic inc/dec to maintain a busy count of
>> IO on a given device. For the dm-mq path, we can replace this
>> with helper that just checks the state of the tags on the device.
>>
>> First patch is a prep patch that allows the iteration helpers
>> to return true/false, like we support internally in sbitmap.
>> For a busy check we don't care about how many requests are
>> busy, just if some are or not. Hence we can stop iterating
>> tags as soon as we find one that is allocated.
> 
> If we don't care about how many requests are busy, why not check
> sb->map[idex].word directly ? It could be more efficient.
> 

Oh, we need to check rq->q for the tag shared case.

> Thanks
> Jianchao
> 


Re: [PATCHSET v2 0/2] Add queue_is_busy helper

2018-11-08 Thread jianchao.wang
Hi Jens

On 11/9/18 12:06 AM, Jens Axboe wrote:
> DM currently uses atomic inc/dec to maintain a busy count of
> IO on a given device. For the dm-mq path, we can replace this
> with helper that just checks the state of the tags on the device.
> 
> First patch is a prep patch that allows the iteration helpers
> to return true/false, like we support internally in sbitmap.
> For a busy check we don't care about how many requests are
> busy, just if some are or not. Hence we can stop iterating
> tags as soon as we find one that is allocated.

If we don't care about how many requests are busy, why not check
sb->map[idex].word directly ? It could be more efficient.

Thanks
Jianchao


Re: [PATCH 01/14] blk-mq: kill q->mq_map

2018-10-29 Thread jianchao.wang



On 10/29/18 11:00 AM, Jens Axboe wrote:
> On 10/28/18 8:40 PM, jianchao.wang wrote:
>>
>>
>> On 10/29/18 10:02 AM, jianchao.wang wrote:
>>> Hi Jens
>>>
>>> On 10/28/18 12:52 AM, Jens Axboe wrote:
>>>> On 10/27/18 10:48 AM, Jens Axboe wrote:
>>>>> On 10/27/18 8:19 AM, jianchao.wang wrote:
>>>>>> Hi Jens
>>>>>>
>>>>>> On 10/26/18 5:16 AM, Jens Axboe wrote:
>>>>>>> It's just a pointer to set->mq_map, use that instead.
>>>>>> Instead of using the set->mq_map and then a two-dimensional set->mq_map,
>>>>>> how about migrate the mq_map from per-set to per-cpuctx ?
>>>>>> something like:
>>>>>> q->queue_hw_ctx[ctx->map[type]]
>>>>>
>>>>> I think the current series is pretty clean in that regard, it goes
>>>>> from members -> map -> map array. I'd be willing to look at a
>>>>> conversion on top of that, if it makes things cleaner.
>>>>
>>>> On top of that, this:
>>>>
>>>> q->queue_hw_ctx[set->map[type].mq_map[cpu]]
>>>>
>>>> is one less pointer dereference, so more efficient.
>>>>
>>>
>>> Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and 
>>> efficient.
>>> something like,
>>>
>>>ctx->map_hctxs[type]
>>>
>>> all of the memory accessing is local.
>>>
>>> What we need to pay for this is do mapping for every request_queue.
>>
>> Actually not, this could be done in blk_mq_map_swqueue.
> 
> I'm not interested in pursuing this for this series. If you want
> to work on it on top and have it tested and evaluated separately,
> then that's obviously fine. But there's no point in mixing it
> up with these changes, as it's an orthogonal change.
> 

Yes, got it.
Thanks for your kindly response.

Sincerely
Jianchao


Re: [PATCH 01/14] blk-mq: kill q->mq_map

2018-10-28 Thread jianchao.wang



On 10/29/18 10:02 AM, jianchao.wang wrote:
> Hi Jens
> 
> On 10/28/18 12:52 AM, Jens Axboe wrote:
>> On 10/27/18 10:48 AM, Jens Axboe wrote:
>>> On 10/27/18 8:19 AM, jianchao.wang wrote:
>>>> Hi Jens
>>>>
>>>> On 10/26/18 5:16 AM, Jens Axboe wrote:
>>>>> It's just a pointer to set->mq_map, use that instead.
>>>> Instead of using the set->mq_map and then a two-dimensional set->mq_map,
>>>> how about migrate the mq_map from per-set to per-cpuctx ?
>>>> something like:
>>>> q->queue_hw_ctx[ctx->map[type]]
>>>
>>> I think the current series is pretty clean in that regard, it goes
>>> from members -> map -> map array. I'd be willing to look at a
>>> conversion on top of that, if it makes things cleaner.
>>
>> On top of that, this:
>>
>> q->queue_hw_ctx[set->map[type].mq_map[cpu]]
>>
>> is one less pointer dereference, so more efficient.
>>
> 
> Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and 
> efficient.
> something like,
> 
>ctx->map_hctxs[type]
> 
> all of the memory accessing is local.
> 
> What we need to pay for this is do mapping for every request_queue.

Actually not, this could be done in blk_mq_map_swqueue.

> 
> Thanks
> Jianchao
> 
> 
> 


Re: [PATCH 01/14] blk-mq: kill q->mq_map

2018-10-28 Thread jianchao.wang
Hi Jens

On 10/28/18 12:52 AM, Jens Axboe wrote:
> On 10/27/18 10:48 AM, Jens Axboe wrote:
>> On 10/27/18 8:19 AM, jianchao.wang wrote:
>>> Hi Jens
>>>
>>> On 10/26/18 5:16 AM, Jens Axboe wrote:
>>>> It's just a pointer to set->mq_map, use that instead.
>>> Instead of using the set->mq_map and then a two-dimensional set->mq_map,
>>> how about migrate the mq_map from per-set to per-cpuctx ?
>>> something like:
>>> q->queue_hw_ctx[ctx->map[type]]
>>
>> I think the current series is pretty clean in that regard, it goes
>> from members -> map -> map array. I'd be willing to look at a
>> conversion on top of that, if it makes things cleaner.
> 
> On top of that, this:
> 
> q->queue_hw_ctx[set->map[type].mq_map[cpu]]
> 
> is one less pointer dereference, so more efficient.
> 

Saving the blk_mq_hw_ctx into blk_mq_ctx direclty maybe more direct and 
efficient.
something like,

   ctx->map_hctxs[type]

all of the memory accessing is local.

What we need to pay for this is do mapping for every request_queue.

Thanks
Jianchao




Re: [PATCH 01/14] blk-mq: kill q->mq_map

2018-10-27 Thread jianchao.wang
Hi Jens

On 10/26/18 5:16 AM, Jens Axboe wrote:
> It's just a pointer to set->mq_map, use that instead.
Instead of using the set->mq_map and then a two-dimensional set->mq_map,
how about migrate the mq_map from per-set to per-cpuctx ?
something like:
q->queue_hw_ctx[ctx->map[type]]


Thanks
Jianchao


Re: [PATCH] blk-mq: place trace_block_getrq() in correct place

2018-10-23 Thread jianchao.wang



On 10/23/18 10:30 PM, Xiaoguang Wang wrote:
> trace_block_getrq() is to indicate a request struct has been allcoated
> for queue, so put it in right place.
> 
> Signed-off-by: Xiaoguang Wang 
> ---
>  block/blk-mq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c39ea8e17b..93aec25f1340 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1833,8 +1833,6 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>  
>   rq_qos_throttle(q, bio, NULL);
>  
> - trace_block_getrq(q, bio, bio->bi_opf);
> -
>   rq = blk_mq_get_request(q, bio, bio->bi_opf, );
>   if (unlikely(!rq)) {
>   rq_qos_cleanup(q, bio);
> @@ -1843,6 +1841,8 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   return BLK_QC_T_NONE;
>   }
>  
> + trace_block_getrq(q, bio, bio->bi_opf);
> +
>   rq_qos_track(q, rq, bio);
>  
>   cookie = request_to_qc_t(data.hctx, rq);
> 

Looks fine.

Reviewed-by:  Jianchao Wang 

Thanks
Jianchao


Re: [PATCH] blk-mq: complete req in softirq context in case of single queue

2018-09-26 Thread jianchao.wang
Hi Ming

On 09/27/2018 12:08 AM, Ming Lei wrote:
> Lot of controllers may have only one irq vector for completing IO
> request. And usually affinity of the only irq vector is all possible
> CPUs, however, on most of ARCH, there may be only one specific CPU
> for handling this interrupt.
> 
> So if all IOs are completed in hardirq context, it is inevitable to
> degrade IO performance because of increased irq latency.
> 
> This patch tries to address this issue by allowing to complete request
> in softirq context, like the legacy IO path.
> 
> IOPS is observed as ~13%+ in the following randread test on raid0 over
> virtio-scsi.
> 
> mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 
> /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
> 
> fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 
> --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 
> --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
> 
> Cc: Zach Marano 
> Cc: Christoph Hellwig 
> Cc: Bart Van Assche 
> Cc: Jianchao Wang 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c  | 14 ++
>  block/blk-softirq.c |  7 +--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 85a1c1a59c72..d4792c3ac983 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -565,6 +565,20 @@ static void __blk_mq_complete_request(struct request *rq)
>   if (rq->internal_tag != -1)
>   blk_mq_sched_completed_request(rq);
>  
> + /*
> +  * Most of single queue controllers, there is only one irq vector
> +  * for handling IO completion, and the only irq's affinity is set
> +  * as all possible CPUs. On most of ARCHs, this affinity means the
> +  * irq is handled on one specific CPU.
> +  *
> +  * So complete IO reqeust in softirq context in case of single queue
> +  * for not degrading IO performance by irqsoff latency.
> +  */
> + if (rq->q->nr_hw_queues == 1) {
> + __blk_complete_request(rq);
> + return;
> + }
> +
>   if (!test_bit(QUEUE_FLAG_SAME_COMP, >q->queue_flags)) {
>   rq->q->softirq_done_fn(rq);
>   return;
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 15c1f5e12eb8..b1df9b6c1731 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -101,17 +101,20 @@ void __blk_complete_request(struct request *req)
>   struct request_queue *q = req->q;
>   unsigned long flags;
>   bool shared = false;
> + int rq_cpu;
>  
>   BUG_ON(!q->softirq_done_fn);
>  
> + rq_cpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
> +
>   local_irq_save(flags);
>   cpu = smp_processor_id();
>  
>   /*
>* Select completion CPU
>*/
> - if (req->cpu != -1) {
> - ccpu = req->cpu;
> + if (rq_cpu != -1) {
> + ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
>   if (!test_bit(QUEUE_FLAG_SAME_FORCE, >queue_flags))
>   shared = cpus_share_cache(cpu, ccpu);
>   } else
> 
;

Only QUEUE_FLAG_SAME_COMP is set, we will try to complete the request on the cpu
where it is allocated.

For the blk-legacy, the req->cpu is -1 when QUEUE_FLAG_SAME_COMP is not set.

blk_queue_bio

if (test_bit(QUEUE_FLAG_SAME_COMP, >queue_flags))
req->cpu = raw_smp_processor_id()

But for the blk-mq, the req->mq_ctx->cpu must be valid.
So we should check the QUEUE_FLAG_SAME_COMP for the blk-mq case.

On the other hand, how about split the softirq completion part out of
__blk_complete_request ? It is confused when find __blk_complete_request
in __blk_mq_complete_request.

Thanks
Jianchao


Re: [PATCHv2] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread jianchao.wang
Hi Keith

On 09/26/2018 12:36 AM, Keith Busch wrote:
> A recent commit runs tag iterator callbacks under the rcu read lock,
> but existing callbacks do not satisfy the non-blocking requirement.
> The commit intended to prevent an iterator from accessing a queue that's
> being modified. This patch fixes the original issue by taking a queue
> reference instead of reading it, which allows callbacks to make blocking
> calls.
> 
> Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with 
> blk_mq_queue_tag_busy_iter")
> Cc: Jianchao Wang 
> Signed-off-by: Keith Busch 
> ---
> v1 -> v2:
> 
>   Leave the iterator interface as-is, making this change transparent
>   to the callers. This will increment the queue usage counter in the
>   timeout path twice, which is harmless.
> 

This is OK to me.
Acked-by: Jianchao Wang 

>   Changelog.
> 
>  block/blk-mq-tag.c | 13 -
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 94e1ed667b6e..41317c50a446 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -322,16 +322,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue 
> *q, busy_iter_fn *fn,
>  
>   /*
>* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
> -  * queue_hw_ctx after freeze the queue. So we could use q_usage_counter
> -  * to avoid race with it. __blk_mq_update_nr_hw_queues will users
> -  * synchronize_rcu to ensure all of the users go out of the critical
> -  * section below and see zeroed q_usage_counter.
> +  * queue_hw_ctx after freeze the queue, so we use q_usage_counter
> +  * to avoid race with it.
>*/
> - rcu_read_lock();
> - if (percpu_ref_is_zero(>q_usage_counter)) {
> - rcu_read_unlock();
> + if (!percpu_ref_tryget(>q_usage_counter))
>   return;
> - }
>  
>   queue_for_each_hw_ctx(q, hctx, i) {
>   struct blk_mq_tags *tags = hctx->tags;
> @@ -347,7 +342,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
> busy_iter_fn *fn,
>   bt_for_each(hctx, >breserved_tags, fn, priv, 
> true);
>   bt_for_each(hctx, >bitmap_tags, fn, priv, false);
>   }
> - rcu_read_unlock();
> + blk_queue_exit(q);
>  }
>  
>  static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
> 


Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-24 Thread jianchao.wang
Hi Bart

On 09/25/2018 10:20 AM, Bart Van Assche wrote:
> On 9/24/18 7:11 PM, jianchao.wang wrote:
>> Hi Keith
>>
>> On 09/25/2018 05:09 AM, Keith Busch wrote:
>>> -    /* A deadlock might occur if a request is stuck requiring a
>>> - * timeout at the same time a queue freeze is waiting
>>> - * completion, since the timeout code would not be able to
>>> - * acquire the queue reference here.
>>> - *
>>> - * That's why we don't use blk_queue_enter here; instead, we use
>>> - * percpu_ref_tryget directly, because we need to be able to
>>> - * obtain a reference even in the short window between the queue
>>> - * starting to freeze, by dropping the first reference in
>>> - * blk_freeze_queue_start, and the moment the last request is
>>> - * consumed, marked by the instant q_usage_counter reaches
>>> - * zero.
>>> - */
>>> -    if (!percpu_ref_tryget(>q_usage_counter))
>>> +    if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
>>>   return;
>>
>> We cannot discard the percpu_ref_tryget here.
>>
>> There following code in blk_mq_timeout_work still need it:
>>
>> if (next != 0) {
>>     mod_timer(>timeout, next);
>> } else {
>>     queue_for_each_hw_ctx(q, hctx, i) {
>>     /* the hctx may be unmapped, so check it here */
>>     if (blk_mq_hw_queue_mapped(hctx))
>>     blk_mq_tag_idle(hctx);
>>     }
>> }
> 
> Hi Jianchao,
> 
> Had you noticed that the percpu_ref_tryget() call has been moved into
> blk_mq_queue_tag_busy_iter()?
> 

Yes.

But the issue is the left part of blk_mq_timeout_work is moved out of 
protection of q refcount.

Thanks
Jianchao


Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-24 Thread jianchao.wang
Hi Keith

On 09/25/2018 05:09 AM, Keith Busch wrote:
> - /* A deadlock might occur if a request is stuck requiring a
> -  * timeout at the same time a queue freeze is waiting
> -  * completion, since the timeout code would not be able to
> -  * acquire the queue reference here.
> -  *
> -  * That's why we don't use blk_queue_enter here; instead, we use
> -  * percpu_ref_tryget directly, because we need to be able to
> -  * obtain a reference even in the short window between the queue
> -  * starting to freeze, by dropping the first reference in
> -  * blk_freeze_queue_start, and the moment the last request is
> -  * consumed, marked by the instant q_usage_counter reaches
> -  * zero.
> -  */
> - if (!percpu_ref_tryget(>q_usage_counter))
> + if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
>   return;

We cannot discard the percpu_ref_tryget here.

There following code in blk_mq_timeout_work still need it:

if (next != 0) {
mod_timer(>timeout, next);
} else {
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_idle(hctx);
}
}

Thanks
Jianchao


Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread jianchao.wang
Hi Bart

On 09/19/2018 01:44 AM, Bart Van Assche wrote:
>>> + * ago. Since blk_queue_enter() is called by the request allocation
>>> + * code before pm_request_resume(), if no requests have a tag assigned
>>> + * it is safe to suspend the device.
>>> + */
>>> +    ret = -EBUSY;
>>> +    if (blk_requests_in_flight(q) == 0) {
>>> +    /*
>>> + * Call synchronize_rcu() such that later blk_queue_enter()
>>> + * calls see the pm-only state. See also
>>> + * 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=.
>>> + */
>>> +    synchronize_rcu();
>>> +    if (blk_requests_in_flight(q) == 0)
>>
>> Seems not safe here.
>>
>> For blk-mq path:
>> Someone may have escaped from the preempt checking, missed the 
>> blk_pm_request_resume there,
>> entered into generic_make_request, but have not allocated request and 
>> occupied any tag.
>>
>> There could be a similar scenario for blk-legacy path, the q->nr_pending is 
>> increased when
>> request is queued.
>>
>> So I guess the q_usage_counter checking is still needed here.
> 
> There is only one blk_pm_request_resume() call and that call is inside 
> blk_queue_enter() after the pm_only counter has been checked.
> 
> For the legacy block layer, nr_pending is increased after the 
> blk_queue_enter() call from inside blk_old_get_request() succeeded.
> 
> So I don't see how blk_pm_request_resume() or q->nr_pending++ could escape 
> from the preempt checking?

For example:


blk_pre_runtime_suspendgeneric_make_request
 blk_queue_enter  // success here, 
no blk_pm_request_resume here
 blk_mq_make_requset
blk_set_pm_only

if (blk_requests_in_flight(q) == 0) {
  synchronize_rcu();
  if (blk_requests_in_flight(q) == 0)
ret = 0;
  }
 blk_mq_get_request

Thanks
Jianchao


Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-18 Thread jianchao.wang
Hi Ming

On 09/18/2018 03:42 PM, Ming Lei wrote:
> On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/17/2018 08:07 PM, Ming Lei wrote:
>>>>> This way will delay runtime pm or system suspend until the queue is 
>>>>> unfrozen,
>>>>> but it isn't reasonable.
>>>> This interface is for the __scsi_execute only, before we call into 
>>>> function, we should have
>>>> get the device resumed synchronously.
>>> I mean when the queue is frozen, it is reasonable to runtime suspend the 
>>> queue. However,
>>> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing 
>>> first.
>>
>> We don't freeze the queue, but set preempt-only mode when suspend the queue. 
>> :)
> 
> But the queue can be frozen by other paths. Even though it is frozen, the 
> queue
> should be allowed to suspend too.
> 

Yes, the race between freeze and preempt-only is a tricky issue.

Thanks
Jianchao


Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-17 Thread jianchao.wang
Hi Bart

On 09/18/2018 04:15 AM, Bart Van Assche wrote:
> Instead of allowing requests that are not power management requests
> to enter the queue in runtime suspended status (RPM_SUSPENDED), make
> the blk_get_request() caller block. This change fixes a starvation
> issue: it is now guaranteed that power management requests will be
> executed no matter how many blk_get_request() callers are waiting.
> Instead of maintaining the q->nr_pending counter, rely on
> q->q_usage_counter. 

Looks like we still depend on this nr_pending for blk-legacy.

Call pm_runtime_mark_last_busy() every time a
> request finishes instead of only if the queue depth drops to zero.
...
>   /*
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index 9b636960d285..5f21bedcb307 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -1,8 +1,11 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +#include 
>  #include 
>  #include 
>  #include 
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
>  
>  /**
>   * blk_pm_runtime_init - Block layer runtime PM initialization routine
> @@ -40,6 +43,36 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
> device *dev)
>  }
>  EXPORT_SYMBOL(blk_pm_runtime_init);
>  
> +struct in_flight_data {
> + struct request_queue*q;
> + int in_flight;
> +};
> +
> +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request 
> *rq,
> + void *priv, bool reserved)
> +{
> + struct in_flight_data *in_flight = priv;
> +
> + if (rq->q == in_flight->q)
> + in_flight->in_flight++;
> +}
> +
> +/*
> + * Count the number of requests that are in flight for request queue @q. Use
> + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
> + * queues.  Use blk_mq_queue_tag_busy_iter() instead of
> + * blk_mq_tagset_busy_iter() because the latter only considers requests that
> + * have already been started.
> + */
> +static int blk_requests_in_flight(struct request_queue *q)
> +{
> + struct in_flight_data in_flight = { .q = q };
> +
> + if (q->mq_ops)
> + blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, _flight);
> + return q->nr_pending + in_flight.in_flight;
> +}

blk_mq_queue_tag_busy_iter only accounts the driver tags. This will only work 
w/o io scheduler

> +
>  /**
>   * blk_pre_runtime_suspend - Pre runtime suspend check
>   * @q: the queue of the device
> @@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>   if (!q->dev)
>   return ret;
>  
> + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> +
> + blk_set_pm_only(q);
> + /*
> +  * This function only gets called if the most recent
> +  * pm_request_resume() call occurred at least autosuspend_delay_ms
   ^^^
pm_runtime_mark_last_busy ?

> +  * ago. Since blk_queue_enter() is called by the request allocation
> +  * code before pm_request_resume(), if no requests have a tag assigned
> +  * it is safe to suspend the device.
> +  */
> + ret = -EBUSY;
> + if (blk_requests_in_flight(q) == 0) {
> + /*
> +  * Call synchronize_rcu() such that later blk_queue_enter()
> +  * calls see the pm-only state. See also
> +  * 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=.
> +  */
> + synchronize_rcu();
> + if (blk_requests_in_flight(q) == 0)

Seems not safe here.

For blk-mq path:
Someone may have escaped from the preempt checking, missed the 
blk_pm_request_resume there,
entered into generic_make_request, but have not allocated request and occupied 
any tag.

There could be a similar scenario for blk-legacy path, the q->nr_pending is 
increased when
request is queued.

So I guess the q_usage_counter checking is still needed here.
  
> + ret = 0;
> + }
> +
>   spin_lock_irq(q->queue_lock);
> - if (q->nr_pending) {
> - ret = -EBUSY;
> + if (ret < 0)
>   pm_runtime_mark_last_busy(q->dev);
> - } else {
> + else
>   q->rpm_status = RPM_SUSPENDING;
> - }
>   spin_unlock_irq(q->queue_lock);
> +
> + if (ret)
> + blk_clear_pm_only(q);
> +
>   return ret;
>  }
>  EXPORT_SYMBOL(blk_pre_runtime_suspend);
> @@ -106,6 +163,9 @@ void blk_post_runtime_suspend(struct request_queue *q, 
> int err)
>   pm_runtime_mark_last_busy(q->dev);
>   }
>   spin_unlock_irq(q->queue_lock);
> +
> + if (err)
> + blk_clear_pm_only(q);
>  }
>  EXPORT_SYMBOL(blk_post_runtime_suspend);
>  
> @@ -153,13 +213,15 @@ void blk_post_runtime_resume(struct request_queue *q, 
> int err)
>   spin_lock_irq(q->queue_lock);
>   

Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-17 Thread jianchao.wang
Hi Ming

On 09/17/2018 08:07 PM, Ming Lei wrote:
>>> This way will delay runtime pm or system suspend until the queue is 
>>> unfrozen,
>>> but it isn't reasonable.
>> This interface is for the __scsi_execute only, before we call into function, 
>> we should have
>> get the device resumed synchronously.
> I mean when the queue is frozen, it is reasonable to runtime suspend the 
> queue. However,
> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing 
> first.

We don't freeze the queue, but set preempt-only mode when suspend the queue. :)

Thanks
Jianchao


Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-16 Thread jianchao.wang
Hi Ming

Thanks for your kindly response.

On 09/16/2018 09:09 PM, Ming Lei wrote:
> On Fri, Sep 14, 2018 at 03:27:44PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>> This patchset introduces per-host admin request queue for submitting
>>> admin request only, and uses this approach to implement both SCSI
>>> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
>>> can be avoided in case that request pool is used up, such as when too
>>> many IO requests are allocated before resuming device
>>
>> To be honest, after look in to the SCSI part of this patchset, I really
>> suspect whether it is worth to introduce this per scsi-host adminq.
>> Too much complexity is introduced into SCSI. Compared with this, the current
> 
> Can you comment on individual patch about which part is complicated?
> I will explain them to you only by one.

I have read through the scsi part of your patch many times. :)

After this patchset, we could control the IO request_queues separately. This is
more convenient.

But what we have to pay is the relative complexity _spread_ over scsi-layer 
code.
Especially, we have to handle that a request to a scsi device could be from its
own IO request_queue or the per-host adminq at both submit and complete side.

We have handled those cases in the patchset, but that looks really boring.
And also it is risky in current stable scsi mid-layer code.

I really think we should control the complexity in a very small range.

> 
>> preempt-only feature look much simpler.
>>
>> If you just don't like the preempt-only checking in the hot path, we may 
>> introduce
>> a new interface similar with blk_queue_enter for the non hot path.
>>
>> With your patch percpu-refcount: relax limit on percpu_ref_reinit()
> 
> Seems this way may be one improvement on Bart's V6.
> 
>>
>> (totally no test)
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4dbc93f..dd7f007 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
>>   * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was 
>> not
>>   * set and 1 if the flag was already set.
>>   */
>> -int blk_set_preempt_only(struct request_queue *q)
>> +int blk_set_preempt_only(struct request_queue *q, bool drain)
>>  {
>> -return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
>> +if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
>> +return 1;
>> +percpu_ref_kill(>q_usage_counter);
>> +synchronize_sched();
>> +
>> +if (drain)
>> +wait_event(q->mq_freeze_wq,
>> +percpu_ref_is_zero(>q_usage_counter));
>> +
>> +return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(blk_set_preempt_only);
> 
> It is easy to cause race between the normal freeze interface and the
> above one. That may be one new complexity of the preempt only approach,
> because there can be more than one path to kill/reinit .q_usage_counter.

Yes, indeed.

> 
>>  
>>  void blk_clear_preempt_only(struct request_queue *q)
>>  {
>> +percpu_ref_reinit(>q_usage_counter);
>>  blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
>>  wake_up_all(>mq_freeze_wq);
>>  }
>> @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
>>  /**
>>   * blk_queue_enter() - try to increase q->q_usage_counter
>>   * @q: request queue pointer
>> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
>> + * @flags: BLK_MQ_REQ_NOWAIT
>>   */
>>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>>  {
>> -const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
>> -
>>  while (true) {
>> -bool success = false;
>> -
>> -rcu_read_lock();
>> -if (percpu_ref_tryget_live(>q_usage_counter)) {
>> -/*
>> - * The code that sets the PREEMPT_ONLY flag is
>> - * responsible for ensuring that that flag is globally
>> - * visible before the queue is unfrozen.
>> - */
>> -if (preempt || !blk_queue_preempt_only(q)) {
>> -success = true;
>> -} else {
>> -percpu_ref_put(>q_usage_counter);
>> -}
>> -}
>> -rcu_read_unlock();
>>  
>> -if (success)
>> +if (percpu_ref_tryget_live(>q_usage_counter))
>>  return 0;
>>  
>>  if (flags

Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-14 Thread jianchao.wang
Hi Ming

On 09/13/2018 08:15 PM, Ming Lei wrote:
> This patchset introduces per-host admin request queue for submitting
> admin request only, and uses this approach to implement both SCSI
> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> can be avoided in case that request pool is used up, such as when too
> many IO requests are allocated before resuming device

To be honest, after look in to the SCSI part of this patchset, I really
suspect whether it is worth to introduce this per scsi-host adminq.
Too much complexity is introduced into SCSI. Compared with this, the current
preempt-only feature look much simpler.

If you just don't like the preempt-only checking in the hot path, we may 
introduce
a new interface similar with blk_queue_enter for the non hot path.

With your patch percpu-refcount: relax limit on percpu_ref_reinit()

(totally no test)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f..dd7f007 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
  * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
  * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+int blk_set_preempt_only(struct request_queue *q, bool drain)
 {
-return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
+return 1;
+percpu_ref_kill(>q_usage_counter);
+synchronize_sched();
+
+if (drain)
+wait_event(q->mq_freeze_wq,
+percpu_ref_is_zero(>q_usage_counter));
+
+return 0;
 }
 EXPORT_SYMBOL_GPL(blk_set_preempt_only);
 
 void blk_clear_preempt_only(struct request_queue *q)
 {
+percpu_ref_reinit(>q_usage_counter);
 blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
 wake_up_all(>mq_freeze_wq);
 }
@@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ * @flags: BLK_MQ_REQ_NOWAIT
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
-
 while (true) {
-bool success = false;
-
-rcu_read_lock();
-if (percpu_ref_tryget_live(>q_usage_counter)) {
-/*
- * The code that sets the PREEMPT_ONLY flag is
- * responsible for ensuring that that flag is globally
- * visible before the queue is unfrozen.
- */
-if (preempt || !blk_queue_preempt_only(q)) {
-success = true;
-} else {
-percpu_ref_put(>q_usage_counter);
-}
-}
-rcu_read_unlock();
 
-if (success)
+if (percpu_ref_tryget_live(>q_usage_counter))
 return 0;
 
 if (flags & BLK_MQ_REQ_NOWAIT)
@@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
 wait_event(q->mq_freeze_wq,
(atomic_read(>mq_freeze_depth) == 0 &&
-(preempt || !blk_queue_preempt_only(q))) ||
+   !blk_queue_preempt_only(q)) ||
+   blk_queue_dying(q));
+if (blk_queue_dying(q))
+return -ENODEV;
+}
+}
+
+/*
+ * When set PREEMPT_ONLY mode, q->q_usage_counter is killed.
+ * If only PREEMPT_ONLY mode, go on.
+ * If queue frozen, wait.
+ */
+int blk_queue_preempt_enter(struct request_queue *q,
+blk_mq_req_flags_t flags)
+{
+while (true) {
+
+if (percpu_ref_tryget_live(>q_usage_counter))
+return 0;
+
+smp_rmb();
+
+/*
+ * If queue is only in PREEMPT_ONLY mode, get the ref
+ * directly. The one who set PREEMPT_ONLY mode is responsible
+ * to wake up the waiters on mq_freeze_wq.
+ */
+if (!atomic_read(>mq_freeze_depth) &&
+blk_queue_preempt_only(q)) {
+percpu_ref_get_many(>q_usage_counter, 1);
+return 0;
+}
+
+if (flags & BLK_MQ_REQ_NOWAIT)
+return -EBUSY;
+
+wait_event(q->mq_freeze_wq,
+   (atomic_read(>mq_freeze_depth) == 0 ||
blk_queue_dying(q));
 if (blk_queue_dying(q))
 return -ENODEV;
@@ -1587,7 +1616,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
 /* create ioc upfront */
 create_io_context(gfp_mask, q->node);
 
-ret = blk_queue_enter(q, flags);
+ret = blk_queue_preempt_enter(q, flags);
 if (ret)
 return ERR_PTR(ret);
 spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a..3aea78f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -403,7 +403,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
 struct request *rq;
 int ret;
 
-ret = 

Re: No protection on the hctx->dispatch_busy

2018-08-27 Thread jianchao.wang



On 08/27/2018 03:00 PM, Ming Lei wrote:
> On Mon, Aug 27, 2018 at 01:56:39PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Currently, blk_mq_update_dispatch_busy is hooked in blk_mq_dispatch_rq_list
>> and __blk_mq_issue_directly. blk_mq_update_dispatch_busy could be invoked on 
>> multiple
>> cpus concurrently. But there is not any protection on the 
>> hctx->dispatch_busy. We cannot
>> ensure the update on the dispatch_busy atomically.
> 
> The update itself is atomic given type of this variable is 'unsigned int'.

The blk_mq_update_dispatch_busy doesn't just write on a unsigned int variable,
but read, calculate and write. The whole operation is not atomic.

> 
>>
>>
>> Look at the test result after applied the debug patch below:
>>
>>  fio-1761  [000]    227.246251: 
>> blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2
>>  fio-1766  [004]    227.246252: 
>> blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1
>>  fio-1755  [000]    227.246366: 
>> blk_mq_update_dispatch_busy.part.50: old 1 ewma 0 cur 0
>>  fio-1754  [003]    227.266050: 
>> blk_mq_update_dispatch_busy.part.50: old 2 ewma 3 cur 3
>>  fio-1763  [007]    227.266050: 
>> blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2
>>  fio-1761  [000]    227.266051: 
>> blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2
>>  fio-1766  [004]    227.266051: 
>> blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2
>>  fio-1760  [005]    227.266165: 
>> blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1
>>
...
>>
>> Is it expected ?
> 
> Yes, it won't be a issue in reality given hctx->dispatch_busy is used as
> a hint, and it often works as expected and hctx->dispatch_busy is convergent
> finally because it is exponential weighted moving average.
I just concern the value of dispatch_busy will bounce up and down in small range
with high workload on 32 or higher core system due to the cache and non-atomic 
update


> 
> Thanks,
> Ming
> 


No protection on the hctx->dispatch_busy

2018-08-26 Thread jianchao.wang
Hi Ming

Currently, blk_mq_update_dispatch_busy is hooked in blk_mq_dispatch_rq_list
and __blk_mq_issue_directly. blk_mq_update_dispatch_busy could be invoked on 
multiple
cpus concurrently. But there is not any protection on the hctx->dispatch_busy. 
We cannot
ensure the update on the dispatch_busy atomically.


Look at the test result after applied the debug patch below:

 fio-1761  [000]    227.246251: 
blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2
 fio-1766  [004]    227.246252: 
blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1
 fio-1755  [000]    227.246366: 
blk_mq_update_dispatch_busy.part.50: old 1 ewma 0 cur 0
 fio-1754  [003]    227.266050: 
blk_mq_update_dispatch_busy.part.50: old 2 ewma 3 cur 3
 fio-1763  [007]    227.266050: 
blk_mq_update_dispatch_busy.part.50: old 0 ewma 2 cur 2
 fio-1761  [000]    227.266051: 
blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2
 fio-1766  [004]    227.266051: 
blk_mq_update_dispatch_busy.part.50: old 3 ewma 2 cur 2
 fio-1760  [005]    227.266165: 
blk_mq_update_dispatch_busy.part.50: old 2 ewma 1 cur 1

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,11 +1088,12 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
*hctx,
 static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
 {
unsigned int ewma;
+   unsigned int old;
 
if (hctx->queue->elevator)
return;
 
-   ewma = hctx->dispatch_busy;
+   old = ewma = hctx->dispatch_busy;
 
if (!ewma && !busy)
return;
@@ -1103,6 +1104,8 @@ static void blk_mq_update_dispatch_busy(struct 
blk_mq_hw_ctx *hctx, bool busy)
ewma /= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT;
 
hctx->dispatch_busy = ewma;
+
+   trace_printk("old %u ewma %u cur %u\n", old, ewma, 
READ_ONCE(hctx->dispatch_busy));
 }


Is it expected ?

Thanks
Jianchao


Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done

2018-08-26 Thread jianchao.wang
Hi Jens

On 08/25/2018 11:41 PM, Jens Axboe wrote:
>   do {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> + if (test_bit(0, ))
> + break;
>  
> - if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> + WARN_ON_ONCE(list_empty());
> +
> + if (!has_sleeper &&
> + rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
> + finish_wait(>wait, );
> +
> + /*
> +  * We raced with wbt_wake_function() getting a token,
> +  * which means we now have two. Put ours and wake
> +  * anyone else potentially waiting for one.
> +  */
> + if (test_bit(0, ))
> + wbt_rqw_done(rwb, rqw, wb_acct);
>   break;

Just use 'bool' variable should be OK 
After finish_wait, no one could race with us here.

> + }
>  
>   if (lock) {
>   spin_unlock_irq(lock);
> @@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum 
> wbt_flags wb_acct,
>   spin_lock_irq(lock);
>   } else
>   io_schedule();
> +
>   has_sleeper = false;
>   } while (1);

I cannot get the point of "since we can't rely on just being woken from the 
->func handler
we set".
Do you mean there could be someone else could wake up this task ?

Thanks
Jianchao


Re: [PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()

2018-08-13 Thread jianchao.wang
Hi Bart

On 08/11/2018 12:17 AM, Bart Van Assche wrote:
> void blk_pm_runtime_exit(struct request_queue *q)
> {
>   if (!q->dev)
>   return;
> 
>   pm_runtime_get_sync(q->dev);
>   blk_freeze_queue(q);
>   q->dev = NULL;
>   blk_unfreeze_queue(q);
> }

I'm afraid this will not work.

In following patch:

@@ -972,6 +972,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;

We could still invoke blk_pm_request_resume even if the queue is frozen.

And we have blk_pm_request_resume later as following:

+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}

Thanks
Jianchao


Re: [PATCH v6 10/12] block: Change the runtime power management approach (1/2)

2018-08-09 Thread jianchao.wang
Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> Instead of scheduling runtime resume of a request queue after a
> request has been queued, schedule asynchronous resume during request
> allocation. The new pm_request_resume() calls occur after
> blk_queue_enter() has increased the q_usage_counter request queue
^^^
> member. This change is needed for a later patch that will make request
> allocation block while the queue status is not RPM_ACTIVE.

Is it "after getting q->q_usage_counter fails" ?
And also this blk_pm_request_resume  will not affect the normal path. ;)

Thanks
Jianchao


Re: [PATCH v6 11/12] block: Change the runtime power management approach (2/2)

2018-08-09 Thread jianchao.wang
Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> +
> + blk_set_pm_only(q);
> + /*
> +  * This function only gets called if the most recent
> +  * pm_request_resume() call occurred at least autosuspend_delay_ms
> +  * ago. Since blk_queue_enter() is called by the request allocation
> +  * code before pm_request_resume(), if no requests have a tag assigned
> +  * it is safe to suspend the device.
> +  */
> + ret = -EBUSY;
> + if (blk_requests_in_flight(q) == 0) {
> + /*
> +  * Call synchronize_rcu() such that later blk_queue_enter()
> +  * calls see the preempt-only state. See also
> +  * 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=U9uPCJD2WnkXvdzrWaKPh2wJuk8-IHvxZ9sWDVrg2Tg=c9E23TPCpNQkiZpuzGztwHxjWF8qrESfRnPmI-e-Z48=.
> +  */
> + synchronize_rcu();
> + if (blk_requests_in_flight(q) == 0)
> + ret = 0;
> + }

I still think blk_set_pm_only should be moved after blk_requests_in_flight.
Otherwise, the normal IO will be blocked for a little while if there are still
busy requests.

Thanks
Jianchao


Re: [PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY

2018-08-09 Thread jianchao.wang
Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> +/*
> + * Whether or not blk_queue_enter() should proceed. RQF_PM requests are 
> always
> + * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not 
> been
> + * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has 
> been
> + * set.
> + */
> +static inline bool blk_enter_allowed(struct request_queue *q,
> +  blk_mq_req_flags_t flags)
> +{
> + return flags & BLK_MQ_REQ_PM ||
> + (!blk_queue_pm_only(q) &&
> +  (flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q)));
> +}

If a new state is indeed necessary, I think this kind of checking in hot path 
is inefficient.
How about introduce a new state into request_queue, such as 
request_queue->gate_state.
Set the PM_ONLY and DV_ONLY into this state, then we could just check 
request_queue->gate_state > 0
before do further checking.

Thanks
Jianchao


Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)

2018-08-08 Thread jianchao.wang



On 08/08/2018 02:11 PM, jianchao.wang wrote:
> Hi Bart
> 
> On 08/08/2018 06:51 AM, Bart Van Assche wrote:
>> @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct 
>> request_queue *q,
>>  }
>>  }
>>  data->hctx->queued++;
>> +
>> +blk_pm_add_request(q, rq);
>> +
>>  return rq;
>>  }
> 
> The request_queue is in pm_only mode when suspended, who can reach here to do 
> the resume ?

I mean, in the original blk-legacy runtime pm implementation, any new IO could 
trigger the resume,
after your patch set, only the pm request could pass the blk_queue_enter while 
the queue is suspended
and in pm-only mode. But if no resume, where does the pm request come from ?

The blk_pm_add_request should be added to blk_queue_enter.
It looks like as following:
   1. when an normal io reaches blk_queue_enter, if queue is in suspended mode, 
it invoke blk_pm_add_request
  to trigger the resume, then wait here for the pm_only mode to be cleared.
   2. the runtime pm core does the resume work and clear the pm_only more 
finally
   3. the task blocked in blk_queue_enter is waked up and proceed.

Thanks
Jianchao


Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)

2018-08-08 Thread jianchao.wang
Hi Bart

On 08/08/2018 06:51 AM, Bart Van Assche wrote:
> @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>   }
>   }
>   data->hctx->queued++;
> +
> + blk_pm_add_request(q, rq);
> +
>   return rq;
>  }

The request_queue is in pm_only mode when suspended, who can reach here to do 
the resume ?

Thanks
Jianchao


Re: [PATCH v4 07/10] block, scsi: Rework runtime power management

2018-08-05 Thread jianchao.wang
Hi Bart

On 08/04/2018 08:03 AM, Bart Van Assche wrote:
> Instead of allowing requests that are not power management requests
> to enter the queue in runtime suspended status (RPM_SUSPENDED), make
> the blk_get_request() caller block. 

Who will resume the runtime suspended device ?

In original blk-legacy, when a request is added, blk_pm_add_request will call
pm_request_resume could do that. The request will be issued after the q is 
resumed
successfully.

After this patch, non-pm request will be blocked and pm_request_resume is 
removed from
blk_pm_add_request, who does resume the runtime suspended q and device ?

Thanks
Jianchao


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-29 Thread jianchao.wang



On 07/27/2018 11:29 PM, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote:
>> Something like:
>>
>>  percpu_ref_switch_to_atomic_sync(>q_usage_counter);
>>  if (!percpu_ref_is_zero(>q_usage_counter)) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  } else {
>>  blk_set_preempt_only(q);
>>  if (!percpu_ref_is_zero(>q_usage_counter) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  blk_clear_preempt_only(q);
>>  } else {
>>  q->rpm_status = RPM_SUSPENDIN;
>>  }
>> }
> 
> I think this code is racy. Because there is no memory barrier in
> blk_queue_enter() between the percpu_ref_tryget_live() and the
> blk_queue_preempt_only() calls, the context that sets the PREEMPT_ONLY flag
> has to use synchronize_rcu() or call_rcu() to ensure that blk_queue_enter()
> sees the PREEMPT_ONLY flag after it has called percpu_ref_tryget_live().
> See also 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIFgQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=mkAOXQtxuVTAej2tBjOvEed2h4TRvX3mE-EPeNEUD5E=xTgTB2x1JwvRUQVyOL8m3rhbbk8xhOkZMC_Io9bmpFc=.

Yes, a synchrorize_rcu is indeed needed here to ensure a q_usage_counter is
got successfully with increasing one,  or unsuccessfully without increasing one.

Thanks
Jianchao


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-29 Thread jianchao.wang


Hi Bart

On 07/27/2018 11:09 PM, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 09:57 +0800, jianchao.wang wrote:
>> If q_usage_counter is not zero here, we will leave the request_queue in 
>> preempt only mode.
> 
> That's on purpose. If q_usage_counter is not zero then 
> blk_pre_runtime_suspend()
> will return -EBUSY. That error code will be passed to 
> blk_post_runtime_suspend()
> and that will cause that function to clear QUEUE_FLAG_PREEMPT_ONLY.
> 

static int sdev_runtime_suspend(struct device *dev)
{
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
struct scsi_device *sdev = to_scsi_device(dev);
int err = 0;

err = blk_pre_runtime_suspend(sdev->request_queue);
if (err)
return err;
if (pm && pm->runtime_suspend)
err = pm->runtime_suspend(dev);
blk_post_runtime_suspend(sdev->request_queue, err);

return err;
}

If blk_pre_runtime_suspend returns -EBUSY, blk_post_runtime_suspend will not be 
invoked.

>> The request_queue should be set to preempt only mode only when we confirm we 
>> could set
>> rpm_status to RPM_SUSPENDING or RPM_RESUMING.
> 
> Why do you think this?

https://marc.info/?l=linux-scsi=133727953625963=2
"
If q->rpm_status is RPM_SUSPENDED, they shouldn't do anything -- act as though 
the queue is
empty.  If q->rpm_status is RPM_SUSPENDING or RPM_RESUMING, they should hand 
over the request
only if it has the REQ_PM flag set.
"
In additon, if we set the preempt only here unconditionally, the normal IO will 
be blocked
during the blk_pre_runtime_suspend. In your patch, q_usage_counter will be 
switched to atomic mode,
this could cost some time. Is it really OK ?

Thanks
Jianchao
> 
> Bart.
> 
> 


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/27/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote:
>> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
>>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>> return ret;
>>>  
>>> blk_pm_runtime_lock(q);
>>> +   blk_set_preempt_only(q);
>>
>> We only stop non-RQF_PM request entering when RPM_SUSPENDING and 
>> RPM_RESUMING.
>> blk_pre_runtime_suspend should only _check_ whether runtime suspend is 
>> allowed.
>> So we should not set preempt_only here.
>>
>>> +   percpu_ref_switch_to_atomic_sync(>q_usage_counter);
>>>  
>>> spin_lock_irq(q->queue_lock);
>>> -   if (q->nr_pending) {
>>> +   if (!percpu_ref_is_zero(>q_usage_counter)) {
>>> ret = -EBUSY;
>>> pm_runtime_mark_last_busy(q->dev);
>>> } else {
>>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>> }
>>> spin_unlock_irq(q->queue_lock);
>>>  
>>> +   percpu_ref_switch_to_percpu(>q_usage_counter);
>>> blk_pm_runtime_unlock(q);
> 
> Hello Jianchao,
> 
> There is only one caller of blk_post_runtime_suspend(), namely
> sdev_runtime_suspend(). That function calls pm->runtime_suspend() before
> it calls blk_post_runtime_suspend(). I think it would be wrong to set the
> PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could
> cause pm->runtime_suspend() while a request is in progress.
> 

Hi Bart

If q_usage_counter is not zero here, we will leave the request_queue in preempt 
only mode.
The request_queue should be set to preempt only mode only when we confirm we 
could set
rpm_status to RPM_SUSPENDING or RPM_RESUMING.

Maybe we could set the PREEMPT_ONLY after we confirm we could set the rpm_status
to RPM_SUSPENDING.
Something like:

percpu_ref_switch_to_atomic_sync(>q_usage_counter);
if (!percpu_ref_is_zero(>q_usage_counter)) {
ret = -EBUSY;
pm_runtime_mark_last_busy(q->dev);
} else {
blk_set_preempt_only(q);
if (!percpu_ref_is_zero(>q_usage_counter) {
ret = -EBUSY;
pm_runtime_mark_last_busy(q->dev);
blk_clear_preempt_only(q);
} else {
q->rpm_status = RPM_SUSPENDIN;
}
}


Thanks
Jianchao


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/26/2018 03:52 PM, jianchao.wang wrote:
> In addition, .runtime_suspend is invoked under spinlock and irq-disabled.
> So sleep is forbidden here.
> Please refer to rpm_suspend
> 
> * This function must be called under dev->power.lock with interrupts disabled
> 
__rpm_callback will unlock the spinlock before invoke the callback.
Sorry for the noise.

Thanks
Jianchao



Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/26/2018 10:45 AM, jianchao.wang wrote:
> Hi Bart
> 
> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>  return ret;
>>  
>>  blk_pm_runtime_lock(q);
>> +blk_set_preempt_only(q);
> 
> We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING.
> blk_pre_runtime_suspend should only _check_ whether runtime suspend is 
> allowed.
> So we should not set preempt_only here.
> 
> 
>> +percpu_ref_switch_to_atomic_sync(>q_usage_counter);

In addition, .runtime_suspend is invoked under spinlock and irq-disabled.
So sleep is forbidden here.
Please refer to rpm_suspend

* This function must be called under dev->power.lock with interrupts disabled

Thanks
Jianchao
>>  
>>  spin_lock_irq(q->queue_lock);
>> -if (q->nr_pending) {
>> +if (!percpu_ref_is_zero(>q_usage_counter)) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  } else {
>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>  }
>>  spin_unlock_irq(q->queue_lock);
>>  
>> +percpu_ref_switch_to_percpu(>q_usage_counter);
>>  blk_pm_runtime_unlock(q);
> 


Re: [PATCH v2 3/5] block: Serialize queue freezing and blk_pre_runtime_suspend()

2018-07-25 Thread jianchao.wang
Hi Bart

On 07/26/2018 06:26 AM, Bart Van Assche wrote:
> +
> +void blk_pm_runtime_lock(struct request_queue *q)
> +{
> + spin_lock(>rpm_lock);
> + wait_event_interruptible_locked(q->rpm_wq,
> +   q->rpm_owner == NULL || q->rpm_owner == current);
> + if (q->rpm_owner == NULL)
> + q->rpm_owner = current;
> + q->rpm_nesting_level++;
> + spin_unlock(>rpm_lock);
> +}

The lock which wait_event_interruptible_locked want to hold is the wq.lock.
Please refer to comment of wait_event_interruptible_locked

 * It must be called with wq.lock being held.  This spinlock is
 * unlocked while sleeping but @condition testing is done while lock
 * is held and when this macro exits the lock is held.

Thanks
Jianchao


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-25 Thread jianchao.wang
Hi Bart

On 07/26/2018 06:26 AM, Bart Van Assche wrote:
> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>   return ret;
>  
>   blk_pm_runtime_lock(q);
> + blk_set_preempt_only(q);

We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING.
blk_pre_runtime_suspend should only _check_ whether runtime suspend is allowed.
So we should not set preempt_only here.


> + percpu_ref_switch_to_atomic_sync(>q_usage_counter);
>  
>   spin_lock_irq(q->queue_lock);
> - if (q->nr_pending) {
> + if (!percpu_ref_is_zero(>q_usage_counter)) {
>   ret = -EBUSY;
>   pm_runtime_mark_last_busy(q->dev);
>   } else {
> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>   }
>   spin_unlock_irq(q->queue_lock);
>  
> + percpu_ref_switch_to_percpu(>q_usage_counter);
>   blk_pm_runtime_unlock(q);


Re: [PATCH] blk-mq: Avoid that a request queue stalls when restarting a shared hctx

2018-07-24 Thread jianchao.wang



On 07/23/2018 11:50 PM, Bart Van Assche wrote:
> The patch below fixes queue stalling when shared hctx marked for restart
> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'

The blk_mq_hw_ctx structure is also per request_queue
Please refer to blk_mq_init_allocated_queue -> blk_mq_realloc_hw_ctxs

> belongs to the particular queue, which in fact may not need to be restarted,
> thus we return from blk_mq_sched_restart() and leave shared hctx of another
> queue never restarted.
> 
> The fix is to make shared_hctx_restart counter belong not to the queue, but
> to tags, thereby counter will reflect real number of shared hctx needed to
> be restarted.
> 
> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
> were noticed in dd->fifo_list of mq-deadline scheduler.
> 
> Seeming possible sequence of events:
> 
> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
> 
> 2. Request B of queue A bypasses scheduler and goes directly to
>hctx->dispatch.
> 
> 3. Request C of queue B is inserted.
> 
> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
>empty (request B is in the list) hctx is only marked for for next restart
>and request A is left in a list (see comment "So it's best to leave them
>there for as long as we can. Mark the hw queue as needing a restart in
>that case." in blk-mq-sched.c)
> 
> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
>called, but by chance hctx from queue B is chosen for restart and request C
>gets a chance to be dispatched.

Request C is just inserted into queue B. If there is no mark restart there,
it will not be chosen.
blk_mq_sched_restart_hctx

if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
return false;

If blk_mq_sched_restart choose queue B, one of its hctx must have SCHED_RESTART 
flag,
and q->shared_hctx_restart must not be zero.

> 
> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
>called, but shared_hctx_restart for queue B is zero and we return without
>attempt to restart hctx from queue A, thus request A is stuck forever.
> 
> But stalling queue is not the only one problem with blk_mq_sched_restart().
> My tests show that those loops thru all queues and hctxs can be very costly,
> even with shared_hctx_restart counter, which aims to fix performance issue.

Currently, SCHED_RESTART is always set when there is reqs on hctx->dispatch 
list in
blk_mq_sched_dispatch_requests. And no driver tag case is the main reason for 
hctx->dispatch
is not empty when there is io scheduler.
Therefore, most of time, blk_mq_sched_restart will be invoked further for no 
driver tag case.

For non-share-tag case, it will wakeup the hctx.
But for shared-tag case, it is unnecessary, because the sbitmap_queue wakeup 
hook will work
and hctx_may_queue will avoid starvation of other ones.

Therefore, the costly loop through the queues and hctxs is unnecessary most of 
time. 

Thanks
Jianchao







Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-19 Thread jianchao.wang
Hi Keith

On 07/19/2018 01:45 AM, Keith Busch wrote:
>>> +   list_for_each_entry(q, >tag_list, tag_set_list) {
>>> /*
>>>  * Request timeouts are handled as a forward rolling timer. If
>>>  * we end up here it means that no requests are pending and
>>> @@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct 
>>> *work)
>>> blk_mq_tag_idle(hctx);
>>> }
>>> }
>>> -   blk_queue_exit(q);

The tags sharing fairness mechanism between different request_queues cannot 
work well here.
When timer is per-request_queue, if there is no request on one request_queue,
it could be idled. But now, with per-tagset timer, we cannot detect the idle 
one at all.


>   
>>> +   timer_setup(>timer, blk_mq_timed_out_timer, 0);
>>> +   INIT_WORK(>timeout_work, blk_mq_timeout_work);
>>> [ ... ]
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -86,6 +86,8 @@ struct blk_mq_tag_set {
>>>  
>>> struct blk_mq_tags  **tags;
>>>  
>>> +   struct timer_list   timer;
>>> +   struct work_struct  timeout_work;
>> Can the timer and timeout_work data structures be replaced by a single
>> delayed_work instance?
> I think so. I wanted to keep blk_add_timer relatively unchanged for this
> proposal, so I followed the existing pattern with the timer kicking the
> work. I don't see why that extra indirection is necessary, so I think
> it's a great idea. Unless anyone knows a reason not to, we can collapse
> this into a single delayed work for both mq and legacy as a prep patch
> before this one.

mod_delayed_work_on is very tricky in our scenario. It will grab the pending
work entry and queue it again.

delayed_work.timer trigger
queue_work timeout_work  delayed_work.timer not pending
 mod_delayed_work_on
   grab the pending timeout_work
 re-arm the timer

The timeout_work would not be run.

Thanks
Jianchao



Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread jianchao.wang


On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);

Even if before tejun's patch, we also have this for both blk-mq and blk-legacy 
code.

blk_rq_check_expired

if (time_after_eq(jiffies, rq->deadline)) {
list_del_init(>timeout_list);

/*
 * Check if we raced with end io completion
 */
if (!blk_mark_rq_complete(rq))
blk_rq_timed_out(rq);
} 
 

blk_complete_request

if (!blk_mark_rq_complete(req))
__blk_complete_request(req);

blk_mq_check_expired

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
blk_mq_rq_timed_out(rq, reserved);
}


blk_mq_complete_request

if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);

Thanks
Jianchao

> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread jianchao.wang



On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
>>> What prevents that a request finishes and gets reused after the
>>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
>>> called? Is this perhaps a race condition that has not yet been triggered by
>>> any existing block layer test? Please note that there is no such race
>>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
>>> again" - 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html=DwIGaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU=).
>>
>> I don't think there's any such race in the merged implementation
>> either. If the request is reallocated, then the kref check may succeed,
>> but the blk_mq_req_expired() check would surely fail, allowing the
>> request to proceed as normal. The code comments at least say as much.
> 
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?
>

Oh, thanks gods for hearing Bart said this.
I was always saying the same thing in the mail
https://marc.info/?l=linux-block=152950093831738=2
Even though my voice is tiny, I support Bart's point definitely.

Thanks
Jianchao
 
> Thanks,
> 
> Bart.
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched

2018-06-28 Thread jianchao.wang



On 06/28/2018 01:42 PM, Kashyap Desai wrote:
> Ming -
> 
> Performance drop is resolved on my setup, but may be some stability of the
> kernel is caused due to this patch set. I have not tried without patch
> set, but in case you can figure out if below crash is due to this patch
> set, I can try potential fix as well.  I am seeing kernel panic if I use
> below three settings in my fio script.
> 
> iodepth_batch=8
> iodepth_batch_complete_min=4
> iodepth_batch_complete_max=16
> 
> 
> Here is panic detail -
> 
> [65237.831029] [ cut here ]
> [65237.831031] list_add corruption. prev->next should be next
> (bb1d0a41fdf8), but was 94d0f0c57240. (prev=94d0f0c57240).
> [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
> scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
> ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
> xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
> ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
> cryptd
> [65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
> snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
> joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
> acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
> ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
> dm_region_hash dm_log dm_mod
> [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
> OE 4.18.0-rc1+ #1
> [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
> 06/22/2017
> [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
> [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
> c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
> <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
> [65237.831130] RSP: 0018:bb1d0a41fdd8 EFLAGS: 00010286
> [65237.831131] RAX:  RBX: 94d0ebf9 RCX:
> 0006
> [65237.831132] RDX:  RSI: 0086 RDI:
> 94d91ec16970
> [65237.831132] RBP: db1cff59e980 R08:  R09:
> 0663
> [65237.831133] R10: 0004 R11:  R12:
> 0001
> [65237.831134] R13:  R14: 94d0f0c57240 R15:
> 94d0f0f04c40
> [65237.831135] FS:  7fad10d02740() GS:94d91ec0()
> knlGS:
> [65237.831135] CS:  0010 DS:  ES:  CR0: 80050033
> [65237.831136] CR2: 7fad10c2a004 CR3: 00085600e006 CR4:
> 007606e0
> [65237.831137] DR0:  DR1:  DR2:
> 
> [65237.831138] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [65237.831138] PKRU: 5554
> [65237.831138] Call Trace:
> [65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
> [65237.831150]  blk_flush_plug_list+0xe7/0x240
> [65237.831152]  blk_finish_plug+0x27/0x40
> [65237.831156]  __x64_sys_io_submit+0xc9/0x180
> [65237.831162]  do_syscall_64+0x5b/0x180
> [65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I guess we need to clean list after list_splice_tail in the 1/1 patch as 
following
@@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
struct list_head *list)
 
 {
...
+
+   spin_lock(>lock);
+   list_splice_tail(list, >rq_list);
+   INIT_LIST_HEAD(list);  // ---> HERE !
blk_mq_hctx_mark_pending(hctx, ctx);
spin_unlock(>lock);

Thanks
Jianchao


Re: Regarding the recent new blk-mq timeout handling

2018-06-12 Thread jianchao.wang


On 06/12/2018 09:01 PM, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> 
> On 06/12/2018 06:17 PM, Ming Lei wrote:
>> On Tue, Jun 12, 2018 at 6:04 PM, jianchao.wang
>>  wrote:
>>> Hi Jens and Christoph
>>>
>>> In the recent commit of new blk-mq timeout handling, we don't have any 
>>> protection
>>> on timed out request against the completion path. We just hold a 
>>> request->ref count,
>>> it just could avoid the request tag to be released and life-recycle, but 
>>> not completion.
>>>
>>> For the scsi mid-layer, what if a request is in error handler and normal 
>>> completion come
>>> at the moment ?
>>
>> Per my understanding, now the protection needs to be done completely by 
>> driver.
>>
> 
> But looks like the drivers have not prepared well to take over this work 
> right now.
> 

I modified the scsi-debug module as the attachment
0001-scsi-debug-make-normal-completion-and-timeout-could-.patch
and try to simulate the scenario where timeout and completion path occur 
concurrently.
The system would run into crash easily.
4.17.rc7 survived from this test.

Maybe we could do as the attachment 
0001-blk-mq-protect-timed-out-request-against-completion-.patch
then replace all the blk_mq_complete_request in timeout path. Then we could 
preserve the capability
to protect the timed out request against completion path.
The patch also survived from the test.

Thanks
Jianchao
>>
>> Thanks,
>> Ming Lei
>>
> 
>From 640a67e7b4386ac42ee789f54dd0898ecd00f8f7 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Tue, 12 Jun 2018 12:04:26 +0800
Subject: [PATCH] scsi-debug: make normal completion and timeout could occur
 concurrently

Invoke blk_abort_request to force the request timed out periodically,
when complete the request in workqueue context.

Signed-off-by: Jianchao Wang 
---
 drivers/scsi/scsi_debug.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 656c98e..2ca0280 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -4323,6 +4323,8 @@ static void setup_inject(struct sdebug_queue *sqp,
 	sqcp->inj_host_busy = !!(SDEBUG_OPT_HOST_BUSY & sdebug_opts);
 }
 
+static atomic_t g_abort_counter;
+
 /* Complete the processing of the thread that queued a SCSI command to this
  * driver. It either completes the command by calling cmnd_done() or
  * schedules a hr timer or work queue then returns 0. Returns
@@ -4459,6 +4461,11 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip,
 			sd_dp->issuing_cpu = raw_smp_processor_id();
 		sd_dp->defer_t = SDEB_DEFER_WQ;
 		schedule_work(_dp->ew.work);
+		atomic_inc(_abort_counter);
+		if (atomic_read(_abort_counter)%2000 == 0) {
+			blk_abort_request(cmnd->request);
+			trace_printk("abort request tag %d\n", cmnd->request->tag);
+		}
 	}
 	if (unlikely((SDEBUG_OPT_Q_NOISE & sdebug_opts) &&
 		 (scsi_result == device_qfull_result)))
@@ -5843,6 +5850,7 @@ static int sdebug_driver_probe(struct device *dev)
 	struct Scsi_Host *hpnt;
 	int hprot;
 
+	atomic_set(_abort_counter, 0);
 	sdbg_host = to_sdebug_host(dev);
 
 	sdebug_driver_template.can_queue = sdebug_max_queue;
-- 
2.7.4

>From fcc515b3a642c909e8b82d2a240014faff5acd44 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Tue, 12 Jun 2018 21:20:13 +0800
Subject: [PATCH] blk-mq: protect timed out request against completion path

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 22 +++---
 include/linux/blk-mq.h |  1 +
 include/linux/blkdev.h |  6 ++
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6332940..2714a23 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -473,6 +473,7 @@ static void __blk_mq_free_request(struct request *rq)
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	const int sched_tag = rq->internal_tag;
 
+	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -509,7 +510,6 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 	if (refcount_dec_and_test(>ref))
 		__blk_mq_free_request(rq);
 }
@@ -552,15 +552,17 @@ static void __blk_mq_complete_request_remote(void *data)
 	rq->q->softirq_done_fn(rq);
 }
 
-static void __blk_mq_complete_request(struct request *rq)
+/*
+ * The LLDD timeout path must invoke this interface to complete
+ * the request.
+ */
+void __blk_mq_complete_request(struct request *rq)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	bool shared = false;
 	int cpu;
 
-	if (cmpxchg(>state, MQ_RQ_IN_FL

Re: Regarding the recent new blk-mq timeout handling

2018-06-12 Thread jianchao.wang
Hi ming

Thanks for your kindly response.

On 06/12/2018 06:17 PM, Ming Lei wrote:
> On Tue, Jun 12, 2018 at 6:04 PM, jianchao.wang
>  wrote:
>> Hi Jens and Christoph
>>
>> In the recent commit of new blk-mq timeout handling, we don't have any 
>> protection
>> on timed out request against the completion path. We just hold a 
>> request->ref count,
>> it just could avoid the request tag to be released and life-recycle, but not 
>> completion.
>>
>> For the scsi mid-layer, what if a request is in error handler and normal 
>> completion come
>> at the moment ?
> 
> Per my understanding, now the protection needs to be done completely by 
> driver.
> 

But looks like the drivers have not prepared well to take over this work right 
now.

Thanks
Jianchao


> 
> Thanks,
> Ming Lei
> 


Regarding the recent new blk-mq timeout handling

2018-06-12 Thread jianchao.wang
Hi Jens and Christoph

In the recent commit of new blk-mq timeout handling, we don't have any 
protection
on timed out request against the completion path. We just hold a request->ref 
count,
it just could avoid the request tag to be released and life-recycle, but not 
completion.

For the scsi mid-layer, what if a request is in error handler and normal 
completion come
at the moment ?

Or do I miss anything about this commit ?


Thanks
Jianchao


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Jens and Holger

Thank for your kindly response.
That's really appreciated.

I will post next version based on Jens' patch.

Thanks
Jianchao

On 05/23/2018 02:32 AM, Holger Hoffstätte wrote:
 This looks great but prevents kyber from being built as module,
 which is AFAIK supposed to work (and works now):

 ..
   CC [M]  block/kyber-iosched.o
   Building modules, stage 2.
   MODPOST 313 modules
 ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
 ..

 It does build fine when compiled in, obviously. :)
>>> It's basically duplicating the contents of blk_attempt_plug_merge().
>>> I would suggest abstracting out the list loop and merge check
>>> into a helper, that could then both be called from kyber and the
>>> plug merge function.
>> See attached, prep patch and yours rebased on top of it.
> That was quick. :)
> 
> Applies smoothly on top of my 4.16++ tree, now builds correctly as
> module and is reproducibly (slightly) faster even on my pedestrian
> SATA SSDs, now on par or occasionally even faster than mq-deadline.
> What's not to like? So:
> 
> Tested-by: Holger Hoffstätte 


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Omar

Thanks for your kindly response.

On 05/23/2018 04:02 AM, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
> 
> That's a great catch, I totally missed this.
> 
> This approach does end up duplicating a lot of code with the blk-mq core
> even after Jens' change, so I'm curious if you tried other approaches.
> One idea I had is to try the bio merge against the kqd->rqs lists. Since
> that's per-queue, the locking overhead might be too high. Alternatively,

Yes, I used to make a patch as you say, try the bio merge against kqd->rqs 
directly.
The patch looks even simpler. However, because the khd->lock is needed every 
time 
when try bio merge, there maybe high contending overhead on hkd->lock when 
cpu-hctx
mapping is not 1:1. 

> you could keep the software queues as-is but add our own version of
> flush_busy_ctxs() that only removes requests of the domain that we want.
> If one domain gets backed up, that might get messy with long iterations,
> though.

Yes, I also considered this approach :)
But the long iterations on every ctx->rq_list looks really inefficient.

> 
> Regarding this approach, a couple of comments below.
...
>>  }
>> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e)
>>  static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int 
>> hctx_idx)
>>  {
>>  struct kyber_hctx_data *khd;
>> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
>>  int i;
>> +int sd;
>>  
>>  khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
>>  if (!khd)
>>  return -ENOMEM;
>>  
>> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
>> +GFP_KERNEL, hctx->numa_node);
>> +if (!khd->kcqs)
>> +goto err_khd;
> 
> Why the double indirection of a percpu allocation per hardware queue
> here? With, say, 56 cpus and that many hardware queues, that's 3136
> pointers, which seems like overkill. Can't you just use the percpu array
> in the kqd directly, or make it per-hardware queue instead?

oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx.
The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked,
so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd.

...
>> +static int bio_sched_domain(const struct bio *bio)
>> +{
>> +unsigned int op = bio->bi_opf;
>> +
>> +if ((op & REQ_OP_MASK) == REQ_OP_READ)
>> +return KYBER_READ;
>> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
>> +return KYBER_SYNC_WRITE;
>> +else
>> +return KYBER_OTHER;
>> +}
> 
> Please add a common helper for rq_sched_domain() and bio_sched_domain()
> instead of duplicating the logic.
> 

Yes, I will do it in next version.

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi Ming

On 05/15/2018 08:56 PM, Ming Lei wrote:
> Looks a nice fix on nvme_create_queue(), but seems the change on 
> adapter_alloc_cq() is missed in above patch.
> 
> Could you prepare a formal one so that I may integrate it to V6?

Please refer to

Thanks
Jianchao
>From 9bb6db79901ef303cd40c4c911604bc053b1ad95 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Wed, 16 May 2018 09:45:45 +0800
Subject: [PATCH] nvme-pci: set nvmeq->cq_vector after alloc cq/sq

Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc
cq/sq command timeout, nvme_suspend_queue will invoke free_irq
for the nvmeq because the cq_vector is valid, this will cause warning
'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after
alloc cq/sq successes to fix this.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fa..c830092 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
-		struct nvme_queue *nvmeq)
+		struct nvme_queue *nvmeq, int cq_vector)
 {
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
@@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
+	c.create_cq.irq_vector = cpu_to_le16(cq_vector);
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, , NULL, 0);
 }
@@ -1450,6 +1450,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
+	int cq_vector;
 
 	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
 		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1463,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-	result = adapter_alloc_cq(dev, qid, nvmeq);
+	cq_vector = dev->num_vecs == 1 ? 0 : qid;
+	result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
 	if (result < 0)
-		goto release_vector;
+		goto out;
 
 	result = adapter_alloc_sq(dev, qid, nvmeq);
 	if (result < 0)
 		goto release_cq;
 
+	/*
+	 * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command
+	 * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning
+	 * 'Trying to free already-free IRQ xxx'
+	 */
+	nvmeq->cq_vector = cq_vector;
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
@@ -1478,13 +1485,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 
 	return result;
 
- release_sq:
+release_sq:
+	nvmeq->cq_vector = -1;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
- release_vector:
-	nvmeq->cq_vector = -1;
+out:
 	return result;
 }
 
-- 
2.7.4



Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi ming

On 05/16/2018 10:09 AM, Ming Lei wrote:
> So could you check if only the patch("unquiesce admin queue after shutdown
> controller") can fix your IO hang issue?

I indeed tested this before fix the warning.
It could fix the io hung issue. :)

Thanks
Jianchao


Re: [PATCH V5 9/9] nvme: pci: support nested EH

2018-05-15 Thread jianchao.wang
Hi ming

On 05/11/2018 08:29 PM, Ming Lei wrote:
> +static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
> +{
> + struct nvme_dev *dev = eh_work->dev;
> + bool top_eh;
> +
> + spin_lock(>eh_lock);
> + top_eh = list_is_last(_work->list, >eh_head);
> + dev->nested_eh--;
> +
> + /* Fail controller if the top EH can't recover it */
> + if (!result)
> + wake_up_all(>eh_wq);
> + else if (top_eh) {
> + dev->ctrl_failed = true;
> + nvme_eh_sched_fail_ctrl(dev);
> + wake_up_all(>eh_wq);
> + }
> +
> + list_del(_work->list);
> + spin_unlock(>eh_lock);
> +
> + dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
> + eh_work->seq, dev->ctrl.state, result, top_eh);
> + wait_event(dev->eh_wq, nvme_eh_reset_done(dev));

decrease the nested_eh before it exits, another new EH will have confusing seq 
number.
please refer to following log:
[ 1342.961869] nvme nvme0: Abort status: 0x0
[ 1342.961878] nvme nvme0: Abort status: 0x0
[ 1343.148341] nvme nvme0: EH 0: after shutdown, top eh: 1
[ 1403.828484] nvme nvme0: I/O 21 QID 0 timeout, disable controller
[ 1403.828603] nvme nvme0: EH 1: before shutdown
... waring logs are ignored here 
[ 1403.984731] nvme nvme0: EH 0: state 4, eh_done -4, top eh 0  // EH0 go to 
wait
[ 1403.984786] nvme nvme0: EH 1: after shutdown, top eh: 1
[ 1464.856290] nvme nvme0: I/O 22 QID 0 timeout, disable controller  // timeout 
again in EH 1
[ 1464.856411] nvme nvme0: EH 1: before shutdown // a new EH has a 1 seq number

Is it expected that the new EH has seq number 1 instead of 2 ?

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi ming

On 05/15/2018 08:33 AM, Ming Lei wrote:
> We still have to quiesce admin queue before canceling request, so looks
> the following patch is better, so please ignore the above patch and try
> the following one and see if your hang can be addressed:
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..c2adc76472a8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   dev->ctrl.admin_q = NULL;
>   return -ENODEV;
>   }
> - } else
> - blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> + }
>  
>   return 0;
>  }
> @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown, bool
>*/
>   if (shutdown)
>   nvme_start_queues(>ctrl);
> +
> + /*
> +  * Avoid to suck reset because timeout may happen during reset and
> +  * reset may hang forever if admin queue is kept as quiesced
> +  */
> + blk_mq_unquiesce_queue(dev->ctrl.admin_q);
>   mutex_unlock(>shutdown_lock);
>  }

w/ patch above and patch below, both the warning and io hung issue didn't 
reproduce till now.


@@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
 {
struct nvme_dev *dev = nvmeq->dev;
int result;
+   int cq_vector;
 
if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
 * A queue's vector matches the queue identifier unless the controller
 * has only one vector available.
 */
-   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-   result = adapter_alloc_cq(dev, qid, nvmeq);
+   cq_vector = dev->num_vecs == 1 ? 0 : qid;
+   result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
if (result < 0)
-   goto release_vector;
+   goto out;
 
result = adapter_alloc_sq(dev, qid, nvmeq);
if (result < 0)
goto release_cq;
-
+   
+   nvmeq->cq_vector = cq_vector;
nvme_init_queue(nvmeq, qid);
result = queue_request_irq(nvmeq);
if (result < 0)
@@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
return result;
 
  release_sq:
+   nvmeq->cq_vector = -1;
dev->online_queues--;
adapter_delete_sq(dev, qid);
  release_cq:
adapter_delete_cq(dev, qid);
- release_vector:
-   nvmeq->cq_vector = -1;
+ out:
return result;
 }

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

On 05/14/2018 05:38 PM, Ming Lei wrote:
>> Here is the deadlock scenario.
>>
>> nvme_eh_work // EH0
>>   -> nvme_reset_dev //hold reset_lock
>> -> nvme_setup_io_queues
>>   -> nvme_create_io_queues
>> -> nvme_create_queue
>>   -> set nvmeq->cq_vector
>>   -> adapter_alloc_cq
>>   -> adapter_alloc_sq
>>  irq has not been requested
>>  io timeout 
>> nvme_eh_work //EH1
>>   -> nvme_dev_disable
>> -> quiesce the adminq //> here !
>> -> nvme_suspend_queue
>>   print out warning Trying to free 
>> already-free IRQ 133
>> -> nvme_cancel_request // complete 
>> the timeout admin request
>>   -> require reset_lock
>>   -> adapter_delete_cq
> If the admin IO submitted in adapter_alloc_sq() is timed out,
> nvme_dev_disable() in EH1 will complete it which is set as 
> REQ_FAILFAST_DRIVER,
> then adapter_alloc_sq() should return error, and the whole reset in EH0
> should have been terminated immediately.

Please refer to the following segment:

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
struct nvme_dev *dev = nvmeq->dev;
int result;
...
nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
goto release_vector;

result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
here
if (result < 0)
goto release_cq;

nvme_init_queue(nvmeq, qid);
result = queue_request_irq(nvmeq);
if (result < 0)
goto release_sq;

return result;

 release_sq:
dev->online_queues--;
adapter_delete_sq(dev, qid);
 release_cq:// we will be here !
adapter_delete_cq(dev, qid);// another cq delete admin 
command will be sent out.
 release_vector:
nvmeq->cq_vector = -1;
return result;
}


> 
> I guess the issue should be that nvme_create_io_queues() ignores the failure.
> 
> Could you dump the stack trace of EH0 reset task? So that we may see
> where EH0 reset kthread hangs.

root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
[<0>] blk_execute_rq+0xf7/0x150
[<0>] __nvme_submit_sync_cmd+0x94/0x110
[<0>] nvme_submit_sync_cmd+0x1b/0x20
[<0>] adapter_delete_queue+0xad/0xf0
[<0>] nvme_reset_dev+0x1b67/0x2450
[<0>] nvme_eh_work+0x19c/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x
root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
[<0>] nvme_eh_work+0x11a/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x

> 
>> -> adapter_delete_queue // submit to the adminq which has been 
>> quiesced.
>>   -> nvme_submit_sync_cmd
>> -> blk_execute_rq
>>   -> wait_for_completion_io_timeout
>>   hang_check is true, so there is no hung task warning for 
>> this context
>>
>> EH0 submit cq delete admin command, but it will never be completed or timed 
>> out, because the admin request queue has
>> been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
>> reset_lock and make things forward.
> The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
> admin command, which won't be retried because it is set as
> REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
> nvme_dev_disable().

This cq delete admin command is sent out after EH 1 nvme_dev_disable completed 
and failed the
previous timeout sq alloc admin command. please refer to the code segment above.

Thanks
jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

Please refer to my test log and analysis.

[  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
[  229.872649] nvme nvme0: EH 0: before shutdown
[  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
[  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
[  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
[  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
[  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
[  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
[  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
[  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
[  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
[  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
[  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
[  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
[  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
[  229.874041] nvme nvme0: Abort status: 0x0
[  229.874064] nvme nvme0: Abort status: 0x0
[  229.874078] nvme nvme0: Abort status: 0x0
[  229.874092] nvme nvme0: Abort status: 0x0
[  229.874106] nvme nvme0: Abort status: 0x0
[  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
[  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  290.840746] nvme nvme0: EH 1: before shutdown

[  290.992650] [ cut here ]
[  290.992751] Trying to free already-free IRQ 133
[  290.992783] WARNING: CPU: 6 PID: 227 at 
/home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 
__free_irq+0xf6/0x420
[  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 
4.17.0-rc3+ #37
[  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
[  290.993436] RIP: 0010:__free_irq+0xf6/0x420
...
[  290.993541] Call Trace:
[  290.993581]  free_irq+0x4c/0xa0
[  290.993600]  pci_free_irq+0x18/0x30
[  290.993615]  nvme_dev_disable+0x20b/0x720
[  290.993745]  nvme_eh_work+0xdd/0x4b0
[  290.993866]  process_one_work+0x3ca/0xaa0
[  290.993960]  worker_thread+0x89/0x6c0
[  290.994017]  kthread+0x18d/0x1e0
[  290.994061]  ret_from_fork+0x24/0x30
[  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
[  338.913348]   Tainted: GW 4.17.0-rc3+ #37
[  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  338.913809] kworker/u16:4   D0   227  2 0x8000
[  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
[  338.913857] Call Trace:
[  338.914002]  schedule+0x52/0xe0
[  338.914019]  schedule_preempt_disabled+0x14/0x20
[  338.914029]  __mutex_lock+0x5b4/0xca0
[  338.914228]  nvme_eh_work+0x11a/0x4b0
[  338.914344]  process_one_work+0x3ca/0xaa0
[  338.914440]  worker_thread+0x89/0x6c0
[  338.914496]  kthread+0x18d/0x1e0
[  338.914542]  ret_from_fork+0x24/0x30
[  338.914648]

Here is the deadlock scenario.

nvme_eh_work // EH0
  -> nvme_reset_dev //hold reset_lock
-> nvme_setup_io_queues
  -> nvme_create_io_queues
-> nvme_create_queue
  -> set nvmeq->cq_vector
  -> adapter_alloc_cq
  -> adapter_alloc_sq
 irq has not been requested
 io timeout 
nvme_eh_work //EH1
  -> nvme_dev_disable
-> quiesce the adminq //> here !
-> nvme_suspend_queue
  print out warning Trying to free 
already-free IRQ 133
-> nvme_cancel_request // complete the 
timeout admin request
  -> require reset_lock
  -> adapter_delete_cq
-> adapter_delete_queue // submit to the adminq which has been 
quiesced.
  -> nvme_submit_sync_cmd
-> blk_execute_rq
  -> wait_for_completion_io_timeout
  hang_check is true, so there is no hung task warning for this 
context

EH0 submit cq delete admin command, but it will never be completed or timed 
out, because the admin request queue has
been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
reset_lock and make things forward.

Thanks
Jianchao


Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-13 Thread jianchao.wang
Hi Bart

On 05/14/2018 12:03 PM, Bart Van Assche wrote:
> On Mon, 2018-05-14 at 09:37 +0800, jianchao.wang wrote:
>> In addition, on a 64bit system, how do you set up the timer with a 32bit 
>> deadline ?
> 
> If timeout handling occurs less than (1 << 31) / HZ seconds after a request 
> has been
> submitted then a 32-bit variable is sufficient to store the deadline. If you 
> want to
> verify the implementation, please have a look at the int32_t variables in
> blk_mq_check_expired().
> 
a 32bit deadline is indeed OK to judge whether a request is timeout or not.
but how is the expires value determined for __blk_add_timer -> mod_timer ?
as we know, when a request is started, we need to arm a timer for it.
the expires value is 'unsigned long'.

Thanks
Jianchao 



Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-13 Thread jianchao.wang
Hi Bart

On 05/11/2018 11:26 PM, Bart Van Assche wrote:
> The bug is in the above trace_printk() call: blk_rq_deadline() must only be 
> used
> for the legacy block layer and not for blk-mq code. If you have a look at the 
> value
> of the das.deadline field then one can see that the value of that field is 
> correct:

blk_rq_deadline is used by __blk_add_timer, and __blk_add_timer is also used by 
blk_mq_add_timer.
This is why I used blk_rq_deadline to print out the value in my test.
on 64bit system, das.deadline will be a correct value,  what if the jiffies has 
crossed the 32bit boundary ?

> 0x550c - 0x37c0 = 7500 = 30 * 250. Does that mean that HZ = 250 on 
> the system
> on which you ran this test?
> 

Yes, the HZ is 250

Thanks
Jianchao



Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-13 Thread jianchao.wang

Hi bart

On 05/11/2018 11:29 PM, Bart Van Assche wrote:
> On Fri, 2018-05-11 at 14:35 +0200, Christoph Hellwig wrote:
>>> It should be due to union blk_deadline_and_state. 
>>> +union blk_deadline_and_state {
>>> +   struct {
>>> +   uint32_t generation:30;
>>> +   uint32_t state:2;
>>> +   uint32_t deadline;
>>> +   };
>>> +   unsigned long legacy_deadline;
>>> +   uint64_t das;
>>> +};
>>
>> Yikes.  Or we just move it into a separate field.  This patch already
>> shrinks struct request a lot, so I'd rather do that to keep it simple.
> 
> Hello Christoph,
> 
> Are you perhaps referring to the legacy_deadline field? Have you noticed that
> Jianchao used a legacy block layer function in blk-mq code and that that is
> why a wrong value for the deadline appeared in the trace output?
> 

I use blk_rq_deadline because __blk_add_timer uses it.

-
/*
 * If the timer isn't already pending or this timeout is earlier
 * than an existing one, modify the timer. Round up to next nearest
 * second.
 */
expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
if (!timer_pending(>timeout) ||
time_before(expiry, q->timeout.expires)) {
unsigned long diff = q->timeout.expires - expiry;


This is also the reason why there is no issue about timeout when test.
blk_rq_timeout will return reasonable value.

In addition, on a 64bit system, how do you set up the timer with a 32bit 
deadline ?


Thanks
Jianchao


Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread jianchao.wang
Hi bart

I add debug log in blk_mq_add_timer as following

void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
 enum mq_rq_state new)
{
   struct request_queue *q = req->q;

   if (!req->timeout)
   req->timeout = q->rq_timeout;
   if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
   WARN_ON_ONCE(true);

   trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n",
   jiffies,
   req->timeout,
   blk_rq_deadline(req),
   req->das.generation,
   req->das.deadline);
 
   return __blk_add_timer(req);
 }

And get log below:

 jbd2/sda2-8-320   [000] ...195.030824: blk_mq_add_timer: jiffies 
37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
kworker/0:1H-136   [000] ...195.031822: blk_mq_add_timer: jiffies 
37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
kworker/6:1H-244   [006] ...195.041695: blk_mq_add_timer: jiffies 
37c3 to 1d4c ldl 550f4000 gen 0 dl 550f
kworker/6:1H-244   [006] ...195.041954: blk_mq_add_timer: jiffies 
37c3 to 1d4c ldl 550f4000 gen 0 dl 550f

The blk_rq_deadline return 550c4000 which looks really crazy.
It should be due to union blk_deadline_and_state. 
+union blk_deadline_and_state {
+   struct {
+   uint32_t generation:30;
+   uint32_t state:2;
+   uint32_t deadline;
+   };
+   unsigned long legacy_deadline;
+   uint64_t das;
+};

And generation never change. 

Thanks
Jianchao



Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-08 Thread jianchao.wang
Hi ming

I did some tests on my local.

[  598.828578] nvme nvme0: I/O 51 QID 4 timeout, disable controller

This should be a timeout on nvme_reset_dev->nvme_wait_freeze.

[  598.828743] nvme nvme0: EH 1: before shutdown
[  599.013586] nvme nvme0: EH 1: after shutdown
[  599.137197] nvme nvme0: EH 1: after recovery

The EH 1 have mark the state to LIVE

[  599.137241] nvme nvme0: failed to mark controller state 1

So the EH 0 failed to mark state to LIVE
The card was removed.
This should not be expected by nested EH.

[  599.137322] nvme nvme0: Removing after probe failure status: 0
[  599.326539] nvme nvme0: EH 0: after recovery
[  599.326760] nvme0n1: detected capacity change from 128035676160 to 0
[  599.457208] nvme nvme0: failed to set APST feature (-19)

nvme_reset_dev should identify whether it is nested.

Thanks
Jianchao


Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

2018-05-04 Thread jianchao.wang
Hi ming

On 05/04/2018 04:02 PM, Ming Lei wrote:
>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing 
>> another interface.
>> Then it is more convenient to ensure that there will be only one resetting 
>> instance running.
>>
> But as you mentioned above, reset_work has to be splitted into two
> contexts for handling IO timeout during wait_freeze in reset_work,
> so single instance of nvme_reset_ctrl() may not work well.

I mean the EH kthread and the reset_work which both could reset the ctrl 
instead of
the pre and post rest context.

Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
The Eh kthread solution could make things easier, but the codes for recovery 
from [1] has
made code really complicated. It is more difficult to unify the nvme-pci, rdma 
and fc. 

How about just fail the resetting as the Keith's solution ?

[1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke 
nvme_wait_freeze.

Thanks
Jianchao


Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

2018-05-04 Thread jianchao.wang
Oh sorry.

On 05/04/2018 02:10 PM, jianchao.wang wrote:
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing 
> another interface.
> Then it is more convenient to ensure that there will be only one resetting 
> instance running.

ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work.

Thanks
Jianchao


Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

2018-05-04 Thread jianchao.wang
Hi ming

On 05/04/2018 12:24 PM, Ming Lei wrote:
>> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the 
>> other things
>> to nvme_reset_work as the v2 patch series seems clearer.
> That way may not fix the current race: nvme_dev_disable() will
> quiesce/freeze queue again when resetting is in-progress, then
> nothing can move on.

With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all 
the in-flight
requests to be handled. When we queue the reset_work in nvme_error_handler, 
there will be no
timeout anymore.
If there is timeout due to the admin io in reset_work, this is expected, 
reset_work will be
interrupted and fail.
If the io timeout when reset_work wait freeze, because the reset_work is 
waiting, things cannot
move on. we have multiple method to handle this:
1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
2. as current patch set, try to recovery it. but we have to split reset_work 
into two contexts.

Actually, there are many other places where the nvme_dev_disable will be 
invoked.   
> 
> One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> in the EH thread, and make sure queues are started once
> nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> enough if admin requests are timed-out in nvme_pre_reset_dev().
> 
>> The primary target of nvme_error_handler is to drain IOs and disable 
>> controller.
>> reset_work could take over the left things of the recovery work
> Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> easy to fix all these races since the .eh_reset_work isn't needed any more,
> because quiescing is enough to prevent IOs from entering driver/device.
> 
> The only issue is that more requests may be failed when reset failed,
> so I don't want to try that, and just follows the current behaviour.
> 
> IMO the primary goal of nvme EH is to recover controller, that is
> what nvme_dev_disable() and resetting should do, if IO isn't drained,
> timeout still may be triggered.
> 
> The big trouble is about percpu_ref, once the q_usage_counter is killed
> to start freezing for preventing IO from entering block layer, we have
> to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.


Sorry for my bad description. I mean nvme_dev_disable will drain all the 
in-flight requests and requeue or fail them, then we will not get any timeout 
again.

In fact, I used to talk with Keith about remove freezing queues & wait_freeze 
for non-shutdown.
There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
When the ctrl comes back, the number of hw queues maybe changed.
Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
On the other hand, if not freeze the queue, the requests that are submitted to 
the dying hw queue 
during the resetting, will be sacrificed.  


> 
>> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to 
>> nvme_reset_ctrl should be ok.
>> If the admin ios in reset_work timeout, the nvme_error_handler could also 
>> provide help to complete them.
>>
> Right, that is one goal of this patchset.
> 
>>> There are actually two issues here, both are covered in this patchset:
>>>
>>> 1) when nvme_wait_freeze() is for draining IO, controller error may
>>> happen again, this patch can shutdown & reset controller again to
>>> recover it.
>>>
>> We could split the nvme_reset_work into two contexts instead of introduce 
>> another nvme_eh_reset.
>> Otherwise, the code looks really complicated.
> Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> 
>   - nvme_pre_reset_dev()
>   - schedule .eh_post_reset_work
> 
> and run draining IO & updating controller state in .eh_post_reset_work.
> .eh_pre_reset_work will be scheduled in nvme_error_handler().
> 
> This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> what do you think of this approach?

nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another 
interface.
Then it is more convenient to ensure that there will be only one resetting 
instance running.

Thanks
Jianchao


Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

2018-05-03 Thread jianchao.wang
Hi Ming

Thanks for your kindly response.

On 05/03/2018 06:08 PM, Ming Lei wrote:
> nvme_eh_reset() can move on, if controller state is either CONNECTING or
> RESETTING, nvme_change_ctrl_state(>ctrl, NVME_CTRL_RESETTING) won't
> be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> directly, then follows scheduling of the eh_reset_work.

We have to consider another context, nvme_reset_work.
There are two contexts that will invoke nvme_pre_reset_dev.
nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or 
CONNECTING.

nvme_error_handlernvme_reset_ctrl
-> change state to RESETTING
-> queue reset work
  nvme_reset_work  
  -> nvme_dev_disable 
  -> nvme_eh_reset
-> nvme_pre_reset_dev  -> nvme_pre_reset_dev
  -> change state to CONNECTING   -> change state fails
   -> nvme_remove_dead_ctrl
There seems to be other race scenarios between them.

Just invoke nvme_dev_disable in nvme_error_handler context and hand over the 
other things
to nvme_reset_work as the v2 patch series seems clearer.
The primary target of nvme_error_handler is to drain IOs and disable controller.
reset_work could take over the left things of the recovery work

IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl 
should be ok.
If the admin ios in reset_work timeout, the nvme_error_handler could also 
provide help to complete them.

> There are actually two issues here, both are covered in this patchset:
> 
> 1) when nvme_wait_freeze() is for draining IO, controller error may
> happen again, this patch can shutdown & reset controller again to
> recover it.
>

We could split the nvme_reset_work into two contexts instead of introduce 
another nvme_eh_reset.
Otherwise, the code looks really complicated.
In addition, this fix for this issue could be deferred after your great EH 
kthread solution is submitted.

Thanks
Jianchao

> 2) there is also another issue: queues may not be put into freeze state
> in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
> issue is addressed by 4th patch.
> 
> Both two can be observed in blktests block/011 and verified by this patches.
> 
> Thanks for your great review, please let me know if there are other
> issues.


Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

2018-05-03 Thread jianchao.wang
Hi ming

On 05/03/2018 11:17 AM, Ming Lei wrote:
>  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>   if (nvme_should_reset(dev, csts)) {
>   nvme_warn_reset(dev, csts);
>   nvme_dev_disable(dev, false, true);
> - nvme_reset_ctrl(>ctrl);
> + nvme_eh_reset(dev);
>   return BLK_EH_HANDLED;
>   }
>  
> @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>"I/O %d QID %d timeout, reset controller\n",
>req->tag, nvmeq->qid);
>   nvme_dev_disable(dev, false, true);
> - nvme_reset_ctrl(>ctrl);
> + nvme_eh_reset(dev);

w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
if this admin ios timeout, the nvme_timeout cannot handle this because the 
timeout work is sleeping
to wait admin ios.

In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put 
it into another context,
but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

Actually, I used to report this issue to Keith. I met io hung when the 
controller die in
nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be 
scheduled because it is waiting.
Here is Keith's commit for this:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

Thanks
Jianchao



 


Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()

2018-05-01 Thread jianchao.wang
Hi Ming

On 05/02/2018 11:33 AM, Ming Lei wrote:
> No, there isn't such race, the 'mod_timer' doesn't make a difference
> because 'q->timeout_off' will be visible in new work func after
> cancel_work_sync() returns. So even the timer is expired, work func
> still returns immediately.

Yes, you are right.
even if timer is setup , but the timeout work will return.

Thanks
Jianchao


Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling

2018-05-01 Thread jianchao.wang
Hi Ming

On 05/02/2018 12:54 PM, Ming Lei wrote:
>> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
>> 1. defer the completion. we can't unmap the io request before close the 
>> controller totally, so not BLK_EH_HANDLED.
>> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked 
>> by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.
> We don't need to change return value of .timeout() any more after
> calling nvme_quiesce_timeout():
> 
> Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
> timed-out requests have been handled already. Some of them may be completed,
> and others may be handled as RESET_TIMER, but none of them are handled as
> NOT_HANDLED because nvme_timeout() won't return that value.
> 
> So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
> all in-flight requests because timeout is drained and quiesced.

The key point here is we cannot unmap the io requests before we close the 
controller directly.
The nvme controller may still hold the command after we complete and unmap the 
io request in nvme_timeout.
This will cause memory corruption.

So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED.
We need to deffer the completion of timeout requests until the controller has 
been closed totally in nvme_dev_disable.

Thanks
Jianchao


Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling

2018-05-01 Thread jianchao.wang
Hi Ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
> +
>  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool 
> reserved)
>  {
>   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>*/
>   if (nvme_should_reset(dev, csts)) {
>   nvme_warn_reset(dev, csts);
> - nvme_dev_disable(dev, false);
> - nvme_reset_ctrl(>ctrl);
> + nvme_eh_schedule(dev);
>   return BLK_EH_HANDLED;
>   }
>  
> @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>   dev_warn(dev->ctrl.device,
>"I/O %d QID %d timeout, disable controller\n",
>req->tag, nvmeq->qid);
> - nvme_dev_disable(dev, false);
>   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + nvme_eh_schedule(dev);
>   return BLK_EH_HANDLED;
>   default:
>   break;
> @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>   dev_warn(dev->ctrl.device,
>"I/O %d QID %d timeout, reset controller\n",
>req->tag, nvmeq->qid);
> - nvme_dev_disable(dev, false);
> - nvme_reset_ctrl(>ctrl);
> -
>   /*
>* Mark the request as handled, since the inline shutdown
>* forces all outstanding requests to complete.
>*/
>   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + nvme_eh_schedule(dev);
>   return BLK_EH_HANDLED;
>   }
>  
> @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
> shutdown)
>   if (pci_is_enabled(pdev)) {
>   u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> - if (dev->ctrl.state == NVME_CTRL_LIVE ||
> - dev->ctrl.state == NVME_CTRL_RESETTING) {
> + if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
> + dev->ctrl.state == NVME_CTRL_RESETTING)) {
>   nvme_start_freeze(>ctrl);
>   frozen = true;
>   }
> @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>   for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>   nvme_suspend_queue(>queues[i]);
>  
> + /* safe to sync timeout after queues are quiesced */
> + nvme_quiesce_timeout(>ctrl);
> + blk_quiesce_timeout(dev->ctrl.admin_q);
> +
>   nvme_pci_disable(dev);
>  
> + /*
> +  * Both timeout and interrupt handler have been drained, and all
> +  * in-flight requests will be canceled now.
> +  */
>   blk_mq_tagset_busy_iter(>tagset, nvme_cancel_request, >ctrl);
>   blk_mq_tagset_busy_iter(>admin_tagset, nvme_cancel_request, 
> >ctrl);


We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
1. defer the completion. we can't unmap the io request before close the 
controller totally, so not BLK_EH_HANDLED.
2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by 
nvme_cancel_request, so not BLK_EH_NOT_HANDLED.

Thanks
Jianchao





Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()

2018-05-01 Thread jianchao.wang
Hi ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
>  
> +static void __blk_unquiesce_timeout(struct request_queue *q)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + q->timeout_off = false;
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +}
> +
> +void blk_unquiesce_timeout(struct request_queue *q)
> +{
> + __blk_unquiesce_timeout(q);
> + mod_timer(>timeout, jiffies + q->rq_timeout);
> +}
> +EXPORT_SYMBOL(blk_unquiesce_timeout);
> +
> +void blk_quiesce_timeout(struct request_queue *q)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + q->timeout_off = true;
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> + del_timer_sync(>timeout);
> + cancel_work_sync(>timeout_work);
> +}
> +EXPORT_SYMBOL(blk_quiesce_timeout);
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
>   */
>  void blk_sync_queue(struct request_queue *q)
>  {
> - del_timer_sync(>timeout);
> - cancel_work_sync(>timeout_work);
> + blk_quiesce_timeout(q);
>  
>   if (q->mq_ops) {
>   struct blk_mq_hw_ctx *hctx;
> @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
>   } else {
>   cancel_delayed_work_sync(>delay_work);
>   }
> +
> + __blk_unquiesce_timeout(q);
>  }
>  EXPORT_SYMBOL(blk_sync_queue);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0dc9e341c2a7..890dd3b3e3e1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   };
>   struct blk_mq_hw_ctx *hctx;
>   int i;
> + bool timeout_off;
> + unsigned long flags;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + timeout_off = q->timeout_off;
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> + if (timeout_off)
> + return;

Looks like there is still race as following:

blk_quiesce_timeout  blk_mq_timeout_work
   -> timeout_off = q->timeout_off; 
//still a false
  -> q->timeout_off = true;
  -> del_timer_sync(>timeout);
   -> mod_timer
  -> cancel_work_sync(>timeout_work);


How about this:

Introduce a new request queue flag QUEUE_FLAG_TIMEOUT_QUIESCED

void blk_quiesce_queue_timeout(struct request_queue *q)
{
unsigned long flags;

spin_lock_irqsave(q->queue_lock, flags);
queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
spin_unlock_irqrestore(q->queue_lock, flags);

synchronize_rcu_bh();
cancel_work_sync(>timeout_work);
/*
 * The timeout work will not be invoked anymore.
 */
del_timer_sync(>timeout);
}

 void blk_set_queue_dying(struct request_queue *q)
 {
spin_lock_irq(q->queue_lock);
@@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
struct request_queue *q = from_timer(q, t, timeout);
 
+   if (blk_queue_timeout_quiesced(q))
+   return;

kblockd_schedule_work(>timeout_work);
 }

Thanks
Jianchao


Re: [PATCH 1/2] nvme: pci: simplify timeout handling

2018-04-28 Thread jianchao.wang
Hi ming

On 04/29/2018 09:36 AM, Ming Lei wrote:
> On Sun, Apr 29, 2018 at 6:27 AM, Ming Lei <tom.leim...@gmail.com> wrote:
>> On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leim...@gmail.com> wrote:
>>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>>> <jianchao.w.w...@oracle.com> wrote:
>>>> Hi ming
>>>>
>>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>>
>>>>> That means this timer won't be expired any more, so could you explain
>>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>>
>>>> Please consider the following case:
>>>>
>>>> blk_sync_queue
>>>>   -> del_timer_sync
>>>>   blk_mq_timeout_work
>>>> -> blk_mq_check_expired // return the timeout 
>>>> value
>>>> -> blk_mq_terninate_expired
>>>>   -> .timeout //return EH_NOT_HANDLED
>>>> -> mod_timer // setup the timer again based on 
>>>> the result of blk_mq_check_expired
>>>>   -> cancel_work_sync
>>>> So after the blk_sync_queue, the timer may come back again, then the 
>>>> timeout work.
>>>
>>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>>> may have to depend on it or similar way.
>>>
>>> BTW, that means blk_sync_queue() has been broken, even though the uses
>>> in blk_cleanup_queue().
>>>
>>> Another approach is to introduce one perpcu_ref of
>>> 'q->timeout_usage_counter' for
>>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>>> and implement.
>>
>> Or one timout_mutex is enough.
> 
> Turns out it is SRCU.
> 
after split the timeout path into timer and workqueue two parts, if we don't 
drain the in-flight requests, the request_queue->timeout and the timeout work 
look like an issue of 'chicken and egg'.
how about introduce a flag to disable triggering of timeout work ?


Thanks
Jianchao


Re: [PATCH 1/2] nvme: pci: simplify timeout handling

2018-04-28 Thread jianchao.wang
Hi Ming and Keith

Let me detail extend more here. :)

On 04/28/2018 09:35 PM, Keith Busch wrote:
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.

Yes, .timeout should be invoked for every timeout request and .timeout should 
also
handle this only one request in principle
however, nvme_timeout will invoke nvme_dev_disable

> That's not quite what I was talking about.
> 
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().

I think Keith are saying that
before this new blk-mq timeout implementation, the logic of blk_mq_timeout_work 
is

get _only_ _one_ timeout request
mark completed
invoke .timeout, in nvme, it is nvme_timeout
then nvme_dev_disable is invoked and thus other requests could be completed by 
blk_mq_complete_request
because they have not been mark completed 

> 
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.
> 

After the new blk-mq timeout implementation, 

set the aborted_gstate of _all_ the timeout requests
invoke the .timeout one by one
for the first timeout request's .timeout, in nvme, it is nvme_timeout
nvme_dev_disable is invoked and try to complete all the in-flight requests 
through blk_mq_complete_request
but timeout requests that have been set aborted_gstate cannot handled by 
blk_mq_complete_request
so some requests are leaked by nvme_dev_disable
these residual timeout requests will still be handled by blk_mq_timeout_work 
through invoke .timeout one by one


Thanks
Jianchao


> 


Re: [PATCH 1/2] nvme: pci: simplify timeout handling

2018-04-28 Thread jianchao.wang
Hi ming

On 04/27/2018 10:57 PM, Ming Lei wrote:
> I may not understand your point, once blk_sync_queue() returns, the
> timer itself is deactivated, meantime the synced .nvme_timeout() only
> returns EH_NOT_HANDLED before the deactivation.
> 
> That means this timer won't be expired any more, so could you explain
> a bit why timeout can come again after blk_sync_queue() returns

Please consider the following case:

blk_sync_queue
  -> del_timer_sync  
  blk_mq_timeout_work
-> blk_mq_check_expired // return the timeout value
-> blk_mq_terninate_expired
  -> .timeout //return EH_NOT_HANDLED
-> mod_timer // setup the timer again based on the 
result of blk_mq_check_expired
  -> cancel_work_sync
So after the blk_sync_queue, the timer may come back again, then the timeout 
work.

Thanks
Jianchao


Re: testing io.low limit for blk-throttle

2018-04-26 Thread jianchao.wang
Hi Tejun and Joseph

On 04/27/2018 02:32 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
>> +Tejun (I guess he might be interested in the results below)
> 
> Our experiments didn't work out too well either.  At this point, it
> isn't clear whether io.low will ever leave experimental state.  We're
> trying to find a working solution.

Would you please take a look at the following two patches.

https://marc.info/?l=linux-block=152325456307423=2
https://marc.info/?l=linux-block=152325457607425=2

In addition, when I tested blk-throtl io.low on NVMe card, I always got
even if the iops has been lower than io.low limit for a while, but the
due to group is not idle, the downgrade always fails.

   tg->latency_target && tg->bio_cnt &&
tg->bad_bio_cnt * 5 < tg->bio_cn

the latency always looks well even the sum of two groups's iops has reached the 
top.
so I disable this check on my test, plus the 2 patches above, the io.low
could basically works.

My NVMe card's max bps is ~600M, and max iops is ~160k.
Here is my config
io.low riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 latency=10
io.max riops=15
There are two cgroups in my test, both of them have same config.

In addition, saying "basically work" is due to the iops of the two cgroup will 
jump up and down.
such as, I launched one fio test per cgroup, the iops will wave as following:

group0   30k  50k   70k   60k  40k
group1   120k 100k  80k   90k  110k

however, if I launched two fio tests only in one cgroup, the iops of two test 
could stay 
about 70k~80k.

Could help to explain this scenario ?

Thanks in advance
Jianchao


Re: [PATCH 1/2] nvme: pci: simplify timeout handling

2018-04-26 Thread jianchao.wang


On 04/26/2018 11:57 PM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your wonderful solution. :)
>>
>> On 04/26/2018 08:39 PM, Ming Lei wrote:
>>> +/*
>>> + * This one is called after queues are quiesced, and no in-fligh timeout
>>> + * and nvme interrupt handling.
>>> + */
>>> +static void nvme_pci_cancel_request(struct request *req, void *data,
>>> +   bool reserved)
>>> +{
>>> +   /* make sure timed-out requests are covered too */
>>> +   if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
>>> +   req->aborted_gstate = 0;
>>> +   req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
>>> +   }
>>> +
>>> +   nvme_cancel_request(req, data, reserved);
>>> +}
>>> +
>>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>  {
>>> int i;
>>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
>>> bool shutdown)
>>> for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>>> nvme_suspend_queue(>queues[i]);
>>>  
>>> +   /*
>>> +* safe to sync timeout after queues are quiesced, then all
>>> +* requests(include the time-out ones) will be canceled.
>>> +*/
>>> +   nvme_sync_queues(>ctrl);
>>> +   blk_sync_queue(dev->ctrl.admin_q);
>>> +
>> Looks like blk_sync_queue cannot drain all the timeout work.
>>
>> blk_sync_queue
>>   -> del_timer_sync  
>>   blk_mq_timeout_work
>> -> mod_timer
>>   -> cancel_work_sync
>> the timeout work may come back again.
>> we may need to force all the in-flight requests to be timed out with 
>> blk_abort_request
>>
> 
> blk_abort_request() seems over-kill, we could avoid this race simply by
> returning EH_NOT_HANDLED if the controller is in-recovery.
return EH_NOT_HANDLED maybe not enough.
please consider the following scenario.

  nvme_error_handler
-> nvme_dev_disable
  -> blk_sync_queue
//timeout comes again due to the
//scenario above
blk_mq_timeout_work
  -> blk_mq_check_expired   
-> set aborted_gstate
  -> nvme_pci_cancel_request
-> RQF_MQ_TIMEOUT_EXPIRED has not 
been set
-> nvme_cancel_request
  -> blk_mq_complete_request
-> do nothing
  -> blk_mq_ternimate_expired
-> blk_mq_rq_timed_out
  -> set RQF_MQ_TIMEOUT_EXPIRED
  -> .timeout return BLK_EH_NOT_HANDLED

Then the timeout request is leaked.

> 
>>> nvme_pci_disable(dev);
>>
>> the interrupt will not come, but there maybe running one.
>> a synchronize_sched() here ?
> 
> We may cover this case by moving nvme_suspend_queue() before
> nvme_stop_queues().
> 
> Both two are very good catch, thanks!
> 


Re: [PATCH 1/2] nvme: pci: simplify timeout handling

2018-04-26 Thread jianchao.wang
Hi Ming

Thanks for your wonderful solution. :)

On 04/26/2018 08:39 PM, Ming Lei wrote:
> +/*
> + * This one is called after queues are quiesced, and no in-fligh timeout
> + * and nvme interrupt handling.
> + */
> +static void nvme_pci_cancel_request(struct request *req, void *data,
> + bool reserved)
> +{
> + /* make sure timed-out requests are covered too */
> + if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> + req->aborted_gstate = 0;
> + req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> + }
> +
> + nvme_cancel_request(req, data, reserved);
> +}
> +
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
>   int i;
> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>   for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>   nvme_suspend_queue(>queues[i]);
>  
> + /*
> +  * safe to sync timeout after queues are quiesced, then all
> +  * requests(include the time-out ones) will be canceled.
> +  */
> + nvme_sync_queues(>ctrl);
> + blk_sync_queue(dev->ctrl.admin_q);
> +
Looks like blk_sync_queue cannot drain all the timeout work.

blk_sync_queue
  -> del_timer_sync  
  blk_mq_timeout_work
-> mod_timer
  -> cancel_work_sync
the timeout work may come back again.
we may need to force all the in-flight requests to be timed out with 
blk_abort_request

>   nvme_pci_disable(dev);

the interrupt will not come, but there maybe running one.
a synchronize_sched() here ?

Thanks
Jianchao
>  
> - blk_mq_tagset_busy_iter(>tagset, nvme_cancel_request, >ctrl);
> - blk_mq_tagset_busy_iter(>admin_tagset, nvme_cancel_request, 
> >ctrl);
> + blk_mq_tagset_busy_iter(>tagset, nvme_pci_cancel_request, 
> >ctrl);
> + blk_mq_tagset_busy_iter(>admin_tagset, nvme_pci_cancel_request, 
> >ctrl);


Re: testing io.low limit for blk-throttle

2018-04-23 Thread jianchao.wang
Hi Paolo

When I test execute the script, I got this
8:0 rbps=1000 wbps=0 riops=0 wiops=0 idle=0 latency=max

The idle is 0.
I'm afraid the io.low would not work.
Please refer to the following code in tg_set_limit

/* force user to configure all settings for low limit  */
if (!(tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW] ||
  tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) ||
tg->idletime_threshold_conf == DFL_IDLE_THRESHOLD ||//-> 
HERE
tg->latency_target_conf == DFL_LATENCY_TARGET) {
tg->bps[READ][LIMIT_LOW] = 0;
tg->bps[WRITE][LIMIT_LOW] = 0;
tg->iops[READ][LIMIT_LOW] = 0;
tg->iops[WRITE][LIMIT_LOW] = 0;
tg->idletime_threshold = DFL_IDLE_THRESHOLD;
tg->latency_target = DFL_LATENCY_TARGET;
} else if (index == LIMIT_LOW) {
tg->idletime_threshold = tg->idletime_threshold_conf;
tg->latency_target = tg->latency_target_conf;
}

blk_throtl_update_limit_valid(tg->td);


Thanks
Jianchao

On 04/23/2018 03:37 PM, Paolo Valente wrote:
> cd thr-lat-with-interference
> sudo ./thr-lat-with-interference.sh -b t -w 1 -W "1000 1000 
> 1000 1000 1000 1000" -n 6 -T "read read read read read read" 
> -R "0 0 0 0 0 0"


Re: testing io.low limit for blk-throttle

2018-04-23 Thread jianchao.wang
Hi Paolo

On 04/23/2018 01:32 PM, Paolo Valente wrote:
> Thanks for sharing this fix.  I tried it too, but nothing changes in
> my test :(> 

That's really sad.

> At this point, my doubt is still: am I getting io.low limit right?  I
> understand that an I/O-bound group should be guaranteed a rbps at
> least equal to the rbps set with io.low for that group (of course,
> provided that the sum of io.low limits is lower than the rate at which
> the device serves all the I/O generated by the groups).  Is this
> really what io.low shall guarantee?

I agree with your point about this even if I'm not qualified to judge it.

On the other hand, could you share your test case and blk-throl config here ?

Thanks
Jianchao


Re: testing io.low limit for blk-throttle

2018-04-22 Thread jianchao.wang
Hi Paolo

As I said, I used to meet similar scenario.
After dome debug, I found out 3 issues.

Here is my setup command:
mkdir test0 test1
echo "259:0 riops=15" > test0/io.max 
echo "259:0 riops=15" > test1/io.max 
echo "259:0 riops=15" > test2/io.max 
echo "259:0 riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 
latency=10" > test0/io.low 
echo "259:0 riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 
latency=10" > test1/io.low 

My NVMe card's max bps is ~600M, and max iops is ~160k.
Two cgroups' io.low is bps 200M and 50k. io.max is iops 150k

1. I setup 2 cgroups test0 and test1, one process per cgroup.
Even if only the process in test0 does IO, its iops is just 50k.
This is fixed by following patch.
https://marc.info/?l=linux-block=152325457607425=2 

2. Let the process in test0 and test1 both do IO.
Sometimes, the iops of both cgroup are 50k, look at the log, blk-throl's 
upgrade always fails.
This is fixed by following patch:
https://marc.info/?l=linux-block=152325456307423=2

3. After applied patch 1 and 2, still see that one of cgroup's iops will fall 
down to 30k ~ 40k but
blk-throl doesn't downgrade. It is due to even if the iops has been lower than 
the io.low limit for some time,
but the cgroup is idle, so downgrade fails. More detailed, it is due to the 
code segment in throtl_tg_is_idle

(tg->latency_target && tg->bio_cnt &&
tg->bad_bio_cnt * 5 < tg->bio_cnt)

I fixed it with following patch.
But I'm not sure about this patch, so I didn't submit it.
Please also try it. :)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b5ba845..c9a43a4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1819,7 +1819,7 @@ static unsigned long tg_last_low_overflow_time(struct 
throtl_grp *tg)
return ret;
 }
 
-static bool throtl_tg_is_idle(struct throtl_grp *tg)
+static bool throtl_tg_is_idle(struct throtl_grp *tg, bool latency)
 {
/*
 * cgroup is idle if:
@@ -1836,7 +1836,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
  tg->idletime_threshold == DFL_IDLE_THRESHOLD ||
  (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
  tg->avg_idletime > tg->idletime_threshold ||
- (tg->latency_target && tg->bio_cnt &&
+ (tg->latency_target && tg->bio_cnt && latency &&
tg->bad_bio_cnt * 5 < tg->bio_cnt);
throtl_log(>service_queue,
"avg_idle=%ld, idle_threshold=%ld, bad_bio=%d, total_bio=%d, 
is_idle=%d, scale=%d",
@@ -1867,7 +1867,7 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 
if (time_after_eq(jiffies,
tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
-   throtl_tg_is_idle(tg))
+   throtl_tg_is_idle(tg, true))
return true;
return false;
 }
@@ -1983,7 +1983,7 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_low_overflow_time(tg) +
td->throtl_slice) &&
-   (!throtl_tg_is_idle(tg) ||
+   (!throtl_tg_is_idle(tg, false) ||
 !list_empty(_to_blkg(tg)->blkcg->css.children)))
return true;
return false;



On 04/22/2018 11:53 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 22 apr 2018, alle ore 15:29, jianchao.wang 
>> <jianchao.w.w...@oracle.com> ha scritto:
>>
>> Hi Paolo
>>
>> I used to meet similar issue on io.low.
>> Can you try the following patch to see whether the issue could be fixed.
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D152325456307423-26w-3D2=DwIFAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=asJMDy9zIe2AqRVpoLbe9RMjsdZOJZ0HrRWTM3CPZeA=AZ4kllxCfaXspjeSylBpK8K7ai6IPjSiffrGmzt4VEM=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D152325457607425-26w-3D2=DwIFAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=asJMDy9zIe2AqRVpoLbe9RMjsdZOJZ0HrRWTM3CPZeA=1EhsoSMte3kIxuVSYBFSE9W2jRKrIWI5z7-stlZ80H4=
>>
> 
> Just tried. Unfortunately, nothing seems to change :(
> 
> Thanks,
> Paolo
> 
>> Thanks
>> Jianchao
>>
>> On 04/22/2018 05:23 PM, Paolo Valente wrote:
>>> Hi Shaohua, all,
>>> at last, I started testing your io.low limit for blk-throttle.  One of
>>> the things I'm interested in is how good throttling is in achieving a
>>> high thr

Re: testing io.low limit for blk-throttle

2018-04-22 Thread jianchao.wang
Hi Paolo

I used to meet similar issue on io.low.
Can you try the following patch to see whether the issue could be fixed.
https://marc.info/?l=linux-block=152325456307423=2
https://marc.info/?l=linux-block=152325457607425=2

Thanks
Jianchao

On 04/22/2018 05:23 PM, Paolo Valente wrote:
> Hi Shaohua, all,
> at last, I started testing your io.low limit for blk-throttle.  One of
> the things I'm interested in is how good throttling is in achieving a
> high throughput in the presence of realistic, variable workloads.
> 
> However, I seem to have bumped into a totally different problem.  The
> io.low parameter doesn't seem to guarantee what I understand it is meant
> to guarantee: minimum per-group bandwidths.  For example, with
> - one group, the interfered, containing one process that does sequential
>   reads with fio
> - io.low set to 100MB/s for the interfered
> - six other groups, the interferers, with each interferer containing one
>   process doing sequential read with fio
> - io.low set to 10MB/s for each interferer
> - the workload executed on an SSD, with a 500MB/s of overall throughput
> the interfered gets only 75MB/s.
> 
> In particular, the throughput of the interfered becomes lower and
> lower as the number of interferers is increased.  So you can make it
> become even much lower than the 75MB/s in the example above.  There
> seems to be no control on bandwidth.
> 
> Am I doing something wrong?  Or did I simply misunderstand the goal of
> io.low, and the only parameter for guaranteeing the desired bandwidth to
> a group is io.max (to be used indirectly, by limiting the bandwidth of
> the interferers)?
> 
> If useful for you, you can reproduce the above test very quickly, by
> using the S suite [1] and typing:
> 
> cd thr-lat-with-interference
> sudo ./thr-lat-with-interference.sh -b t -w 1 -W "1000 1000 
> 1000 1000 1000 1000" -n 6 -T "read read read read read read" 
> -R "0 0 0 0 0 0"
> 
> Looking forward to your feedback,
> Paolo
> 
> [1] 
> 


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang


On 04/21/2018 10:10 PM, Jens Axboe wrote:
> On 4/21/18 7:34 AM, jianchao.wang wrote:
>> Hi Bart
>>
>> Thanks for your kindly response.
>>
>> On 04/20/2018 10:11 PM, Bart Van Assche wrote:
>>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>>>> Hi Bart
>>>>
>>>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
>>>>> Use the deadline instead of the request generation to detect whether
>>>>>   or not a request timer fired after reinitialization of a request
>>>>
>>>> Maybe use deadline to do this is not suitable.
>>>>
>>>> Let's think of the following scenario.
>>>>
>>>> T1/T2 times in nano seconds
>>>> J1/J2 times in jiffies
>>>>
>>>> rq is started at T1,J1
>>>> blk_mq_timeout_work
>>>>   -> blk_mq_check_expired
>>>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>>>
>>>>  rq is completed and freed 
>>>>  rq is allocated and started 
>>>> again on T2
>>>>  rq->__deadline = J2 + 
>>>> MQ_RQ_IN_FLIGHT
>>>>   -> synchronize srcu/rcu
>>>>   -> blk_mq_terminate_expired
>>>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>>>
>>>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>>>
>>>> 2. even if we do some bit shift when save and get deadline
>>>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
>>>
>>> Hello Jianchao,
>>>
>>> How about using the upper 16 or 32 bits of rq->__deadline to store a 
>>> generation
>>> number? I don't know any block driver for which the product of (deadline in
>>> seconds) and HZ exceeds (1 << 32).
>>>
>> yes, we don't need so long timeout value.
>> However, req->__deadline is an absolute time, not a relative one, 
>> its type is unsigned long variable which is same with jiffies.
>> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
>> generation number,
>> how to handle it when mod_timer ?
> 
> We don't need to support timeouts to the full granularity of jiffies.
> Most normal commands are in the 30-60s range, and some lower level
> commands may take minutes. We're well within the range of chopping
> off some bits and not having to worry about that at all.
> 
Yes.
So we have a __deadline as below:
 
deadline gen state   
|__||_|
   \___ __/ 
   V
  granularity 

and even we usually have a 30~60s timeout,
but we should support a 1s granularity at least.
How to decide the granularity ?

Thanks
Jianchao


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/20/2018 10:11 PM, Bart Van Assche wrote:
> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>> Hi Bart
>>
>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
>>> Use the deadline instead of the request generation to detect whether
>>>   or not a request timer fired after reinitialization of a request
>>
>> Maybe use deadline to do this is not suitable.
>>
>> Let's think of the following scenario.
>>
>> T1/T2 times in nano seconds
>> J1/J2 times in jiffies
>>
>> rq is started at T1,J1
>> blk_mq_timeout_work
>>   -> blk_mq_check_expired
>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>
>>  rq is completed and freed 
>>  rq is allocated and started 
>> again on T2
>>  rq->__deadline = J2 + 
>> MQ_RQ_IN_FLIGHT
>>   -> synchronize srcu/rcu
>>   -> blk_mq_terminate_expired
>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>
>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>
>> 2. even if we do some bit shift when save and get deadline
>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
> 
> Hello Jianchao,
> 
> How about using the upper 16 or 32 bits of rq->__deadline to store a 
> generation
> number? I don't know any block driver for which the product of (deadline in
> seconds) and HZ exceeds (1 << 32).
> 
yes, we don't need so long timeout value.
However, req->__deadline is an absolute time, not a relative one, 
its type is unsigned long variable which is same with jiffies.
If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
generation number,
how to handle it when mod_timer ?

Thanks
Jianchao
 


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-20 Thread jianchao.wang
Hi Bart

On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request

Maybe use deadline to do this is not suitable.

Let's think of the following scenario.

T1/T2 times in nano seconds
J1/J2 times in jiffies

rq is started at T1,J1
blk_mq_timeout_work
  -> blk_mq_check_expired
-> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)

 rq is completed and freed 
 rq is allocated and started again 
on T2
 rq->__deadline = J2 + 
MQ_RQ_IN_FLIGHT
  -> synchronize srcu/rcu
  -> blk_mq_terminate_expired
-> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)

1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
   if J2-J1 < 4 jiffies, we will get the same deadline value.

2. even if we do some bit shift when save and get deadline
   if T2 - T1 < time of 1 jiffies, we sill get the same deadline.

Thanks
Jianchao


Re: [PATCH] blk-mq: start request gstate with gen 1

2018-04-17 Thread jianchao.wang
Hi Martin

On 04/17/2018 08:10 PM, Martin Steigerwald wrote:
> For testing it I add it to 4.16.2 with the patches I have already?

You could try to only apply this patch to have a test. :)

> 
> - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
> request.mbox'
> 
> - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
> rcu_read_{lock,unlock}().mbox'
> 
> - '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s 
> timeout isn'\''t handled.mbox'
> 
> - '[PATCH V4 2_2] blk-mq_fix race between complete and 
> BLK_EH_RESET_TIMER.mbox


Thanks
Jianchao


Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER

2018-04-16 Thread jianchao.wang
Hi bart

Thanks for your kindly response.
I have sent out the patch. Please refer to
https://marc.info/?l=linux-block=152393666517449=2

Thanks
Jianchao

On 04/17/2018 08:15 AM, Bart Van Assche wrote:
> On Tue, 2018-04-17 at 00:04 +0800, jianchao.wang wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 16e83e6..be9b435 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
>> *set, struct request *rq,
>>  
>> seqcount_init(>gstate_seq);
>> u64_stats_init(>aborted_gstate_sync);
>> +   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>> return 0;
>>  }
> 
> Hello Jianchao,
> 
> Your approach looks interesting to me. Can you send an official patch to Jens?
> 
> Thanks,
> 
> Bart.
> 
> 
> 
> 


Re: [PATCH] blk-mq: mark hctx RESTART when get budget fails

2018-04-16 Thread jianchao.wang
Hi Ming

Thanks for your kindly response.

On 04/16/2018 04:15 PM, Ming Lei wrote:
>> -if (!blk_mq_get_dispatch_budget(hctx))
>> +if (!blk_mq_get_dispatch_budget(hctx)) {
>> +blk_mq_sched_mark_restart_hctx(hctx);
> The RESTART flag still may not take into effect if all requests are
> completed before calling blk_mq_sched_mark_restart_hctx().
> 

Yes, this is indeed a very tricky scenario.

Sincerely
Jianchao


Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread jianchao.wang
Hi Ming

On 04/12/2018 07:38 AM, Ming Lei wrote:
> +  *
> +  * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> +  * helding queue lock.
>*/
>   hctx_lock(hctx, _idx);
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
> + else {
> + unsigned long flags;
> + bool need_complete = false;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + if (!blk_mq_rq_aborted_gstate(rq))
> + need_complete = true;
> + else
> + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
> + spin_unlock_irqrestore(q->queue_lock, flags);

What if the .timeout return BLK_EH_HANDLED during this ?
timeout context irq context
  .timeout()
 blk_mq_complete_request
   set state to MQ_RQ_COMPLETE_IN_TIMEOUT
  __blk_mq_complete_request
WARN_ON_ONCE(blk_mq_rq_state(rq) 
 != MQ_RQ_IN_FLIGHT);

If further upon blk_mq_free_request, the final freed request maybe changed to 
MQ_RQ_COMPLETE_IN_TIMEOUT
instead of MQ_RQ_IDLE.

Thanks
Jianchao


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/10/2018 09:01 PM, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
>> If yes, how does the timeout handler get the freed request when the tag has 
>> been freed ?
> 
> Hello Jianchao,
> 
> Have you noticed that the timeout handler does not check whether or not the 
> request
> tag is freed? Additionally, I don't think it would be possible to add such a 
> check
> to the timeout code without introducing a new race condition.

Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been 
allocated/set ?
When the request is freed, the tag will be cleared through 
blk_mq_put_tag->sbitmap_queue_clear
Do I miss something else ?

Thanks
Jianchao

> 
> 


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed 
twice ?

Is it the scenario you described as below in 
https://marc.info/?l=linux-block=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.
If yes, how does the timeout handler get the freed request when the tag has 
been freed ?

Thanks
Jianchao


Re: blk-mq and CPU hotplug error

2018-03-26 Thread jianchao.wang
Hi Jose

On 03/27/2018 05:25 AM, jos...@linux.vnet.ibm.com wrote:
> Hello everyone!
> 
> I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le).
> Everything looks good until I try to hotplug CPUs in my VM. As soon as I
> do that I get the following error in my VM dmesg:

Please refer to the following commit:
20e4d813 blk-mq: simplify queue mapping & schedule with each possisble CPU
84676c1 genirq/affinity: assign vectors to all possible CPUs

Thanks
Jianchao


Re: [PATCH 1/3] blk-mq: Allow PCI vector offset for mapping queues

2018-03-24 Thread jianchao.wang
Hi Keith

Thanks for your time and patch for this.

On 03/24/2018 06:19 AM, Keith Busch wrote:
> The PCI interrupt vectors intended to be associated with a queue may
> not start at 0. This patch adds an offset parameter so blk-mq may find
> the intended affinity mask. The default value is 0 so existing drivers
> that don't care about this parameter don't need to change.
> 
> Signed-off-by: Keith Busch 
> ---
>  block/blk-mq-pci.c | 12 ++--
>  include/linux/blk-mq-pci.h |  2 ++
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
> index 76944e3271bf..1040a7705c13 100644
> --- a/block/blk-mq-pci.c
> +++ b/block/blk-mq-pci.c
> @@ -21,6 +21,7 @@
>   * blk_mq_pci_map_queues - provide a default queue mapping for PCI device
>   * @set: tagset to provide the mapping for
>   * @pdev:PCI device associated with @set.
> + * @offset:  PCI irq starting vector offset
>   *
>   * This function assumes the PCI device @pdev has at least as many available
>   * interrupt vectors as @set has queues.  It will then query the vector
> @@ -28,13 +29,14 @@
>   * that maps a queue to the CPUs that have irq affinity for the corresponding
>   * vector.
>   */
> -int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
> +int __blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev,
> + int offset)
>  {
>   const struct cpumask *mask;
>   unsigned int queue, cpu;
>  
>   for (queue = 0; queue < set->nr_hw_queues; queue++) {
> - mask = pci_irq_get_affinity(pdev, queue);
> + mask = pci_irq_get_affinity(pdev, queue + offset);
>   if (!mask)
>   goto fallback;
>  

Maybe we could provide a callback parameter for __blk_mq_pci_map_queues which
give the mapping from hctx queue num to device-relative interrupt vector index.

Thanks
Jianchao






Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread jianchao.wang
Sorry, Jens, I think I didn't get the point.
Do I miss anything ?

On 02/01/2018 11:07 AM, Jens Axboe wrote:
> Yeah I agree, and my last patch missed that we do care about segments for
> discards. Below should be better...
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8452fc7164cc..055057bd727f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -553,9 +553,8 @@ static bool req_no_special_merge(struct request *req)
>  static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
>   struct request *next)
>  {
> - int total_phys_segments;
> - unsigned int seg_size =
> - req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
> + int total_phys_segments = req->nr_phys_segments +
> + next->nr_phys_segments;

For DISCARD reqs, the total_phys_segments is still zero here.

>  
>   /*
>* First check if the either of the requests are re-queued
> @@ -574,8 +573,15 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   return 0;
>  
> - total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> - if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + /*
> +  * If the requests aren't carrying any data payloads, we don't need
> +  * to look at the segment count
> +  */
> + if (bio_has_data(next->bio) &&
> + blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + unsigned int seg_size = req->biotail->bi_seg_back_size +
> + next->bio->bi_seg_front_size;

Yes, total_phys_segments will not be decreased.

> +
>   if (req->nr_phys_segments == 1)
>   req->bio->bi_seg_front_size = seg_size;
>   if (next->nr_phys_segments == 1)
> @@ -584,7 +590,7 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   }
>  
>   if (total_phys_segments > queue_max_segments(q))
> - return 0;
> + return 0;
>  
>   if (blk_integrity_merge_rq(q, req, next) == false)
>   return 0;

But finally, the merged DISCARD req's nr_phys_segment is still zero.

blk_rq_nr_discard_segments will return 1 but this req has two bios.
blk_rq_nr_discard_segments 's comment says
-- 
Each discard bio merged into a request is counted as one segment
--

Maybe patch below should be followed with yours.

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a4..b444fb7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1763,7 +1763,6 @@ bool bio_attempt_discard_merge(struct request_queue *q, 
struct request *req,
req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
-   req->nr_phys_segments = segments + 1;
 
blk_account_io_start(req, false);
return true;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4f3df80..1af2138 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1312,7 +1312,13 @@ static inline unsigned short 
blk_rq_nr_phys_segments(struct request *rq)
  */
 static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
 {
-   return max_t(unsigned short, rq->nr_phys_segments, 1);
+   struct bio *bio;
+   unsigned short count = 0;
+
+   __rq_for_each_bio(bio, req)
+   count ++;
+
+   return count;
 }


Thanks
Jianchao


Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread jianchao.wang
Hi Jens


On 01/31/2018 11:29 PM, Jens Axboe wrote:
> How about something like the below?
> 
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8452fc7164cc..cee102fb060e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   return 0;
>  
> + /*
> +  * For DISCARDs, the segment count isn't interesting since
> +  * the requests have no data attached.
> +  */
>   total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> - if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + if (total_phys_segments &&
> + blk_phys_contig_segment(q, req->biotail, next->bio)) {
>   if (req->nr_phys_segments == 1)
>   req->bio->bi_seg_front_size = seg_size;
>   if (next->nr_phys_segments == 1)

This patch will avoid the nr_phys_segments to be set to 0x,
but the merged req will have two bios but zero nr_phys_segments.

We have to align with the DISCARD merging strategy.

Please refer to:
/*
 * Number of discard segments (or ranges) the driver needs to fill in.
 * Each discard bio merged into a request is counted as one segment.
 */
static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
{
   return max_t(unsigned short, rq->nr_phys_segments, 1);
}
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
struct bio *bio)
{
unsigned short segments = blk_rq_nr_discard_segments(req);

if (segments >= queue_max_discard_segments(q))
goto no_merge;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
goto no_merge;

req->biotail->bi_next = bio;
req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
req->nr_phys_segments = segments + 1;

blk_account_io_start(req, false);
return true;
no_merge:
req_set_nomerge(q, req);
return false;
}

blk_rq_nr_discard_segments will get a wrong value finally.

Maybe we could change blk_rq_nr_discard_segments to iterate and count the bios 
in one request
to decide the DISCARD request nr_phy_segment. And discard the nr_phys_segments 
operations in
the DISCARD merging path, plus your patch here.

Thanks
Jianchao


Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread jianchao.wang
Hi Jens

On 01/30/2018 11:57 PM, Jens Axboe wrote:
> On 1/30/18 8:41 AM, Jens Axboe wrote:
>> Hi,
>>
>> I just hit this on 4.15+ on the laptop, it's running Linus' git
>> as of yesterday, right after the block tree merge:
>>
>> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0
>> Merge: 9697e9da8429 796baeeef85a
>> Author: Linus Torvalds 
>> Date:   Mon Jan 29 11:51:49 2018 -0800
>>
>> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
>>
>> It's hitting this case:
>>
>> if (WARN_ON_ONCE(n != segments)) {   
>>
>>
>> in nvme, indicating that rq->nr_phys_segments does not equal the
>> number of bios in the request. Anyone seen this? I'll take a closer
>> look as time permits, full trace below.
>>
>>
>>  WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 
>> nvme_setup_cmd+0x3d3/0x430
>>  Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc 
>> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat 
>> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 
>> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
>> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb 
>> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer 
>> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 
>> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore 
>> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
>> intel_gtt
>>  CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U   4.15.0+ 
>> #176
>>  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 
>> 12/19/2017
>>  RIP: 0010:nvme_setup_cmd+0x3d3/0x430
>>  RSP: 0018:880423e9f838 EFLAGS: 00010217
>>  RAX:  RBX: 880423e9f8c8 RCX: 0001
>>  RDX: 88022b200010 RSI: 0002 RDI: 327f
>>  RBP: 880421251400 R08: 88022b20 R09: 0009
>>  R10:  R11:  R12: 
>>  R13: 88042341e280 R14:  R15: 880421251440
>>  FS:  () GS:88044150() knlGS:
>>  CS:  0010 DS:  ES:  CR0: 80050033
>>  CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0
>>  DR0:  DR1:  DR2: 
>>  DR3:  DR6: fffe0ff0 DR7: 0400
>>  Call Trace:
>>   nvme_queue_rq+0x40/0xa00
>>   ? __sbitmap_queue_get+0x24/0x90
>>   ? blk_mq_get_tag+0xa3/0x250
>>   ? wait_woken+0x80/0x80
>>   ? blk_mq_get_driver_tag+0x97/0xf0
>>   blk_mq_dispatch_rq_list+0x7b/0x4a0
>>   ? deadline_remove_request+0x49/0xb0
>>   blk_mq_do_dispatch_sched+0x4f/0xc0
>>   blk_mq_sched_dispatch_requests+0x106/0x170
>>   __blk_mq_run_hw_queue+0x53/0xa0
>>   __blk_mq_delay_run_hw_queue+0x83/0xa0
>>   blk_mq_run_hw_queue+0x6c/0xd0
>>   blk_mq_sched_insert_request+0x96/0x140
>>   __blk_mq_try_issue_directly+0x3d/0x190
>>   blk_mq_try_issue_directly+0x30/0x70
>>   blk_mq_make_request+0x1a4/0x6a0
>>   generic_make_request+0xfd/0x2f0
>>   ? submit_bio+0x5c/0x110
>>   submit_bio+0x5c/0x110
>>   ? __blkdev_issue_discard+0x152/0x200
>>   submit_bio_wait+0x43/0x60
>>   ext4_process_freed_data+0x1cd/0x440
>>   ? account_page_dirtied+0xe2/0x1a0
>>   ext4_journal_commit_callback+0x4a/0xc0
>>   jbd2_journal_commit_transaction+0x17e2/0x19e0
>>   ? kjournald2+0xb0/0x250
>>   kjournald2+0xb0/0x250
>>   ? wait_woken+0x80/0x80
>>   ? commit_timeout+0x10/0x10
>>   kthread+0x111/0x130
>>   ? kthread_create_worker_on_cpu+0x50/0x50
>>   ? do_group_exit+0x3a/0xa0
>>   ret_from_fork+0x1f/0x30
>>  Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 
>> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 
>> 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff 
>>  ---[ end trace 50d361cc444506c8 ]---
>>  print_req_error: I/O error, dev nvme0n1, sector 847167488
> 
> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
> 

Let's consider the case following:

blk_mq_bio_to_request()
  -> blk_init_request_from_bio()
-> blk_rq_bio_prep()

if (bio_has_data(bio))
rq->nr_phys_segments = bio_phys_segments(q, bio);

static inline bool bio_has_data(struct bio *bio)
{
if (bio &&
bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&   //-> HERE !
bio_op(bio) != REQ_OP_SECURE_ERASE &&
bio_op(bio) != REQ_OP_WRITE_ZEROES)
return true;

return false;
}
For the DISCARD req, the nr_phys_segments is 0.

dd_insert_request()
  -> blk_mq_sched_try_insert_merge()
-> elv_attempt_insert_merge()
  -> blk_attempt_req_merge()
-> attempt_merge()
  -> ll_merge_requests_fn()

total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
// 

Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-18 Thread jianchao.wang
Hi ming

Sorry for delayed report this.

On 01/17/2018 05:57 PM, Ming Lei wrote:
> 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
> is run, there isn't warning, but once the IO is submitted to hardware,
> after it is completed, how does the HBA/hw queue notify CPU since CPUs
> assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> handler may cover that, but looks too tricky.

In theory, the irq affinity will be migrated to other cpu. This is done by
fixup_irqs() in the context of stop_machine.
However, in my test, I found this log:

[  267.161043] do_IRQ: 7.33 No irq handler for vector

The 33 is the vector used by nvme cq.
The irq seems to be missed and sometimes IO hang occurred.
It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq.

I add dump stack behind the error log and get following:
[  267.161043] do_IRQ: 7.33 No irq handler for vector migration/7
[  267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27
[  267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  267.161046] Call Trace:
[  267.161047]  
[  267.161052]  dump_stack+0x7c/0xb5
[  267.161054]  do_IRQ+0xb9/0xf0
[  267.161056]  common_interrupt+0xa2/0xa2
[  267.161057]  
[  267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120
[  267.161060] RSP: 0018:bb6c81af7e70 EFLAGS: 0202 ORIG_RAX: 
ffde
[  267.161061] RAX: 0001 RBX: 0004 RCX: 
[  267.161062] RDX: 0006 RSI: 898c4591 RDI: 0202
[  267.161063] RBP: bb6c826e7c88 R08: 991abc1256bc R09: 0005
[  267.161063] R10: bb6c81af7db8 R11: 89c91d20 R12: 0001
[  267.161064] R13: bb6c826e7cac R14: 0003 R15: 
[  267.161067]  ? cpu_stop_queue_work+0x90/0x90
[  267.161068]  cpu_stopper_thread+0x83/0x100
[  267.161070]  smpboot_thread_fn+0x161/0x220
[  267.161072]  kthread+0xf5/0x130
[  267.161073]  ? sort_range+0x20/0x20
[  267.161074]  ? kthread_associate_blkcg+0xe0/0xe0
[  267.161076]  ret_from_fork+0x24/0x30

The irq just occurred after the irq is enabled in multi_cpu_stop.

0x8112d655 is in multi_cpu_stop 
(/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223).
218  */
219 touch_nmi_watchdog();
220 }
221 } while (curstate != MULTI_STOP_EXIT);
222 
223 local_irq_restore(flags);
224 return err;
225 }


Thanks
Jianchao


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread jianchao.wang
Hi ming 

Thanks for your kindly response.

On 01/17/2018 02:22 PM, Ming Lei wrote:
> This warning can't be removed completely, for example, the CPU figured
> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> following call returns and before __blk_mq_run_hw_queue() is scheduled
> to run.
> 
>   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> >run_work, msecs_to_jiffies(msecs))
We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
There is a big gap between cpu_online and cpu_active. rebind_workers is also 
between them.

> 
> Just be curious how you trigger this issue? And is it triggered in CPU
> hotplug stress test? Or in a normal use case?

In fact, this is my own investigation about whether the .queue_rq to one 
hardware queue could be executed on
the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
 - A special patch that could hold some requests on ctx->rq_list though 
.get_budget
 - A script issues IOs with fio
 - A script online/offline the cpus continuously
At first, just the warning above. Then after this patch was introduced, panic 
came up.

Thanks
Jianchao


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your kindly response.

On 01/17/2018 11:52 AM, Ming Lei wrote:
>> It is here.
>> __blk_mq_run_hw_queue()
>> 
>> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
>> cpu_online(hctx->next_cpu));
> I think this warning is triggered after the CPU of hctx->next_cpu becomes
> online again, and it should have been dealt with by the following trick,
> and this patch is against the previous one, please test it and see if
> the warning can be fixed.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 23f0f3ddffcf..0620ccb65e4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> *hctx)
>   tried = true;
>   goto select_cpu;
>   }
> +
> + /* handle after this CPU of hctx->next_cpu becomes online again 
> */
> + hctx->next_cpu_batch = 1;
>   return WORK_CPU_UNBOUND;
>   }
>   return hctx->next_cpu;
> 

The WARNING still could be triggered.

[  282.194532] WARNING: CPU: 0 PID: 222 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  282.194534] Modules linked in: 
[  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW
4.15.0-rc7+ #17
[  282.194562] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  282.194563] Workqueue: kblockd blk_mq_run_work_fn
[  282.194565] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  282.194565] RSP: 0018:96c50223be60 EFLAGS: 00010202
[  282.194566] RAX: 0001 RBX: 8a7110233800 RCX: 
[  282.194567] RDX: 8a711255f2d0 RSI:  RDI: 8a7110233800
[  282.194567] RBP: 8a7122ca2140 R08:  R09: 
[  282.194568] R10: 0263 R11:  R12: 8a7122ca8200
[  282.194568] R13:  R14: 08a7122ca820 R15: 8a70bcef0780
[  282.194569] FS:  () GS:8a7122c0() 
knlGS:
[  282.194569] CS:  0010 DS:  ES:  CR0: 80050033
[  282.194570] CR2: 555e14d89d60 CR3: 3c809002 CR4: 003606f0
[  282.194571] DR0:  DR1:  DR2: 
[  282.194571] DR3:  DR6: fffe0ff0 DR7: 0400
[  282.194571] Call Trace:
[  282.194576]  process_one_work+0x14b/0x420
[  282.194577]  worker_thread+0x47/0x420
[  282.194579]  kthread+0xf5/0x130
[  282.194581]  ? process_one_work+0x420/0x420
[  282.194581]  ? kthread_associate_blkcg+0xe0/0xe0
[  282.194583]  ret_from_fork+0x24/0x30
[  282.194585] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  282.194601] ---[ end trace c104f0a63d3df31b ]---

>> 
>>
>> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu 
>>  even if the cpu is offlined and modify the cpu_online above to cpu_active.
>> The kworkers of the per-cpu pool must have be migrated back when the cpu is 
>> set active.
>> But there seems to be issues in DASD as your previous comment.
> Yes, we can't break DASD.
> 
>> That is the original version of this patch, and both Christian and Stefan
>> reported that system can't boot from DASD in this way[2], and I changed
>> to AND with cpu_online_mask, then their system can boot well
>> On the other hand, there is also risk in 
>>
>> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>>  blk_queue_exit(q);
>>  return ERR_PTR(-EXDEV);
>>  }
>> -cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> This one is crazy, and is used by NVMe only, it should be fine if
> the passed 'hctx_idx' is retrieved by the current running CPU, such
> as the way of blk_mq_map_queue(). But if not, bad thing may happen.
Even if retrieved by current running cpu, the task still could be migrated away 
when the cpu is offlined.

Thanks
Jianchao




Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your patch and kindly response.

On 01/16/2018 11:32 PM, Ming Lei wrote:
> OK, I got it, and it should have been the only corner case in which
> all CPUs mapped to this hctx become offline, and I believe the following
> patch should address this case, could you give a test?
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c376d1b6309a..23f0f3ddffcf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx)
>   */
>  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>  {
> + bool tried = false;
> +
>   if (hctx->queue->nr_hw_queues == 1)
>   return WORK_CPU_UNBOUND;
>  
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
> +select_cpu:
>  
>   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>   cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
>   next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>  
> - hctx->next_cpu = next_cpu;
> + /*
> +  * No online CPU can be found here when running from
> +  * blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu
> +  * is set correctly.
> +  */
> + if (next_cpu >= nr_cpu_ids)
> + hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> + cpu_possible_mask);
> + else
> + hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>   }
>  
> + /*
> +  * Do unbound schedule if we can't find a online CPU for this hctx,
> +  * and it should happen only if hctx->next_cpu is becoming DEAD.
> +  */
> + if (!cpu_online(hctx->next_cpu)) {
> + if (!tried) {
> + tried = true;
> + goto select_cpu;
> + }
> + return WORK_CPU_UNBOUND;
> + }
>   return hctx->next_cpu;
>  }

I have tested this patch. The panic was gone, but I got the following:

[  231.674464] WARNING: CPU: 0 PID: 263 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  231.674466] Modules linked in: .
[  231.674494] CPU: 0 PID: 263 Comm: kworker/2:1H Not tainted 4.15.0-rc7+ #12
[  231.674495] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  231.674496] Workqueue: kblockd blk_mq_run_work_fn
[  231.674498] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  231.674499] RSP: 0018:a9c801fcfe60 EFLAGS: 00010202
[  231.674500] RAX: 0001 RBX: 9c7c90231400 RCX: 
[  231.674500] RDX: 9c7c9255b0f8 RSI:  RDI: 9c7c90231400
[  231.674500] RBP: 9c7ca2ca2140 R08:  R09: 
[  231.674501] R10: 03cb R11:  R12: 9c7ca2ca8200
[  231.674501] R13:  R14: 09c7ca2ca820 R15: 9c7c3df25240
[  231.674502] FS:  () GS:9c7ca2c0() 
knlGS:
[  231.674502] CS:  0010 DS:  ES:  CR0: 80050033
[  231.674503] CR2: 01727008 CR3: 000336409003 CR4: 003606f0
[  231.674504] DR0:  DR1:  DR2: 
[  231.674504] DR3:  DR6: fffe0ff0 DR7: 0400
[  231.674504] Call Trace:
[  231.674509]  process_one_work+0x14b/0x420
[  231.674510]  worker_thread+0x47/0x420
[  231.674512]  kthread+0xf5/0x130
[  231.674514]  ? process_one_work+0x420/0x420
[  231.674515]  ? kthread_associate_blkcg+0xe0/0xe0
[  231.674517]  ? do_group_exit+0x3a/0xa0
[  231.674518]  ret_from_fork+0x24/0x30
[  231.674520] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  231.674537] ---[ end trace cc2de957e0e0fed4 ]---

It is here.
__blk_mq_run_hw_queue()

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));


To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  
even if the cpu is offlined and modify the cpu_online above to cpu_active.
The kworkers of the per-cpu pool must have be migrated back when the cpu is set 
active.
But there seems to be issues in DASD as your previous comment.

That is the original version of this patch, and both Christian and Stefan
reported that system can't boot from DASD in this way[2], and I changed
to AND with cpu_online_mask, then their system can boot well


On the other hand, there is also risk in 

@@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
blk_queue_exit(q);
return ERR_PTR(-EXDEV);
}
-   cpu = 

Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi minglei

On 01/16/2018 08:10 PM, Ming Lei wrote:
>>> -   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
>>> +   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>>> +   cpu_online_mask);
>>> if (next_cpu >= nr_cpu_ids)
>>> -   next_cpu = cpumask_first(hctx->cpumask);
>>> +   next_cpu = 
>>> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask 
>> is online.
> That supposes not happen because storage device(blk-mq hw queue) is
> generally C/S model, that means the queue becomes only active when
> there is online CPU mapped to it.
> 
> But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
> network controller RX queues.
> 
> [1] 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU=
> 
> One thing I am still not sure(but generic irq affinity supposes to deal with
> well) is that the CPU may become offline after the IO is just submitted,
> then where the IRQ controller delivers the interrupt of this hw queue
> to?
> 
>> This could be reproduced on NVMe with a patch that could hold some rqs on 
>> ctx->rq_list,
>> meanwhile a script online and offline the cpus. Then a panic occurred in 
>> __queue_work().
> That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
> are dispatched directly, please see blk_mq_hctx_notify_dead().

Yes, I know. The  blk_mq_hctx_notify_dead will be invoked after the cpu has 
been set offlined.
Please refer to the following diagram.

CPU A  CPU T
 kick  
  _cpu_down() ->   cpuhp_thread_fun (cpuhpT kthread)
   AP_ACTIVE   (clear cpu_active_mask)
 |
 v
   AP_WORKQUEUE_ONLINE (unbind workers)
 |
 v
   TEARDOWN_CPU(stop_machine)
,   | execute
 \_ _ _ _ _ _   v
preempt  V  take_cpu_down ( migration 
kthread)

set_cpu_online(smp_processor_id(), false) (__cpu_disable)  --> Here !!!
TEARDOWN_CPU
|
 cpuhpT kthead is|  v
 migrated away   ,AP_SCHED_STARTING 
(migrate_tasks)
 _ _ _ _ _ _ _ _/   |
V   v
  CPU X   AP_OFFLINE

|
,
 _ _ _ _ _ /
V
  do_idle (idle task)
 <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead
 complete st->done_down
   __cpu_die (cpuhpT kthread, teardown_cpu) 

 AP_OFFLINE
   |
   v
 BRINGUP_CPU
   |
   v
 BLK_MQ_DEAD---> Here !!!
   |
   v
 OFFLINE

The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is 
invoked.
If the device is NVMe which only has one cpu mapped on the hctx, 
cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value.

I even got a backtrace showed that the panic occurred in 
blk_mq_hctx_notify_dead -> kblocked_schedule_delayed_work_on -> __queue_work.
Kdump doesn't work well on my machine, so I cannot share the backtrace here, 
that's really sad.
I even added BUG_ON as following, it could be triggered.

if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
BUG_ON(next_cpu >= nr_cpu_ids);


Thanks
Jianchao
> 
>> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu 
>> has been unbound.
>> It should be ok to queue on them.


  1   2   >