Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-26 Thread Ming Lei
Hi Jianchao,

On Fri, Jan 19, 2018 at 11:05:35AM +0800, jianchao.wang wrote:
> Hi ming
> 
> Sorry for delayed report this.
> 
> On 01/17/2018 05:57 PM, Ming Lei wrote:
> > 2) hctx->next_cpu can become offline from online before 
> > __blk_mq_run_hw_queue
> > is run, there isn't warning, but once the IO is submitted to hardware,
> > after it is completed, how does the HBA/hw queue notify CPU since CPUs
> > assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> > handler may cover that, but looks too tricky.
> 
> In theory, the irq affinity will be migrated to other cpu. This is done by

Yes, but the other CPU should belong to this irq's affinity, and if all
CPUs in the irq's affinity is DEAD, this irq vector will be shutdown,
and if there is in-flight IO or will be, then the completion for this
IOs won't be delivered to CPUs. And now seems we depend on queue's timeout
handler to handle them.

> fixup_irqs() in the context of stop_machine.


> However, in my test, I found this log:
> 
> [  267.161043] do_IRQ: 7.33 No irq handler for vector
> 
> The 33 is the vector used by nvme cq.
> The irq seems to be missed and sometimes IO hang occurred.

As I mentioned above, it shouldn't be strange to see in CPU offline/online
stress test.


-- 
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-18 Thread jianchao.wang
Hi ming

Sorry for delayed report this.

On 01/17/2018 05:57 PM, Ming Lei wrote:
> 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
> is run, there isn't warning, but once the IO is submitted to hardware,
> after it is completed, how does the HBA/hw queue notify CPU since CPUs
> assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> handler may cover that, but looks too tricky.

In theory, the irq affinity will be migrated to other cpu. This is done by
fixup_irqs() in the context of stop_machine.
However, in my test, I found this log:

[  267.161043] do_IRQ: 7.33 No irq handler for vector

The 33 is the vector used by nvme cq.
The irq seems to be missed and sometimes IO hang occurred.
It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq.

I add dump stack behind the error log and get following:
[  267.161043] do_IRQ: 7.33 No irq handler for vector migration/7
[  267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27
[  267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  267.161046] Call Trace:
[  267.161047]  
[  267.161052]  dump_stack+0x7c/0xb5
[  267.161054]  do_IRQ+0xb9/0xf0
[  267.161056]  common_interrupt+0xa2/0xa2
[  267.161057]  
[  267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120
[  267.161060] RSP: 0018:bb6c81af7e70 EFLAGS: 0202 ORIG_RAX: 
ffde
[  267.161061] RAX: 0001 RBX: 0004 RCX: 
[  267.161062] RDX: 0006 RSI: 898c4591 RDI: 0202
[  267.161063] RBP: bb6c826e7c88 R08: 991abc1256bc R09: 0005
[  267.161063] R10: bb6c81af7db8 R11: 89c91d20 R12: 0001
[  267.161064] R13: bb6c826e7cac R14: 0003 R15: 
[  267.161067]  ? cpu_stop_queue_work+0x90/0x90
[  267.161068]  cpu_stopper_thread+0x83/0x100
[  267.161070]  smpboot_thread_fn+0x161/0x220
[  267.161072]  kthread+0xf5/0x130
[  267.161073]  ? sort_range+0x20/0x20
[  267.161074]  ? kthread_associate_blkcg+0xe0/0xe0
[  267.161076]  ret_from_fork+0x24/0x30

The irq just occurred after the irq is enabled in multi_cpu_stop.

0x8112d655 is in multi_cpu_stop 
(/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223).
218  */
219 touch_nmi_watchdog();
220 }
221 } while (curstate != MULTI_STOP_EXIT);
222 
223 local_irq_restore(flags);
224 return err;
225 }


Thanks
Jianchao


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 11:07:48AM +0100, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
> >> Hi ming 
> >>
> >> Thanks for your kindly response.
> >>
> >> On 01/17/2018 02:22 PM, Ming Lei wrote:
> >>> This warning can't be removed completely, for example, the CPU figured
> >>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> >>> following call returns and before __blk_mq_run_hw_queue() is scheduled
> >>> to run.
> >>>
> >>>   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> >>> >run_work, msecs_to_jiffies(msecs))
> >> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> >> There is a big gap between cpu_online and cpu_active. rebind_workers is 
> >> also between them.
> > 
> > This warning is harmless, also you can't reproduce it without help of your
> > special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with 
> panic_on_warn.
> If a condition can happen we should not use WARN_ON but something else.

Agree, printk() should be fine, IMO.


-- 
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Christian Borntraeger


On 01/17/2018 11:07 AM, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
>> Hi Jianchao,
>>
>> On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
>>> Hi ming 
>>>
>>> Thanks for your kindly response.
>>>
>>> On 01/17/2018 02:22 PM, Ming Lei wrote:
 This warning can't be removed completely, for example, the CPU figured
 in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
 following call returns and before __blk_mq_run_hw_queue() is scheduled
 to run.

kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
 >run_work, msecs_to_jiffies(msecs))
>>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>>> There is a big gap between cpu_online and cpu_active. rebind_workers is 
>>> also between them.
>>
>> This warning is harmless, also you can't reproduce it without help of your
>> special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with 
> panic_on_warn.

To make it more clear. Every WARN_ON that can happen in real life without 
actually being
an error is problematic.

> If a condition can happen we should not use WARN_ON but something else.
> 



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Christian Borntraeger


On 01/17/2018 10:57 AM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
>> Hi ming 
>>
>> Thanks for your kindly response.
>>
>> On 01/17/2018 02:22 PM, Ming Lei wrote:
>>> This warning can't be removed completely, for example, the CPU figured
>>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
>>> following call returns and before __blk_mq_run_hw_queue() is scheduled
>>> to run.
>>>
>>> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
>>> >run_work, msecs_to_jiffies(msecs))
>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>> There is a big gap between cpu_online and cpu_active. rebind_workers is also 
>> between them.
> 
> This warning is harmless, also you can't reproduce it without help of your
> special patch, I guess, :-) So the window shouldn't be a big deal. 

FWIW, every WARN_ON is problematic since there are people running with 
panic_on_warn.
If a condition can happen we should not use WARN_ON but something else.



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
> Hi ming 
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 02:22 PM, Ming Lei wrote:
> > This warning can't be removed completely, for example, the CPU figured
> > in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> > following call returns and before __blk_mq_run_hw_queue() is scheduled
> > to run.
> > 
> > kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> > >run_work, msecs_to_jiffies(msecs))
> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> There is a big gap between cpu_online and cpu_active. rebind_workers is also 
> between them.

This warning is harmless, also you can't reproduce it without help of your
special patch, I guess, :-) So the window shouldn't be a big deal. 

But it can be a problem about the delay(msecs_to_jiffies(msecs)) passed to
kblockd_mod_delayed_work_on(), because during the period:

1) hctx->next_cpu can become online from offline before __blk_mq_run_hw_queue
is run, your warning is triggered, but it is harmless

2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
is run, there isn't warning, but once the IO is submitted to hardware,
after it is completed, how does the HBA/hw queue notify CPU since CPUs
assigned to this hw queue(irq vector) are offline? blk-mq's timeout
handler may cover that, but looks too tricky.

> 
> > 
> > Just be curious how you trigger this issue? And is it triggered in CPU
> > hotplug stress test? Or in a normal use case?
> 
> In fact, this is my own investigation about whether the .queue_rq to one 
> hardware queue could be executed on
> the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
> I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
>  - A special patch that could hold some requests on ctx->rq_list though 
> .get_budget
>  - A script issues IOs with fio
>  - A script online/offline the cpus continuously

Thanks for sharing your reproduction approach.

Without a handler for CPU hotplug, it isn't easy to avoid the warning
completely in __blk_mq_run_hw_queue().

> At first, just the warning above. Then after this patch was introduced, panic 
> came up.

We have to fix the panic, so I will post the patch you tested in this thread.

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread jianchao.wang
Hi ming 

Thanks for your kindly response.

On 01/17/2018 02:22 PM, Ming Lei wrote:
> This warning can't be removed completely, for example, the CPU figured
> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> following call returns and before __blk_mq_run_hw_queue() is scheduled
> to run.
> 
>   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> >run_work, msecs_to_jiffies(msecs))
We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
There is a big gap between cpu_online and cpu_active. rebind_workers is also 
between them.

> 
> Just be curious how you trigger this issue? And is it triggered in CPU
> hotplug stress test? Or in a normal use case?

In fact, this is my own investigation about whether the .queue_rq to one 
hardware queue could be executed on
the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
 - A special patch that could hold some requests on ctx->rq_list though 
.get_budget
 - A script issues IOs with fio
 - A script online/offline the cpus continuously
At first, just the warning above. Then after this patch was introduced, panic 
came up.

Thanks
Jianchao


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 01:24:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 11:52 AM, Ming Lei wrote:
> >> It is here.
> >> __blk_mq_run_hw_queue()
> >> 
> >> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> >> cpu_online(hctx->next_cpu));
> > I think this warning is triggered after the CPU of hctx->next_cpu becomes
> > online again, and it should have been dealt with by the following trick,
> > and this patch is against the previous one, please test it and see if
> > the warning can be fixed.
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 23f0f3ddffcf..0620ccb65e4e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> > *hctx)
> > tried = true;
> > goto select_cpu;
> > }
> > +
> > +   /* handle after this CPU of hctx->next_cpu becomes online again 
> > */
> > +   hctx->next_cpu_batch = 1;
> > return WORK_CPU_UNBOUND;
> > }
> > return hctx->next_cpu;
> > 
> 
> The WARNING still could be triggered.
> 
> [  282.194532] WARNING: CPU: 0 PID: 222 at 
> /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
> __blk_mq_run_hw_queue+0x92/0xa0
> [  282.194534] Modules linked in: 
> [  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW
> 4.15.0-rc7+ #17
> 

This warning can't be removed completely, for example, the CPU figured
in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
following call returns and before __blk_mq_run_hw_queue() is scheduled
to run.

kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
>run_work, msecs_to_jiffies(msecs))

Just be curious how you trigger this issue? And is it triggered in CPU
hotplug stress test? Or in a normal use case?

> >> 
> >>
> >> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the 
> >> cpu  even if the cpu is offlined and modify the cpu_online above to 
> >> cpu_active.
> >> The kworkers of the per-cpu pool must have be migrated back when the cpu 
> >> is set active.
> >> But there seems to be issues in DASD as your previous comment.
> > Yes, we can't break DASD.
> > 
> >> That is the original version of this patch, and both Christian and Stefan
> >> reported that system can't boot from DASD in this way[2], and I changed
> >> to AND with cpu_online_mask, then their system can boot well
> >> On the other hand, there is also risk in 
> >>
> >> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> >> request_queue *q,
> >>blk_queue_exit(q);
> >>return ERR_PTR(-EXDEV);
> >>}
> >> -  cpu = cpumask_first(alloc_data.hctx->cpumask);
> >> +  cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> >>alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> >>
> >> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> > This one is crazy, and is used by NVMe only, it should be fine if
> > the passed 'hctx_idx' is retrieved by the current running CPU, such
> > as the way of blk_mq_map_queue(). But if not, bad thing may happen.
> Even if retrieved by current running cpu, the task still could be migrated 
> away when the cpu is offlined.
> 

I just checked NVMe code, looks only nvmf_connect_io_queue() passes
one specific 'qid' and calls blk_mq_alloc_request_hctx(); and for others,
NVME_QID_ANY is passed, then blk_mq_alloc_request() is called in
nvme_alloc_request().

CC NVMe list and guys.

Hello James, Keith, Christoph and Sagi,

Looks nvmf_connect_io_queue() is run in the following fragile way:

for (i = 1; i < ctrl->ctrl.queue_count; i++) {
...
nvmf_connect_io_queue(>ctrl, i);
...
}

The hardware queue idx is passed to nvmf_connect_io_queue() directly
and finally blk_mq_alloc_request_hctx() is called to allocate request
for the queue, but all CPUs mapped to this hw queue may become offline
when running in blk_mq_alloc_request_hctx(), so what is the supposed way
to handle this situation?

Is it fine to return failure simply in blk_mq_alloc_request_hctx()
under this situation? But the CPU may become online later...

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your kindly response.

On 01/17/2018 11:52 AM, Ming Lei wrote:
>> It is here.
>> __blk_mq_run_hw_queue()
>> 
>> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
>> cpu_online(hctx->next_cpu));
> I think this warning is triggered after the CPU of hctx->next_cpu becomes
> online again, and it should have been dealt with by the following trick,
> and this patch is against the previous one, please test it and see if
> the warning can be fixed.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 23f0f3ddffcf..0620ccb65e4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> *hctx)
>   tried = true;
>   goto select_cpu;
>   }
> +
> + /* handle after this CPU of hctx->next_cpu becomes online again 
> */
> + hctx->next_cpu_batch = 1;
>   return WORK_CPU_UNBOUND;
>   }
>   return hctx->next_cpu;
> 

The WARNING still could be triggered.

[  282.194532] WARNING: CPU: 0 PID: 222 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  282.194534] Modules linked in: 
[  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW
4.15.0-rc7+ #17
[  282.194562] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  282.194563] Workqueue: kblockd blk_mq_run_work_fn
[  282.194565] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  282.194565] RSP: 0018:96c50223be60 EFLAGS: 00010202
[  282.194566] RAX: 0001 RBX: 8a7110233800 RCX: 
[  282.194567] RDX: 8a711255f2d0 RSI:  RDI: 8a7110233800
[  282.194567] RBP: 8a7122ca2140 R08:  R09: 
[  282.194568] R10: 0263 R11:  R12: 8a7122ca8200
[  282.194568] R13:  R14: 08a7122ca820 R15: 8a70bcef0780
[  282.194569] FS:  () GS:8a7122c0() 
knlGS:
[  282.194569] CS:  0010 DS:  ES:  CR0: 80050033
[  282.194570] CR2: 555e14d89d60 CR3: 3c809002 CR4: 003606f0
[  282.194571] DR0:  DR1:  DR2: 
[  282.194571] DR3:  DR6: fffe0ff0 DR7: 0400
[  282.194571] Call Trace:
[  282.194576]  process_one_work+0x14b/0x420
[  282.194577]  worker_thread+0x47/0x420
[  282.194579]  kthread+0xf5/0x130
[  282.194581]  ? process_one_work+0x420/0x420
[  282.194581]  ? kthread_associate_blkcg+0xe0/0xe0
[  282.194583]  ret_from_fork+0x24/0x30
[  282.194585] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  282.194601] ---[ end trace c104f0a63d3df31b ]---

>> 
>>
>> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu 
>>  even if the cpu is offlined and modify the cpu_online above to cpu_active.
>> The kworkers of the per-cpu pool must have be migrated back when the cpu is 
>> set active.
>> But there seems to be issues in DASD as your previous comment.
> Yes, we can't break DASD.
> 
>> That is the original version of this patch, and both Christian and Stefan
>> reported that system can't boot from DASD in this way[2], and I changed
>> to AND with cpu_online_mask, then their system can boot well
>> On the other hand, there is also risk in 
>>
>> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>>  blk_queue_exit(q);
>>  return ERR_PTR(-EXDEV);
>>  }
>> -cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> This one is crazy, and is used by NVMe only, it should be fine if
> the passed 'hctx_idx' is retrieved by the current running CPU, such
> as the way of blk_mq_map_queue(). But if not, bad thing may happen.
Even if retrieved by current running cpu, the task still could be migrated away 
when the cpu is offlined.

Thanks
Jianchao




Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 10:56:13AM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your patch and kindly response.

You are welcome!

> 
> On 01/16/2018 11:32 PM, Ming Lei wrote:
> > OK, I got it, and it should have been the only corner case in which
> > all CPUs mapped to this hctx become offline, and I believe the following
> > patch should address this case, could you give a test?
> > 
> > ---
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c376d1b6309a..23f0f3ddffcf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct 
> > blk_mq_hw_ctx *hctx)
> >   */
> >  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> >  {
> > +   bool tried = false;
> > +
> > if (hctx->queue->nr_hw_queues == 1)
> > return WORK_CPU_UNBOUND;
> >  
> > if (--hctx->next_cpu_batch <= 0) {
> > int next_cpu;
> > +select_cpu:
> >  
> > next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> > cpu_online_mask);
> > if (next_cpu >= nr_cpu_ids)
> > next_cpu = 
> > cpumask_first_and(hctx->cpumask,cpu_online_mask);
> >  
> > -   hctx->next_cpu = next_cpu;
> > +   /*
> > +* No online CPU can be found here when running from
> > +* blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu
> > +* is set correctly.
> > +*/
> > +   if (next_cpu >= nr_cpu_ids)
> > +   hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> > +   cpu_possible_mask);
> > +   else
> > +   hctx->next_cpu = next_cpu;
> > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> > }
> >  
> > +   /*
> > +* Do unbound schedule if we can't find a online CPU for this hctx,
> > +* and it should happen only if hctx->next_cpu is becoming DEAD.
> > +*/
> > +   if (!cpu_online(hctx->next_cpu)) {
> > +   if (!tried) {
> > +   tried = true;
> > +   goto select_cpu;
> > +   }
> > +   return WORK_CPU_UNBOUND;
> > +   }
> > return hctx->next_cpu;
> >  }
> 
> I have tested this patch. The panic was gone, but I got the following:
> 
> [  231.674464] WARNING: CPU: 0 PID: 263 at 
> /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
> __blk_mq_run_hw_queue+0x92/0xa0
>

..

> It is here.
> __blk_mq_run_hw_queue()
> 
> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> cpu_online(hctx->next_cpu));

I think this warning is triggered after the CPU of hctx->next_cpu becomes
online again, and it should have been dealt with by the following trick,
and this patch is against the previous one, please test it and see if
the warning can be fixed.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 23f0f3ddffcf..0620ccb65e4e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
tried = true;
goto select_cpu;
}
+
+   /* handle after this CPU of hctx->next_cpu becomes online again 
*/
+   hctx->next_cpu_batch = 1;
return WORK_CPU_UNBOUND;
}
return hctx->next_cpu;

> 
> 
> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  
> even if the cpu is offlined and modify the cpu_online above to cpu_active.
> The kworkers of the per-cpu pool must have be migrated back when the cpu is 
> set active.
> But there seems to be issues in DASD as your previous comment.

Yes, we can't break DASD.

> 
> That is the original version of this patch, and both Christian and Stefan
> reported that system can't boot from DASD in this way[2], and I changed
> to AND with cpu_online_mask, then their system can boot well
> 
> 
> On the other hand, there is also risk in 
> 
> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>   blk_queue_exit(q);
>   return ERR_PTR(-EXDEV);
>   }
> - cpu = cpumask_first(alloc_data.hctx->cpumask);
> + cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> what if the cpus in alloc_data.hctx->cpumask are all offlined ?

This one is crazy, and is used by NVMe only, it should be fine if
the passed 'hctx_idx' is retrieved by the current running CPU, such
as the way of blk_mq_map_queue(). But if not, bad thing may happen.

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your patch and kindly response.

On 01/16/2018 11:32 PM, Ming Lei wrote:
> OK, I got it, and it should have been the only corner case in which
> all CPUs mapped to this hctx become offline, and I believe the following
> patch should address this case, could you give a test?
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c376d1b6309a..23f0f3ddffcf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx)
>   */
>  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>  {
> + bool tried = false;
> +
>   if (hctx->queue->nr_hw_queues == 1)
>   return WORK_CPU_UNBOUND;
>  
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
> +select_cpu:
>  
>   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>   cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
>   next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>  
> - hctx->next_cpu = next_cpu;
> + /*
> +  * No online CPU can be found here when running from
> +  * blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu
> +  * is set correctly.
> +  */
> + if (next_cpu >= nr_cpu_ids)
> + hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> + cpu_possible_mask);
> + else
> + hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>   }
>  
> + /*
> +  * Do unbound schedule if we can't find a online CPU for this hctx,
> +  * and it should happen only if hctx->next_cpu is becoming DEAD.
> +  */
> + if (!cpu_online(hctx->next_cpu)) {
> + if (!tried) {
> + tried = true;
> + goto select_cpu;
> + }
> + return WORK_CPU_UNBOUND;
> + }
>   return hctx->next_cpu;
>  }

I have tested this patch. The panic was gone, but I got the following:

[  231.674464] WARNING: CPU: 0 PID: 263 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  231.674466] Modules linked in: .
[  231.674494] CPU: 0 PID: 263 Comm: kworker/2:1H Not tainted 4.15.0-rc7+ #12
[  231.674495] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  231.674496] Workqueue: kblockd blk_mq_run_work_fn
[  231.674498] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  231.674499] RSP: 0018:a9c801fcfe60 EFLAGS: 00010202
[  231.674500] RAX: 0001 RBX: 9c7c90231400 RCX: 
[  231.674500] RDX: 9c7c9255b0f8 RSI:  RDI: 9c7c90231400
[  231.674500] RBP: 9c7ca2ca2140 R08:  R09: 
[  231.674501] R10: 03cb R11:  R12: 9c7ca2ca8200
[  231.674501] R13:  R14: 09c7ca2ca820 R15: 9c7c3df25240
[  231.674502] FS:  () GS:9c7ca2c0() 
knlGS:
[  231.674502] CS:  0010 DS:  ES:  CR0: 80050033
[  231.674503] CR2: 01727008 CR3: 000336409003 CR4: 003606f0
[  231.674504] DR0:  DR1:  DR2: 
[  231.674504] DR3:  DR6: fffe0ff0 DR7: 0400
[  231.674504] Call Trace:
[  231.674509]  process_one_work+0x14b/0x420
[  231.674510]  worker_thread+0x47/0x420
[  231.674512]  kthread+0xf5/0x130
[  231.674514]  ? process_one_work+0x420/0x420
[  231.674515]  ? kthread_associate_blkcg+0xe0/0xe0
[  231.674517]  ? do_group_exit+0x3a/0xa0
[  231.674518]  ret_from_fork+0x24/0x30
[  231.674520] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  231.674537] ---[ end trace cc2de957e0e0fed4 ]---

It is here.
__blk_mq_run_hw_queue()

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));


To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  
even if the cpu is offlined and modify the cpu_online above to cpu_active.
The kworkers of the per-cpu pool must have be migrated back when the cpu is set 
active.
But there seems to be issues in DASD as your previous comment.

That is the original version of this patch, and both Christian and Stefan
reported that system can't boot from DASD in this way[2], and I changed
to AND with cpu_online_mask, then their system can boot well


On the other hand, there is also risk in 

@@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
blk_queue_exit(q);
return ERR_PTR(-EXDEV);
}
-   cpu = 

Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
On Tue, Jan 16, 2018 at 10:31:42PM +0800, jianchao.wang wrote:
> Hi minglei
> 
> On 01/16/2018 08:10 PM, Ming Lei wrote:
> >>> - next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> >>> + next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> >>> + cpu_online_mask);
> >>>   if (next_cpu >= nr_cpu_ids)
> >>> - next_cpu = cpumask_first(hctx->cpumask);
> >>> + next_cpu = 
> >>> cpumask_first_and(hctx->cpumask,cpu_online_mask);
> >> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask 
> >> is online.
> > That supposes not happen because storage device(blk-mq hw queue) is
> > generally C/S model, that means the queue becomes only active when
> > there is online CPU mapped to it.
> > 
> > But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
> > network controller RX queues.
> > 
> > [1] 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU=
> > 
> > One thing I am still not sure(but generic irq affinity supposes to deal with
> > well) is that the CPU may become offline after the IO is just submitted,
> > then where the IRQ controller delivers the interrupt of this hw queue
> > to?
> > 
> >> This could be reproduced on NVMe with a patch that could hold some rqs on 
> >> ctx->rq_list,
> >> meanwhile a script online and offline the cpus. Then a panic occurred in 
> >> __queue_work().
> > That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
> > are dispatched directly, please see blk_mq_hctx_notify_dead().
> 
> Yes, I know. The  blk_mq_hctx_notify_dead will be invoked after the cpu has 
> been set offlined.
> Please refer to the following diagram.
> 
> CPU A  CPU T
>  kick  
>   _cpu_down() ->   cpuhp_thread_fun (cpuhpT kthread)
>AP_ACTIVE   (clear cpu_active_mask)
>  |
>  v
>AP_WORKQUEUE_ONLINE (unbind workers)
>  |
>  v
>TEARDOWN_CPU(stop_machine)
> ,   | execute
>  \_ _ _ _ _ _   v
> preempt  V  take_cpu_down ( migration 
> kthread)
> 
> set_cpu_online(smp_processor_id(), false) (__cpu_disable)  --> Here !!!
> TEARDOWN_CPU
> |
>  cpuhpT kthead is|  v
>  migrated away   ,AP_SCHED_STARTING 
> (migrate_tasks)
>  _ _ _ _ _ _ _ _/   |
> V   v
>   CPU X   AP_OFFLINE
> 
> |
> ,
>  _ _ _ _ _ /
> V
>   do_idle (idle task)
>  <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead
>  complete st->done_down
>__cpu_die (cpuhpT kthread, teardown_cpu) 
> 
>  AP_OFFLINE
>|
>v
>  BRINGUP_CPU
>|
>v
>  BLK_MQ_DEAD---> Here !!!
>|
>v
>  OFFLINE
> 
> The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is 
> invoked.
> If the device is NVMe which only has one cpu mapped on the hctx, 
> cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value.

Hi Jianchao,

OK, I got it, and it should have been the only corner case in which
all CPUs mapped to this hctx become offline, and I believe the following
patch should address this case, could you give a test?

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c376d1b6309a..23f0f3ddffcf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
+   bool tried = false;
+
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
+select_cpu:
 
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if 

Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi minglei

On 01/16/2018 08:10 PM, Ming Lei wrote:
>>> -   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
>>> +   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>>> +   cpu_online_mask);
>>> if (next_cpu >= nr_cpu_ids)
>>> -   next_cpu = cpumask_first(hctx->cpumask);
>>> +   next_cpu = 
>>> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask 
>> is online.
> That supposes not happen because storage device(blk-mq hw queue) is
> generally C/S model, that means the queue becomes only active when
> there is online CPU mapped to it.
> 
> But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
> network controller RX queues.
> 
> [1] 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU=
> 
> One thing I am still not sure(but generic irq affinity supposes to deal with
> well) is that the CPU may become offline after the IO is just submitted,
> then where the IRQ controller delivers the interrupt of this hw queue
> to?
> 
>> This could be reproduced on NVMe with a patch that could hold some rqs on 
>> ctx->rq_list,
>> meanwhile a script online and offline the cpus. Then a panic occurred in 
>> __queue_work().
> That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
> are dispatched directly, please see blk_mq_hctx_notify_dead().

Yes, I know. The  blk_mq_hctx_notify_dead will be invoked after the cpu has 
been set offlined.
Please refer to the following diagram.

CPU A  CPU T
 kick  
  _cpu_down() ->   cpuhp_thread_fun (cpuhpT kthread)
   AP_ACTIVE   (clear cpu_active_mask)
 |
 v
   AP_WORKQUEUE_ONLINE (unbind workers)
 |
 v
   TEARDOWN_CPU(stop_machine)
,   | execute
 \_ _ _ _ _ _   v
preempt  V  take_cpu_down ( migration 
kthread)

set_cpu_online(smp_processor_id(), false) (__cpu_disable)  --> Here !!!
TEARDOWN_CPU
|
 cpuhpT kthead is|  v
 migrated away   ,AP_SCHED_STARTING 
(migrate_tasks)
 _ _ _ _ _ _ _ _/   |
V   v
  CPU X   AP_OFFLINE

|
,
 _ _ _ _ _ /
V
  do_idle (idle task)
 <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead
 complete st->done_down
   __cpu_die (cpuhpT kthread, teardown_cpu) 

 AP_OFFLINE
   |
   v
 BRINGUP_CPU
   |
   v
 BLK_MQ_DEAD---> Here !!!
   |
   v
 OFFLINE

The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is 
invoked.
If the device is NVMe which only has one cpu mapped on the hctx, 
cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value.

I even got a backtrace showed that the panic occurred in 
blk_mq_hctx_notify_dead -> kblocked_schedule_delayed_work_on -> __queue_work.
Kdump doesn't work well on my machine, so I cannot share the backtrace here, 
that's really sad.
I even added BUG_ON as following, it could be triggered.

if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
BUG_ON(next_cpu >= nr_cpu_ids);


Thanks
Jianchao
> 
>> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu 
>> has been unbound.
>> It should be ok to queue on them.


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Tue, Jan 16, 2018 at 06:12:09PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 01/12/2018 10:53 AM, Ming Lei wrote:
> > From: Christoph Hellwig 
> > 
> > The previous patch assigns interrupt vectors to all possible CPUs, so
> > now hctx can be mapped to possible CPUs, this patch applies this fact
> > to simplify queue mapping & schedule so that we don't need to handle
> > CPU hotplug for dealing with physical CPU plug & unplug. With this
> > simplication, we can work well on physical CPU plug & unplug, which
> > is a normal use case for VM at least.
> > 
> > Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
> > set hctx->numa_node for possible CPUs which are mapped to this hctx. And
> > only choose the online CPUs for schedule.
> > 
> > Reported-by: Christian Borntraeger 
> > Tested-by: Christian Borntraeger 
> > Tested-by: Stefan Haberland 
> > Cc: Thomas Gleixner 
> > Signed-off-by: Christoph Hellwig 
> > (merged the three into one because any single one may not work, and fix
> > selecting online CPUs for scheduler)
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 19 ---
> >  1 file changed, 8 insertions(+), 11 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8000ba6db07d..ef9beca2d117 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> > request_queue *q,
> > blk_queue_exit(q);
> > return ERR_PTR(-EXDEV);
> > }
> > -   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > +   cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> >  
> > rq = blk_mq_get_request(q, NULL, op, _data);
> > @@ -1323,9 +1323,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> > *hctx)
> > if (--hctx->next_cpu_batch <= 0) {
> > int next_cpu;
> >  
> > -   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> > +   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> > +   cpu_online_mask);
> > if (next_cpu >= nr_cpu_ids)
> > -   next_cpu = cpumask_first(hctx->cpumask);
> > +   next_cpu = 
> > cpumask_first_and(hctx->cpumask,cpu_online_mask);
> 
> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask is 
> online.

That supposes not happen because storage device(blk-mq hw queue) is
generally C/S model, that means the queue becomes only active when
there is online CPU mapped to it.

But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
network controller RX queues.

[1] https://marc.info/?l=linux-kernel=151601867018444=2

One thing I am still not sure(but generic irq affinity supposes to deal with
well) is that the CPU may become offline after the IO is just submitted,
then where the IRQ controller delivers the interrupt of this hw queue
to?

> This could be reproduced on NVMe with a patch that could hold some rqs on 
> ctx->rq_list,
> meanwhile a script online and offline the cpus. Then a panic occurred in 
> __queue_work().

That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
are dispatched directly, please see blk_mq_hctx_notify_dead().

> 
> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu has 
> been unbound.
> It should be ok to queue on them.

That is the original version of this patch, and both Christian and Stefan
reported that system can't boot from DASD in this way[2], and I changed
to AND with cpu_online_mask, then their system can boot well.

[2] https://marc.info/?l=linux-kernel=151256312722285=2

-- 
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi Ming

On 01/12/2018 10:53 AM, Ming Lei wrote:
> From: Christoph Hellwig 
> 
> The previous patch assigns interrupt vectors to all possible CPUs, so
> now hctx can be mapped to possible CPUs, this patch applies this fact
> to simplify queue mapping & schedule so that we don't need to handle
> CPU hotplug for dealing with physical CPU plug & unplug. With this
> simplication, we can work well on physical CPU plug & unplug, which
> is a normal use case for VM at least.
> 
> Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
> set hctx->numa_node for possible CPUs which are mapped to this hctx. And
> only choose the online CPUs for schedule.
> 
> Reported-by: Christian Borntraeger 
> Tested-by: Christian Borntraeger 
> Tested-by: Stefan Haberland 
> Cc: Thomas Gleixner 
> Signed-off-by: Christoph Hellwig 
> (merged the three into one because any single one may not work, and fix
> selecting online CPUs for scheduler)
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 19 ---
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8000ba6db07d..ef9beca2d117 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>   blk_queue_exit(q);
>   return ERR_PTR(-EXDEV);
>   }
> - cpu = cpumask_first(alloc_data.hctx->cpumask);
> + cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>  
>   rq = blk_mq_get_request(q, NULL, op, _data);
> @@ -1323,9 +1323,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> *hctx)
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
>  
> - next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> + next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> + cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
> - next_cpu = cpumask_first(hctx->cpumask);
> + next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);

the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask is 
online.
This could be reproduced on NVMe with a patch that could hold some rqs on 
ctx->rq_list,
meanwhile a script online and offline the cpus. Then a panic occurred in 
__queue_work().

maybe cpu_possible_mask here, the workers in the pool of the offlined cpu has 
been unbound.
It should be ok to queue on them.

Thanks
Jianchao 


>   hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> @@ -2219,16 +2220,11 @@ static void blk_mq_init_cpu_queues(struct 
> request_queue *q,
>   INIT_LIST_HEAD(&__ctx->rq_list);
>   __ctx->queue = q;
>  
> - /* If the cpu isn't present, the cpu is mapped to first hctx */
> - if (!cpu_present(i))
> - continue;
> -
> - hctx = blk_mq_map_queue(q, i);
> -
>   /*
>* Set local node, IFF we have more than one hw queue. If
>* not, we remain on the home node of the device
>*/
> + hctx = blk_mq_map_queue(q, i);
>   if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
>   hctx->numa_node = local_memory_node(cpu_to_node(i));
>   }
> @@ -2285,7 +2281,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>*
>* If the cpu isn't present, the cpu is mapped to first hctx.
>*/
> - for_each_present_cpu(i) {
> + for_each_possible_cpu(i) {
>   hctx_idx = q->mq_map[i];
>   /* unmapped hw queue can be remapped after CPU topo changed */
>   if (!set->tags[hctx_idx] &&
> @@ -2339,7 +2335,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   /*
>* Initialize batch roundrobin counts
>*/
> - hctx->next_cpu = cpumask_first(hctx->cpumask);
> + hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> + cpu_online_mask);
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>   }
>  }
> 


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Stefan Haberland

Ming or Christoph,

would you mind to send this patch to stable #4.12? Or is the fixes tag 
enough to get this fixed in all related releases?


Regards,
Stefan



[PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-11 Thread Ming Lei
From: Christoph Hellwig 

The previous patch assigns interrupt vectors to all possible CPUs, so
now hctx can be mapped to possible CPUs, this patch applies this fact
to simplify queue mapping & schedule so that we don't need to handle
CPU hotplug for dealing with physical CPU plug & unplug. With this
simplication, we can work well on physical CPU plug & unplug, which
is a normal use case for VM at least.

Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
set hctx->numa_node for possible CPUs which are mapped to this hctx. And
only choose the online CPUs for schedule.

Reported-by: Christian Borntraeger 
Tested-by: Christian Borntraeger 
Tested-by: Stefan Haberland 
Cc: Thomas Gleixner 
Signed-off-by: Christoph Hellwig 
(merged the three into one because any single one may not work, and fix
selecting online CPUs for scheduler)
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8000ba6db07d..ef9beca2d117 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
blk_queue_exit(q);
return ERR_PTR(-EXDEV);
}
-   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
 
rq = blk_mq_get_request(q, NULL, op, _data);
@@ -1323,9 +1323,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
 
-   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+   cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first(hctx->cpumask);
+   next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);
 
hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
@@ -2219,16 +2220,11 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
INIT_LIST_HEAD(&__ctx->rq_list);
__ctx->queue = q;
 
-   /* If the cpu isn't present, the cpu is mapped to first hctx */
-   if (!cpu_present(i))
-   continue;
-
-   hctx = blk_mq_map_queue(q, i);
-
/*
 * Set local node, IFF we have more than one hw queue. If
 * not, we remain on the home node of the device
 */
+   hctx = blk_mq_map_queue(q, i);
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
hctx->numa_node = local_memory_node(cpu_to_node(i));
}
@@ -2285,7 +2281,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 *
 * If the cpu isn't present, the cpu is mapped to first hctx.
 */
-   for_each_present_cpu(i) {
+   for_each_possible_cpu(i) {
hctx_idx = q->mq_map[i];
/* unmapped hw queue can be remapped after CPU topo changed */
if (!set->tags[hctx_idx] &&
@@ -2339,7 +2335,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
/*
 * Initialize batch roundrobin counts
 */
-   hctx->next_cpu = cpumask_first(hctx->cpumask);
+   hctx->next_cpu = cpumask_first_and(hctx->cpumask,
+   cpu_online_mask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 }
-- 
2.9.5