subject:"Hotplug"

On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote:
> On 21/10/2019 13:53, Ming Lei wrote:
> > On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
> > > > > > 
> > > > > 
> > > > > Yes, we share tags among all queues, but we generate the tag - known 
> > > > > as IPTT
> > > > > - in the LLDD now, as we can no longer use the request tag (as it is 
> > > > > not
> > > > > unique per all queues):
> > > > > 
> > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> > > > > 
> > > > > As I said, the branch is messy and I did have to fix 087b95af374.
> > > > 
> > > > Firstly this way may waste lots of memory, especially the queue depth is
> > > > big, such as, hisilicon V3's queue depth is 4096.
> > > > 
> > > > Secondly, you have to deal with queue busy efficiently and correctly,
> > > > for example, your real hw tags(IPTT) can be used up easily, and how
> > > > will you handle these dispatched request?
> > > 
> > > I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
> > > host.can_queue, so SCSI midlayer should ensure that this does not occur.
> > 
> 
> Hi Ming,
> 
> > That check isn't correct, and each hw queue should have allowed
> > .can_queue in-flight requests.
> 
> There always seems to be some confusion or disagreement on this topic.
> 
> I work according to the comment in scsi_host.h:
> 
> "Note: it is assumed that each hardware queue has a queue depth of
>  can_queue. In other words, the total queue depth per host
>  is nr_hw_queues * can_queue."
> 
> So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)

I believe all current drivers set .can_queue as single hw queue's depth.
If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue
depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues
in-flight requests.

> 
> > 
> > > 
> > > > 
> > > > Finally, you have to evaluate the performance effect, this is highly
> > > > related with how to deal with out-of-IPTT.
> > > 
> > > Some figures from our previous testing:
> > > 
> > > Managed interrupt without exposing multiple queues: 3M IOPs
> > > Managed interrupt with exposing multiple queues: 2.6M IOPs
> > 
> > Then you see the performance regression.
> 
> Let's discuss this when I send the patches, so we don't get sidetracked on
> this blk-mq improvement topic.

OK, what I meant is to use correct driver to test the patches, otherwise
it might be hard to investigate.


Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

On 21/10/2019 13:53, Ming Lei wrote:

On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:

Yes, we share tags among all queues, but we generate the tag - known as IPTT
- in the LLDD now, as we can no longer use the request tag (as it is not
unique per all queues):

https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188

As I said, the branch is messy and I did have to fix 087b95af374.

Firstly this way may waste lots of memory, especially the queue depth is
big, such as, hisilicon V3's queue depth is 4096.

Secondly, you have to deal with queue busy efficiently and correctly,
for example, your real hw tags(IPTT) can be used up easily, and how
will you handle these dispatched request?

I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
host.can_queue, so SCSI midlayer should ensure that this does not occur.

Hi Ming,

That check isn't correct, and each hw queue should have allowed
.can_queue in-flight requests.

There always seems to be some confusion or disagreement on this topic.

I work according to the comment in scsi_host.h:

"Note: it is assumed that each hardware queue has a queue depth of
can_queue. In other words, the total queue depth per host
is nr_hw_queues * can_queue."

So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)

Finally, you have to evaluate the performance effect, this is highly
related with how to deal with out-of-IPTT.

Some figures from our previous testing:

Managed interrupt without exposing multiple queues: 3M IOPs
Managed interrupt with exposing multiple queues: 2.6M IOPs

Then you see the performance regression.

Let's discuss this when I send the patches, so we don't get sidetracked
on this blk-mq improvement topic.

Thanks,
John

Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
> > > > 
> > > 
> > > Yes, we share tags among all queues, but we generate the tag - known as 
> > > IPTT
> > > - in the LLDD now, as we can no longer use the request tag (as it is not
> > > unique per all queues):
> > > 
> > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> > > 
> > > As I said, the branch is messy and I did have to fix 087b95af374.
> > 
> > Firstly this way may waste lots of memory, especially the queue depth is
> > big, such as, hisilicon V3's queue depth is 4096.
> > 
> > Secondly, you have to deal with queue busy efficiently and correctly,
> > for example, your real hw tags(IPTT) can be used up easily, and how
> > will you handle these dispatched request?
> 
> I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
> host.can_queue, so SCSI midlayer should ensure that this does not occur.

That check isn't correct, and each hw queue should have allowed
.can_queue in-flight requests.

> 
> > 
> > Finally, you have to evaluate the performance effect, this is highly
> > related with how to deal with out-of-IPTT.
> 
> Some figures from our previous testing:
> 
> Managed interrupt without exposing multiple queues: 3M IOPs
> Managed interrupt with exposing multiple queues: 2.6M IOPs

Then you see the performance regression.


Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

Yes, we share tags among all queues, but we generate the tag - known as IPTT
- in the LLDD now, as we can no longer use the request tag (as it is not
unique per all queues):

https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188

As I said, the branch is messy and I did have to fix 087b95af374.

Firstly this way may waste lots of memory, especially the queue depth is
big, such as, hisilicon V3's queue depth is 4096.

Secondly, you have to deal with queue busy efficiently and correctly,
for example, your real hw tags(IPTT) can be used up easily, and how
will you handle these dispatched request?

I have not seen scenario of exhausted IPTT. And IPTT count is same as
SCSI host.can_queue, so SCSI midlayer should ensure that this does not
occur.

Finally, you have to evaluate the performance effect, this is highly
related with how to deal with out-of-IPTT.

Some figures from our previous testing:

Managed interrupt without exposing multiple queues: 3M IOPs
Managed interrupt with exposing multiple queues: 2.6M IOPs
No managed interrupts: 500K IOPs.

Now my problem is that I hear some people are relying on the 3M
performance, even though it is experimental and unsafe. I need to follow
up on this. I really don't want to keep that code.

I'd suggest you to fix the stuff and post them out for review.

OK, I'll also check on adding that WARN you provided.

Thanks again,
John

Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

On Mon, Oct 21, 2019 at 10:47:05AM +0100, John Garry wrote:
> On 21/10/2019 10:34, Ming Lei wrote:
> > On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:
> > > On 20/10/2019 11:14, Ming Lei wrote:
> > > > > > ght? If so, I need to find some simple sysfs entry to
> > > > > > > > tell me of this occurrence, to trigger the capture. Or add 
> > > > > > > > something. My
> > > > > > > > script is pretty dump.
> > > > > > > > 
> > > > > > > > BTW, I did notice that we the dump_stack in 
> > > > > > > > __blk_mq_run_hw_queue()
> > > > > > > > pretty soon before the problem happens - maybe a clue or maybe 
> > > > > > > > coincidence.
> > > > > > > > 
> > > > > > 
> > > > > > I finally got to capture that debugfs dump at the point the SCSI IOs
> > > > > > timeout, as attached. Let me know if any problem receiving it.
> > > > > > 
> > > > > > Here's a kernel log snippet at that point (I added some print for 
> > > > > > the
> > > > > > timeout):
> > > > > > 
> > > > > > 609] psci: CPU6 killed.
> > > > > > [  547.722217] CPU5: shutdown
> > > > > > [  547.724926] psci: CPU5 killed.
> > > > > > [  547.749951] irq_shutdown
> > > > > > [  547.752701] IRQ 800: no longer affine to CPU4
> > > > > > [  547.757265] CPU4: shutdown
> > > > > > [  547.759971] psci: CPU4 killed.
> > > > > > [  547.790348] CPU3: shutdown
> > > > > > [  547.793052] psci: CPU3 killed.
> > > > > > [  547.818330] CPU2: shutdown
> > > > > > [  547.821033] psci: CPU2 killed.
> > > > > > [  547.854285] CPU1: shutdown
> > > > > > [  547.856989] psci: CPU1 killed.
> > > > > > [  575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0
> > > > > > [  575.930794] scsi_timeout req=0x0023b0df2700 reserved=0
> > > > > From the debugfs log, 66 requests are dumped, and 63 of them has
> > > > been submitted to device, and the other 3 is in ->dispatch list
> > > > via requeue after timeout is handled.
> > > > 
> > > 
> > > Hi Ming,
> > > 
> > > > You mentioned that:
> > > > 
> > > > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for 
> > > > when
> > > >  inflights rqs !=0, and I don't see them for this timeout"
> > > > 
> > > > There might be two reasons:
> > > > 
> > > > 1) You are still testing a multiple reply-queue device?
> > > 
> > > As before, I am testing by exposing mutliple queues to the SCSI midlayer. 
> > > I
> > > had to make this change locally, as on mainline we still only expose a
> > > single queue and use the internal reply queue when enabling managed
> > > interrupts.
> > > 
> > > As I
> > > > mentioned last times, it is hard to map reply-queue into blk-mq
> > > > hctx correctly.
> > > 
> > > Here's my branch, if you want to check:
> > > 
> > > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> > > 
> > > It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
> > > is removed in commit 087b95af374.
> > > 
> > > I am now thinking of actually making this change to the LLDD in mainline 
> > > to
> > > avoid any doubt in future.
> > 
> > As I mentioned last time, you do share tags among all MQ queues on your 
> > hardware
> > given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
> > wrong, isn't it?
> > 
> 
> Yes, we share tags among all queues, but we generate the tag - known as IPTT
> - in the LLDD now, as we can no longer use the request tag (as it is not
> unique per all queues):
> 
> https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> 
> As I said, the branch is messy and I did have to fix 087b95af374.

Firstly this way may waste lots of memory, especially the queue depth is
big, such as, hisilicon V3's queue depth is 4096.

Secondly, you have to deal with queue busy efficiently and correctly,
for example, your real hw tags(IPTT) can be used up easily, and how
will you handle these dispatched request?

Finally, you have to evaluate the performance effect, this is highly
related with how to deal with out-of-IPTT.

I'd suggest you to fix the stuff and post them out for review.

Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug


On 21/10/2019 10:34, Ming Lei wrote:

On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:

On 20/10/2019 11:14, Ming Lei wrote:

ght? If so, I need to find some simple sysfs entry to

tell me of this occurrence, to trigger the capture. Or add something. My
script is pretty dump.

BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
pretty soon before the problem happens - maybe a clue or maybe coincidence.



I finally got to capture that debugfs dump at the point the SCSI IOs
timeout, as attached. Let me know if any problem receiving it.

Here's a kernel log snippet at that point (I added some print for the
timeout):

609] psci: CPU6 killed.
[  547.722217] CPU5: shutdown
[  547.724926] psci: CPU5 killed.
[  547.749951] irq_shutdown
[  547.752701] IRQ 800: no longer affine to CPU4
[  547.757265] CPU4: shutdown
[  547.759971] psci: CPU4 killed.
[  547.790348] CPU3: shutdown
[  547.793052] psci: CPU3 killed.
[  547.818330] CPU2: shutdown
[  547.821033] psci: CPU2 killed.
[  547.854285] CPU1: shutdown
[  547.856989] psci: CPU1 killed.
[  575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0
[  575.930794] scsi_timeout req=0x0023b0df2700 reserved=0

From the debugfs log, 66 requests are dumped, and 63 of them has

been submitted to device, and the other 3 is in ->dispatch list
via requeue after timeout is handled.



Hi Ming,


You mentioned that:

" - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
 inflights rqs !=0, and I don't see them for this timeout"

There might be two reasons:

1) You are still testing a multiple reply-queue device?


As before, I am testing by exposing mutliple queues to the SCSI midlayer. I
had to make this change locally, as on mainline we still only expose a
single queue and use the internal reply queue when enabling managed
interrupts.

As I

mentioned last times, it is hard to map reply-queue into blk-mq
hctx correctly.


Here's my branch, if you want to check:

https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4

It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
is removed in commit 087b95af374.

I am now thinking of actually making this change to the LLDD in mainline to
avoid any doubt in future.


As I mentioned last time, you do share tags among all MQ queues on your hardware
given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
wrong, isn't it?



Yes, we share tags among all queues, but we generate the tag - known as 
IPTT - in the LLDD now, as we can no longer use the request tag (as it 
is not unique per all queues):


https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188

As I said, the branch is messy and I did have to fix 087b95af374.


It can be very hard to partition the single tags among multiple hctx.



I really do think now that I'll make this change on mainline to avoid 
doubt...


Thanks,
John



Thanks,
Ming


.

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:
> On 20/10/2019 11:14, Ming Lei wrote:
> > > > ght? If so, I need to find some simple sysfs entry to
> > > > > > tell me of this occurrence, to trigger the capture. Or add 
> > > > > > something. My
> > > > > > script is pretty dump.
> > > > > >
> > > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > > > > > pretty soon before the problem happens - maybe a clue or maybe 
> > > > > > coincidence.
> > > > > >
> > > >
> > > > I finally got to capture that debugfs dump at the point the SCSI IOs
> > > > timeout, as attached. Let me know if any problem receiving it.
> > > >
> > > > Here's a kernel log snippet at that point (I added some print for the
> > > > timeout):
> > > >
> > > > 609] psci: CPU6 killed.
> > > > [  547.722217] CPU5: shutdown
> > > > [  547.724926] psci: CPU5 killed.
> > > > [  547.749951] irq_shutdown
> > > > [  547.752701] IRQ 800: no longer affine to CPU4
> > > > [  547.757265] CPU4: shutdown
> > > > [  547.759971] psci: CPU4 killed.
> > > > [  547.790348] CPU3: shutdown
> > > > [  547.793052] psci: CPU3 killed.
> > > > [  547.818330] CPU2: shutdown
> > > > [  547.821033] psci: CPU2 killed.
> > > > [  547.854285] CPU1: shutdown
> > > > [  547.856989] psci: CPU1 killed.
> > > > [  575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0
> > > > [  575.930794] scsi_timeout req=0x0023b0df2700 reserved=0
> > > From the debugfs log, 66 requests are dumped, and 63 of them has
> > been submitted to device, and the other 3 is in ->dispatch list
> > via requeue after timeout is handled.
> > 
> 
> Hi Ming,
> 
> > You mentioned that:
> > 
> > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> >  inflights rqs !=0, and I don't see them for this timeout"
> > 
> > There might be two reasons:
> > 
> > 1) You are still testing a multiple reply-queue device?
> 
> As before, I am testing by exposing mutliple queues to the SCSI midlayer. I
> had to make this change locally, as on mainline we still only expose a
> single queue and use the internal reply queue when enabling managed
> interrupts.
> 
> As I
> > mentioned last times, it is hard to map reply-queue into blk-mq
> > hctx correctly.
> 
> Here's my branch, if you want to check:
> 
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> 
> It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
> is removed in commit 087b95af374.
> 
> I am now thinking of actually making this change to the LLDD in mainline to
> avoid any doubt in future.

As I mentioned last time, you do share tags among all MQ queues on your hardware
given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
wrong, isn't it?

It can be very hard to partition the single tags among multiple hctx.


Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

On 20/10/2019 11:14, Ming Lei wrote:

ght? If so, I need to find some simple sysfs entry to
> > tell me of this occurrence, to trigger the capture. Or add something. My
> > script is pretty dump.
> >
> > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> >

>
> I finally got to capture that debugfs dump at the point the SCSI IOs
> timeout, as attached. Let me know if any problem receiving it.
>
> Here's a kernel log snippet at that point (I added some print for the
> timeout):
>
> 609] psci: CPU6 killed.
> [  547.722217] CPU5: shutdown
> [  547.724926] psci: CPU5 killed.
> [  547.749951] irq_shutdown
> [  547.752701] IRQ 800: no longer affine to CPU4
> [  547.757265] CPU4: shutdown
> [  547.759971] psci: CPU4 killed.
> [  547.790348] CPU3: shutdown
> [  547.793052] psci: CPU3 killed.
> [  547.818330] CPU2: shutdown
> [  547.821033] psci: CPU2 killed.
> [  547.854285] CPU1: shutdown
> [  547.856989] psci: CPU1 killed.
> [  575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0
> [  575.930794] scsi_timeout req=0x0023b0df2700 reserved=0
From the debugfs log, 66 requests are dumped, and 63 of them has

been submitted to device, and the other 3 is in ->dispatch list
via requeue after timeout is handled.

Hi Ming,

You mentioned that:

" - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
 inflights rqs !=0, and I don't see them for this timeout"

There might be two reasons:

1) You are still testing a multiple reply-queue device?

As before, I am testing by exposing mutliple queues to the SCSI 
midlayer. I had to make this change locally, as on mainline we still 
only expose a single queue and use the internal reply queue when 
enabling managed interrupts.

As I

mentioned last times, it is hard to map reply-queue into blk-mq
hctx correctly.

Here's my branch, if you want to check:

https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4

It's a bit messy (sorry), but you can see that the reply-queue in the 
LLDD is removed in commit 087b95af374.

I am now thinking of actually making this change to the LLDD in mainline 
to avoid any doubt in future.

2) concurrent dispatch to device, which can be observed by the
following patch.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 06081966549f..3590f6f947eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
 {
struct request_queue *q = rq->q;

+   WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state));
+
trace_block_rq_issue(q, rq);

if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

However, I think it is hard to be 2#, since the current CPU is the last
CPU in hctx->cpu_mask.

I'll try it.

Thanks as always,
John

Thanks,
Ming

.

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-20 Thread Ming Lei

On Thu, Oct 17, 2019 at 04:40:12PM +0100, John Garry wrote:
> On 16/10/2019 17:19, John Garry wrote:
> > On 16/10/2019 13:07, Ming Lei wrote:
> > > On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:
> > > > On 14/10/2019 02:50, Ming Lei wrote:
> > > > > Hi,
> > > > > 
> > > > > Thomas mentioned:
> > > > > "
> > > > >  That was the constraint of managed interrupts from the very
> > > > > beginning:
> > > > > 
> > > > >   The driver/subsystem has to quiesce the interrupt line and the
> > > > > associated
> > > > >   queue _before_ it gets shutdown in CPU unplug and not fiddle
> > > > > with it
> > > > >   until it's restarted by the core when the CPU is plugged in
> > > > > again.
> > > > > "
> > > > > 
> > > > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills
> > > > > tries
> > > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > > > 
> > > > > This patchset tries to address the issue by two stages:
> > > > > 
> > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > > > 
> > > > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > > > if the hctx is going to be dead.
> > > > > 
> > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> > > > > becomes dead
> > > > > 
> > > > > - steal bios from the request, and resubmit them via
> > > > > generic_make_request(),
> > > > > then these IO will be mapped to other live hctx for dispatch
> > > > > 
> > > > > Please comment & review, thanks!
> > > > > 
> > > > > John, I don't add your tested-by tag since V3 have some changes,
> > > > > and I appreciate if you may run your test on V3.
> > > > 
> > > > Hi Ming,
> > > > 
> > > > So I got around to doing some testing. The good news is that issue
> > > > which we
> > > > were experiencing in v3 series seems to have has gone away - alot more
> > > > stable.
> > > > 
> > > > However, unfortunately, I did notice some SCSI timeouts:
> > > > 
> > > > 15508.615074] CPU2: shutdown
> > > > [15508.617778] psci: CPU2 killed.
> > > > [15508.651220] CPU1: shutdown
> > > > [15508.653924] psci: CPU1 killed.
> > > > [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
> > > > Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
> > > > aborting task 0xa7159744
> > > > KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
> > > > sas_eh_handle_sas_errors: task 0xa7159744 is done
> > > > [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb
> > > > [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is
> > > > done
> > > > [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd
> > > > [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is
> > > > done
> > > > 
> > > > A couple of things to note:
> > > > - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> > > > inflights rqs !=0, and I don't see them for this timeout
> > > > - 0 datarate reported from fio
> > > > 
> > > > I'll have a look...
> > > 
> > > What is the output of the following command?
> > > 
> > > (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH
> > > . {} \;)
> > I assume that you want this run at about the time SCSI EH kicks in for
> > the timeout, right? If so, I need to find some simple sysfs entry to
> > tell me of this occurrence, to trigger the capture. Or add something. My
> > script is pretty dump.
> > 
> > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> > 
> 
> I finally got to capture that debugfs dump at the point the SCSI IOs
> timeout, as attached. Let me know if any problem receiving it.
> 
> Here's a kernel log snippet at that point (I added some print for the
> timeout):
> 
> 609] psci: CPU6 killed.
> [  547.722217] CPU5: shutdown
> [  547.724926] psci: CPU5 killed.
> [  547.749951] irq_shutdown
> [  547.752701] IRQ 800: no longer affine to CPU4
> [  547.757265] CPU4: shutdown
> [  547.759971] psci: CPU4 killed.
> [  547.790348] CPU3: shutdown
> [  547.793052] psci: CPU3 killed.
> [  547.818330] CPU2: shutdown
> [  547.821033] psci: CPU2 killed.
> [  547.854285] CPU1: shutdown
> [  547.856989] psci: CPU1 killed.
> [  575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0
> [  575.930794] scsi_timeout req=0x0023b0df2700 reserved=0

>From the debugfs log, 66 requests are dumped, and 63 of them has
been submitted to device, and the other 3 is in ->dispatch list
via requeue after timeout is handled.

You mentioned that:

" - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
 inflights rqs !=0, and I don't see them for this timeout"

There might be two reasons:

1) You are still testing a multiple reply-queue device? As I
mentioned last times, it is hard to map

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-16 Thread John Garry


On 16/10/2019 13:07, Ming Lei wrote:

On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:

On 14/10/2019 02:50, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.


Hi Ming,

So I got around to doing some testing. The good news is that issue which we
were experiencing in v3 series seems to have has gone away - alot more
stable.

However, unfortunately, I did notice some SCSI timeouts:

15508.615074] CPU2: shutdown
[15508.617778] psci: CPU2 killed.
[15508.651220] CPU1: shutdown
[15508.653924] psci: CPU1 killed.
[15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
aborting task 0xa7159744
KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
sas_eh_handle_sas_errors: task 0xa7159744 is done
[15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb
[15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is
done
[15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd
[15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is
done

A couple of things to note:
- I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
inflights rqs !=0, and I don't see them for this timeout
- 0 datarate reported from fio

I'll have a look...


What is the output of the following command?

(cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;)
I assume that you want this run at about the time SCSI EH kicks in for 
the timeout, right? If so, I need to find some simple sysfs entry to 
tell me of this occurrence, to trigger the capture. Or add something. My 
script is pretty dump.


BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() 
pretty soon before the problem happens - maybe a clue or maybe coincidence.


Thanks,
John



Thanks,
Ming

.

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-16 Thread Ming Lei

On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:
> On 14/10/2019 02:50, Ming Lei wrote:
> > Hi,
> > 
> > Thomas mentioned:
> > "
> >  That was the constraint of managed interrupts from the very beginning:
> > 
> >   The driver/subsystem has to quiesce the interrupt line and the 
> > associated
> >   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >   until it's restarted by the core when the CPU is plugged in again.
> > "
> > 
> > But no drivers or blk-mq do that before one hctx becomes dead(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > 
> > This patchset tries to address the issue by two stages:
> > 
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > 
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> > 
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes 
> > dead
> > 
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> > 
> > Please comment & review, thanks!
> > 
> > John, I don't add your tested-by tag since V3 have some changes,
> > and I appreciate if you may run your test on V3.
> 
> Hi Ming,
> 
> So I got around to doing some testing. The good news is that issue which we
> were experiencing in v3 series seems to have has gone away - alot more
> stable.
> 
> However, unfortunately, I did notice some SCSI timeouts:
> 
> 15508.615074] CPU2: shutdown
> [15508.617778] psci: CPU2 killed.
> [15508.651220] CPU1: shutdown
> [15508.653924] psci: CPU1 killed.
> [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
> Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
> aborting task 0xa7159744
> KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
> sas_eh_handle_sas_errors: task 0xa7159744 is done
> [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb
> [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is
> done
> [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd
> [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is
> done
> 
> A couple of things to note:
> - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> inflights rqs !=0, and I don't see them for this timeout
> - 0 datarate reported from fio
> 
> I'll have a look...

What is the output of the following command?

(cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;)

Thanks,
Ming

Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-16 Thread John Garry


On 14/10/2019 02:50, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.


Hi Ming,

So I got around to doing some testing. The good news is that issue which 
we were experiencing in v3 series seems to have has gone away - alot 
more stable.


However, unfortunately, I did notice some SCSI timeouts:

15508.615074] CPU2: shutdown
[15508.617778] psci: CPU2 killed.
[15508.651220] CPU1: shutdown
[15508.653924] psci: CPU1 killed.
[15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: 
aborting task 0xa7159744
KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: 
sas_eh_handle_sas_errors: task 0xa7159744 is done

[15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb
[15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is 
done

[15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd
[15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is 
done


A couple of things to note:
- I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when 
inflights rqs !=0, and I don't see them for this timeout

- 0 datarate reported from fio

I'll have a look...

Thanks,
John



V4:
- resubmit IOs in dispatch list in case that this hctx is dead

V3:
- re-organize patch 2 & 3 a bit for addressing Hannes's comment
- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo



Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes dead
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 137 ++---
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 141 insertions(+), 15 deletions(-)

Cc: John Garry 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-14 Thread John Garry


On 14/10/2019 02:25, Ming Lei wrote:

On Fri, Oct 11, 2019 at 10:10 PM John Garry  wrote:


On 11/10/2019 12:55, Ming Lei wrote:

On Fri, Oct 11, 2019 at 4:54 PM John Garry  wrote:


On 10/10/2019 12:21, John Garry wrote:




As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.


I am generating the tag internally in the driver now, so that hostwide
tags issue should not be an issue.

And, to be clear, I am not paying too much attention to performance, but
rather just hotplugging while running IO.

An update on testing:
I did some scripted overnight testing. The script essentially loops like
this:
- online all CPUS
- run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
minute
- offline those CPUs
- wait 1 minute (> SCSI or NVMe timeout)
- and repeat

SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
some fio processes never dying with IOPS @ 0. I don't see any NVMe
timeout reported. Did you do any NVMe testing of this sort?



Yeah, so for NVMe, I see some sort of regression, like this:
Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
1158037877d:17h:18m:22s]


I can reproduce this issue, and looks there are requests in ->dispatch.


OK, that may match with what I see:
- the problem occuring coincides with this callpath with
BLK_MQ_S_INTERNAL_STOPPED set:


Good catch, these requests should have been re-submitted in
blk_mq_hctx_notify_dead() too.

Will do it in V4.


OK, I'll have a look at v4 and retest - it may take a while as testing 
this is slow...


All the best,
John



Thanks,
Ming Lei

.

[PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-13 Thread Ming Lei

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.

V4:
- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
- re-organize patch 2 & 3 a bit for addressing Hannes's comment
- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo



Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes dead
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 137 ++---
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 141 insertions(+), 15 deletions(-)

Cc: John Garry 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 

-- 
2.20.1

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-13 Thread Ming Lei

On Fri, Oct 11, 2019 at 10:10 PM John Garry  wrote:
>
> On 11/10/2019 12:55, Ming Lei wrote:
> > On Fri, Oct 11, 2019 at 4:54 PM John Garry  wrote:
> >>
> >> On 10/10/2019 12:21, John Garry wrote:
> >>>
> 
>  As discussed before, tags of hisilicon V3 is HBA wide. If you switch
>  to real hw queue, each hw queue has to own its independent tags.
>  However, that isn't supported by V3 hardware.
> >>>
> >>> I am generating the tag internally in the driver now, so that hostwide
> >>> tags issue should not be an issue.
> >>>
> >>> And, to be clear, I am not paying too much attention to performance, but
> >>> rather just hotplugging while running IO.
> >>>
> >>> An update on testing:
> >>> I did some scripted overnight testing. The script essentially loops like
> >>> this:
> >>> - online all CPUS
> >>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
> >>> minute
> >>> - offline those CPUs
> >>> - wait 1 minute (> SCSI or NVMe timeout)
> >>> - and repeat
> >>>
> >>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
> >>> some fio processes never dying with IOPS @ 0. I don't see any NVMe
> >>> timeout reported. Did you do any NVMe testing of this sort?
> >>>
> >>
> >> Yeah, so for NVMe, I see some sort of regression, like this:
> >> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> >> 1158037877d:17h:18m:22s]
> >
> > I can reproduce this issue, and looks there are requests in ->dispatch.
>
> OK, that may match with what I see:
> - the problem occuring coincides with this callpath with
> BLK_MQ_S_INTERNAL_STOPPED set:

Good catch, these requests should have been re-submitted in
blk_mq_hctx_notify_dead() too.

Will do it in V4.

Thanks,
Ming Lei

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-11 Thread John Garry


On 11/10/2019 12:55, Ming Lei wrote:

On Fri, Oct 11, 2019 at 4:54 PM John Garry  wrote:


On 10/10/2019 12:21, John Garry wrote:




As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.


I am generating the tag internally in the driver now, so that hostwide
tags issue should not be an issue.

And, to be clear, I am not paying too much attention to performance, but
rather just hotplugging while running IO.

An update on testing:
I did some scripted overnight testing. The script essentially loops like
this:
- online all CPUS
- run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
minute
- offline those CPUs
- wait 1 minute (> SCSI or NVMe timeout)
- and repeat

SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
some fio processes never dying with IOPS @ 0. I don't see any NVMe
timeout reported. Did you do any NVMe testing of this sort?



Yeah, so for NVMe, I see some sort of regression, like this:
Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
1158037877d:17h:18m:22s]


I can reproduce this issue, and looks there are requests in ->dispatch.


OK, that may match with what I see:
- the problem occuring coincides with this callpath with 
BLK_MQ_S_INTERNAL_STOPPED set:


blk_mq_request_bypass_insert
(__)blk_mq_try_issue_list_directly
blk_mq_sched_insert_requests
blk_mq_flush_plug_list
blk_flush_plug_list
blk_finish_plug
blkdev_direct_IO
generic_file_read_iter
blkdev_read_iter
aio_read
io_submit_one

blk_mq_request_bypass_insert() adds to the dispatch list, and looking at 
debugfs, could this be that dispatched request sitting:

root@(none)$ more /sys/kernel/debug/block/nvme0n1/hctx18/dispatch
ac28511d {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, 
.tag=56, .internal_tag=-1}


So could there be some race here?


I am a bit busy this week, please feel free to investigate it and debugfs
can help you much. I may have time next week for looking this issue.



OK, appreciated

John


Thanks,
Ming Lei

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-11 Thread Ming Lei

On Fri, Oct 11, 2019 at 4:54 PM John Garry  wrote:
>
> On 10/10/2019 12:21, John Garry wrote:
> >
> >>
> >> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
> >> to real hw queue, each hw queue has to own its independent tags.
> >> However, that isn't supported by V3 hardware.
> >
> > I am generating the tag internally in the driver now, so that hostwide
> > tags issue should not be an issue.
> >
> > And, to be clear, I am not paying too much attention to performance, but
> > rather just hotplugging while running IO.
> >
> > An update on testing:
> > I did some scripted overnight testing. The script essentially loops like
> > this:
> > - online all CPUS
> > - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
> > minute
> > - offline those CPUs
> > - wait 1 minute (> SCSI or NVMe timeout)
> > - and repeat
> >
> > SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
> > some fio processes never dying with IOPS @ 0. I don't see any NVMe
> > timeout reported. Did you do any NVMe testing of this sort?
> >
>
> Yeah, so for NVMe, I see some sort of regression, like this:
> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 1158037877d:17h:18m:22s]

I can reproduce this issue, and looks there are requests in ->dispatch.
I am a bit busy this week, please feel free to investigate it and debugfs
can help you much. I may have time next week for looking this issue.

Thanks,
Ming Lei

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-11 Thread John Garry


On 10/10/2019 12:21, John Garry wrote:




As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.


I am generating the tag internally in the driver now, so that hostwide
tags issue should not be an issue.

And, to be clear, I am not paying too much attention to performance, but
rather just hotplugging while running IO.

An update on testing:
I did some scripted overnight testing. The script essentially loops like
this:
- online all CPUS
- run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
minute
- offline those CPUs
- wait 1 minute (> SCSI or NVMe timeout)
- and repeat

SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
some fio processes never dying with IOPS @ 0. I don't see any NVMe
timeout reported. Did you do any NVMe testing of this sort?



Yeah, so for NVMe, I see some sort of regression, like this:
Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
1158037877d:17h:18m:22s]


I have tested against vanilla 5.4 rc1 without problem.

If you can advise some debug to add, then I'd appreciate it. If not, 
I'll try to add some debug to the new paths introduced in this series to 
see if anything I hit coincides with the error state, so will at least 
have a hint...


Thanks,
John






See previous discussion:

https://marc.info/?t=15592886301&r=1&w=2


Thanks,
Ming

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-10 Thread John Garry

On 10/10/2019 11:30, Ming Lei wrote:

Yes, hisi_sas. So, right, it is single queue today on mainline, but I
> manually made it multiqueue on my dev branch just to test this series.
> Otherwise I could not test it for that driver.
>
> My dev branch is here, if interested:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq

Your conversion shouldn't work given you do not change .can_queue in the
patch of 'hisi_sas_v3: multiqueue support'.

Ah, I missed that, but I don't think that it will make a difference 
really since I'm only using a single disk, so I don't think that 
can_queue really comes into play. But

As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.

I am generating the tag internally in the driver now, so that hostwide 
tags issue should not be an issue.

And, to be clear, I am not paying too much attention to performance, but 
rather just hotplugging while running IO.

An update on testing:
I did some scripted overnight testing. The script essentially loops like 
this:

- online all CPUS
- run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 
minute

- offline those CPUs
- wait 1 minute (> SCSI or NVMe timeout)
- and repeat

SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding 
some fio processes never dying with IOPS @ 0. I don't see any NVMe 
timeout reported. Did you do any NVMe testing of this sort?

Thanks,
John

See previous discussion:

https://marc.info/?t=15592886301&r=1&w=2

Thanks,
Ming

.

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-10 Thread Ming Lei

On Wed, Oct 09, 2019 at 09:49:35AM +0100, John Garry wrote:
> > > > > - steal bios from the request, and resubmit them via
> > > > > generic_make_request(),
> > > > > then these IO will be mapped to other live hctx for dispatch
> > > > > 
> > > > > Please comment & review, thanks!
> > > > > 
> > > > > John, I don't add your tested-by tag since V3 have some changes,
> > > > > and I appreciate if you may run your test on V3.
> > > > > 
> > > > 
> > > > Will do, Thanks
> > > 
> > > Hi Ming,
> > > 
> > > I got this warning once:
> > > 
> > > [  162.558185] CPU10: shutdown
> > > [  162.560994] psci: CPU10 killed.
> > > [  162.593939] CPU9: shutdown
> > > [  162.596645] psci: CPU9 killed.
> > > [  162.625838] CPU8: shutdown
> > > [  162.628550] psci: CPU8 killed.
> > > [  162.685790] CPU7: shutdown
> > > [  162.688496] psci: CPU7 killed.
> > > [  162.725771] CPU6: shutdown
> > > [  162.728486] psci: CPU6 killed.
> > > [  162.753884] CPU5: shutdown
> > > [  162.756591] psci: CPU5 killed.
> > > [  162.785584] irq_shutdown
> > > [  162.788277] IRQ 800: no longer affine to CPU4
> > > [  162.793267] CPU4: shutdown
> > > [  162.795975] psci: CPU4 killed.
> > > [  162.849680] run queue from wrong CPU 13, hctx active
> > > [  162.849692] CPU3: shutdown
> > > [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
> > > 5.4.0-rc1-00012-gad025dd3d001 #1098
> > > [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
> > > RC0 -
> > > V1.16.01 03/15/2019
> > > [  162.857362] psci: CPU3 killed.
> > > [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
> > > [  162.882281] Call trace:
> > > [  162.884716]  dump_backtrace+0x0/0x150
> > > [  162.888365]  show_stack+0x14/0x20
> > > [  162.891668]  dump_stack+0xb0/0xf8
> > > [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
> > > [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
> > > [  162.903397]  process_one_work+0x1e0/0x358
> > > [  162.907393]  worker_thread+0x40/0x488
> > > [  162.911042]  kthread+0x118/0x120
> > > [  162.914257]  ret_from_fork+0x10/0x18
> > 
> > What is the HBA? If it is Hisilicon SAS, it isn't strange, because
> > this patch can't fix single hw queue with multiple private reply queue
> > yet, that can be one follow-up job of this patchset.
> > 
> 
> Yes, hisi_sas. So, right, it is single queue today on mainline, but I
> manually made it multiqueue on my dev branch just to test this series.
> Otherwise I could not test it for that driver.
> 
> My dev branch is here, if interested:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq

Your conversion shouldn't work given you do not change .can_queue in the
patch of 'hisi_sas_v3: multiqueue support'.

As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.

See previous discussion:

https://marc.info/?t=15592886301&r=1&w=2


Thanks,
Ming

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-09 Thread John Garry


- steal bios from the request, and resubmit them via
generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.



Will do, Thanks


Hi Ming,

I got this warning once:

[  162.558185] CPU10: shutdown
[  162.560994] psci: CPU10 killed.
[  162.593939] CPU9: shutdown
[  162.596645] psci: CPU9 killed.
[  162.625838] CPU8: shutdown
[  162.628550] psci: CPU8 killed.
[  162.685790] CPU7: shutdown
[  162.688496] psci: CPU7 killed.
[  162.725771] CPU6: shutdown
[  162.728486] psci: CPU6 killed.
[  162.753884] CPU5: shutdown
[  162.756591] psci: CPU5 killed.
[  162.785584] irq_shutdown
[  162.788277] IRQ 800: no longer affine to CPU4
[  162.793267] CPU4: shutdown
[  162.795975] psci: CPU4 killed.
[  162.849680] run queue from wrong CPU 13, hctx active
[  162.849692] CPU3: shutdown
[  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
[  162.857362] psci: CPU3 killed.
[  162.866039] Workqueue: kblockd blk_mq_run_work_fn
[  162.882281] Call trace:
[  162.884716]  dump_backtrace+0x0/0x150
[  162.888365]  show_stack+0x14/0x20
[  162.891668]  dump_stack+0xb0/0xf8
[  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
[  162.899400]  blk_mq_run_work_fn+0x1c/0x28
[  162.903397]  process_one_work+0x1e0/0x358
[  162.907393]  worker_thread+0x40/0x488
[  162.911042]  kthread+0x118/0x120
[  162.914257]  ret_from_fork+0x10/0x18


What is the HBA? If it is Hisilicon SAS, it isn't strange, because
this patch can't fix single hw queue with multiple private reply queue
yet, that can be one follow-up job of this patchset.



Yes, hisi_sas. So, right, it is single queue today on mainline, but I 
manually made it multiqueue on my dev branch just to test this series. 
Otherwise I could not test it for that driver.


My dev branch is here, if interested:
https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq

I am also going to retest NVMe.

Thanks,
John


Thanks,
Ming

.

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-09 Thread Ming Lei

On Tue, Oct 08, 2019 at 06:15:52PM +0100, John Garry wrote:
> On 08/10/2019 10:06, John Garry wrote:
> > On 08/10/2019 05:18, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > > "
> > >  That was the constraint of managed interrupts from the very
> > > beginning:
> > > 
> > >   The driver/subsystem has to quiesce the interrupt line and the
> > > associated
> > >   queue _before_ it gets shutdown in CPU unplug and not fiddle
> > > with it
> > >   until it's restarted by the core when the CPU is plugged in again.
> > > "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> > > becomes dead
> > > 
> > > - steal bios from the request, and resubmit them via
> > > generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > John, I don't add your tested-by tag since V3 have some changes,
> > > and I appreciate if you may run your test on V3.
> > > 
> > 
> > Will do, Thanks
> 
> Hi Ming,
> 
> I got this warning once:
> 
> [  162.558185] CPU10: shutdown
> [  162.560994] psci: CPU10 killed.
> [  162.593939] CPU9: shutdown
> [  162.596645] psci: CPU9 killed.
> [  162.625838] CPU8: shutdown
> [  162.628550] psci: CPU8 killed.
> [  162.685790] CPU7: shutdown
> [  162.688496] psci: CPU7 killed.
> [  162.725771] CPU6: shutdown
> [  162.728486] psci: CPU6 killed.
> [  162.753884] CPU5: shutdown
> [  162.756591] psci: CPU5 killed.
> [  162.785584] irq_shutdown
> [  162.788277] IRQ 800: no longer affine to CPU4
> [  162.793267] CPU4: shutdown
> [  162.795975] psci: CPU4 killed.
> [  162.849680] run queue from wrong CPU 13, hctx active
> [  162.849692] CPU3: shutdown
> [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
> 5.4.0-rc1-00012-gad025dd3d001 #1098
> [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> V1.16.01 03/15/2019
> [  162.857362] psci: CPU3 killed.
> [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
> [  162.882281] Call trace:
> [  162.884716]  dump_backtrace+0x0/0x150
> [  162.888365]  show_stack+0x14/0x20
> [  162.891668]  dump_stack+0xb0/0xf8
> [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
> [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
> [  162.903397]  process_one_work+0x1e0/0x358
> [  162.907393]  worker_thread+0x40/0x488
> [  162.911042]  kthread+0x118/0x120
> [  162.914257]  ret_from_fork+0x10/0x18

What is the HBA? If it is Hisilicon SAS, it isn't strange, because
this patch can't fix single hw queue with multiple private reply queue
yet, that can be one follow-up job of this patchset.

Thanks,
Ming

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-08 Thread John Garry


On 08/10/2019 10:06, John Garry wrote:

On 08/10/2019 05:18, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very
beginning:

  The driver/subsystem has to quiesce the interrupt line and the
associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle
with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
becomes dead

- steal bios from the request, and resubmit them via
generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.



Will do, Thanks


Hi Ming,

I got this warning once:

[  162.558185] CPU10: shutdown
[  162.560994] psci: CPU10 killed.
[  162.593939] CPU9: shutdown
[  162.596645] psci: CPU9 killed.
[  162.625838] CPU8: shutdown
[  162.628550] psci: CPU8 killed.
[  162.685790] CPU7: shutdown
[  162.688496] psci: CPU7 killed.
[  162.725771] CPU6: shutdown
[  162.728486] psci: CPU6 killed.
[  162.753884] CPU5: shutdown
[  162.756591] psci: CPU5 killed.
[  162.785584] irq_shutdown
[  162.788277] IRQ 800: no longer affine to CPU4
[  162.793267] CPU4: shutdown
[  162.795975] psci: CPU4 killed.
[  162.849680] run queue from wrong CPU 13, hctx active
[  162.849692] CPU3: shutdown
[  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019

[  162.857362] psci: CPU3 killed.
[  162.866039] Workqueue: kblockd blk_mq_run_work_fn
[  162.882281] Call trace:
[  162.884716]  dump_backtrace+0x0/0x150
[  162.888365]  show_stack+0x14/0x20
[  162.891668]  dump_stack+0xb0/0xf8
[  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
[  162.899400]  blk_mq_run_work_fn+0x1c/0x28
[  162.903397]  process_one_work+0x1e0/0x358
[  162.907393]  worker_thread+0x40/0x488
[  162.911042]  kthread+0x118/0x120
[  162.914257]  ret_from_fork+0x10/0x18
[  162.917834] run queue from wrong CPU 13, hctx active
[  162.922789] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.931472] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019

[  162.939983] Workqueue: kblockd blk_mq_run_work_fn
[  162.944674] Call trace:
[  162.947107]  dump_backtrace+0x0/0x150
[  162.950755]  show_stack+0x14/0x20
[  162.954057]  dump_stack+0xb0/0xf8
[  162.957359]  __blk_mq_run_hw_queue+0x11c/0x128
[  162.961788]  blk_mq_run_work_fn+0x1c/0x28
[  162.965784]  process_one_work+0x1e0/0x358
[  162.969780]  worker_thread+0x40/0x488
[  162.973429]  kthread+0x118/0x120
[  162.976644]  ret_from_fork+0x10/0x18
[  162.980214] run queue from wrong CPU 13, hctx active
[  162.985171] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.993853] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019

[  163.002366] Workqueue: kblockd blk_mq_run_work_fn
[  163.007057] Call trace:
[  163.009490]  dump_backtrace+0x0/0x150
[  163.013138]  show_stack+0x14/0x20
[  163.016440]  dump_stack+0xb0/0xf8
[  163.019742]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.024172]  blk_mq_run_work_fn+0x1c/0x28
[  163.028167]  process_one_work+0x1e0/0x358
[  163.032163]  worker_thread+0x238/0x488
[  163.035899]  kthread+0x118/0x120
[  163.039113]  ret_from_fork+0x10/0x18
[  163.042736] run queue from wrong CPU 13, hctx active
[  163.047692] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.056374] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019

[  163.064885] Workqueue: kblockd blk_mq_run_work_fn
[  163.069575] Call trace:
[  163.072008]  dump_backtrace+0x0/0x150
[  163.075656]  show_stack+0x14/0x20
[  163.078958]  dump_stack+0xb0/0xf8
[  163.082260]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.086690]  blk_mq_run_work_fn+0x1c/0x28
[  163.090686]  process_one_work+0x1e0/0x358
[  163.094681]  worker_thread+0x238/0x488
[  163.098417]  kthread+0x118/0x120
[  163.101631]  ret_from_fork+0x10/0x18
[  163.105200] run queue from wrong CPU 13, hctx active
[  163.110534] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098

[  163.111852] CPU2: shutdown
[  163.119218] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019

[  163.119222] Workqueue:

Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-08 Thread John Garry


On 08/10/2019 05:18, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.



Will do, Thanks


V3:
- re-organize patch 2 & 3 a bit for addressing Hannes's comment
- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- addres

[PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-07 Thread Ming Lei

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.

V3:
- re-organize patch 2 & 3 a bit for addressing Hannes's comment
- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes dead
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 135 +
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 138 insertions(+), 16 deletions(-)

Cc: John Garry 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 
-- 
2.20.1

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-05 Thread Ming Lei

On Wed, Oct 02, 2019 at 08:36:52AM -0600, Jens Axboe wrote:
> On 10/2/19 3:56 AM, John Garry wrote:
> > On 22/08/2019 18:39, John Garry wrote:
> >> On 12/08/2019 14:43, Ming Lei wrote:
> >>> Hi,
> >>>
> >>> Thomas mentioned:
> >>>  "
> >>>   That was the constraint of managed interrupts from the very
> >>> beginning:
> >>>
> >>>The driver/subsystem has to quiesce the interrupt line and the
> >>> associated
> >>>queue _before_ it gets shutdown in CPU unplug and not fiddle
> >>> with it
> >>>until it's restarted by the core when the CPU is plugged in again.
> >>>  "
> >>>
> >>> But no drivers or blk-mq do that before one hctx becomes dead(all
> >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> >>>
> >>> This patchset tries to address the issue by two stages:
> >>>
> >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> >>>
> >>> - mark the hctx as internal stopped, and drain all in-flight requests
> >>> if the hctx is going to be dead.
> >>>
> >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> >>> becomes dead
> >>>
> >>> - steal bios from the request, and resubmit them via
> >>> generic_make_request(),
> >>> then these IO will be mapped to other live hctx for dispatch
> >>>
> >>> Please comment & review, thanks!
> >>>
> >>> V2:
> >>>  - patch4 & patch 5 in V1 have been merged to block tree, so remove
> >>>them
> >>>  - address comments from John Garry and Minwoo
> >>>
> >>>
> >>> Ming Lei (5):
> >>>blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
> >>>blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
> >>>blk-mq: stop to handle IO before hctx's all CPUs become offline
> >>>blk-mq: re-submit IO in case that hctx is dead
> >>>blk-mq: handle requests dispatched from IO scheduler in case that hctx
> >>>  is dead
> >>
> >> Hi Ming,
> >>
> >> This looks to fix the hotplug issue for me.
> >>
> >> Previously I could manufacture a scenario while running fio where I got
> >> IO timeouts, like this:
> >>
> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> >> [  296.897627] process 891 (fio) no longer affine to cpu0
> >> [  296.898488] process 893 (fio) no longer affine to cpu0
> >> [  296.910270] process 890 (fio) no longer affine to cpu0
> >> [  296.927322] IRQ 775: no longer affine to CPU0
> >> [  296.932762] CPU0: shutdown
> >> [  296.935469] psci: CPU0 killed.
> >> root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
> >> failed: 61
> >> [  326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b
> >> root@(none)$ [  333.047964] hisi_sas_v3_hw :74:02.0: internal task
> >> abort: timeout and not done.
> >> [  333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5)
> >> [  333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b
> >> [  333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU
> >> [  333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover
> >> [  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49
> >>
> >> Please notice the 30-second delay for the SCSI IO timeout.
> >>
> >> And now I don't see it; here's a sample for irq shutdown:
> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> >> [  344.608148] process 849 (fio) no longer affine to cpu0
> >> [  344.608639] process 848 (fio) no longer affine to cpu0
> >> [  344.609454] process 850 (fio) no longer affine to cpu0
> >> [  344.643481] process 847 (fio) no longer affine to cpu0
> >> [  346.213842] IRQ 775: no longer affine to CPU0
> >> [  346.219712] CPU0: shutdown
> >> [  346.222425] psci: CPU0 killed.
> >>
> >> Please notice the ~1.5s pause, which would be the queue draining.
> >>
> >> So FWIW:
> >> Tested-by: John Garry 
> >>
> >> JFYI, I tested on 5.3-rc5 and cherry-picked
> >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
> >> and "blk-mq: add callback of .cleanup_rq".
> >>
> >> Cheers,
> >> John
> > 
> > Hi Jens,
> > 
> > I don't mean to be pushy, but can we consider to get these patches from
> > Ming merged?
> > 
> > As above, I tested on my SCSI driver and it works. I also tested on an
> > NVMe disk, and it solves the condition which generates this message:
> > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
> > [  465.635960] CPU2: shutdown
> > [  465.638662] psci: CPU2 killed.
> > [  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled
> > 
> > (that's on top off v5.4-rc1)
> 
> Ming, can you repost the series?

It has been resent out just now.

Thanks,
Ming

[PATCH V2 RESEND 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-05 Thread Ming Lei

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo

Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 143 +
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 
-- 
2.20.1

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-02 Thread Jens Axboe

On 10/2/19 3:56 AM, John Garry wrote:
> On 22/08/2019 18:39, John Garry wrote:
>> On 12/08/2019 14:43, Ming Lei wrote:
>>> Hi,
>>>
>>> Thomas mentioned:
>>>  "
>>>   That was the constraint of managed interrupts from the very
>>> beginning:
>>>
>>>The driver/subsystem has to quiesce the interrupt line and the
>>> associated
>>>queue _before_ it gets shutdown in CPU unplug and not fiddle
>>> with it
>>>until it's restarted by the core when the CPU is plugged in again.
>>>  "
>>>
>>> But no drivers or blk-mq do that before one hctx becomes dead(all
>>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>>
>>> This patchset tries to address the issue by two stages:
>>>
>>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>>
>>> - mark the hctx as internal stopped, and drain all in-flight requests
>>> if the hctx is going to be dead.
>>>
>>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
>>> becomes dead
>>>
>>> - steal bios from the request, and resubmit them via
>>> generic_make_request(),
>>> then these IO will be mapped to other live hctx for dispatch
>>>
>>> Please comment & review, thanks!
>>>
>>> V2:
>>>  - patch4 & patch 5 in V1 have been merged to block tree, so remove
>>>them
>>>  - address comments from John Garry and Minwoo
>>>
>>>
>>> Ming Lei (5):
>>>blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>>>blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>>>blk-mq: stop to handle IO before hctx's all CPUs become offline
>>>blk-mq: re-submit IO in case that hctx is dead
>>>blk-mq: handle requests dispatched from IO scheduler in case that hctx
>>>  is dead
>>
>> Hi Ming,
>>
>> This looks to fix the hotplug issue for me.
>>
>> Previously I could manufacture a scenario while running fio where I got
>> IO timeouts, like this:
>>
>> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
>> [  296.897627] process 891 (fio) no longer affine to cpu0
>> [  296.898488] process 893 (fio) no longer affine to cpu0
>> [  296.910270] process 890 (fio) no longer affine to cpu0
>> [  296.927322] IRQ 775: no longer affine to CPU0
>> [  296.932762] CPU0: shutdown
>> [  296.935469] psci: CPU0 killed.
>> root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
>> failed: 61
>> [  326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b
>> root@(none)$ [  333.047964] hisi_sas_v3_hw :74:02.0: internal task
>> abort: timeout and not done.
>> [  333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5)
>> [  333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b
>> [  333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU
>> [  333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover
>> [  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49
>>
>> Please notice the 30-second delay for the SCSI IO timeout.
>>
>> And now I don't see it; here's a sample for irq shutdown:
>> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
>> [  344.608148] process 849 (fio) no longer affine to cpu0
>> [  344.608639] process 848 (fio) no longer affine to cpu0
>> [  344.609454] process 850 (fio) no longer affine to cpu0
>> [  344.643481] process 847 (fio) no longer affine to cpu0
>> [  346.213842] IRQ 775: no longer affine to CPU0
>> [  346.219712] CPU0: shutdown
>> [  346.222425] psci: CPU0 killed.
>>
>> Please notice the ~1.5s pause, which would be the queue draining.
>>
>> So FWIW:
>> Tested-by: John Garry 
>>
>> JFYI, I tested on 5.3-rc5 and cherry-picked
>> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
>> and "blk-mq: add callback of .cleanup_rq".
>>
>> Cheers,
>> John
> 
> Hi Jens,
> 
> I don't mean to be pushy, but can we consider to get these patches from
> Ming merged?
> 
> As above, I tested on my SCSI driver and it works. I also tested on an
> NVMe disk, and it solves the condition which generates this message:
> root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
> [  465.635960] CPU2: shutdown
> [  465.638662] psci: CPU2 killed.
> [  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled
> 
> (that's on top off v5.4-rc1)

Ming, can you repost the series?

-- 
Jens Axboe

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-10-02 Thread John Garry


On 22/08/2019 18:39, John Garry wrote:

On 12/08/2019 14:43, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very
beginning:

  The driver/subsystem has to quiesce the interrupt line and the
associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle
with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
becomes dead

- steal bios from the request, and resubmit them via
generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead


Hi Ming,

This looks to fix the hotplug issue for me.

Previously I could manufacture a scenario while running fio where I got
IO timeouts, like this:

root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  296.897627] process 891 (fio) no longer affine to cpu0
[  296.898488] process 893 (fio) no longer affine to cpu0
[  296.910270] process 890 (fio) no longer affine to cpu0
[  296.927322] IRQ 775: no longer affine to CPU0
[  296.932762] CPU0: shutdown
[  296.935469] psci: CPU0 killed.
root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
failed: 61
[  326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b
root@(none)$ [  333.047964] hisi_sas_v3_hw :74:02.0: internal task
abort: timeout and not done.
[  333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5)
[  333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b
[  333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU
[  333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover
[  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49

Please notice the 30-second delay for the SCSI IO timeout.

And now I don't see it; here's a sample for irq shutdown:
root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  344.608148] process 849 (fio) no longer affine to cpu0
[  344.608639] process 848 (fio) no longer affine to cpu0
[  344.609454] process 850 (fio) no longer affine to cpu0
[  344.643481] process 847 (fio) no longer affine to cpu0
[  346.213842] IRQ 775: no longer affine to CPU0
[  346.219712] CPU0: shutdown
[  346.222425] psci: CPU0 killed.

Please notice the ~1.5s pause, which would be the queue draining.

So FWIW:
Tested-by: John Garry 

JFYI, I tested on 5.3-rc5 and cherry-picked
https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
and "blk-mq: add callback of .cleanup_rq".

Cheers,
John


Hi Jens,

I don't mean to be pushy, but can we consider to get these patches from 
Ming merged?


As above, I tested on my SCSI driver and it works. I also tested on an 
NVMe disk, and it solves the condition which generates this message:

root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
[  465.635960] CPU2: shutdown
[  465.638662] psci: CPU2 killed.
[  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled

(that's on top off v5.4-rc1)

Thanks,
John







 block/blk-mq-debugfs.c |   2 +

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-08-22 Thread John Garry


On 12/08/2019 14:43, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead


Hi Ming,

This looks to fix the hotplug issue for me.

Previously I could manufacture a scenario while running fio where I got 
IO timeouts, like this:


root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  296.897627] process 891 (fio) no longer affine to cpu0
[  296.898488] process 893 (fio) no longer affine to cpu0
[  296.910270] process 890 (fio) no longer affine to cpu0
[  296.927322] IRQ 775: no longer affine to CPU0
[  296.932762] CPU0: shutdown
[  296.935469] psci: CPU0 killed.
root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61 
failed: 61

[  326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b
root@(none)$ [  333.047964] hisi_sas_v3_hw :74:02.0: internal task 
abort: timeout and not done.

[  333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5)
[  333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b
[  333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU
[  333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover
[  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49

Please notice the 30-second delay for the SCSI IO timeout.

And now I don't see it; here's a sample for irq shutdown:
root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  344.608148] process 849 (fio) no longer affine to cpu0
[  344.608639] process 848 (fio) no longer affine to cpu0
[  344.609454] process 850 (fio) no longer affine to cpu0
[  344.643481] process 847 (fio) no longer affine to cpu0
[  346.213842] IRQ 775: no longer affine to CPU0
[  346.219712] CPU0: shutdown
[  346.222425] psci: CPU0 killed.

Please notice the ~1.5s pause, which would be the queue draining.

So FWIW:
Tested-by: John Garry 

JFYI, I tested on 5.3-rc5 and cherry-picked 
https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 
and "blk-mq: add callback of .cleanup_rq".


Cheers,
John



 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 143 +
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-08-12 Thread Ming Lei

On Mon, Aug 12, 2019 at 05:21:44PM +0100, John Garry wrote:
> On 12/08/2019 14:46, Ming Lei wrote:
> > Hi John,
> > 
> > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > > "
> > >  That was the constraint of managed interrupts from the very 
> > > beginning:
> > > 
> > >   The driver/subsystem has to quiesce the interrupt line and the 
> > > associated
> > >   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> > >   until it's restarted by the core when the CPU is plugged in again.
> > > "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes 
> > > dead
> > > 
> > > - steal bios from the request, and resubmit them via 
> > > generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > V2:
> > >   - patch4 & patch 5 in V1 have been merged to block tree, so remove
> > > them
> > >   - address comments from John Garry and Minwoo
> > > 
> > > 
> > > Ming Lei (5):
> > >   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
> > >   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
> > >   blk-mq: stop to handle IO before hctx's all CPUs become offline
> > >   blk-mq: re-submit IO in case that hctx is dead
> > >   blk-mq: handle requests dispatched from IO scheduler in case that hctx
> > > is dead
> > > 
> > >  block/blk-mq-debugfs.c |   2 +
> > >  block/blk-mq-tag.c |   2 +-
> > >  block/blk-mq-tag.h |   2 +
> > >  block/blk-mq.c | 143 +
> > >  block/blk-mq.h |   3 +-
> > >  drivers/block/loop.c   |   2 +-
> > >  drivers/md/dm-rq.c |   2 +-
> > >  include/linux/blk-mq.h |   5 ++
> > >  include/linux/cpuhotplug.h |   1 +
> > >  9 files changed, 146 insertions(+), 16 deletions(-)
> > > 
> > > Cc: Bart Van Assche 
> > > Cc: Hannes Reinecke 
> > > Cc: Christoph Hellwig 
> > > Cc: Thomas Gleixner 
> > > Cc: Keith Busch 
> > > --
> > > 2.20.1
> > > 
> > 
> > Sorry for forgetting to Cc you.
> 
> Already subscribed :)
> 
> I don't mean to hijack this thread, but JFYI we're getting around to test
> https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately
> we're still seeing a performance regression. I can't see where it's coming
> from. We're double-checking the test though.

host-tag patchset is only for several particular drivers which use
private reply queue as completion queue.

This patchset is for handling generic blk-mq CPU hotplug issue, and
the several particular scsi drivers(hisi_sas_v3, hpsa, megaraid_sas and
mp3sas) won't be covered so far.

I'd suggest to move on for generic blk-mq devices first given now blk-mq
is the only request IO path now.

There are at least two choices for us to handle drivers/devices with
private completion queue:

1) host-tags
- performance issue shouldn't be hard to solve, given it is same with
with single tags in theory, and just corner cases is there.

What I am not glad with this approach is that blk-mq-tag code becomes mess.

2) private callback
- we could define private callback to drain each completion queue in
  driver simply.
- problem is that the four drivers have to duplicate the same job


Thanks,
Ming

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-08-12 Thread John Garry


On 12/08/2019 14:46, Ming Lei wrote:

Hi John,

On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 143 +
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 
--
2.20.1



Sorry for forgetting to Cc you.


Already subscribed :)

I don't mean to hijack this thread, but JFYI we're getting around to 
test https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - 
unfortunately we're still seeing a performance regression. I can't see 
where it's coming from. We're double-checking the test though.


Thanks,
John




Thanks,
Ming

.

Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-08-12 Thread Ming Lei

Hi John,

On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
> "
>  That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the 
> associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again.
> "
> 
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!
> 
> V2:
>   - patch4 & patch 5 in V1 have been merged to block tree, so remove
> them
>   - address comments from John Garry and Minwoo
> 
> 
> Ming Lei (5):
>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>   blk-mq: re-submit IO in case that hctx is dead
>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
> is dead
> 
>  block/blk-mq-debugfs.c |   2 +
>  block/blk-mq-tag.c |   2 +-
>  block/blk-mq-tag.h |   2 +
>  block/blk-mq.c | 143 +
>  block/blk-mq.h |   3 +-
>  drivers/block/loop.c   |   2 +-
>  drivers/md/dm-rq.c |   2 +-
>  include/linux/blk-mq.h |   5 ++
>  include/linux/cpuhotplug.h |   1 +
>  9 files changed, 146 insertions(+), 16 deletions(-)
> 
> Cc: Bart Van Assche 
> Cc: Hannes Reinecke 
> Cc: Christoph Hellwig 
> Cc: Thomas Gleixner 
> Cc: Keith Busch 
> -- 
> 2.20.1
> 

Sorry for forgetting to Cc you.


Thanks,
Ming

[PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug

2019-08-12 Thread Ming Lei

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
- patch4 & patch 5 in V1 have been merged to block tree, so remove
  them
- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 143 +
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 include/linux/blk-mq.h |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 
-- 
2.20.1

Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug

2019-07-16 Thread John Garry


在 16/07/2019 15:18, Ming Lei 写道:

So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert
private reply queue into blk_mq hw queue" then?


Hi Ming,


V2 has been in the following tree for a while:

https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2

It works, however the implementation is a bit ugly even though the idea
is simple.


Yeah, sorry to say that I agree - the BLK_MQ_F_TAG_HCTX_SHARED checks 
look bolted on.




So I think we may need to think of it further, for better implementation or
approach.


Understood.

But at least we may test it to ensure no performance regression.

Thanks,
John



Thanks,
Ming Lei

Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug

2019-07-16 Thread Ming Lei

Hi John,

On Tue, Jul 16, 2019 at 2:55 PM John Garry  wrote:
>
> 在 12/07/2019 10:47, Ming Lei 写道:
> > Hi,
> >
> > Thomas mentioned:
> >  "
> >   That was the constraint of managed interrupts from the very beginning:
> >
> >The driver/subsystem has to quiesce the interrupt line and the 
> > associated
> >queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >until it's restarted by the core when the CPU is plugged in again.
> >  "
> >
> > But no drivers or blk-mq do that before one hctx becomes dead(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> >
> > This patchset tries to address the issue by two stages:
> >
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> >
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> >
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes 
> > dead
> >
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> >
> > Please comment & review, thanks!
> >
>
> Hi Ming,
>
> FWIW, to me this series looks reasonable.

Thanks!

>
> So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert
> private reply queue into blk_mq hw queue" then?

V2 has been in the following tree for a while:

https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2

It works, however the implementation is a bit ugly even though the idea
is simple.

So I think we may need to think of it further, for better implementation or
approach.

Thanks,
Ming Lei

Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug

2019-07-15 Thread John Garry


在 12/07/2019 10:47, Ming Lei 写道:

Hi,

Thomas mentioned:
 "
  That was the constraint of managed interrupts from the very beginning:
 
   The driver/subsystem has to quiesce the interrupt line and the associated

   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
   until it's restarted by the core when the CPU is plugged in again.
 "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!



Hi Ming,

FWIW, to me this series looks reasonable.

So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert 
private reply queue into blk_mq hw queue" then?


All the best,
John



Ming Lei (7):
   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
   blk-mq: stop to handle IO before hctx's all CPUs become offline
   blk-mq: add callback of .free_request
   SCSI: implement .free_request callback
   blk-mq: re-submit IO in case that hctx is dead
   blk-mq: handle requests dispatched from IO scheduler in case that hctx
 is dead

  block/blk-mq-debugfs.c |   2 +
  block/blk-mq-tag.c |   2 +-
  block/blk-mq-tag.h |   2 +
  block/blk-mq.c | 147 ++---
  block/blk-mq.h |   3 +-
  drivers/block/loop.c   |   2 +-
  drivers/md/dm-rq.c |   2 +-
  drivers/scsi/scsi_lib.c|  13 
  include/linux/blk-mq.h |  12 +++
  include/linux/cpuhotplug.h |   1 +
  10 files changed, 170 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch

[RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug

2019-07-11 Thread Ming Lei

Hi,

Thomas mentioned:
"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!


Ming Lei (7):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: add callback of .free_request
  SCSI: implement .free_request callback
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
is dead

 block/blk-mq-debugfs.c |   2 +
 block/blk-mq-tag.c |   2 +-
 block/blk-mq-tag.h |   2 +
 block/blk-mq.c | 147 ++---
 block/blk-mq.h |   3 +-
 drivers/block/loop.c   |   2 +-
 drivers/md/dm-rq.c |   2 +-
 drivers/scsi/scsi_lib.c|  13 
 include/linux/blk-mq.h |  12 +++
 include/linux/cpuhotplug.h |   1 +
 10 files changed, 170 insertions(+), 16 deletions(-)

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: Keith Busch 


-- 
2.20.1

RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)

2019-06-24 Thread 曾文斌

Hi Ming,

> -Original Message-
> From: Ming Lei 
> Sent: Tuesday, June 25, 2019 10:27 AM
> To: wenbinzeng(曾文斌) 
> Cc: Wenbin Zeng ; ax...@kernel.dk; 
> keith.bu...@intel.com;
> h...@suse.com; osan...@fb.com; s...@grimberg.me; bvanass...@acm.org;
> linux-block@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet 
> mail)
> 
> On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote:
> > Hi Ming,
> >
> > > -Original Message-
> > > From: Ming Lei 
> > > Sent: Tuesday, June 25, 2019 9:55 AM
> > > To: Wenbin Zeng 
> > > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com;
> > > osan...@fb.com; s...@grimberg.me; bvanass...@acm.org;
> > > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > wenbinzeng(曾文斌) 
> > > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at
> > > cpu-hotplug(Internet mail)
> > >
> > > On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
> > > > Currently hctx->cpumask is not updated when hot-plugging new cpus,
> > > > as there are many chances kblockd_mod_delayed_work_on() getting
> > > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> > >
> > > There are only two cases in which WORK_CPU_UNBOUND is applied:
> > >
> > > 1) single hw queue
> > >
> > > 2) multiple hw queue, and all CPUs in this hctx become offline
> > >
> > > For 1), all CPUs can be found in hctx->cpumask.
> > >
> > > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> > > > reporting excessive "run queue from wrong CPU" messages because
> > > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
> > >
> > > The message means CPU hotplug race is triggered.
> > >
> > > Yeah, there is big problem in blk_mq_hctx_notify_dead() which is
> > > called after one CPU is dead, but still run this hw queue to
> > > dispatch request, and all CPUs in this hctx might become offline.
> > >
> > > We have some discussion before on this issue:
> > >
> > > https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+pu
> > > E78STz8hj
> > > 9j5bs82...@mail.gmail.com/
> > >
> >
> > There is another scenario, you can reproduce it by hot-plugging cpus to kvm 
> > guests
> via qemu monitor (I believe virsh setvcpus --live can do the same thing), for 
> example:
> > (qemu) cpu-add 1
> > (qemu) cpu-add 2
> > (qemu) cpu-add 3
> >
> > In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask 
> > doesn't
> get synced when these cpus are added.
> 
> It is CPU cold-plug, we suppose to support it.
> 
> The new added CPUs should be visible to hctx, since we spread queues among all
> possible CPUs(), please see blk_mq_map_queues() and 
> irq_build_affinity_masks(),
> which is like static allocation on CPU resources.
> 
> Otherwise, you might use an old kernel or there is bug somewhere.

It turns out that I was using old kernel, version 4.14, I tested the latest 
version, it works well as you said. Thank you very much.

> 
> >
> > > >
> > > > This patch added a cpu-hotplug handler into blk-mq, updating
> > > > hctx->cpumask at cpu-hotplug.
> > >
> > > This way isn't correct, hctx->cpumask should be kept as sync with
> > > queue mapping.
> >
> > Please advise what should I do to deal with the above situation? Thanks a 
> > lot.
> 
> As I shared in last email, there is one approach discussed, which seems 
> doable.
> 
> Thanks,
> Ming

Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)

2019-06-24 Thread Dongli Zhang




On 6/25/19 10:27 AM, Ming Lei wrote:
> On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote:
>> Hi Ming,
>>
>>> -Original Message-
>>> From: Ming Lei 
>>> Sent: Tuesday, June 25, 2019 9:55 AM
>>> To: Wenbin Zeng 
>>> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com;
>>> s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org;
>>> linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) 
>>> Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet 
>>> mail)
>>>
>>> On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
>>>> Currently hctx->cpumask is not updated when hot-plugging new cpus,
>>>> as there are many chances kblockd_mod_delayed_work_on() getting
>>>> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
>>>
>>> There are only two cases in which WORK_CPU_UNBOUND is applied:
>>>
>>> 1) single hw queue
>>>
>>> 2) multiple hw queue, and all CPUs in this hctx become offline
>>>
>>> For 1), all CPUs can be found in hctx->cpumask.
>>>
>>>> on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
>>>> reporting excessive "run queue from wrong CPU" messages because
>>>> cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
>>>
>>> The message means CPU hotplug race is triggered.
>>>
>>> Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called
>>> after one CPU is dead, but still run this hw queue to dispatch request,
>>> and all CPUs in this hctx might become offline.
>>>
>>> We have some discussion before on this issue:
>>>
>>> https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj
>>> 9j5bs82...@mail.gmail.com/
>>>
>>
>> There is another scenario, you can reproduce it by hot-plugging cpus to kvm 
>> guests via qemu monitor (I believe virsh setvcpus --live can do the same 
>> thing), for example:
>> (qemu) cpu-add 1
>> (qemu) cpu-add 2
>> (qemu) cpu-add 3
>>
>> In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask 
>> doesn't get synced when these cpus are added.

Here is how I play with it with the most recent qemu and linux.

Boot VM with 1 out of 4 vcpu online.

# qemu-system-x86_64 -hda disk.img \
-smp 1,maxcpus=4 \
-m 4096M -enable-kvm \
-device nvme,drive=lightnvme,serial=deadbeaf1 \
-drive file=nvme.img,if=none,id=lightnvme \
-vnc :0 \
-kernel /.../mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda1 init=/sbin/init text" \
-monitor stdio -net nic -net user,hostfwd=tcp::5022-:22


As Ming mentioned, after boot:

# cat /proc/cpuinfo  | grep processor
processor   : 0

# cat /sys/block/nvme0n1/mq/0/cpu_list
0
# cat /sys/block/nvme0n1/mq/1/cpu_list
1
# cat /sys/block/nvme0n1/mq/2/cpu_list
2
# cat /sys/block/nvme0n1/mq/3/cpu_list
3

# cat /proc/interrupts | grep nvme
 24: 11   PCI-MSI 65536-edge  nvme0q0
 25: 78   PCI-MSI 65537-edge  nvme0q1
 26:  0   PCI-MSI 65538-edge  nvme0q2
 27:  0   PCI-MSI 65539-edge  nvme0q3
 28:  0   PCI-MSI 65540-edge  nvme0q4

I hotplug with "device_add
qemu64-x86_64-cpu,id=core1,socket-id=1,core-id=0,thread-id=0" in qemu monitor.

Dongli Zhang

> 
> It is CPU cold-plug, we suppose to support it.
> 
> The new added CPUs should be visible to hctx, since we spread queues
> among all possible CPUs(), please see blk_mq_map_queues() and
> irq_build_affinity_masks(), which is like static allocation on CPU
> resources.
> 
> Otherwise, you might use an old kernel or there is bug somewhere.
> 
>>
>>>>
>>>> This patch added a cpu-hotplug handler into blk-mq, updating
>>>> hctx->cpumask at cpu-hotplug.
>>>
>>> This way isn't correct, hctx->cpumask should be kept as sync with
>>> queue mapping.
>>
>> Please advise what should I do to deal with the above situation? Thanks a 
>> lot.
> 
> As I shared in last email, there is one approach discussed, which seems
> doable.
> 
> Thanks,
> Ming
>

RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)

2019-06-24 Thread 曾文斌

Hi Dongli,

> -Original Message-
> From: Dongli Zhang 
> Sent: Tuesday, June 25, 2019 9:30 AM
> To: Wenbin Zeng 
> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; 
> ming@redhat.com;
> osan...@fb.com; s...@grimberg.me; bvanass...@acm.org;
> linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; wenbinzeng(曾文斌)
> 
> Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet 
> mail)
> 
> Hi Wenbin,
> 
> On 6/24/19 11:24 PM, Wenbin Zeng wrote:
> > Currently hctx->cpumask is not updated when hot-plugging new cpus,
> > as there are many chances kblockd_mod_delayed_work_on() getting
> > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> > reporting excessive "run queue from wrong CPU" messages because
> > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
> >
> > This patch added a cpu-hotplug handler into blk-mq, updating
> > hctx->cpumask at cpu-hotplug.
> >
> > Signed-off-by: Wenbin Zeng 
> > ---
> >  block/blk-mq.c | 29 +
> >  include/linux/blk-mq.h |  1 +
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index ce0f5f4..2e465fc 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -39,6 +39,8 @@
> >  #include "blk-mq-sched.h"
> >  #include "blk-rq-qos.h"
> >
> > +static enum cpuhp_state cpuhp_blk_mq_online;
> > +
> >  static void blk_mq_poll_stats_start(struct request_queue *q);
> >  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
> >
> > @@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, 
> > struct
> blk_mq_tags *tags,
> > return -ENOMEM;
> >  }
> >
> > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node 
> > *node)
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +
> > +   hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online);
> > +
> > +   if (!cpumask_test_cpu(cpu, hctx->cpumask)) {
> > +   mutex_lock(&hctx->queue->sysfs_lock);
> > +   cpumask_set_cpu(cpu, hctx->cpumask);
> > +   mutex_unlock(&hctx->queue->sysfs_lock);
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> 
> As this callback is registered for each hctx, when a cpu is online, it is 
> called
> for each hctx.
> 
> Just taking a 4-queue nvme as example (regardless about other block like 
> loop).
> Suppose cpu=2 (out of 0, 1, 2 and 3) is offline. When we online cpu=2,
> 
> blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=3
> blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=2
> blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=1
> blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=0
> 
> There is no need to set cpu 2 for blk_mq_hw_ctx->queue_num=[3, 1, 0]. I am
> afraid this patch would erroneously set cpumask for 
> blk_mq_hw_ctx->queue_num=[3,
> 1, 0].
> 
> I used to submit the below patch explaining above for removing a cpu and it is
> unfortunately not merged yet.
> 
> https://patchwork.kernel.org/patch/10889307/
> 
> 
> Another thing is during initialization, the hctx->cpumask should already been
> set and even the cpu is offline. Would you please explain the case 
> hctx->cpumask
> is not set correctly, e.g., how to reproduce with a kvm guest running
> scsi/virtio/nvme/loop?

My scenario is:
A kvm guest started with single cpu, during initialization only one cpu was 
visible by kernel.
After boot, I hot-add some cpus via qemu monitor (I believe virsh setvcpus 
--live can do the same thing), for example:
(qemu) cpu-add 1
(qemu) cpu-add 2
(qemu) cpu-add 3

In such scenario, hctx->cpumask doesn't get updated when these cpus are added.

> 
> Dongli Zhang

Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)

2019-06-24 Thread Ming Lei

On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote:
> Hi Ming,
> 
> > -Original Message-
> > From: Ming Lei 
> > Sent: Tuesday, June 25, 2019 9:55 AM
> > To: Wenbin Zeng 
> > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com;
> > s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org;
> > linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) 
> > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet 
> > mail)
> > 
> > On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
> > > Currently hctx->cpumask is not updated when hot-plugging new cpus,
> > > as there are many chances kblockd_mod_delayed_work_on() getting
> > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> > 
> > There are only two cases in which WORK_CPU_UNBOUND is applied:
> > 
> > 1) single hw queue
> > 
> > 2) multiple hw queue, and all CPUs in this hctx become offline
> > 
> > For 1), all CPUs can be found in hctx->cpumask.
> > 
> > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> > > reporting excessive "run queue from wrong CPU" messages because
> > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
> > 
> > The message means CPU hotplug race is triggered.
> > 
> > Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called
> > after one CPU is dead, but still run this hw queue to dispatch request,
> > and all CPUs in this hctx might become offline.
> > 
> > We have some discussion before on this issue:
> > 
> > https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj
> > 9j5bs82...@mail.gmail.com/
> > 
> 
> There is another scenario, you can reproduce it by hot-plugging cpus to kvm 
> guests via qemu monitor (I believe virsh setvcpus --live can do the same 
> thing), for example:
> (qemu) cpu-add 1
> (qemu) cpu-add 2
> (qemu) cpu-add 3
> 
> In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask 
> doesn't get synced when these cpus are added.

It is CPU cold-plug, we suppose to support it.

The new added CPUs should be visible to hctx, since we spread queues
among all possible CPUs(), please see blk_mq_map_queues() and
irq_build_affinity_masks(), which is like static allocation on CPU
resources.

Otherwise, you might use an old kernel or there is bug somewhere.

> 
> > >
> > > This patch added a cpu-hotplug handler into blk-mq, updating
> > > hctx->cpumask at cpu-hotplug.
> > 
> > This way isn't correct, hctx->cpumask should be kept as sync with
> > queue mapping.
> 
> Please advise what should I do to deal with the above situation? Thanks a lot.

As I shared in last email, there is one approach discussed, which seems
doable.

Thanks,
Ming

RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)

2019-06-24 Thread 曾文斌

Hi Ming,

> -Original Message-
> From: Ming Lei 
> Sent: Tuesday, June 25, 2019 9:55 AM
> To: Wenbin Zeng 
> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com;
> s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org;
> linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) 
> Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet 
> mail)
> 
> On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
> > Currently hctx->cpumask is not updated when hot-plugging new cpus,
> > as there are many chances kblockd_mod_delayed_work_on() getting
> > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> 
> There are only two cases in which WORK_CPU_UNBOUND is applied:
> 
> 1) single hw queue
> 
> 2) multiple hw queue, and all CPUs in this hctx become offline
> 
> For 1), all CPUs can be found in hctx->cpumask.
> 
> > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> > reporting excessive "run queue from wrong CPU" messages because
> > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
> 
> The message means CPU hotplug race is triggered.
> 
> Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called
> after one CPU is dead, but still run this hw queue to dispatch request,
> and all CPUs in this hctx might become offline.
> 
> We have some discussion before on this issue:
> 
> https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj
> 9j5bs82...@mail.gmail.com/
> 

There is another scenario, you can reproduce it by hot-plugging cpus to kvm 
guests via qemu monitor (I believe virsh setvcpus --live can do the same 
thing), for example:
(qemu) cpu-add 1
(qemu) cpu-add 2
(qemu) cpu-add 3

In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask doesn't 
get synced when these cpus are added.

> >
> > This patch added a cpu-hotplug handler into blk-mq, updating
> > hctx->cpumask at cpu-hotplug.
> 
> This way isn't correct, hctx->cpumask should be kept as sync with
> queue mapping.

Please advise what should I do to deal with the above situation? Thanks a lot.

> 
> Thanks,
> Ming

Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug

2019-06-24 Thread Ming Lei

On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
> Currently hctx->cpumask is not updated when hot-plugging new cpus,
> as there are many chances kblockd_mod_delayed_work_on() getting
> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run

There are only two cases in which WORK_CPU_UNBOUND is applied:

1) single hw queue

2) multiple hw queue, and all CPUs in this hctx become offline

For 1), all CPUs can be found in hctx->cpumask.

> on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> reporting excessive "run queue from wrong CPU" messages because
> cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.

The message means CPU hotplug race is triggered.

Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called
after one CPU is dead, but still run this hw queue to dispatch request,
and all CPUs in this hctx might become offline.

We have some discussion before on this issue:

https://lore.kernel.org/linux-block/cacvxfvn729sgfqgugmu1in7p6mv5+pue78stz8hj9j5bs82...@mail.gmail.com/

> 
> This patch added a cpu-hotplug handler into blk-mq, updating
> hctx->cpumask at cpu-hotplug.

This way isn't correct, hctx->cpumask should be kept as sync with
queue mapping.

Thanks,
Ming

Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug

2019-06-24 Thread Dongli Zhang

Hi Wenbin,

On 6/24/19 11:24 PM, Wenbin Zeng wrote:
> Currently hctx->cpumask is not updated when hot-plugging new cpus,
> as there are many chances kblockd_mod_delayed_work_on() getting
> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
> reporting excessive "run queue from wrong CPU" messages because
> cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.
> 
> This patch added a cpu-hotplug handler into blk-mq, updating
> hctx->cpumask at cpu-hotplug.
> 
> Signed-off-by: Wenbin Zeng 
> ---
>  block/blk-mq.c | 29 +
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ce0f5f4..2e465fc 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -39,6 +39,8 @@
>  #include "blk-mq-sched.h"
>  #include "blk-rq-qos.h"
>  
> +static enum cpuhp_state cpuhp_blk_mq_online;
> +
>  static void blk_mq_poll_stats_start(struct request_queue *q);
>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>  
> @@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, 
> struct blk_mq_tags *tags,
>   return -ENOMEM;
>  }
>  
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node 
> *node)
> +{
> + struct blk_mq_hw_ctx *hctx;
> +
> + hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online);
> +
> + if (!cpumask_test_cpu(cpu, hctx->cpumask)) {
> + mutex_lock(&hctx->queue->sysfs_lock);
> + cpumask_set_cpu(cpu, hctx->cpumask);
> + mutex_unlock(&hctx->queue->sysfs_lock);
> + }
> +
> + return 0;
> +}
> +

As this callback is registered for each hctx, when a cpu is online, it is called
for each hctx.

Just taking a 4-queue nvme as example (regardless about other block like loop).
Suppose cpu=2 (out of 0, 1, 2 and 3) is offline. When we online cpu=2,

blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=3
blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=2
blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=1
blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=0

There is no need to set cpu 2 for blk_mq_hw_ctx->queue_num=[3, 1, 0]. I am
afraid this patch would erroneously set cpumask for blk_mq_hw_ctx->queue_num=[3,
1, 0].

I used to submit the below patch explaining above for removing a cpu and it is
unfortunately not merged yet.

https://patchwork.kernel.org/patch/10889307/


Another thing is during initialization, the hctx->cpumask should already been
set and even the cpu is offline. Would you please explain the case hctx->cpumask
is not set correctly, e.g., how to reproduce with a kvm guest running
scsi/virtio/nvme/loop?

Dongli Zhang

[PATCH] blk-mq: update hctx->cpumask at cpu-hotplug

2019-06-24 Thread Wenbin Zeng

Currently hctx->cpumask is not updated when hot-plugging new cpus,
as there are many chances kblockd_mod_delayed_work_on() getting
called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
reporting excessive "run queue from wrong CPU" messages because
cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false.

This patch added a cpu-hotplug handler into blk-mq, updating
hctx->cpumask at cpu-hotplug.

Signed-off-by: Wenbin Zeng 
---
 block/blk-mq.c | 29 +
 include/linux/blk-mq.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce0f5f4..2e465fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -39,6 +39,8 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
+static enum cpuhp_state cpuhp_blk_mq_online;
+
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct 
blk_mq_tags *tags,
return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online);
+
+   if (!cpumask_test_cpu(cpu, hctx->cpumask)) {
+   mutex_lock(&hctx->queue->sysfs_lock);
+   cpumask_set_cpu(cpu, hctx->cpumask);
+   mutex_unlock(&hctx->queue->sysfs_lock);
+   }
+
+   return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2251,6 +2268,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, 
struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+   if (cpuhp_blk_mq_online > 0)
+   cpuhp_state_remove_instance_nocalls(cpuhp_blk_mq_online,
+   &hctx->cpuhp_online);
cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
&hctx->cpuhp_dead);
 }
@@ -2310,6 +2330,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
hctx->queue_num = hctx_idx;
 
+   if (cpuhp_blk_mq_online > 0)
+   cpuhp_state_add_instance_nocalls(cpuhp_blk_mq_online,
+&hctx->cpuhp_online);
cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
hctx->tags = set->tags[hctx_idx];
@@ -3544,6 +3567,12 @@ unsigned int blk_mq_rq_cpu(struct request *rq)
 
 static int __init blk_mq_init(void)
 {
+   cpuhp_blk_mq_online = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+   "block/mq:online",
+   blk_mq_hctx_notify_online, NULL);
+   if (cpuhp_blk_mq_online <= 0)
+   pr_warn("blk_mq_init: failed to setup cpu online callback\n");
+
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
blk_mq_hctx_notify_dead);
return 0;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15d1aa5..5241659 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -58,6 +58,7 @@ struct blk_mq_hw_ctx {
 
atomic_tnr_active;
 
+   struct hlist_node   cpuhp_online;
struct hlist_node   cpuhp_dead;
struct kobject  kobj;
 
-- 
1.8.3.1

Re: Hotplug

2018-04-11 Thread Jens Axboe

On 4/11/18 10:14 AM, Jan Kara wrote:
> Hello,
> 
> On Wed 11-04-18 08:11:13, Jens Axboe wrote:
>> On 4/11/18 7:58 AM, Jan Kara wrote:
>>> On Tue 10-04-18 11:17:46, Jens Axboe wrote:
>>>> Been running some tests and I keep running into issues with hotplug.
>>>> This looks similar to what Bart posted the other day, but it looks
>>>> more deeply rooted than just having to protect the queue in
>>>> generic_make_request_checks(). The test run is blktests,
>>>> block/001. Current -git doesn't survive it. I've seen at least two
>>>> different oopses, pasted below.
>>>>
>>>> [  102.163442] NULL pointer dereference at 0010
>>>> [  102.163444] PGD 0 P4D 0 
>>>> [  102.163447] Oops:  [#1] PREEMPT SMP
>>>> [  102.163449] Modules linked in:
>>>> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
>>>> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
>>>> nvme nvme_core sb_edac xl
>>>> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
>>>> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress 
>>>> xxhash lzo_compress zlib_defc
>>>> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
>>>> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
>>>> crc_t10dif]
>>>> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ 
>>>> #650
>>>> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>>>> 11/09/2016
>>>> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
>>>> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
>>>> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
>>>> 883ff357bb66
>>>> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
>>>> 881fea631000
>>>> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
>>>> 
>>>> [  102.390185] R10:  R11: 01b6 R12: 
>>>> 085d
>>>> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
>>>> 
>>>> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
>>>> knlGS:
>>>> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
>>>> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
>>>> 003606e0
>>>> [  102.432545] DR0:  DR1:  DR2: 
>>>> 
>>>> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
>>>> 0400
>>>> [  102.449502] Call Trace:
>>>> [  102.452744]  ? __invalidate_device+0x48/0x60
>>>> [  102.458022]  check_disk_change+0x4c/0x60
>>>> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
>>>> [  102.468270]  __blkdev_get+0xb9/0x450
>>>> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
>>>> [  102.477568]  blkdev_get+0x11e/0x320
>>>> [  102.481969]  ? bdget+0x11d/0x150
>>>> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.490968]  ? bd_acquire+0xc0/0xc0
>>>> [  102.495368]  do_dentry_open+0x1b0/0x320
>>>> [  102.500159]  ? inode_permission+0x24/0xc0
>>>> [  102.505140]  path_openat+0x4e6/0x1420
>>>> [  102.509741]  ? cpumask_any_but+0x1f/0x40
>>>> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
>>>> [  102.519903]  do_filp_open+0x8c/0xf0
>>>> [  102.524305]  ? __seccomp_filter+0x28/0x230
>>>> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
>>>> [  102.539559]  ? list_lru_add+0xa8/0xc0
>>>> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.549047]  ? __alloc_fd+0xaf/0x160
>>>> [  102.553549]  ? do_sys_open+0x1a6/0x230
>>>> [  102.558244]  do_sys_open+0x1a6/0x230
>>>> [  102.562742]  do_syscall_64+0x5a/0x100
>>>> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>
>>> Interesting. Thinking out loud: This is cd->device dereference I guess
>>> which means disk->private_data was NULL. That gets set in sr_probe()
>>> together with disk->fops which are certainly set as they must have led us
>>> to the crashing function sr_block_revalidate_disk(). So likely
>>> disk->private_data go

Re: Hotplug

2018-04-11 Thread Jan Kara

Hello,

On Wed 11-04-18 08:11:13, Jens Axboe wrote:
> On 4/11/18 7:58 AM, Jan Kara wrote:
> > On Tue 10-04-18 11:17:46, Jens Axboe wrote:
> >> Been running some tests and I keep running into issues with hotplug.
> >> This looks similar to what Bart posted the other day, but it looks
> >> more deeply rooted than just having to protect the queue in
> >> generic_make_request_checks(). The test run is blktests,
> >> block/001. Current -git doesn't survive it. I've seen at least two
> >> different oopses, pasted below.
> >>
> >> [  102.163442] NULL pointer dereference at 0010
> >> [  102.163444] PGD 0 P4D 0 
> >> [  102.163447] Oops:  [#1] PREEMPT SMP
> >> [  102.163449] Modules linked in:
> >> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
> >> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
> >> nvme nvme_core sb_edac xl
> >> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
> >> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress 
> >> xxhash lzo_compress zlib_defc
> >> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
> >> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
> >> crc_t10dif]
> >> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ 
> >> #650
> >> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> >> 11/09/2016
> >> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> >> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
> >> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
> >> 883ff357bb66
> >> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
> >> 881fea631000
> >> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
> >> 
> >> [  102.390185] R10:  R11: 01b6 R12: 
> >> 085d
> >> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
> >> 
> >> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
> >> knlGS:
> >> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
> >> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
> >> 003606e0
> >> [  102.432545] DR0:  DR1:  DR2: 
> >> 
> >> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
> >> 0400
> >> [  102.449502] Call Trace:
> >> [  102.452744]  ? __invalidate_device+0x48/0x60
> >> [  102.458022]  check_disk_change+0x4c/0x60
> >> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
> >> [  102.468270]  __blkdev_get+0xb9/0x450
> >> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
> >> [  102.477568]  blkdev_get+0x11e/0x320
> >> [  102.481969]  ? bdget+0x11d/0x150
> >> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.490968]  ? bd_acquire+0xc0/0xc0
> >> [  102.495368]  do_dentry_open+0x1b0/0x320
> >> [  102.500159]  ? inode_permission+0x24/0xc0
> >> [  102.505140]  path_openat+0x4e6/0x1420
> >> [  102.509741]  ? cpumask_any_but+0x1f/0x40
> >> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
> >> [  102.519903]  do_filp_open+0x8c/0xf0
> >> [  102.524305]  ? __seccomp_filter+0x28/0x230
> >> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
> >> [  102.539559]  ? list_lru_add+0xa8/0xc0
> >> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.549047]  ? __alloc_fd+0xaf/0x160
> >> [  102.553549]  ? do_sys_open+0x1a6/0x230
> >> [  102.558244]  do_sys_open+0x1a6/0x230
> >> [  102.562742]  do_syscall_64+0x5a/0x100
> >> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > 
> > Interesting. Thinking out loud: This is cd->device dereference I guess
> > which means disk->private_data was NULL. That gets set in sr_probe()
> > together with disk->fops which are certainly set as they must have led us
> > to the crashing function sr_block_revalidate_disk(). So likely
> > disk->private_data got already cleared. That happens in sr_kref_release()
> > and the fact that that function got called means struct scsi_cd went away -
> > so sr_remove() must have been called as well. That all seems possible like:
> > 
> > CPU1CPU2
> >

Re: Hotplug

2018-04-11 Thread Jens Axboe

On 4/11/18 7:58 AM, Jan Kara wrote:
> Hi,
> 
> On Tue 10-04-18 11:17:46, Jens Axboe wrote:
>> Been running some tests and I keep running into issues with hotplug.
>> This looks similar to what Bart posted the other day, but it looks
>> more deeply rooted than just having to protect the queue in
>> generic_make_request_checks(). The test run is blktests,
>> block/001. Current -git doesn't survive it. I've seen at least two
>> different oopses, pasted below.
>>
>> [  102.163442] NULL pointer dereference at 0010
>> [  102.163444] PGD 0 P4D 0 
>> [  102.163447] Oops:  [#1] PREEMPT SMP
>> [  102.163449] Modules linked in:
>> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
>> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
>> nvme nvme_core sb_edac xl
>> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
>> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
>> lzo_compress zlib_defc
>> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
>> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
>> crc_t10dif]
>> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
>> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>> 11/09/2016
>> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
>> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
>> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
>> 883ff357bb66
>> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
>> 881fea631000
>> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
>> 
>> [  102.390185] R10:  R11: 01b6 R12: 
>> 085d
>> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
>> 
>> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
>> knlGS:
>> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
>> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
>> 003606e0
>> [  102.432545] DR0:  DR1:  DR2: 
>> 
>> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  102.449502] Call Trace:
>> [  102.452744]  ? __invalidate_device+0x48/0x60
>> [  102.458022]  check_disk_change+0x4c/0x60
>> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
>> [  102.468270]  __blkdev_get+0xb9/0x450
>> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
>> [  102.477568]  blkdev_get+0x11e/0x320
>> [  102.481969]  ? bdget+0x11d/0x150
>> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
>> [  102.490968]  ? bd_acquire+0xc0/0xc0
>> [  102.495368]  do_dentry_open+0x1b0/0x320
>> [  102.500159]  ? inode_permission+0x24/0xc0
>> [  102.505140]  path_openat+0x4e6/0x1420
>> [  102.509741]  ? cpumask_any_but+0x1f/0x40
>> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
>> [  102.519903]  do_filp_open+0x8c/0xf0
>> [  102.524305]  ? __seccomp_filter+0x28/0x230
>> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
>> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
>> [  102.539559]  ? list_lru_add+0xa8/0xc0
>> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
>> [  102.549047]  ? __alloc_fd+0xaf/0x160
>> [  102.553549]  ? do_sys_open+0x1a6/0x230
>> [  102.558244]  do_sys_open+0x1a6/0x230
>> [  102.562742]  do_syscall_64+0x5a/0x100
>> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Interesting. Thinking out loud: This is cd->device dereference I guess
> which means disk->private_data was NULL. That gets set in sr_probe()
> together with disk->fops which are certainly set as they must have led us
> to the crashing function sr_block_revalidate_disk(). So likely
> disk->private_data got already cleared. That happens in sr_kref_release()
> and the fact that that function got called means struct scsi_cd went away -
> so sr_remove() must have been called as well. That all seems possible like:
> 
> CPU1  CPU2
> sr_probe()
>   __blkdev_get()
> disk = bdev_get_gendisk();
> 
> sr_remove()
>   del_gendisk()
>   ...
>   kref_put(&cd->kref, sr_kref_release);
> disk->private_data = NULL;
> put_disk(disk);
> kfree(cd);
> if (disk->fops->open) {
>   ret = disk->fops->open(bdev, mode); => sr_block_open
> check_disk_change(bdev);
>

Re: Hotplug

2018-04-11 Thread Jens Axboe

On 4/11/18 8:06 AM, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote:
>> I'm not really sure where this is crashing and 'Code' line is incomplete to
>> tell me.
> 
> Hello Jan,
> 
> The following patch should fix this crash:
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html.

Yeah, I forgot the link in my reply, thanks.

-- 
Jens Axboe

Re: Hotplug

2018-04-11 Thread Bart Van Assche

On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote:
> I'm not really sure where this is crashing and 'Code' line is incomplete to
> tell me.

Hello Jan,

The following patch should fix this crash:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html.

Thanks,

Bart.

Re: Hotplug

2018-04-11 Thread Jan Kara

Hi,

On Tue 10-04-18 11:17:46, Jens Axboe wrote:
> Been running some tests and I keep running into issues with hotplug.
> This looks similar to what Bart posted the other day, but it looks
> more deeply rooted than just having to protect the queue in
> generic_make_request_checks(). The test run is blktests,
> block/001. Current -git doesn't survive it. I've seen at least two
> different oopses, pasted below.
> 
> [  102.163442] NULL pointer dereference at 0010
> [  102.163444] PGD 0 P4D 0 
> [  102.163447] Oops:  [#1] PREEMPT SMP
> [  102.163449] Modules linked in:
> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme 
> nvme_core sb_edac xl
> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
> lzo_compress zlib_defc
> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
> crc_t10dif]
> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> 11/09/2016
> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
> 883ff357bb66
> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
> 881fea631000
> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
> 
> [  102.390185] R10:  R11: 01b6 R12: 
> 085d
> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
> 
> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
> knlGS:
> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
> 003606e0
> [  102.432545] DR0:  DR1:  DR2: 
> 
> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  102.449502] Call Trace:
> [  102.452744]  ? __invalidate_device+0x48/0x60
> [  102.458022]  check_disk_change+0x4c/0x60
> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
> [  102.468270]  __blkdev_get+0xb9/0x450
> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
> [  102.477568]  blkdev_get+0x11e/0x320
> [  102.481969]  ? bdget+0x11d/0x150
> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
> [  102.490968]  ? bd_acquire+0xc0/0xc0
> [  102.495368]  do_dentry_open+0x1b0/0x320
> [  102.500159]  ? inode_permission+0x24/0xc0
> [  102.505140]  path_openat+0x4e6/0x1420
> [  102.509741]  ? cpumask_any_but+0x1f/0x40
> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
> [  102.519903]  do_filp_open+0x8c/0xf0
> [  102.524305]  ? __seccomp_filter+0x28/0x230
> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
> [  102.539559]  ? list_lru_add+0xa8/0xc0
> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
> [  102.549047]  ? __alloc_fd+0xaf/0x160
> [  102.553549]  ? do_sys_open+0x1a6/0x230
> [  102.558244]  do_sys_open+0x1a6/0x230
> [  102.562742]  do_syscall_64+0x5a/0x100
> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Interesting. Thinking out loud: This is cd->device dereference I guess
which means disk->private_data was NULL. That gets set in sr_probe()
together with disk->fops which are certainly set as they must have led us
to the crashing function sr_block_revalidate_disk(). So likely
disk->private_data got already cleared. That happens in sr_kref_release()
and the fact that that function got called means struct scsi_cd went away -
so sr_remove() must have been called as well. That all seems possible like:

CPU1CPU2
sr_probe()
__blkdev_get()
  disk = bdev_get_gendisk();

sr_remove()
  del_gendisk()
  ...
  kref_put(&cd->kref, sr_kref_release);
disk->private_data = NULL;
put_disk(disk);
kfree(cd);
  if (disk->fops->open) {
ret = disk->fops->open(bdev, mode); => sr_block_open
  check_disk_change(bdev);
sr_block_revalidate_disk()
  CRASH

And I think the problem is in sr_block_revalidate_disk() itself as the
scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
and thus that 'cd' does not disappear under it. IMHO it needs to use
scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something?

> and this one, similar to Barts:
> 
> [ 4676.73806

Hotplug

2018-04-10 Thread Jens Axboe

Hi,

Been running some tests and I keep running into issues with hotplug.
This looks similar to what Bart posted the other day, but it looks
more deeply rooted than just having to protect the queue in
generic_make_request_checks(). The test run is blktests,
block/001. Current -git doesn't survive it. I've seen at least two
different oopses, pasted below.

[  102.163442] NULL pointer dereference at 0010
[  102.163444] PGD 0 P4D 0 
[  102.163447] Oops:  [#1] PREEMPT SMP
[  102.163449] Modules linked in:
[  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
[  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme 
nvme_core sb_edac xl
[  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
[  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
lzo_compress zlib_defc
[  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
[  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
crc_t10dif]
[  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
[  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
[  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
[  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66
[  102.373220] RDX: 0003 RSI: 7530 RDI: 881fea631000
[  102.381705] RBP:  R08: 881fe4d38400 R09: 
[  102.390185] R10:  R11: 01b6 R12: 085d
[  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
[  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
knlGS:
[  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
[  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 003606e0
[  102.432545] DR0:  DR1:  DR2: 
[  102.441024] DR3:  DR6: fffe0ff0 DR7: 0400
[  102.449502] Call Trace:
[  102.452744]  ? __invalidate_device+0x48/0x60
[  102.458022]  check_disk_change+0x4c/0x60
[  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
[  102.468270]  __blkdev_get+0xb9/0x450
[  102.472774]  ? iget5_locked+0x1c0/0x1e0
[  102.477568]  blkdev_get+0x11e/0x320
[  102.481969]  ? bdget+0x11d/0x150
[  102.486083]  ? _raw_spin_unlock+0xa/0x20
[  102.490968]  ? bd_acquire+0xc0/0xc0
[  102.495368]  do_dentry_open+0x1b0/0x320
[  102.500159]  ? inode_permission+0x24/0xc0
[  102.505140]  path_openat+0x4e6/0x1420
[  102.509741]  ? cpumask_any_but+0x1f/0x40
[  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
[  102.519903]  do_filp_open+0x8c/0xf0
[  102.524305]  ? __seccomp_filter+0x28/0x230
[  102.529389]  ? _raw_spin_unlock+0xa/0x20
[  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
[  102.539559]  ? list_lru_add+0xa8/0xc0
[  102.544157]  ? _raw_spin_unlock+0xa/0x20
[  102.549047]  ? __alloc_fd+0xaf/0x160
[  102.553549]  ? do_sys_open+0x1a6/0x230
[  102.558244]  do_sys_open+0x1a6/0x230
[  102.562742]  do_syscall_64+0x5a/0x100
[  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

and this one, similar to Barts:

[ 4676.738069] NULL pointer dereference at 0154
[ 4676.738071] PGD 0 P4D 0 
[ 4676.738073] Oops:  [#1] PREEMPT SMP
[ 4676.738075] Modules linked in: scsi_debug crc_t10dif nvme nvme_core loop 
configfs crct10dif_ge]
[ 4676.859272] CPU: 10 PID: 16598 Comm: systemd-udevd Not tainted 4.16.0+ #651
[ 4676.867525] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[ 4676.876765] RIP: 0010:blk_throtl_bio+0x45/0x9b0
[ 4676.882296] RSP: 0018:881ff0c8bb38 EFLAGS: 00010246
[ 4676.888610] RAX:  RBX: 881ffa273a40 RCX: 
[ 4676.897059] RDX: 881ffa273a40 RSI:  RDI: 
[ 4676.905507] RBP:  R08: 881ffa273ac0 R09: 881ff123f458
[ 4676.913955] R10: 881ff0c8bc70 R11: 1000 R12: 
[ 4676.922402] R13: 82600980 R14: 88208113 R15: 
[ 4676.930856] FS:  7fe63e5228c0() GS:881fff74() 
knlGS:
[ 4676.940773] CS:  0010 DS:  ES:  CR0: 80050033
[ 4676.947667] CR2: 0154 CR3: 001fed2a0003 CR4: 003606e0
[ 4676.956116] DR0:  DR1:  DR2: 
[ 4676.964568] DR3:  DR6: fffe0ff0 DR7: 0400
[ 4676.973021] Call Trace:
[ 4676.976229]  generic_make_request_checks+0x640/0x660
[ 4676.982245]  ? kmem_cache_alloc+0x37/0x190
[ 4676.987295]  generic_make_request+0x29/0x2f0
[ 4676.992534]  ? submit_bio+0x5c/0x110
[ 4676.996998]  ? bio_alloc_bioset+0x99/0x1c0
[ 4677.002050]  submit_bio+0x5c/0x110
[ 4677.006317]  ? guard_bio_eod+0x42/0x120
[ 4677.011074]  submit_bh_wbc.isra.49+0x132/0x160
[ 4677.016517]  ? bh_uptodate_or_lock+0x90/0x90
[ 4

Re: blk-mq and CPU hotplug error

2018-03-26 Thread jianchao.wang

Hi Jose

On 03/27/2018 05:25 AM, jos...@linux.vnet.ibm.com wrote:
> Hello everyone!
> 
> I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le).
> Everything looks good until I try to hotplug CPUs in my VM. As soon as I
> do that I get the following error in my VM dmesg:

Please refer to the following commit:
20e4d813 blk-mq: simplify queue mapping & schedule with each possisble CPU
84676c1 genirq/affinity: assign vectors to all possible CPUs

Thanks
Jianchao

blk-mq and CPU hotplug error

2018-03-26 Thread joserz

Hello everyone!

I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le).
Everything looks good until I try to hotplug CPUs in my VM. As soon as I
do that I get the following error in my VM dmesg:

[  763.629425] WARNING: CPU: 34 PID: 2276 at 
/build/linux-zBVy54/linux-4.15.0/block/blk-mq.c:1211 
__blk_mq_run_hw_queue+0xf0/0x140
...
[  763.629497] CPU: 34 PID: 2276 Comm: kworker/34:1H Not tainted 
4.15.0-12-generic #13-Ubuntu
[  763.629501] Workqueue: kblockd blk_mq_run_work_fn

See that blk-mq complains about CPU 34. This CPU is one of the
hotplugged CPUs.

Looking at mq/0/cpu_list, I see the my new CPUs aren't in that list:

# cat /sys/block/vda/mq/0/cpu_list
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
22, 23, 24, 25, 26, 27, 28, 29, 30, 31

But the real problem comes after it, if I try to run any command that
performs intensive IO operations, like:

# dd if=/dev/random of=test.img bs=1M count=100

I start to get errors like:

[  967.661543] INFO: task kworker/u128:3:454 blocked for more than 120 seconds.
[  967.661656]   Tainted: GW4.15.0-12-generic #13-Ubuntu
[  967.661705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  967.661762] kworker/u128:3  D0   454  2 0x0800
[  967.661802] Workqueue: writeback wb_workfn (flush-252:0)
[  967.661805] Call Trace:
[  967.661808] [c00267e5f1d0] [0022] 0x22 (unreliable)
[  967.661813] [c00267e5f3a0] [c001c220] __switch_to+0x2a0/0x4d0
[  967.661823] [c00267e5f400] [c0d035e4] __schedule+0x2a4/0xaf0
[  967.661825] [c00267e5f4d0] [c0d03e70] schedule+0x40/0xc0
[  967.661834] [c00267e5f4f0] [c014f77c] io_schedule+0x2c/0x50
[  967.661838] [c00267e5f520] [c02d810c] __lock_page+0x14c/0x1c0
[  967.661842] [c00267e5f5c0] [c04c9288] 
mpage_prepare_extent_to_map+0x2c8/0x3d0
[  967.661844] [c00267e5f6c0] [c04d16bc] ext4_writepages+0x49c/0xfd0
[  967.661847] [c00267e5f870] [c02f1a9c] do_writepages+0x4c/0x130
[  967.661849] [c00267e5f8e0] [c041a370] 
__writeback_single_inode+0x70/0x570
[  967.661852] [c00267e5f940] [c041af0c] 
writeback_sb_inodes+0x26c/0x5d0
[  967.661854] [c00267e5fa50] [c041b308] 
__writeback_inodes_wb+0x98/0x110
[  967.661856] [c00267e5fab0] [c041b72c] wb_writeback+0x29c/0x460
[  967.661859] [c00267e5fb80] [c041c38c] wb_workfn+0x1ec/0x600
[  967.661862] [c00267e5fc90] [c0131538] 
process_one_work+0x298/0x5a0
[  967.661864] [c00267e5fd20] [c01318d8] worker_thread+0x98/0x630
[  967.661866] [c00267e5fdc0] [c013a508] kthread+0x1a8/0x1b0
[  967.661869] [c00267e5fe30] [c000b528] 
ret_from_kernel_thread+0x5c/0xb4
[  967.661872] INFO: task jbd2/vda2-8:579 blocked for more than 120 seconds.
[  967.661924]   Tainted: GW4.15.0-12-generic #13-Ubuntu
[  967.661973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  967.662035] jbd2/vda2-8 D0   579  2 0x0800
[  967.662037] Call Trace:
...snip

Which leads to an unresponsive VM that forces me to reboot it. After
searching the internet I changed the IO scheduler:

# cat /sys/block/vda/queue/scheduler
[none]

# modprobe bfq
# echo bfq > /sys/block/vda/queue/scheduler

# cat /sys/block/vda/queue/scheduler
[bfq] none

which seems to be enough (it still throws that WARNING, but no
hung_task_timeout_secs so far).

Now I'm not sure on how to attack the problem. At first I thought that
CPU hotplug IRQ should also update the cpu_list but now I think that the
problem could be in the queue, maybe it should give up scheduling tasks
on the failing CPU and try another one. What do you people think?

Any suggestion on how to debug it is welcome as well.

Thank you!

Jose R. Ziviani

Re: [PATCH V2 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Jens Axboe

On 1/17/18 9:41 AM, Ming Lei wrote:
> Hi Jens,
> 
> The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
> queue mapping & schedule with each possisble CPU").
> 
> The 2nd one add comments for current two races, and convert the WARN_ON
> into printk(KERN_WARN) with dump_stack() since this warning is harmless.
> 
> V2:
>   - fix comment and commit log in patch 1
>   - use dump_stack() with printk to replace WARN_ON() in patch 2

Applied, thanks.

-- 
Jens Axboe

[PATCH V2 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Ming Lei

Hi Jens,

The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
queue mapping & schedule with each possisble CPU").

The 2nd one add comments for current two races, and convert the WARN_ON
into printk(KERN_WARN) with dump_stack() since this warning is harmless.

V2:
- fix comment and commit log in patch 1
- use dump_stack() with printk to replace WARN_ON() in patch 2

Thanks,

Ming Lei (2):
  blk-mq: make sure hctx->next_cpu is set correctly
  blk-mq: avoid one WARN_ON in __blk_mq_run_hw_queue to printk

 block/blk-mq.c | 52 
 1 file changed, 48 insertions(+), 4 deletions(-)

-- 
2.9.5

[PATCH 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Ming Lei

Hi Jens,

The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
queue mapping & schedule with each possisble CPU").

The 2nd one add commens about current race, and convert the WARN_ON
into printk(KERN_WARN) since this warning is harmless.

Thanks,

Ming Lei (2):
  blk-mq: make sure hctx->next_cpu is set correctly
  blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

 block/blk-mq.c | 51 +++
 1 file changed, 47 insertions(+), 4 deletions(-)

-- 
2.9.5

Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Jens Axboe

On 1/11/18 7:53 PM, Ming Lei wrote:
> Hi,
> 
> This two patches support physical CPU hotplug, so that we can make blk-mq
> scale well when new physical CPU is added or removed, and this use case
> is normal for VM world.
> 
> Also this patchset fixes the following warning reported by Christian
> Borntraeger:
> 
>   https://marc.info/?l=linux-block&m=151092973417143&w=2
> 
> Christoph Hellwig (2):
>   genirq/affinity: assign vectors to all possible CPUs
>   blk-mq: simplify queue mapping & schedule with each possisble CPU

Applied, thanks Ming.

-- 
Jens Axboe

Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Johannes Thumshirn

On Fri, Jan 12, 2018 at 09:12:24AM +0100, Christian Borntraeger wrote:
> I think we also need cc stable for this. The bug was introduced with 
>  
> commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
> ("blk-mq: Create hctx for each present CPU")
> 
> and that was even backported into 4.12 stable.

Yes, and Ming (or Jens?), can you please add a Fixes tag as well? It helps
with identifying patches which fix backports.

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Christian Borntraeger

I think we also need cc stable for this. The bug was introduced with 
 
commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
("blk-mq: Create hctx for each present CPU")

and that was even backported into 4.12 stable.



On 01/12/2018 03:53 AM, Ming Lei wrote:
> Hi,
> 
> This two patches support physical CPU hotplug, so that we can make blk-mq
> scale well when new physical CPU is added or removed, and this use case
> is normal for VM world.
> 
> Also this patchset fixes the following warning reported by Christian
> Borntraeger:
> 
>   https://marc.info/?l=linux-block&m=151092973417143&w=2
> 
> Christoph Hellwig (2):
>   genirq/affinity: assign vectors to all possible CPUs
>   blk-mq: simplify queue mapping & schedule with each possisble CPU
> 
>  block/blk-mq.c| 19 ---
>  kernel/irq/affinity.c | 30 +++---
>  2 files changed, 23 insertions(+), 26 deletions(-)
>

[PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-11 Thread Ming Lei

Hi,

This two patches support physical CPU hotplug, so that we can make blk-mq
scale well when new physical CPU is added or removed, and this use case
is normal for VM world.

Also this patchset fixes the following warning reported by Christian
Borntraeger:

https://marc.info/?l=linux-block&m=151092973417143&w=2

Christoph Hellwig (2):
  genirq/affinity: assign vectors to all possible CPUs
  blk-mq: simplify queue mapping & schedule with each possisble CPU

 block/blk-mq.c| 19 ---
 kernel/irq/affinity.c | 30 +++---
 2 files changed, 23 insertions(+), 26 deletions(-)

-- 
2.9.5

Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> + return;

That's silly. Why want you to do that for each irq descriptor? That's
outright silly. Either you allocated it in irq_affinity_[on|off]line_cpu()
once or just make it a static cpumask. Lemme fix that up.

Thanks,

tglx

Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> +
> + cpumask_and(mask, affinity, cpu_online_mask);
> + cpumask_set_cpu(cpu, mask);
> + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> + irq_startup(desc, false);
> + irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> + } else {
> + irq_affinity_set(irq, desc, mask);

I don't think this should be forced. In that case we really can use the
regular mechanism.

Thanks,

tglx

[PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-03 Thread Christoph Hellwig

Remove a CPU from the affinity mask when it goes offline and add it
back when it returns.  In case the vetor was assigned only to the CPU
going offline it will be shutdown and re-started when the CPU
reappears.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/irq.c  |   3 +-
 include/linux/cpuhotplug.h |   1 +
 include/linux/irq.h|   9 
 kernel/cpu.c   |   6 +++
 kernel/irq/affinity.c  | 127 -
 5 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index a54eac5d81b3..72c35ed534f1 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -453,7 +453,8 @@ void fixup_irqs(void)
 
data = irq_desc_get_irq_data(desc);
affinity = irq_data_get_affinity_mask(data);
-   if (!irq_has_action(irq) || irqd_is_per_cpu(data) ||
+   if (irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) || irqd_is_per_cpu(data) ||
cpumask_subset(affinity, cpu_online_mask)) {
raw_spin_unlock(&desc->lock);
continue;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 0f2a80377520..c15f22c54535 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -124,6 +124,7 @@ enum cpuhp_state {
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,
CPUHP_AP_X86_VDSO_VMA_ONLINE,
+   CPUHP_AP_IRQ_AFFINITY_ONLINE,
CPUHP_AP_PERF_ONLINE,
CPUHP_AP_PERF_X86_ONLINE,
CPUHP_AP_PERF_X86_UNCORE_ONLINE,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f887351aa80e..ae15b8582685 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -216,6 +216,7 @@ enum {
IRQD_WAKEUP_ARMED   = (1 << 19),
IRQD_FORWARDED_TO_VCPU  = (1 << 20),
IRQD_AFFINITY_MANAGED   = (1 << 21),
+   IRQD_AFFINITY_SUSPENDED = (1 << 22),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d)
__irqd_to_state(d) &= ~IRQD_ACTIVATED;
 }
 
+static inline bool irqd_affinity_is_suspended(struct irq_data *d)
+{
+   return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
@@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct 
cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+int irq_affinity_online_cpu(unsigned int cpu);
+int irq_affinity_offline_cpu(unsigned int cpu);
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..ef0c5b63ca0d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #define CREATE_TRACE_POINTS
@@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = {
.startup.single = smpboot_unpark_threads,
.teardown.single= NULL,
},
+   [CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
+   .name   = "irq/affinity:online",
+   .startup.single = irq_affinity_online_cpu,
+   .teardown.single= irq_affinity_offline_cpu,
+   },
[CPUHP_AP_PERF_ONLINE] = {
.name   = "perf:online",
.startup.single = perf_event_init_cpu,
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 337e6ffba93f..e27ecfb4866f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,4 +1,7 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include 
 #include 
 #include 
@@ -227,3 +230,125 @@ int irq_calc_affinity_vectors(int maxvec, const struct 
irq_affinity *affd)
 
return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 }
+
+static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
+   unsigned int cpu)
+{
+   const struct cpumask *affinity;
+   struct irq_data *data;
+   struct irq_chip *chip;
+   unsigned long flags;
+   cpumask_var_t mask;
+
+   if (!desc)
+   return;
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+   return;
+
+   raw_spin_lock_irqsave(&desc->lock, flags);
+
+   data = irq_desc_get_irq_data(desc);
+   affinity = irq_data_get_affinity_mask(data);
+   if (!irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) ||
+   !cpumask_test_cpu(cpu, affinity))
+   goto out_free_cpumask;
+
+   /*
+* The interrupt descriptor might have been cleaned up
+* already, but it is not yet removed from the

[PATCH 4/7] genirq/affinity: update CPU affinity for CPU hotplug events

2017-05-19 Thread Christoph Hellwig

Remove a CPU from the affinity mask when it goes offline and add it
back when it returns.  In case the vetor was assigned only to the CPU
going offline it will be shutdown and re-started when the CPU
reappears.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/irq.c  |   3 +-
 include/linux/cpuhotplug.h |   1 +
 include/linux/irq.h|   9 
 kernel/cpu.c   |   6 +++
 kernel/irq/affinity.c  | 129 -
 5 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index a54eac5d81b3..72c35ed534f1 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -453,7 +453,8 @@ void fixup_irqs(void)
 
data = irq_desc_get_irq_data(desc);
affinity = irq_data_get_affinity_mask(data);
-   if (!irq_has_action(irq) || irqd_is_per_cpu(data) ||
+   if (irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) || irqd_is_per_cpu(data) ||
cpumask_subset(affinity, cpu_online_mask)) {
raw_spin_unlock(&desc->lock);
continue;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 0f2a80377520..c15f22c54535 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -124,6 +124,7 @@ enum cpuhp_state {
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,
CPUHP_AP_X86_VDSO_VMA_ONLINE,
+   CPUHP_AP_IRQ_AFFINITY_ONLINE,
CPUHP_AP_PERF_ONLINE,
CPUHP_AP_PERF_X86_ONLINE,
CPUHP_AP_PERF_X86_UNCORE_ONLINE,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f887351aa80e..ae15b8582685 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -216,6 +216,7 @@ enum {
IRQD_WAKEUP_ARMED   = (1 << 19),
IRQD_FORWARDED_TO_VCPU  = (1 << 20),
IRQD_AFFINITY_MANAGED   = (1 << 21),
+   IRQD_AFFINITY_SUSPENDED = (1 << 22),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d)
__irqd_to_state(d) &= ~IRQD_ACTIVATED;
 }
 
+static inline bool irqd_affinity_is_suspended(struct irq_data *d)
+{
+   return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
@@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct 
cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+int irq_affinity_online_cpu(unsigned int cpu);
+int irq_affinity_offline_cpu(unsigned int cpu);
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..ef0c5b63ca0d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #define CREATE_TRACE_POINTS
@@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = {
.startup.single = smpboot_unpark_threads,
.teardown.single= NULL,
},
+   [CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
+   .name   = "irq/affinity:online",
+   .startup.single = irq_affinity_online_cpu,
+   .teardown.single= irq_affinity_offline_cpu,
+   },
[CPUHP_AP_PERF_ONLINE] = {
.name   = "perf:online",
.startup.single = perf_event_init_cpu,
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index d58431f59f7c..809a7d241eff 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,8 +1,13 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include 
 #include 
 #include 
 #include 
+#include 
+#include "internals.h"
 
 static cpumask_var_t node_to_present_cpumask[MAX_NUMNODES] __read_mostly;
 
@@ -176,6 +181,128 @@ bool irq_affinity_set(int irq, struct irq_desc *desc, 
const cpumask_t *mask)
return ret;
 }
 
+static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
+   unsigned int cpu)
+{
+   const struct cpumask *affinity;
+   struct irq_data *data;
+   struct irq_chip *chip;
+   unsigned long flags;
+   cpumask_var_t mask;
+
+   if (!desc)
+   return;
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+   return;
+
+   raw_spin_lock_irqsave(&desc->lock, flags);
+
+   data = irq_desc_get_irq_data(desc);
+   affinity = irq_data_get_affinity_mask(data);
+   if (!irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) ||
+   !cpumask_test_cpu(cpu, affinity))
+   goto out_free_cpumask;
+
+   /*
+* The interrupt descriptor might have

Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-10 Thread Thomas Gleixner

On Fri, 3 Feb 2017, Christoph Hellwig wrote:
> @@ -127,6 +127,7 @@ enum cpuhp_state {
>   CPUHP_AP_ONLINE_IDLE,
>   CPUHP_AP_SMPBOOT_THREADS,
>   CPUHP_AP_X86_VDSO_VMA_ONLINE,
> + CPUHP_AP_IRQ_AFFINIY_ONLINE,

s/AFFINIY/AFFINITY/ perhaps?

> +static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
> + cpumask_t *mask)

static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
   cpumask_t *mask)

Please

> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + int ret;
> +
> + if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
> + chip->irq_mask(data);
> + ret = chip->irq_set_affinity(data, mask, true);
> + WARN_ON_ONCE(ret);
> +
> + /*
> +  * We unmask if the irq was not marked masked by the core code.
> +  * That respects the lazy irq disable behaviour.
> +  */
> + if (!irqd_can_move_in_process_context(data) &&
> + !irqd_irq_masked(data) && chip->irq_unmask)
> + chip->irq_unmask(data);
> +}

This looks very familiar. arch/x86/kernel/irq.c comes to mind

> +
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * The interrupt descriptor might have been cleaned up
> +  * already, but it is not yet removed from the radix tree
> +  */
> + chip = irq_data_get_irq_chip(data);
> + if (!chip)
> + goto out_unlock;
> +
> + if (WARN_ON_ONCE(!chip->irq_set_affinity))
> + goto out_unlock;
> +
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {

You really want to allocate that _BEFORE_ locking desc->lock. GFP_KERNEL
inside the lock held region is wrong and shows that this was never tested :)

And no, we don't want GFP_ATOMIC here. You can allocate is once at the call
site and hand it in, so you avoid the alloc/free dance when iterating over
a large number of descriptors.

> + pr_err("failed to allocate memory for cpumask\n");
> + goto out_unlock;
> + }
> +
> + cpumask_and(mask, affinity, cpu_online_mask);
> + cpumask_set_cpu(cpu, mask);
> + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> + irq_startup(desc, false);
> + irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> + } else {
> + __irq_affinity_set(irq, desc, mask);
> + }
> +
> + free_cpumask_var(mask);
> +out_unlock:
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
> +}



> +int irq_affinity_online_cpu(unsigned int cpu)
> +{
> + struct irq_desc *desc;
> + unsigned int irq;
> +
> + for_each_irq_desc(irq, desc)
> + irq_affinity_online_irq(irq, desc, cpu);

That lacks protection against concurrent irq setup/teardown. Wants to be
protected with irq_lock_sparse()

> + return 0;
> +}
> +
> +static void irq_affinity_offline_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + irqd_has_set(data, IRQD_AFFINITY_SUSPENDED) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * Complete the irq move. This cpu is going down and for
> +  * non intr-remapping case, we can't wait till this interrupt
> +  * arrives at this cpu before completing the irq move.
> +  */
> + irq_force_complete_move(desc);

Hmm. That's what we do in x86 when the cpu is really dying, i.e. before it
really goes away. It's the last resort we have.

So if a move is pending, then you force it here and then you call
__irq_affinity_set() further down, which queues another pending move, which
then gets cleaned up in the cpu dying code.

If a move is pending, then you should first verify whether the pending
affinity mask is different from the one you are going to set. If it's the
same, you can just let the final cleanup code do its job. If not, then you
need to

Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-03 Thread kbuild test robot

Hi Christoph,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.10-rc6]
[cannot apply to next-20170203]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christoph-Hellwig/genirq-allow-assigning-affinity-to-present-but-not-online-CPUs/20170203-224056
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: arm-socfpga_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All error/warnings (new ones prefixed by >>):

   In file included from kernel/irq/internals.h:8:0,
from kernel/irq/affinity.c:9:
>> include/linux/irqdesc.h:52:25: error: field 'irq_common_data' has incomplete 
>> type
 struct irq_common_data irq_common_data;
^~~
>> include/linux/irqdesc.h:53:19: error: field 'irq_data' has incomplete type
 struct irq_data  irq_data;
  ^~~~
>> include/linux/irqdesc.h:55:2: error: unknown type name 'irq_flow_handler_t'
 irq_flow_handler_t handle_irq;
 ^~
   In file included from include/linux/interrupt.h:5:0,
from kernel/irq/affinity.c:5:
   include/linux/irqdesc.h: In function 'irq_data_to_desc':
>> include/linux/irqdesc.h:111:26: error: dereferencing pointer to incomplete 
>> type 'struct irq_data'
 return container_of(data->common, struct irq_desc, irq_common_data);
 ^
   include/linux/kernel.h:850:49: note: in definition of macro 'container_of'
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
^~~
   In file included from kernel/irq/internals.h:8:0,
from kernel/irq/affinity.c:9:
   include/linux/irqdesc.h: In function 'generic_handle_irq_desc':
>> include/linux/irqdesc.h:150:2: error: called object is not a function or 
>> function pointer
 desc->handle_irq(desc);
 ^~~~
   include/linux/irqdesc.h: At top level:
   include/linux/irqdesc.h:194:8: error: unknown type name 'irq_flow_handler_t'
   irq_flow_handler_t handler)
   ^~
   include/linux/irqdesc.h:215:6: error: unknown type name 'irq_flow_handler_t'
 irq_flow_handler_t handler, const char *name)
 ^~
   include/linux/irqdesc.h: In function 'irq_balancing_disabled':
>> include/linux/irqdesc.h:229:38: error: 'IRQ_NO_BALANCING_MASK' undeclared 
>> (first use in this function)
 return desc->status_use_accessors & IRQ_NO_BALANCING_MASK;
 ^
   include/linux/irqdesc.h:229:38: note: each undeclared identifier is reported 
only once for each function it appears in
   include/linux/irqdesc.h: In function 'irq_is_percpu':
>> include/linux/irqdesc.h:237:38: error: 'IRQ_PER_CPU' undeclared (first use 
>> in this function)
 return desc->status_use_accessors & IRQ_PER_CPU;
 ^~~
   In file included from kernel/irq/internals.h:62:0,
from kernel/irq/affinity.c:9:
   kernel/irq/debug.h: In function 'print_irq_desc':
>> kernel/irq/debug.h:16:28: warning: format '%p' expects argument of type 
>> 'void *', but argument 2 has type 'int' [-Wformat=]
 printk("->handle_irq():  %p, ", desc->handle_irq);
   ^
>> kernel/irq/debug.h:26:7: error: 'IRQ_LEVEL' undeclared (first use in this 
>> function)
 ___P(IRQ_LEVEL);
  ^
   kernel/irq/debug.h:7:50: note: in definition of macro '___P'
#define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f)
 ^
>> kernel/irq/debug.h:27:7: error: 'IRQ_PER_CPU' undeclared (first use in this 
>> function)
 ___P(IRQ_PER_CPU);
  ^
   kernel/irq/debug.h:7:50: note: in definition of macro '___P'
#define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f)
 ^
>> kernel/irq/debug.h:28:7: error: 'IRQ_NOPROBE' undeclared (first use in this 
>> function)
 ___P(IRQ_NOPROBE);
  ^
   kernel/irq/debug.h:7:50: note: in definition of macro '___P'
#define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f)
 ^
>> kernel/irq/debug.h:29:7: error: 'IRQ_NOREQUEST' undeclared (first use in 
>> this function)
 ___P(IRQ_NOREQUEST);
  ^
   kernel/irq/debug.h:7:50: note: in definition of macro '___P'
#define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f)

Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-03 Thread kbuild test robot

Hi Christoph,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.10-rc6]
[cannot apply to next-20170203]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christoph-Hellwig/genirq-allow-assigning-affinity-to-present-but-not-online-CPUs/20170203-224056
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

   kernel/irq/affinity.c: In function 'irq_affinity_offline_irq':
>> kernel/irq/affinity.c:264:2: error: implicit declaration of function 
>> 'irq_force_complete_move' [-Werror=implicit-function-declaration]
 irq_force_complete_move(desc);
 ^~~
   cc1: some warnings being treated as errors

vim +/irq_force_complete_move +264 kernel/irq/affinity.c

   258  
   259  /*
   260   * Complete the irq move. This cpu is going down and for
   261   * non intr-remapping case, we can't wait till this interrupt
   262   * arrives at this cpu before completing the irq move.
   263   */
 > 264  irq_force_complete_move(desc);
   265  
   266  /*
   267   * The interrupt descriptor might have been cleaned up

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-03 Thread Christoph Hellwig

Remove a CPU from the affinity mask when it goes offline and add it
back when it returns.  In case the vetor was assigned only to the CPU
going offline it will be shutdown and re-started when the CPU
reappears.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/irq.c  |   3 +-
 include/linux/cpuhotplug.h |   1 +
 include/linux/irq.h|   9 +++
 kernel/cpu.c   |   6 ++
 kernel/irq/affinity.c  | 157 -
 5 files changed, 174 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 7c6e9ffe4424..285ef40ae290 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -449,7 +449,8 @@ void fixup_irqs(void)
 
data = irq_desc_get_irq_data(desc);
affinity = irq_data_get_affinity_mask(data);
-   if (!irq_has_action(irq) || irqd_is_per_cpu(data) ||
+   if (irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) || irqd_is_per_cpu(data) ||
cpumask_subset(affinity, cpu_online_mask)) {
raw_spin_unlock(&desc->lock);
continue;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index d936a0021839..63406ae5b2df 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -127,6 +127,7 @@ enum cpuhp_state {
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,
CPUHP_AP_X86_VDSO_VMA_ONLINE,
+   CPUHP_AP_IRQ_AFFINIY_ONLINE,
CPUHP_AP_PERF_ONLINE,
CPUHP_AP_PERF_X86_ONLINE,
CPUHP_AP_PERF_X86_UNCORE_ONLINE,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index e79875574b39..4b2a542b2591 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -214,6 +214,7 @@ enum {
IRQD_WAKEUP_ARMED   = (1 << 19),
IRQD_FORWARDED_TO_VCPU  = (1 << 20),
IRQD_AFFINITY_MANAGED   = (1 << 21),
+   IRQD_AFFINITY_SUSPENDED = (1 << 22),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -312,6 +313,11 @@ static inline bool irqd_affinity_is_managed(struct 
irq_data *d)
return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED;
 }
 
+static inline bool irqd_affinity_is_suspended(struct irq_data *d)
+{
+   return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
@@ -989,4 +995,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct 
cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+int irq_affinity_online_cpu(unsigned int cpu);
+int irq_affinity_offline_cpu(unsigned int cpu);
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0a5f630f5c54..fe19af6a896b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #define CREATE_TRACE_POINTS
@@ -1248,6 +1249,11 @@ static struct cpuhp_step cpuhp_ap_states[] = {
.startup.single = smpboot_unpark_threads,
.teardown.single= NULL,
},
+   [CPUHP_AP_IRQ_AFFINIY_ONLINE] = {
+   .name   = "irq/affinity:online",
+   .startup.single = irq_affinity_online_cpu,
+   .teardown.single= irq_affinity_offline_cpu,
+   },
[CPUHP_AP_PERF_ONLINE] = {
.name   = "perf:online",
.startup.single = perf_event_init_cpu,
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6cd20a569359..74006167892d 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,8 +1,12 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include 
 #include 
 #include 
 #include 
+#include "internals.h"
 
 static cpumask_var_t node_to_present_cpumask[MAX_NUMNODES] __read_mostly;
 
@@ -148,6 +152,157 @@ int irq_calc_affinity_vectors(int maxvec, const struct 
irq_affinity *affd)
return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 }
 
+static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
+   cpumask_t *mask)
+{
+   struct irq_data *data = irq_desc_get_irq_data(desc);
+   struct irq_chip *chip = irq_data_get_irq_chip(data);
+   int ret;
+
+   if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
+   chip->irq_mask(data);
+   ret = chip->irq_set_affinity(data, mask, true);
+   WARN_ON_ONCE(ret);
+
+   /*
+* We unmask if the irq was not marked masked by the core code.
+* That respects the lazy irq disable behaviour.
+*/
+   if (!irqd_can_move_in_process_context(data) &&
+   !irqd_irq_masked(data) && chip->irq_unmask)
+   chip->irq_unmask(

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-16 Thread John Hubbard




On 01/16/2017 04:58 AM, Anshuman Khandual wrote:

On 01/13/2017 04:13 AM, Dan Williams wrote:

Back when we were first attempting to support DMA for DAX mappings of
persistent memory the plan was to forgo 'struct page' completely and
develop a pfn-to-scatterlist capability for the dma-mapping-api. That
effort died in this thread:

https://lkml.org/lkml/2015/8/14/3

...where we learned that the dependencies on struct page for dma
mapping are deeper than a PFN_PHYS() conversion for some
architectures. That was the moment we pivoted to ZONE_DEVICE and
arranged for a 'struct page' to be available for any persistent memory
range that needs to be the target of DMA. ZONE_DEVICE enables any
device-driver that can target "System RAM" to also be able to target
persistent memory through a DAX mapping.

Since that time the "page-less" DAX path has continued to mature [1]
without growing new dependencies on struct page, but at the same time
continuing to rely on ZONE_DEVICE to satisfy get_user_pages().

Peer-to-peer DMA appears to be evolving from a niche embedded use case
to something general purpose platforms will need to comprehend. The
"map_peer_resource" [2] approach looks to be headed to the same
destination as the pfn-to-scatterlist effort. It's difficult to avoid
'struct page' for describing DMA operations without custom driver
code.

With that background, a statement and a question to discuss at LSF/MM:

General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
requires pfn_to_page() support across the entire physical address
range mapped.

Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
collides with platform alignment assumptions [3], and if there's a
wider effort to rework memory hotplug [4] it seems DMA support should
be part of the discussion.


I had experimented with ZONE_DEVICE representation from migration point of
view. Tried migration of both anonymous pages as well as file cache pages
into and away from ZONE_DEVICE memory. Learned that the lack of 'page->lru'
element in the struct page of the ZONE_DEVICE memory makes it difficult
for it to represent file backed mapping in it's present form. But given


That reminds me: while testing out HMM in our device driver, we had some early difficulties with the 
LRU system (including pagevec) in general. For example, sometimes HMM was forced to say "I cannot 
migrate your page range, because a page is still on the very most recently used list". If the number 
of pages was very small, then *all* the pages might be on that list. :)  HMM avoids the problem by 
forcing it, but it reminds me that the LRU and pagevec were never really intended to intersect with 
device memory.


Another point that may seem unrelated at first: using struct pages and pfns to back device memory is 
still under discussion:


   a)  Need to avoid using pfns that can ever be needed for other hotpluggable 
memory

   b) *Very* hard to justify adding any fields to struct page, or flags for it, 
of course.

...but given this new-ish requirement to support these types of devices, maybe (b) actually makes 
sense. Something to discuss.


thanks,
John Hubbard
NVIDIA



that ZONE_DEVICE was created to enable direct mapping (DAX) bypassing page
cache, it came as no surprise. My objective has been how ZONE_DEVICE can
accommodate movable coherent device memory. In our HMM discussions I had
brought to the attention how ZONE_DEVICE going forward should evolve to
represent all these three types of device memory.

* Unmovable addressable device memory   (persistent memory)
* Movable addressable device memory (similar memory represented as CDM)
* Movable un-addressable device memory  (similar memory represented as HMM)

I would like to attend to discuss on the road map for ZONE_DEVICE, struct
pages and device memory in general.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-16 Thread Anshuman Khandual

On 01/13/2017 04:13 AM, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.
> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

I had experimented with ZONE_DEVICE representation from migration point of
view. Tried migration of both anonymous pages as well as file cache pages
into and away from ZONE_DEVICE memory. Learned that the lack of 'page->lru'
element in the struct page of the ZONE_DEVICE memory makes it difficult
for it to represent file backed mapping in it's present form. But given
that ZONE_DEVICE was created to enable direct mapping (DAX) bypassing page
cache, it came as no surprise. My objective has been how ZONE_DEVICE can
accommodate movable coherent device memory. In our HMM discussions I had
brought to the attention how ZONE_DEVICE going forward should evolve to
represent all these three types of device memory.

* Unmovable addressable device memory   (persistent memory)
* Movable addressable device memory (similar memory represented as CDM)
* Movable un-addressable device memory  (similar memory represented as HMM)

I would like to attend to discuss on the road map for ZONE_DEVICE, struct
pages and device memory in general. 

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams

On Thu, Jan 12, 2017 at 3:14 PM, Jerome Glisse  wrote:
> On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
>> Back when we were first attempting to support DMA for DAX mappings of
>> persistent memory the plan was to forgo 'struct page' completely and
>> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
>> effort died in this thread:
>>
>> https://lkml.org/lkml/2015/8/14/3
>>
>> ...where we learned that the dependencies on struct page for dma
>> mapping are deeper than a PFN_PHYS() conversion for some
>> architectures. That was the moment we pivoted to ZONE_DEVICE and
>> arranged for a 'struct page' to be available for any persistent memory
>> range that needs to be the target of DMA. ZONE_DEVICE enables any
>> device-driver that can target "System RAM" to also be able to target
>> persistent memory through a DAX mapping.
>>
>> Since that time the "page-less" DAX path has continued to mature [1]
>> without growing new dependencies on struct page, but at the same time
>> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
>>
>> Peer-to-peer DMA appears to be evolving from a niche embedded use case
>> to something general purpose platforms will need to comprehend. The
>> "map_peer_resource" [2] approach looks to be headed to the same
>> destination as the pfn-to-scatterlist effort. It's difficult to avoid
>> 'struct page' for describing DMA operations without custom driver
>> code.
>>
>> With that background, a statement and a question to discuss at LSF/MM:
>>
>> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
>> requires pfn_to_page() support across the entire physical address
>> range mapped.
>
> Note that in my case it is even worse. The pfn of the page does not
> correspond to anything so it need to go through a special function
> to find if a page can be mapped for another device and to provide a
> valid pfn at which the page can be access by other device.

I still haven't quite wrapped my head about how these pfn ranges are
created. Would this be a use case for a new pfn_t flag? It doesn't
sound like something we'd want to risk describing with raw 'unsigned
long' pfns.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Jerome Glisse

On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.

Note that in my case it is even worse. The pfn of the page does not
correspond to anything so it need to go through a special function
to find if a page can be mapped for another device and to provide a
valid pfn at which the page can be access by other device.

Basicly the PCIE bar is like a window into the device memory that is
dynamicly remap to specific page of the device memory. Not all device
memory can be expose through PCIE bar because of PCIE issues.

> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

Obvioulsy i would like to join this discussion :)

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams

Back when we were first attempting to support DMA for DAX mappings of
persistent memory the plan was to forgo 'struct page' completely and
develop a pfn-to-scatterlist capability for the dma-mapping-api. That
effort died in this thread:

https://lkml.org/lkml/2015/8/14/3

...where we learned that the dependencies on struct page for dma
mapping are deeper than a PFN_PHYS() conversion for some
architectures. That was the moment we pivoted to ZONE_DEVICE and
arranged for a 'struct page' to be available for any persistent memory
range that needs to be the target of DMA. ZONE_DEVICE enables any
device-driver that can target "System RAM" to also be able to target
persistent memory through a DAX mapping.

Since that time the "page-less" DAX path has continued to mature [1]
without growing new dependencies on struct page, but at the same time
continuing to rely on ZONE_DEVICE to satisfy get_user_pages().

Peer-to-peer DMA appears to be evolving from a niche embedded use case
to something general purpose platforms will need to comprehend. The
"map_peer_resource" [2] approach looks to be headed to the same
destination as the pfn-to-scatterlist effort. It's difficult to avoid
'struct page' for describing DMA operations without custom driver
code.

With that background, a statement and a question to discuss at LSF/MM:

General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
requires pfn_to_page() support across the entire physical address
range mapped.

Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
collides with platform alignment assumptions [3], and if there's a
wider effort to rework memory hotplug [4] it seems DMA support should
be part of the discussion.

---

This topic focuses on the mechanism to enable pfn_to_page() for an
arbitrary physical address range, and the proposed peer-to-peer DMA
topic [5] touches on the userspace presentation of this mechanism. I
might be good to combine these topics if there's interest? In any
event, I'm interested in both as well Michal's concern about memory
hotplug in general.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-November/007672.html
[2]: http://www.spinics.net/lists/linux-pci/msg44560.html
[3]: https://lkml.org/lkml/2016/12/1/740
[4]: http://www.spinics.net/lists/linux-mm/msg119369.html
[5]: http://marc.info/?l=linux-mm&m=148156541804940&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] blk-mq: Drop explicit timeout sync in hotplug

2016-11-29 Thread Jens Axboe


On 11/28/2016 10:01 AM, Gabriel Krisman Bertazi wrote:

Sorry for the dup.  Missed linux-block address.


8


After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.


Added for 4.10, thanks.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] blk-mq: Drop explicit timeout sync in hotplug

2016-11-28 Thread Ming Lei

On Tue, Nov 29, 2016 at 1:01 AM, Gabriel Krisman Bertazi
 wrote:
> Sorry for the dup.  Missed linux-block address.
>
>>8
>
> After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
> deleting the timeout work after freezing the queue shouldn't be
> necessary, since the synchronization is already enforced by the
> acquisition of a q_usage_counter reference in blk_mq_timeout_work.
>
> Signed-off-by: Gabriel Krisman Bertazi 
> Cc: Ming Lei 
> ---
>  block/blk-mq.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9d4a1d630d0b..bac12caece06 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2220,16 +2220,9 @@ static void blk_mq_queue_reinit_work(void)
>  */
> list_for_each_entry(q, &all_q_list, all_q_node)
> blk_mq_freeze_queue_start(q);
> -   list_for_each_entry(q, &all_q_list, all_q_node) {
> +   list_for_each_entry(q, &all_q_list, all_q_node)
> blk_mq_freeze_queue_wait(q);
>
> -   /*
> -* timeout handler can't touch hw queue during the
> -* reinitialization
> -*/
> -   del_timer_sync(&q->timeout);
> -   }
> -
> list_for_each_entry(q, &all_q_list, all_q_node)
> blk_mq_queue_reinit(q, &cpuhp_online_new);

Reviewed-by: Ming Lei 

Thanks,
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] blk-mq: Drop explicit timeout sync in hotplug

2016-11-28 Thread Gabriel Krisman Bertazi

Sorry for the dup.  Missed linux-block address.

>8

After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Ming Lei 
---
 block/blk-mq.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9d4a1d630d0b..bac12caece06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2220,16 +2220,9 @@ static void blk_mq_queue_reinit_work(void)
 */
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_freeze_queue_start(q);
-   list_for_each_entry(q, &all_q_list, all_q_node) {
+   list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_freeze_queue_wait(q);
 
-   /*
-* timeout handler can't touch hw queue during the
-* reinitialization
-*/
-   del_timer_sync(&q->timeout);
-   }
-
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_queue_reinit(q, &cpuhp_online_new);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] blk-mq CPU hotplug

2016-10-07 Thread Jens Axboe


Hi Linus,

Last topical branch, this is the conversion of blk-mq to the new hotplug
state machine. This should merge cleanly.

Please pull!


  git://git.kernel.dk/linux-block.git for-4.9/block-smp



Jens Axboe (1):
  Merge branch 'smp/for-block' of git://git.kernel.org/.../tip/tip 
into for-4.9/block-smp


Sebastian Andrzej Siewior (2):
  blk-mq: Convert to new hotplug state machine
  blk-mq: fixup "Convert to new hotplug state machine"

Thomas Gleixner (1):
  blk-mq/cpu-notif: Convert to new hotplug state machine

 arch/x86/kernel/smpboot.c|  11 --
 block/Makefile   |   2 +-
 block/blk-mq-cpu.c   |  67 -
 block/blk-mq.c   | 123 +++-
 block/blk-mq.h   |   7 -
 include/linux/blk-mq.h   |   8 +-
 include/linux/cpuhotplug.h   | 112 +-
 include/trace/events/cpuhp.h |  28 
 kernel/cpu.c | 344 
+++

 9 files changed, 444 insertions(+), 258 deletions(-)
 delete mode 100644 block/blk-mq-cpu.c


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine

2016-09-22 Thread Jens Axboe


On 09/22/2016 07:57 AM, Christoph Hellwing wrote:

On Tue, Sep 20, 2016 at 09:48:39PM -0600, Jens Axboe wrote:

On 09/20/2016 09:21 AM, Thomas Gleixner wrote:

The following series converts block/mq to the new hotplug state
machine. The series is against block.git/for-next and depends on

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

This branch contains the necessary infrastructure for multi-instance
callbacks which allows us to remove blk-mq-cpu.c completely. It can be
pulled into the block tree.


Thanks, I'll pull this in for 4.9.


I still can't find it in your tree despite other commits today.


Didn't get to it yesterday, done now.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine

2016-09-22 Thread Christoph Hellwing

On Tue, Sep 20, 2016 at 09:48:39PM -0600, Jens Axboe wrote:
> On 09/20/2016 09:21 AM, Thomas Gleixner wrote:
>> The following series converts block/mq to the new hotplug state
>> machine. The series is against block.git/for-next and depends on
>>
>>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block
>>
>> This branch contains the necessary infrastructure for multi-instance
>> callbacks which allows us to remove blk-mq-cpu.c completely. It can be
>> pulled into the block tree.
>
> Thanks, I'll pull this in for 4.9.

I still can't find it in your tree despite other commits today.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Query] increased latency observed in cpu hotplug path

2016-09-22 Thread Khan, Imran

On 9/21/2016 9:26 PM, Akinobu Mita wrote:
> 2016-09-21 23:06 GMT+09:00 Khan, Imran :
> 
>>>> commit 084ee793ec1ff4e2171b481642bfbdaa2e10664c
>>>> Author: Imran Khan 
>>>> Date:   Thu Sep 15 16:44:02 2016 +0530
>>>>
>>>> blk-mq: use static mapping
>>>>
>>>> blk-mq layer performs a remapping between s/w and h/w contexts and also
>>>> between h/w contexts and CPUs, whenever a CPU hotplug event happens.
>>>> This remapping has to wait for queue freezing which may take tens of
>>>> miliseconds, resulting in a high latency in CPU hotplug path.
>>>> This patch makes the above mentioned mappings static so that we can
>>>> avoid remapping when CPU hotplug event happens and this results in
>>>> improved CPU hotplug latency.
>>>>
>>>> Change-Id: Idf38cb6c4e78c91fda3c86608c6d0441f01ab435
>>>> Signed-off-by: Imran Khan 
>>>>
>>>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
>>>> index 8764c24..85fdb88 100644
>>>> --- a/block/blk-mq-cpumap.c
>>>> +++ b/block/blk-mq-cpumap.c
>>>> @@ -52,18 +52,13 @@ int blk_mq_update_queue_map(unsigned int *map, 
>>>> unsigned int nr_queues,
>>>>
>>>>  queue = 0;
>>>>  for_each_possible_cpu(i) {
>>>> -if (!cpumask_test_cpu(i, online_mask)) {
>>>> -map[i] = 0;
>>>> -continue;
>>>> -}
>>>> -
>>>>  /*
>>>>   * Easy case - we have equal or more hardware queues. Or
>>>>   * there are no thread siblings to take into account. Do
>>>>   * 1:1 if enough, or sequential mapping if less.
>>>>   */
>>>> -if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
>>>> -map[i] = cpu_to_queue_index(nr_cpus, nr_queues, 
>>>> queue);
>>>> +if (nr_queues >= nr_cpu_ids) {
>>>> +map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, 
>>>> queue);
>>>>  queue++;
>>>>  continue;
>>>>  }
>>>> @@ -75,7 +70,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned 
>>>> int nr_queues,
>>>>   */
>>>>  first_sibling = get_first_sibling(i);
>>>>  if (first_sibling == i) {
>>>> -map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
>>>> +map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues,
>>>>  queue);
>>>>  queue++;
>>>>  } else
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 6d6f8fe..0753fdf 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1783,10 +1783,6 @@ static void blk_mq_init_cpu_queues(struct 
>>>> request_queue *q,
>>>>  INIT_LIST_HEAD(&__ctx->rq_list);
>>>>  __ctx->queue = q;
>>>>
>>>> -/* If the cpu isn't online, the cpu is mapped to first hctx */
>>>> -if (!cpu_online(i))
>>>> -continue;
>>>> -
>>>>  hctx = q->mq_ops->map_queue(q, i);
>>>>
>>>>  /*
>>>> @@ -1820,12 +1816,9 @@ static void blk_mq_map_swqueue(struct request_queue 
>>>> *q,
>>>>   * Map software to hardware queues
>>>>   */
>>>>  queue_for_each_ctx(q, ctx, i) {
>>>> -/* If the cpu isn't online, the cpu is mapped to first hctx */
>>>> -if (!cpumask_test_cpu(i, online_mask))
>>>> -continue;
>>>> -
>>>>  hctx = q->mq_ops->map_queue(q, i);
>>>> -cpumask_set_cpu(i, hctx->cpumask);
>>>> +if (cpumask_test_cpu(i, online_mask))
>>>> +cpumask_set_cpu(i, hctx->cpumask);
>>>>  ctx->index_hw = hctx->nr_ctx;
>>>>  hctx->ctxs[hctx->nr_ctx++] = ctx;
>>>>  }
>>>> @@ -1863,17 +1856,22 @@ static void blk_mq_map_swqueue(struct 
>>>> reque

Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine

2016-09-20 Thread Jens Axboe


On 09/20/2016 09:21 AM, Thomas Gleixner wrote:

The following series converts block/mq to the new hotplug state
machine. The series is against block.git/for-next and depends on

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

This branch contains the necessary infrastructure for multi-instance
callbacks which allows us to remove blk-mq-cpu.c completely. It can be
pulled into the block tree.


Thanks, I'll pull this in for 4.9.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch V2 1/2] blk/mq/cpu-notif: Convert to hotplug state machine

2016-09-20 Thread Thomas Gleixner

Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.

Signed-off-by: Thomas Gleixner 
Cc: Jens Axboe 
Cc: Peter Zijlstra 
Cc: linux-block@vger.kernel.org
Cc: r...@linutronix.de
Cc: Christoph Hellwing 

---

 block/Makefile |2 -
 block/blk-mq-cpu.c |   67 -
 block/blk-mq.c |   36 +-
 block/blk-mq.h |7 -
 include/linux/blk-mq.h |8 -
 5 files changed, 15 insertions(+), 105 deletions(-)

--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o \
-   blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
--- a/block/blk-mq-cpu.c
+++ /dev/null
@@ -1,67 +0,0 @@
-/*
- * CPU notifier helper code for blk-mq
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include "blk-mq.h"
-
-static LIST_HEAD(blk_mq_cpu_notify_list);
-static DEFINE_RAW_SPINLOCK(blk_mq_cpu_notify_lock);
-
-static int blk_mq_main_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
-   unsigned int cpu = (unsigned long) hcpu;
-   struct blk_mq_cpu_notifier *notify;
-   int ret = NOTIFY_OK;
-
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-
-   list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) {
-   ret = notify->notify(notify->data, action, cpu);
-   if (ret != NOTIFY_OK)
-   break;
-   }
-
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-   return ret;
-}
-
-void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
-{
-   BUG_ON(!notifier->notify);
-
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-   list_add_tail(¬ifier->list, &blk_mq_cpu_notify_list);
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-}
-
-void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
-{
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-   list_del(¬ifier->list);
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-}
-
-void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
- int (*fn)(void *, unsigned long, unsigned int),
- void *data)
-{
-   notifier->notify = fn;
-   notifier->data = data;
-}
-
-void __init blk_mq_cpu_init(void)
-{
-   hotcpu_notifier(blk_mq_main_cpu_notify, 0);
-}
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1575,11 +1575,13 @@ static struct blk_mq_tags *blk_mq_init_r
  * software queue to the hw queue dispatch list, and ensure that it
  * gets run.
  */
-static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
+static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
+   struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
LIST_HEAD(tmp);
 
+   hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 
spin_lock(&ctx->lock);
@@ -1590,30 +1592,20 @@ static int blk_mq_hctx_cpu_offline(struc
spin_unlock(&ctx->lock);
 
if (list_empty(&tmp))
-   return NOTIFY_OK;
+   return 0;
 
spin_lock(&hctx->lock);
list_splice_tail_init(&tmp, &hctx->dispatch);
spin_unlock(&hctx->lock);
 
blk_mq_run_hw_queue(hctx, true);
-   return NOTIFY_OK;
+   return 0;
 }
 
-static int blk_mq_hctx_notify(void *data, unsigned long action,
- unsigned int cpu)
+static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
-   struct blk_mq_hw_ctx *hctx = data;
-
-   if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-   return blk_mq_hctx_cpu_offline(hctx, cpu);
-
-   /*
-* In case of CPU online, tags may be reallocated
-* in blk_mq_map_swqueue() after mapping is updated.
-*/
-
-   return NOTIFY_OK;
+   cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
+   &hctx->cpuhp_dead);
 }
 
 /* hctx->ctxs will be freed in queue's release handler */
@@ -1633,7 +1625,7 @@ static void blk_mq_exit_hctx(struct requ
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
 
-   blk_mq_unregister_cpu_notif

[patch V2 2/2] blk/mq: Convert to hotplug state machine

2016-09-20 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.


Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Jens Axboe 
Cc: Peter Zijlstra 
Cc: linux-block@vger.kernel.org
Cc: r...@linutronix.de
Cc: Christoph Hellwing 
Link: http://lkml.kernel.org/r/20160919212601.180033...@linutronix.de
Signed-off-by: Thomas Gleixner 

---

 block/blk-mq.c |   87 -
 1 file changed, 43 insertions(+), 44 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r
blk_mq_sysfs_register(q);
 }
 
-static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
- unsigned long action, void *hcpu)
+/*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask cpuhp_online_new;
+
+static void blk_mq_queue_reinit_work(void)
 {
struct request_queue *q;
-   int cpu = (unsigned long)hcpu;
-   /*
-* New online cpumask which is going to be set in this hotplug event.
-* Declare this cpumasks as global as cpu-hotplug operation is invoked
-* one-by-one and dynamically allocating this could result in a failure.
-*/
-   static struct cpumask online_new;
-
-   /*
-* Before hotadded cpu starts handling requests, new mappings must
-* be established.  Otherwise, these requests in hw queue might
-* never be dispatched.
-*
-* For example, there is a single hw queue (hctx) and two CPU queues
-* (ctx0 for CPU0, and ctx1 for CPU1).
-*
-* Now CPU1 is just onlined and a request is inserted into
-* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
-* still zero.
-*
-* And then while running hw queue, flush_busy_ctxs() finds bit0 is
-* set in pending bitmap and tries to retrieve requests in
-* hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
-* so the request in ctx1->rq_list is ignored.
-*/
-   switch (action & ~CPU_TASKS_FROZEN) {
-   case CPU_DEAD:
-   case CPU_UP_CANCELED:
-   cpumask_copy(&online_new, cpu_online_mask);
-   break;
-   case CPU_UP_PREPARE:
-   cpumask_copy(&online_new, cpu_online_mask);
-   cpumask_set_cpu(cpu, &online_new);
-   break;
-   default:
-   return NOTIFY_OK;
-   }
 
mutex_lock(&all_q_mutex);
-
/*
 * We need to freeze and reinit all existing queues.  Freezing
 * involves synchronous wait for an RCU grace period and doing it
@@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st
}
 
list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_mq_queue_reinit(q, &online_new);
+   blk_mq_queue_reinit(q, &cpuhp_online_new);
 
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
 
mutex_unlock(&all_q_mutex);
-   return NOTIFY_OK;
+}
+
+static int blk_mq_queue_reinit_dead(unsigned int cpu)
+{
+   cpumask_clear_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
+}
+
+/*
+ * Before hotadded cpu starts handling requests, new mappings must be
+ * established.  Otherwise, these requests in hw queue might never be
+ * dispatched.
+ *
+ * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
+ * for CPU0, and ctx1 for CPU1).
+ *
+ * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
+ * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
+ *
+ * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
+ * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
+ * is ignored.
+ */
+static int blk_mq_queue_reinit_prepare(unsigned int cpu)
+{
+   cpumask_copy(&cpuhp_online_new, cpu_online_mask);
+   cpumask_set_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
@@ -2382,8 +2380,9 @@ static int __init blk_mq_init(void)
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
blk_mq_hctx_notify_dead);
 
-   hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
-
+   cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
+ blk_mq_queue_reinit_prepare,
+

[patch V2 0/2] block/mq: Convert to new hotplug state machine

2016-09-20 Thread Thomas Gleixner

The following series converts block/mq to the new hotplug state
machine. The series is against block.git/for-next and depends on

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

This branch contains the necessary infrastructure for multi-instance
callbacks which allows us to remove blk-mq-cpu.c completely. It can be
pulled into the block tree.

Changes vs. V1:

  Use the multi instance callbacks and remove the private notifier handling
  in the block code.

Thanks,

tglx
---
 a/block/blk-mq-cpu.c |   67 -
 b/include/linux/blk-mq.h |8 ---
 block/Makefile   |2 
 block/blk-mq.c   |  123 +--
 block/blk-mq.h   |7 --
 5 files changed, 58 insertions(+), 149 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine

On Tue, 20 Sep 2016, Christoph Hellwing wrote:

> On Mon, Sep 19, 2016 at 09:28:20PM -, Thomas Gleixner wrote:
> > From: Sebastian Andrzej Siewior 
> > 
> > Install the callbacks via the state machine so we can phase out the cpu
> > hotplug notifiers..
> 
> Didn't Sebastian come up with a version of the hotplug state callbacks
> that can be per-object?  This seems to be a perfect candidate for that.

Indeed. I wrote that myself and forgot about it already. :(

So yes, we can use that and get rid of blk-mq-cpu.c completely.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine

2016-09-19 Thread Christoph Hellwing

On Mon, Sep 19, 2016 at 09:28:20PM -, Thomas Gleixner wrote:
> From: Sebastian Andrzej Siewior 
> 
> Install the callbacks via the state machine so we can phase out the cpu
> hotplug notifiers..

Didn't Sebastian come up with a version of the hotplug state callbacks
that can be per-object?  This seems to be a perfect candidate for that.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 3/3] blk/mq: Convert to hotplug state machine

From: Sebastian Andrzej Siewior 

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: Peter Zijlstra 
Cc: Jens Axboe 
Cc: r...@linutronix.de
Signed-off-by: Thomas Gleixner 

---

 block/blk-mq.c |   87 -
 1 file changed, 43 insertions(+), 44 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r
blk_mq_sysfs_register(q);
 }
 
-static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
- unsigned long action, void *hcpu)
+/*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask cpuhp_online_new;
+
+static void blk_mq_queue_reinit_work(void)
 {
struct request_queue *q;
-   int cpu = (unsigned long)hcpu;
-   /*
-* New online cpumask which is going to be set in this hotplug event.
-* Declare this cpumasks as global as cpu-hotplug operation is invoked
-* one-by-one and dynamically allocating this could result in a failure.
-*/
-   static struct cpumask online_new;
-
-   /*
-* Before hotadded cpu starts handling requests, new mappings must
-* be established.  Otherwise, these requests in hw queue might
-* never be dispatched.
-*
-* For example, there is a single hw queue (hctx) and two CPU queues
-* (ctx0 for CPU0, and ctx1 for CPU1).
-*
-* Now CPU1 is just onlined and a request is inserted into
-* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
-* still zero.
-*
-* And then while running hw queue, flush_busy_ctxs() finds bit0 is
-* set in pending bitmap and tries to retrieve requests in
-* hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
-* so the request in ctx1->rq_list is ignored.
-*/
-   switch (action & ~CPU_TASKS_FROZEN) {
-   case CPU_DEAD:
-   case CPU_UP_CANCELED:
-   cpumask_copy(&online_new, cpu_online_mask);
-   break;
-   case CPU_UP_PREPARE:
-   cpumask_copy(&online_new, cpu_online_mask);
-   cpumask_set_cpu(cpu, &online_new);
-   break;
-   default:
-   return NOTIFY_OK;
-   }
 
mutex_lock(&all_q_mutex);
-
/*
 * We need to freeze and reinit all existing queues.  Freezing
 * involves synchronous wait for an RCU grace period and doing it
@@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st
}
 
list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_mq_queue_reinit(q, &online_new);
+   blk_mq_queue_reinit(q, &cpuhp_online_new);
 
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
 
mutex_unlock(&all_q_mutex);
-   return NOTIFY_OK;
+}
+
+static int blk_mq_queue_reinit_dead(unsigned int cpu)
+{
+   cpumask_clear_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
+}
+
+/*
+ * Before hotadded cpu starts handling requests, new mappings must be
+ * established.  Otherwise, these requests in hw queue might never be
+ * dispatched.
+ *
+ * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
+ * for CPU0, and ctx1 for CPU1).
+ *
+ * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
+ * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
+ *
+ * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
+ * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
+ * is ignored.
+ */
+static int blk_mq_queue_reinit_prepare(unsigned int cpu)
+{
+   cpumask_copy(&cpuhp_online_new, cpu_online_mask);
+   cpumask_set_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
@@ -2381,8 +2379,9 @@ static int __init blk_mq_init(void)
 {
blk_mq_cpu_init();
 
-   hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
-
+   cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
+ blk_mq_queue_reinit_prepare,
+ blk_mq_queue_reinit_dead);
return 0;
 }
 subsys_initcall(blk_mq_init);


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/3] blk/mq: Reserve hotplug states for block multiqueue

This patch only reserves two CPU hotplug states for block/mq so the block tree
can apply the conversion patches.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: Peter Zijlstra 
Cc: Jens Axboe 
Cc: r...@linutronix.de
Signed-off-by: Thomas Gleixner 

---
 include/linux/cpuhotplug.h |2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -14,6 +14,7 @@ enum cpuhp_state {
CPUHP_PERF_SUPERH,
CPUHP_X86_HPET_DEAD,
CPUHP_X86_APB_DEAD,
+   CPUHP_BLK_MQ_DEAD,
CPUHP_WORKQUEUE_PREP,
CPUHP_POWER_NUMA_PREPARE,
CPUHP_HRTIMERS_PREPARE,
@@ -22,6 +23,7 @@ enum cpuhp_state {
CPUHP_SMPCFD_PREPARE,
CPUHP_RCUTREE_PREP,
CPUHP_NOTIFY_PREPARE,
+   CPUHP_BLK_MQ_PREPARE,
CPUHP_TIMERS_DEAD,
CPUHP_BRINGUP_CPU,
CPUHP_AP_IDLE_DEAD,


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 0/3] block/mq: Convert to the new hotplug state machine

The following series converts block/mq to the new hotplug state
machine. Patch 1/3 reserves the states for the block layer and is already 
applied to

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

to avoid merge conflicts. This branch can be pulled into the block layer
instead of applying patch 1/3 manually,

Thanks,

tglx
---
 include/linux/blk-mq.h |2 
 block/blk-mq-cpu.c |   15 ++
 block/blk-mq.c |  108 -
 block/blk-mq.h |2 
 include/linux/cpuhotplug.h |2 
 5 files changed, 59 insertions(+), 70 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine