Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Tue, Oct 22, 2019 at 12:19:17PM +0100, John Garry wrote: > On 22/10/2019 01:16, Ming Lei wrote: > > On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote: > > > On 21/10/2019 13:53, Ming Lei wrote: > > > > On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote: > > > > > > > > > > > > > > > > > > > > > > Yes, we share tags among all queues, but we generate the tag - > > > > > > > known as IPTT > > > > > > > - in the LLDD now, as we can no longer use the request tag (as it > > > > > > > is not > > > > > > > unique per all queues): > > > > > > > > > > > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 > > > > > > > > > > > > > > As I said, the branch is messy and I did have to fix 087b95af374. > > > > > > > > > > > > Firstly this way may waste lots of memory, especially the queue > > > > > > depth is > > > > > > big, such as, hisilicon V3's queue depth is 4096. > > > > > > > > > > > > Secondly, you have to deal with queue busy efficiently and > > > > > > correctly, > > > > > > for example, your real hw tags(IPTT) can be used up easily, and how > > > > > > will you handle these dispatched request? > > > > > > > > > > I have not seen scenario of exhausted IPTT. And IPTT count is same as > > > > > SCSI > > > > > host.can_queue, so SCSI midlayer should ensure that this does not > > > > > occur. > > > > > > > > > > Hi Ming, > > Hi Ming, > > > > > > > > That check isn't correct, and each hw queue should have allowed > > > > .can_queue in-flight requests. > > > > > > There always seems to be some confusion or disagreement on this topic. > > > > > > I work according to the comment in scsi_host.h: > > > > > > "Note: it is assumed that each hardware queue has a queue depth of > > > can_queue. In other words, the total queue depth per host > > > is nr_hw_queues * can_queue." > > > > > > So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096) > > > > I believe all current drivers set .can_queue as single hw queue's depth. > > If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue > > depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues > > in-flight requests. > > Yeah, but the SCSI host should still limit max IOs over all queues to > .can_queue: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scsi/scsi_mid_low_api.txt#n1083 > That limit is actually from legacy single-queue era, you should see that I am removing it: https://lore.kernel.org/linux-scsi/20191009093241.21481-2-ming@redhat.com/ With this change, IOPS can be improved much on some fast SCSI storage. Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 22/10/2019 01:16, Ming Lei wrote: On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote: On 21/10/2019 13:53, Ming Lei wrote: On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote: Yes, we share tags among all queues, but we generate the tag - known as IPTT - in the LLDD now, as we can no longer use the request tag (as it is not unique per all queues): https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 As I said, the branch is messy and I did have to fix 087b95af374. Firstly this way may waste lots of memory, especially the queue depth is big, such as, hisilicon V3's queue depth is 4096. Secondly, you have to deal with queue busy efficiently and correctly, for example, your real hw tags(IPTT) can be used up easily, and how will you handle these dispatched request? I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI host.can_queue, so SCSI midlayer should ensure that this does not occur. Hi Ming, Hi Ming, That check isn't correct, and each hw queue should have allowed .can_queue in-flight requests. There always seems to be some confusion or disagreement on this topic. I work according to the comment in scsi_host.h: "Note: it is assumed that each hardware queue has a queue depth of can_queue. In other words, the total queue depth per host is nr_hw_queues * can_queue." So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096) I believe all current drivers set .can_queue as single hw queue's depth. If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues in-flight requests. Yeah, but the SCSI host should still limit max IOs over all queues to .can_queue: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scsi/scsi_mid_low_api.txt#n1083 Finally, you have to evaluate the performance effect, this is highly related with how to deal with out-of-IPTT. Some figures from our previous testing: Managed interrupt without exposing multiple queues: 3M IOPs Managed interrupt with exposing multiple queues: 2.6M IOPs Then you see the performance regression. Let's discuss this when I send the patches, so we don't get sidetracked on this blk-mq improvement topic. OK, what I meant is to use correct driver to test the patches, otherwise it might be hard to investigate. Of course. I'm working on this now, and it looks like it will turn out complicated... you'll see. BTW, I reran the test and never say the new WARN trigger (while SCSI timeouts did occur). Thanks again, John Thanks, Ming .
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote: > On 21/10/2019 13:53, Ming Lei wrote: > > On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote: > > > > > > > > > > > > > > > > Yes, we share tags among all queues, but we generate the tag - known > > > > > as IPTT > > > > > - in the LLDD now, as we can no longer use the request tag (as it is > > > > > not > > > > > unique per all queues): > > > > > > > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 > > > > > > > > > > As I said, the branch is messy and I did have to fix 087b95af374. > > > > > > > > Firstly this way may waste lots of memory, especially the queue depth is > > > > big, such as, hisilicon V3's queue depth is 4096. > > > > > > > > Secondly, you have to deal with queue busy efficiently and correctly, > > > > for example, your real hw tags(IPTT) can be used up easily, and how > > > > will you handle these dispatched request? > > > > > > I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI > > > host.can_queue, so SCSI midlayer should ensure that this does not occur. > > > > Hi Ming, > > > That check isn't correct, and each hw queue should have allowed > > .can_queue in-flight requests. > > There always seems to be some confusion or disagreement on this topic. > > I work according to the comment in scsi_host.h: > > "Note: it is assumed that each hardware queue has a queue depth of > can_queue. In other words, the total queue depth per host > is nr_hw_queues * can_queue." > > So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096) I believe all current drivers set .can_queue as single hw queue's depth. If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues in-flight requests. > > > > > > > > > > > > > > Finally, you have to evaluate the performance effect, this is highly > > > > related with how to deal with out-of-IPTT. > > > > > > Some figures from our previous testing: > > > > > > Managed interrupt without exposing multiple queues: 3M IOPs > > > Managed interrupt with exposing multiple queues: 2.6M IOPs > > > > Then you see the performance regression. > > Let's discuss this when I send the patches, so we don't get sidetracked on > this blk-mq improvement topic. OK, what I meant is to use correct driver to test the patches, otherwise it might be hard to investigate. Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 21/10/2019 13:53, Ming Lei wrote: On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote: Yes, we share tags among all queues, but we generate the tag - known as IPTT - in the LLDD now, as we can no longer use the request tag (as it is not unique per all queues): https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 As I said, the branch is messy and I did have to fix 087b95af374. Firstly this way may waste lots of memory, especially the queue depth is big, such as, hisilicon V3's queue depth is 4096. Secondly, you have to deal with queue busy efficiently and correctly, for example, your real hw tags(IPTT) can be used up easily, and how will you handle these dispatched request? I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI host.can_queue, so SCSI midlayer should ensure that this does not occur. Hi Ming, That check isn't correct, and each hw queue should have allowed .can_queue in-flight requests. There always seems to be some confusion or disagreement on this topic. I work according to the comment in scsi_host.h: "Note: it is assumed that each hardware queue has a queue depth of can_queue. In other words, the total queue depth per host is nr_hw_queues * can_queue." So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096) Finally, you have to evaluate the performance effect, this is highly related with how to deal with out-of-IPTT. Some figures from our previous testing: Managed interrupt without exposing multiple queues: 3M IOPs Managed interrupt with exposing multiple queues: 2.6M IOPs Then you see the performance regression. Let's discuss this when I send the patches, so we don't get sidetracked on this blk-mq improvement topic. Thanks, John Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote: > > > > > > > > > > Yes, we share tags among all queues, but we generate the tag - known as > > > IPTT > > > - in the LLDD now, as we can no longer use the request tag (as it is not > > > unique per all queues): > > > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 > > > > > > As I said, the branch is messy and I did have to fix 087b95af374. > > > > Firstly this way may waste lots of memory, especially the queue depth is > > big, such as, hisilicon V3's queue depth is 4096. > > > > Secondly, you have to deal with queue busy efficiently and correctly, > > for example, your real hw tags(IPTT) can be used up easily, and how > > will you handle these dispatched request? > > I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI > host.can_queue, so SCSI midlayer should ensure that this does not occur. That check isn't correct, and each hw queue should have allowed .can_queue in-flight requests. > > > > > Finally, you have to evaluate the performance effect, this is highly > > related with how to deal with out-of-IPTT. > > Some figures from our previous testing: > > Managed interrupt without exposing multiple queues: 3M IOPs > Managed interrupt with exposing multiple queues: 2.6M IOPs Then you see the performance regression. Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
Yes, we share tags among all queues, but we generate the tag - known as IPTT - in the LLDD now, as we can no longer use the request tag (as it is not unique per all queues): https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 As I said, the branch is messy and I did have to fix 087b95af374. Firstly this way may waste lots of memory, especially the queue depth is big, such as, hisilicon V3's queue depth is 4096. Secondly, you have to deal with queue busy efficiently and correctly, for example, your real hw tags(IPTT) can be used up easily, and how will you handle these dispatched request? I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI host.can_queue, so SCSI midlayer should ensure that this does not occur. Finally, you have to evaluate the performance effect, this is highly related with how to deal with out-of-IPTT. Some figures from our previous testing: Managed interrupt without exposing multiple queues: 3M IOPs Managed interrupt with exposing multiple queues: 2.6M IOPs No managed interrupts: 500K IOPs. Now my problem is that I hear some people are relying on the 3M performance, even though it is experimental and unsafe. I need to follow up on this. I really don't want to keep that code. I'd suggest you to fix the stuff and post them out for review. OK, I'll also check on adding that WARN you provided. Thanks again, John Thanks, Ming .
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Mon, Oct 21, 2019 at 10:47:05AM +0100, John Garry wrote: > On 21/10/2019 10:34, Ming Lei wrote: > > On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote: > > > On 20/10/2019 11:14, Ming Lei wrote: > > > > > > ght? If so, I need to find some simple sysfs entry to > > > > > > > > tell me of this occurrence, to trigger the capture. Or add > > > > > > > > something. My > > > > > > > > script is pretty dump. > > > > > > > > > > > > > > > > BTW, I did notice that we the dump_stack in > > > > > > > > __blk_mq_run_hw_queue() > > > > > > > > pretty soon before the problem happens - maybe a clue or maybe > > > > > > > > coincidence. > > > > > > > > > > > > > > > > > > > > I finally got to capture that debugfs dump at the point the SCSI IOs > > > > > > timeout, as attached. Let me know if any problem receiving it. > > > > > > > > > > > > Here's a kernel log snippet at that point (I added some print for > > > > > > the > > > > > > timeout): > > > > > > > > > > > > 609] psci: CPU6 killed. > > > > > > [ 547.722217] CPU5: shutdown > > > > > > [ 547.724926] psci: CPU5 killed. > > > > > > [ 547.749951] irq_shutdown > > > > > > [ 547.752701] IRQ 800: no longer affine to CPU4 > > > > > > [ 547.757265] CPU4: shutdown > > > > > > [ 547.759971] psci: CPU4 killed. > > > > > > [ 547.790348] CPU3: shutdown > > > > > > [ 547.793052] psci: CPU3 killed. > > > > > > [ 547.818330] CPU2: shutdown > > > > > > [ 547.821033] psci: CPU2 killed. > > > > > > [ 547.854285] CPU1: shutdown > > > > > > [ 547.856989] psci: CPU1 killed. > > > > > > [ 575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0 > > > > > > [ 575.930794] scsi_timeout req=0x0023b0df2700 reserved=0 > > > > > From the debugfs log, 66 requests are dumped, and 63 of them has > > > > been submitted to device, and the other 3 is in ->dispatch list > > > > via requeue after timeout is handled. > > > > > > > > > > Hi Ming, > > > > > > > You mentioned that: > > > > > > > > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for > > > > when > > > > inflights rqs !=0, and I don't see them for this timeout" > > > > > > > > There might be two reasons: > > > > > > > > 1) You are still testing a multiple reply-queue device? > > > > > > As before, I am testing by exposing mutliple queues to the SCSI midlayer. > > > I > > > had to make this change locally, as on mainline we still only expose a > > > single queue and use the internal reply queue when enabling managed > > > interrupts. > > > > > > As I > > > > mentioned last times, it is hard to map reply-queue into blk-mq > > > > hctx correctly. > > > > > > Here's my branch, if you want to check: > > > > > > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4 > > > > > > It's a bit messy (sorry), but you can see that the reply-queue in the LLDD > > > is removed in commit 087b95af374. > > > > > > I am now thinking of actually making this change to the LLDD in mainline > > > to > > > avoid any doubt in future. > > > > As I mentioned last time, you do share tags among all MQ queues on your > > hardware > > given your hardware is actually SQ HBA, so commit 087b95af374 is definitely > > wrong, isn't it? > > > > Yes, we share tags among all queues, but we generate the tag - known as IPTT > - in the LLDD now, as we can no longer use the request tag (as it is not > unique per all queues): > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 > > As I said, the branch is messy and I did have to fix 087b95af374. Firstly this way may waste lots of memory, especially the queue depth is big, such as, hisilicon V3's queue depth is 4096. Secondly, you have to deal with queue busy efficiently and correctly, for example, your real hw tags(IPTT) can be used up easily, and how will you handle these dispatched request? Finally, you have to evaluate the performance effect, this is highly related with how to deal with out-of-IPTT. I'd suggest you to fix the stuff and post them out for review. Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 21/10/2019 10:34, Ming Lei wrote: On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote: On 20/10/2019 11:14, Ming Lei wrote: ght? If so, I need to find some simple sysfs entry to tell me of this occurrence, to trigger the capture. Or add something. My script is pretty dump. BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() pretty soon before the problem happens - maybe a clue or maybe coincidence. I finally got to capture that debugfs dump at the point the SCSI IOs timeout, as attached. Let me know if any problem receiving it. Here's a kernel log snippet at that point (I added some print for the timeout): 609] psci: CPU6 killed. [ 547.722217] CPU5: shutdown [ 547.724926] psci: CPU5 killed. [ 547.749951] irq_shutdown [ 547.752701] IRQ 800: no longer affine to CPU4 [ 547.757265] CPU4: shutdown [ 547.759971] psci: CPU4 killed. [ 547.790348] CPU3: shutdown [ 547.793052] psci: CPU3 killed. [ 547.818330] CPU2: shutdown [ 547.821033] psci: CPU2 killed. [ 547.854285] CPU1: shutdown [ 547.856989] psci: CPU1 killed. [ 575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0 [ 575.930794] scsi_timeout req=0x0023b0df2700 reserved=0 From the debugfs log, 66 requests are dumped, and 63 of them has been submitted to device, and the other 3 is in ->dispatch list via requeue after timeout is handled. Hi Ming, You mentioned that: " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout" There might be two reasons: 1) You are still testing a multiple reply-queue device? As before, I am testing by exposing mutliple queues to the SCSI midlayer. I had to make this change locally, as on mainline we still only expose a single queue and use the internal reply queue when enabling managed interrupts. As I mentioned last times, it is hard to map reply-queue into blk-mq hctx correctly. Here's my branch, if you want to check: https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4 It's a bit messy (sorry), but you can see that the reply-queue in the LLDD is removed in commit 087b95af374. I am now thinking of actually making this change to the LLDD in mainline to avoid any doubt in future. As I mentioned last time, you do share tags among all MQ queues on your hardware given your hardware is actually SQ HBA, so commit 087b95af374 is definitely wrong, isn't it? Yes, we share tags among all queues, but we generate the tag - known as IPTT - in the LLDD now, as we can no longer use the request tag (as it is not unique per all queues): https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188 As I said, the branch is messy and I did have to fix 087b95af374. It can be very hard to partition the single tags among multiple hctx. I really do think now that I'll make this change on mainline to avoid doubt... Thanks, John Thanks, Ming .
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote: > On 20/10/2019 11:14, Ming Lei wrote: > > > > ght? If so, I need to find some simple sysfs entry to > > > > > > tell me of this occurrence, to trigger the capture. Or add > > > > > > something. My > > > > > > script is pretty dump. > > > > > > > > > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() > > > > > > pretty soon before the problem happens - maybe a clue or maybe > > > > > > coincidence. > > > > > > > > > > > > > > I finally got to capture that debugfs dump at the point the SCSI IOs > > > > timeout, as attached. Let me know if any problem receiving it. > > > > > > > > Here's a kernel log snippet at that point (I added some print for the > > > > timeout): > > > > > > > > 609] psci: CPU6 killed. > > > > [ 547.722217] CPU5: shutdown > > > > [ 547.724926] psci: CPU5 killed. > > > > [ 547.749951] irq_shutdown > > > > [ 547.752701] IRQ 800: no longer affine to CPU4 > > > > [ 547.757265] CPU4: shutdown > > > > [ 547.759971] psci: CPU4 killed. > > > > [ 547.790348] CPU3: shutdown > > > > [ 547.793052] psci: CPU3 killed. > > > > [ 547.818330] CPU2: shutdown > > > > [ 547.821033] psci: CPU2 killed. > > > > [ 547.854285] CPU1: shutdown > > > > [ 547.856989] psci: CPU1 killed. > > > > [ 575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0 > > > > [ 575.930794] scsi_timeout req=0x0023b0df2700 reserved=0 > > > From the debugfs log, 66 requests are dumped, and 63 of them has > > been submitted to device, and the other 3 is in ->dispatch list > > via requeue after timeout is handled. > > > > Hi Ming, > > > You mentioned that: > > > > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when > > inflights rqs !=0, and I don't see them for this timeout" > > > > There might be two reasons: > > > > 1) You are still testing a multiple reply-queue device? > > As before, I am testing by exposing mutliple queues to the SCSI midlayer. I > had to make this change locally, as on mainline we still only expose a > single queue and use the internal reply queue when enabling managed > interrupts. > > As I > > mentioned last times, it is hard to map reply-queue into blk-mq > > hctx correctly. > > Here's my branch, if you want to check: > > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4 > > It's a bit messy (sorry), but you can see that the reply-queue in the LLDD > is removed in commit 087b95af374. > > I am now thinking of actually making this change to the LLDD in mainline to > avoid any doubt in future. As I mentioned last time, you do share tags among all MQ queues on your hardware given your hardware is actually SQ HBA, so commit 087b95af374 is definitely wrong, isn't it? It can be very hard to partition the single tags among multiple hctx. Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 20/10/2019 11:14, Ming Lei wrote: ght? If so, I need to find some simple sysfs entry to > > tell me of this occurrence, to trigger the capture. Or add something. My > > script is pretty dump. > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() > > pretty soon before the problem happens - maybe a clue or maybe coincidence. > > > > I finally got to capture that debugfs dump at the point the SCSI IOs > timeout, as attached. Let me know if any problem receiving it. > > Here's a kernel log snippet at that point (I added some print for the > timeout): > > 609] psci: CPU6 killed. > [ 547.722217] CPU5: shutdown > [ 547.724926] psci: CPU5 killed. > [ 547.749951] irq_shutdown > [ 547.752701] IRQ 800: no longer affine to CPU4 > [ 547.757265] CPU4: shutdown > [ 547.759971] psci: CPU4 killed. > [ 547.790348] CPU3: shutdown > [ 547.793052] psci: CPU3 killed. > [ 547.818330] CPU2: shutdown > [ 547.821033] psci: CPU2 killed. > [ 547.854285] CPU1: shutdown > [ 547.856989] psci: CPU1 killed. > [ 575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0 > [ 575.930794] scsi_timeout req=0x0023b0df2700 reserved=0 From the debugfs log, 66 requests are dumped, and 63 of them has been submitted to device, and the other 3 is in ->dispatch list via requeue after timeout is handled. Hi Ming, You mentioned that: " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout" There might be two reasons: 1) You are still testing a multiple reply-queue device? As before, I am testing by exposing mutliple queues to the SCSI midlayer. I had to make this change locally, as on mainline we still only expose a single queue and use the internal reply queue when enabling managed interrupts. As I mentioned last times, it is hard to map reply-queue into blk-mq hctx correctly. Here's my branch, if you want to check: https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4 It's a bit messy (sorry), but you can see that the reply-queue in the LLDD is removed in commit 087b95af374. I am now thinking of actually making this change to the LLDD in mainline to avoid any doubt in future. 2) concurrent dispatch to device, which can be observed by the following patch. diff --git a/block/blk-mq.c b/block/blk-mq.c index 06081966549f..3590f6f947eb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq) { struct request_queue *q = rq->q; + WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state)); + trace_block_rq_issue(q, rq); if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { However, I think it is hard to be 2#, since the current CPU is the last CPU in hctx->cpu_mask. I'll try it. Thanks as always, John Thanks, Ming .
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Thu, Oct 17, 2019 at 04:40:12PM +0100, John Garry wrote: > On 16/10/2019 17:19, John Garry wrote: > > On 16/10/2019 13:07, Ming Lei wrote: > > > On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote: > > > > On 14/10/2019 02:50, Ming Lei wrote: > > > > > Hi, > > > > > > > > > > Thomas mentioned: > > > > > " > > > > > That was the constraint of managed interrupts from the very > > > > > beginning: > > > > > > > > > > The driver/subsystem has to quiesce the interrupt line and the > > > > > associated > > > > > queue _before_ it gets shutdown in CPU unplug and not fiddle > > > > > with it > > > > > until it's restarted by the core when the CPU is plugged in > > > > > again. > > > > > " > > > > > > > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills > > > > > tries > > > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > > > > > > > This patchset tries to address the issue by two stages: > > > > > > > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > > > > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > > > > if the hctx is going to be dead. > > > > > > > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx > > > > > becomes dead > > > > > > > > > > - steal bios from the request, and resubmit them via > > > > > generic_make_request(), > > > > > then these IO will be mapped to other live hctx for dispatch > > > > > > > > > > Please comment & review, thanks! > > > > > > > > > > John, I don't add your tested-by tag since V3 have some changes, > > > > > and I appreciate if you may run your test on V3. > > > > > > > > Hi Ming, > > > > > > > > So I got around to doing some testing. The good news is that issue > > > > which we > > > > were experiencing in v3 series seems to have has gone away - alot more > > > > stable. > > > > > > > > However, unfortunately, I did notice some SCSI timeouts: > > > > > > > > 15508.615074] CPU2: shutdown > > > > [15508.617778] psci: CPU2 killed. > > > > [15508.651220] CPU1: shutdown > > > > [15508.653924] psci: CPU1 killed. > > > > [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63 > > > > Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: > > > > aborting task 0xa7159744 > > > > KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: > > > > sas_eh_handle_sas_errors: task 0xa7159744 is done > > > > [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb > > > > [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is > > > > done > > > > [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd > > > > [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is > > > > done > > > > > > > > A couple of things to note: > > > > - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when > > > > inflights rqs !=0, and I don't see them for this timeout > > > > - 0 datarate reported from fio > > > > > > > > I'll have a look... > > > > > > What is the output of the following command? > > > > > > (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH > > > . {} \;) > > I assume that you want this run at about the time SCSI EH kicks in for > > the timeout, right? If so, I need to find some simple sysfs entry to > > tell me of this occurrence, to trigger the capture. Or add something. My > > script is pretty dump. > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() > > pretty soon before the problem happens - maybe a clue or maybe coincidence. > > > > I finally got to capture that debugfs dump at the point the SCSI IOs > timeout, as attached. Let me know if any problem receiving it. > > Here's a kernel log snippet at that point (I added some print for the > timeout): > > 609] psci: CPU6 killed. > [ 547.722217] CPU5: shutdown > [ 547.724926] psci: CPU5 killed. > [ 547.749951] irq_shutdown > [ 547.752701] IRQ 800: no longer affine to CPU4 > [ 547.757265] CPU4: shutdown > [ 547.759971] psci: CPU4 killed. > [ 547.790348] CPU3: shutdown > [ 547.793052] psci: CPU3 killed. > [ 547.818330] CPU2: shutdown > [ 547.821033] psci: CPU2 killed. > [ 547.854285] CPU1: shutdown > [ 547.856989] psci: CPU1 killed. > [ 575.925307] scsi_timeout req=0x0023b0dd9c00 reserved=0 > [ 575.930794] scsi_timeout req=0x0023b0df2700 reserved=0 >From the debugfs log, 66 requests are dumped, and 63 of them has been submitted to device, and the other 3 is in ->dispatch list via requeue after timeout is handled. You mentioned that: " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout" There might be two reasons: 1) You are still testing a multiple reply-queue device? As I mentioned last times, it is hard to map
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 16/10/2019 13:07, Ming Lei wrote: On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote: On 14/10/2019 02:50, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. Hi Ming, So I got around to doing some testing. The good news is that issue which we were experiencing in v3 series seems to have has gone away - alot more stable. However, unfortunately, I did notice some SCSI timeouts: 15508.615074] CPU2: shutdown [15508.617778] psci: CPU2 killed. [15508.651220] CPU1: shutdown [15508.653924] psci: CPU1 killed. [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63 Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: aborting task 0xa7159744 KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: sas_eh_handle_sas_errors: task 0xa7159744 is done [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is done [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is done A couple of things to note: - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout - 0 datarate reported from fio I'll have a look... What is the output of the following command? (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;) I assume that you want this run at about the time SCSI EH kicks in for the timeout, right? If so, I need to find some simple sysfs entry to tell me of this occurrence, to trigger the capture. Or add something. My script is pretty dump. BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() pretty soon before the problem happens - maybe a clue or maybe coincidence. Thanks, John Thanks, Ming .
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote: > On 14/10/2019 02:50, Ming Lei wrote: > > Hi, > > > > Thomas mentioned: > > " > > That was the constraint of managed interrupts from the very beginning: > > > > The driver/subsystem has to quiesce the interrupt line and the > > associated > > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > > until it's restarted by the core when the CPU is plugged in again. > > " > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > This patchset tries to address the issue by two stages: > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > if the hctx is going to be dead. > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes > > dead > > > > - steal bios from the request, and resubmit them via generic_make_request(), > > then these IO will be mapped to other live hctx for dispatch > > > > Please comment & review, thanks! > > > > John, I don't add your tested-by tag since V3 have some changes, > > and I appreciate if you may run your test on V3. > > Hi Ming, > > So I got around to doing some testing. The good news is that issue which we > were experiencing in v3 series seems to have has gone away - alot more > stable. > > However, unfortunately, I did notice some SCSI timeouts: > > 15508.615074] CPU2: shutdown > [15508.617778] psci: CPU2 killed. > [15508.651220] CPU1: shutdown > [15508.653924] psci: CPU1 killed. > [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63 > Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: > aborting task 0xa7159744 > KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: > sas_eh_handle_sas_errors: task 0xa7159744 is done > [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb > [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is > done > [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd > [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is > done > > A couple of things to note: > - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when > inflights rqs !=0, and I don't see them for this timeout > - 0 datarate reported from fio > > I'll have a look... What is the output of the following command? (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;) Thanks, Ming
Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 14/10/2019 02:50, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. Hi Ming, So I got around to doing some testing. The good news is that issue which we were experiencing in v3 series seems to have has gone away - alot more stable. However, unfortunately, I did notice some SCSI timeouts: 15508.615074] CPU2: shutdown [15508.617778] psci: CPU2 killed. [15508.651220] CPU1: shutdown [15508.653924] psci: CPU1 killed. [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63 Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: aborting task 0xa7159744 KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: sas_eh_handle_sas_errors: task 0xa7159744 is done [15518.431266] sas: sas_scsi_find_task: aborting task 0xd39731eb [15518.442539] sas: sas_eh_handle_sas_errors: task 0xd39731eb is done [15518.449407] sas: sas_scsi_find_task: aborting task 0x9f77c9bd [15518.455899] sas: sas_eh_handle_sas_errors: task 0x9f77c9bd is done A couple of things to note: - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout - 0 datarate reported from fio I'll have a look... Thanks, John V4: - resubmit IOs in dispatch list in case that this hctx is dead V3: - re-organize patch 2 & 3 a bit for addressing Hannes's comment - fix patch 4 for avoiding potential deadlock, as found by Hannes V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: prepare for draining IO when hctx's all CPUs are offline blk-mq: stop to handle IO and drain IO before hctx becomes dead blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 137 ++--- block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 141 insertions(+), 15 deletions(-) Cc: John Garry Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 14/10/2019 02:25, Ming Lei wrote: On Fri, Oct 11, 2019 at 10:10 PM John Garry wrote: On 11/10/2019 12:55, Ming Lei wrote: On Fri, Oct 11, 2019 at 4:54 PM John Garry wrote: On 10/10/2019 12:21, John Garry wrote: As discussed before, tags of hisilicon V3 is HBA wide. If you switch to real hw queue, each hw queue has to own its independent tags. However, that isn't supported by V3 hardware. I am generating the tag internally in the driver now, so that hostwide tags issue should not be an issue. And, to be clear, I am not paying too much attention to performance, but rather just hotplugging while running IO. An update on testing: I did some scripted overnight testing. The script essentially loops like this: - online all CPUS - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 minute - offline those CPUs - wait 1 minute (> SCSI or NVMe timeout) - and repeat SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding some fio processes never dying with IOPS @ 0. I don't see any NVMe timeout reported. Did you do any NVMe testing of this sort? Yeah, so for NVMe, I see some sort of regression, like this: Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158037877d:17h:18m:22s] I can reproduce this issue, and looks there are requests in ->dispatch. OK, that may match with what I see: - the problem occuring coincides with this callpath with BLK_MQ_S_INTERNAL_STOPPED set: Good catch, these requests should have been re-submitted in blk_mq_hctx_notify_dead() too. Will do it in V4. OK, I'll have a look at v4 and retest - it may take a while as testing this is slow... All the best, John Thanks, Ming Lei .
[PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. V4: - resubmit IOs in dispatch list in case that this hctx is dead V3: - re-organize patch 2 & 3 a bit for addressing Hannes's comment - fix patch 4 for avoiding potential deadlock, as found by Hannes V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: prepare for draining IO when hctx's all CPUs are offline blk-mq: stop to handle IO and drain IO before hctx becomes dead blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 137 ++--- block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 141 insertions(+), 15 deletions(-) Cc: John Garry Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Fri, Oct 11, 2019 at 10:10 PM John Garry wrote: > > On 11/10/2019 12:55, Ming Lei wrote: > > On Fri, Oct 11, 2019 at 4:54 PM John Garry wrote: > >> > >> On 10/10/2019 12:21, John Garry wrote: > >>> > > As discussed before, tags of hisilicon V3 is HBA wide. If you switch > to real hw queue, each hw queue has to own its independent tags. > However, that isn't supported by V3 hardware. > >>> > >>> I am generating the tag internally in the driver now, so that hostwide > >>> tags issue should not be an issue. > >>> > >>> And, to be clear, I am not paying too much attention to performance, but > >>> rather just hotplugging while running IO. > >>> > >>> An update on testing: > >>> I did some scripted overnight testing. The script essentially loops like > >>> this: > >>> - online all CPUS > >>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 > >>> minute > >>> - offline those CPUs > >>> - wait 1 minute (> SCSI or NVMe timeout) > >>> - and repeat > >>> > >>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding > >>> some fio processes never dying with IOPS @ 0. I don't see any NVMe > >>> timeout reported. Did you do any NVMe testing of this sort? > >>> > >> > >> Yeah, so for NVMe, I see some sort of regression, like this: > >> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > >> 1158037877d:17h:18m:22s] > > > > I can reproduce this issue, and looks there are requests in ->dispatch. > > OK, that may match with what I see: > - the problem occuring coincides with this callpath with > BLK_MQ_S_INTERNAL_STOPPED set: Good catch, these requests should have been re-submitted in blk_mq_hctx_notify_dead() too. Will do it in V4. Thanks, Ming Lei
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 11/10/2019 12:55, Ming Lei wrote: On Fri, Oct 11, 2019 at 4:54 PM John Garry wrote: On 10/10/2019 12:21, John Garry wrote: As discussed before, tags of hisilicon V3 is HBA wide. If you switch to real hw queue, each hw queue has to own its independent tags. However, that isn't supported by V3 hardware. I am generating the tag internally in the driver now, so that hostwide tags issue should not be an issue. And, to be clear, I am not paying too much attention to performance, but rather just hotplugging while running IO. An update on testing: I did some scripted overnight testing. The script essentially loops like this: - online all CPUS - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 minute - offline those CPUs - wait 1 minute (> SCSI or NVMe timeout) - and repeat SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding some fio processes never dying with IOPS @ 0. I don't see any NVMe timeout reported. Did you do any NVMe testing of this sort? Yeah, so for NVMe, I see some sort of regression, like this: Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158037877d:17h:18m:22s] I can reproduce this issue, and looks there are requests in ->dispatch. OK, that may match with what I see: - the problem occuring coincides with this callpath with BLK_MQ_S_INTERNAL_STOPPED set: blk_mq_request_bypass_insert (__)blk_mq_try_issue_list_directly blk_mq_sched_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug blkdev_direct_IO generic_file_read_iter blkdev_read_iter aio_read io_submit_one blk_mq_request_bypass_insert() adds to the dispatch list, and looking at debugfs, could this be that dispatched request sitting: root@(none)$ more /sys/kernel/debug/block/nvme0n1/hctx18/dispatch ac28511d {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=56, .internal_tag=-1} So could there be some race here? I am a bit busy this week, please feel free to investigate it and debugfs can help you much. I may have time next week for looking this issue. OK, appreciated John Thanks, Ming Lei
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Fri, Oct 11, 2019 at 4:54 PM John Garry wrote: > > On 10/10/2019 12:21, John Garry wrote: > > > >> > >> As discussed before, tags of hisilicon V3 is HBA wide. If you switch > >> to real hw queue, each hw queue has to own its independent tags. > >> However, that isn't supported by V3 hardware. > > > > I am generating the tag internally in the driver now, so that hostwide > > tags issue should not be an issue. > > > > And, to be clear, I am not paying too much attention to performance, but > > rather just hotplugging while running IO. > > > > An update on testing: > > I did some scripted overnight testing. The script essentially loops like > > this: > > - online all CPUS > > - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 > > minute > > - offline those CPUs > > - wait 1 minute (> SCSI or NVMe timeout) > > - and repeat > > > > SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding > > some fio processes never dying with IOPS @ 0. I don't see any NVMe > > timeout reported. Did you do any NVMe testing of this sort? > > > > Yeah, so for NVMe, I see some sort of regression, like this: > Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > 1158037877d:17h:18m:22s] I can reproduce this issue, and looks there are requests in ->dispatch. I am a bit busy this week, please feel free to investigate it and debugfs can help you much. I may have time next week for looking this issue. Thanks, Ming Lei
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 10/10/2019 12:21, John Garry wrote: As discussed before, tags of hisilicon V3 is HBA wide. If you switch to real hw queue, each hw queue has to own its independent tags. However, that isn't supported by V3 hardware. I am generating the tag internally in the driver now, so that hostwide tags issue should not be an issue. And, to be clear, I am not paying too much attention to performance, but rather just hotplugging while running IO. An update on testing: I did some scripted overnight testing. The script essentially loops like this: - online all CPUS - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 minute - offline those CPUs - wait 1 minute (> SCSI or NVMe timeout) - and repeat SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding some fio processes never dying with IOPS @ 0. I don't see any NVMe timeout reported. Did you do any NVMe testing of this sort? Yeah, so for NVMe, I see some sort of regression, like this: Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158037877d:17h:18m:22s] I have tested against vanilla 5.4 rc1 without problem. If you can advise some debug to add, then I'd appreciate it. If not, I'll try to add some debug to the new paths introduced in this series to see if anything I hit coincides with the error state, so will at least have a hint... Thanks, John See previous discussion: https://marc.info/?t=15592886301&r=1&w=2 Thanks, Ming
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 10/10/2019 11:30, Ming Lei wrote: Yes, hisi_sas. So, right, it is single queue today on mainline, but I > manually made it multiqueue on my dev branch just to test this series. > Otherwise I could not test it for that driver. > > My dev branch is here, if interested: > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq Your conversion shouldn't work given you do not change .can_queue in the patch of 'hisi_sas_v3: multiqueue support'. Ah, I missed that, but I don't think that it will make a difference really since I'm only using a single disk, so I don't think that can_queue really comes into play. But As discussed before, tags of hisilicon V3 is HBA wide. If you switch to real hw queue, each hw queue has to own its independent tags. However, that isn't supported by V3 hardware. I am generating the tag internally in the driver now, so that hostwide tags issue should not be an issue. And, to be clear, I am not paying too much attention to performance, but rather just hotplugging while running IO. An update on testing: I did some scripted overnight testing. The script essentially loops like this: - online all CPUS - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 minute - offline those CPUs - wait 1 minute (> SCSI or NVMe timeout) - and repeat SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding some fio processes never dying with IOPS @ 0. I don't see any NVMe timeout reported. Did you do any NVMe testing of this sort? Thanks, John See previous discussion: https://marc.info/?t=15592886301&r=1&w=2 Thanks, Ming .
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Wed, Oct 09, 2019 at 09:49:35AM +0100, John Garry wrote: > > > > > - steal bios from the request, and resubmit them via > > > > > generic_make_request(), > > > > > then these IO will be mapped to other live hctx for dispatch > > > > > > > > > > Please comment & review, thanks! > > > > > > > > > > John, I don't add your tested-by tag since V3 have some changes, > > > > > and I appreciate if you may run your test on V3. > > > > > > > > > > > > > Will do, Thanks > > > > > > Hi Ming, > > > > > > I got this warning once: > > > > > > [ 162.558185] CPU10: shutdown > > > [ 162.560994] psci: CPU10 killed. > > > [ 162.593939] CPU9: shutdown > > > [ 162.596645] psci: CPU9 killed. > > > [ 162.625838] CPU8: shutdown > > > [ 162.628550] psci: CPU8 killed. > > > [ 162.685790] CPU7: shutdown > > > [ 162.688496] psci: CPU7 killed. > > > [ 162.725771] CPU6: shutdown > > > [ 162.728486] psci: CPU6 killed. > > > [ 162.753884] CPU5: shutdown > > > [ 162.756591] psci: CPU5 killed. > > > [ 162.785584] irq_shutdown > > > [ 162.788277] IRQ 800: no longer affine to CPU4 > > > [ 162.793267] CPU4: shutdown > > > [ 162.795975] psci: CPU4 killed. > > > [ 162.849680] run queue from wrong CPU 13, hctx active > > > [ 162.849692] CPU3: shutdown > > > [ 162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted > > > 5.4.0-rc1-00012-gad025dd3d001 #1098 > > > [ 162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI > > > RC0 - > > > V1.16.01 03/15/2019 > > > [ 162.857362] psci: CPU3 killed. > > > [ 162.866039] Workqueue: kblockd blk_mq_run_work_fn > > > [ 162.882281] Call trace: > > > [ 162.884716] dump_backtrace+0x0/0x150 > > > [ 162.888365] show_stack+0x14/0x20 > > > [ 162.891668] dump_stack+0xb0/0xf8 > > > [ 162.894970] __blk_mq_run_hw_queue+0x11c/0x128 > > > [ 162.899400] blk_mq_run_work_fn+0x1c/0x28 > > > [ 162.903397] process_one_work+0x1e0/0x358 > > > [ 162.907393] worker_thread+0x40/0x488 > > > [ 162.911042] kthread+0x118/0x120 > > > [ 162.914257] ret_from_fork+0x10/0x18 > > > > What is the HBA? If it is Hisilicon SAS, it isn't strange, because > > this patch can't fix single hw queue with multiple private reply queue > > yet, that can be one follow-up job of this patchset. > > > > Yes, hisi_sas. So, right, it is single queue today on mainline, but I > manually made it multiqueue on my dev branch just to test this series. > Otherwise I could not test it for that driver. > > My dev branch is here, if interested: > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq Your conversion shouldn't work given you do not change .can_queue in the patch of 'hisi_sas_v3: multiqueue support'. As discussed before, tags of hisilicon V3 is HBA wide. If you switch to real hw queue, each hw queue has to own its independent tags. However, that isn't supported by V3 hardware. See previous discussion: https://marc.info/?t=15592886301&r=1&w=2 Thanks, Ming
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
- steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. Will do, Thanks Hi Ming, I got this warning once: [ 162.558185] CPU10: shutdown [ 162.560994] psci: CPU10 killed. [ 162.593939] CPU9: shutdown [ 162.596645] psci: CPU9 killed. [ 162.625838] CPU8: shutdown [ 162.628550] psci: CPU8 killed. [ 162.685790] CPU7: shutdown [ 162.688496] psci: CPU7 killed. [ 162.725771] CPU6: shutdown [ 162.728486] psci: CPU6 killed. [ 162.753884] CPU5: shutdown [ 162.756591] psci: CPU5 killed. [ 162.785584] irq_shutdown [ 162.788277] IRQ 800: no longer affine to CPU4 [ 162.793267] CPU4: shutdown [ 162.795975] psci: CPU4 killed. [ 162.849680] run queue from wrong CPU 13, hctx active [ 162.849692] CPU3: shutdown [ 162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 162.857362] psci: CPU3 killed. [ 162.866039] Workqueue: kblockd blk_mq_run_work_fn [ 162.882281] Call trace: [ 162.884716] dump_backtrace+0x0/0x150 [ 162.888365] show_stack+0x14/0x20 [ 162.891668] dump_stack+0xb0/0xf8 [ 162.894970] __blk_mq_run_hw_queue+0x11c/0x128 [ 162.899400] blk_mq_run_work_fn+0x1c/0x28 [ 162.903397] process_one_work+0x1e0/0x358 [ 162.907393] worker_thread+0x40/0x488 [ 162.911042] kthread+0x118/0x120 [ 162.914257] ret_from_fork+0x10/0x18 What is the HBA? If it is Hisilicon SAS, it isn't strange, because this patch can't fix single hw queue with multiple private reply queue yet, that can be one follow-up job of this patchset. Yes, hisi_sas. So, right, it is single queue today on mainline, but I manually made it multiqueue on my dev branch just to test this series. Otherwise I could not test it for that driver. My dev branch is here, if interested: https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq I am also going to retest NVMe. Thanks, John Thanks, Ming .
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Tue, Oct 08, 2019 at 06:15:52PM +0100, John Garry wrote: > On 08/10/2019 10:06, John Garry wrote: > > On 08/10/2019 05:18, Ming Lei wrote: > > > Hi, > > > > > > Thomas mentioned: > > > " > > > That was the constraint of managed interrupts from the very > > > beginning: > > > > > > The driver/subsystem has to quiesce the interrupt line and the > > > associated > > > queue _before_ it gets shutdown in CPU unplug and not fiddle > > > with it > > > until it's restarted by the core when the CPU is plugged in again. > > > " > > > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > > > This patchset tries to address the issue by two stages: > > > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > > if the hctx is going to be dead. > > > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx > > > becomes dead > > > > > > - steal bios from the request, and resubmit them via > > > generic_make_request(), > > > then these IO will be mapped to other live hctx for dispatch > > > > > > Please comment & review, thanks! > > > > > > John, I don't add your tested-by tag since V3 have some changes, > > > and I appreciate if you may run your test on V3. > > > > > > > Will do, Thanks > > Hi Ming, > > I got this warning once: > > [ 162.558185] CPU10: shutdown > [ 162.560994] psci: CPU10 killed. > [ 162.593939] CPU9: shutdown > [ 162.596645] psci: CPU9 killed. > [ 162.625838] CPU8: shutdown > [ 162.628550] psci: CPU8 killed. > [ 162.685790] CPU7: shutdown > [ 162.688496] psci: CPU7 killed. > [ 162.725771] CPU6: shutdown > [ 162.728486] psci: CPU6 killed. > [ 162.753884] CPU5: shutdown > [ 162.756591] psci: CPU5 killed. > [ 162.785584] irq_shutdown > [ 162.788277] IRQ 800: no longer affine to CPU4 > [ 162.793267] CPU4: shutdown > [ 162.795975] psci: CPU4 killed. > [ 162.849680] run queue from wrong CPU 13, hctx active > [ 162.849692] CPU3: shutdown > [ 162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted > 5.4.0-rc1-00012-gad025dd3d001 #1098 > [ 162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - > V1.16.01 03/15/2019 > [ 162.857362] psci: CPU3 killed. > [ 162.866039] Workqueue: kblockd blk_mq_run_work_fn > [ 162.882281] Call trace: > [ 162.884716] dump_backtrace+0x0/0x150 > [ 162.888365] show_stack+0x14/0x20 > [ 162.891668] dump_stack+0xb0/0xf8 > [ 162.894970] __blk_mq_run_hw_queue+0x11c/0x128 > [ 162.899400] blk_mq_run_work_fn+0x1c/0x28 > [ 162.903397] process_one_work+0x1e0/0x358 > [ 162.907393] worker_thread+0x40/0x488 > [ 162.911042] kthread+0x118/0x120 > [ 162.914257] ret_from_fork+0x10/0x18 What is the HBA? If it is Hisilicon SAS, it isn't strange, because this patch can't fix single hw queue with multiple private reply queue yet, that can be one follow-up job of this patchset. Thanks, Ming
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 08/10/2019 10:06, John Garry wrote: On 08/10/2019 05:18, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. Will do, Thanks Hi Ming, I got this warning once: [ 162.558185] CPU10: shutdown [ 162.560994] psci: CPU10 killed. [ 162.593939] CPU9: shutdown [ 162.596645] psci: CPU9 killed. [ 162.625838] CPU8: shutdown [ 162.628550] psci: CPU8 killed. [ 162.685790] CPU7: shutdown [ 162.688496] psci: CPU7 killed. [ 162.725771] CPU6: shutdown [ 162.728486] psci: CPU6 killed. [ 162.753884] CPU5: shutdown [ 162.756591] psci: CPU5 killed. [ 162.785584] irq_shutdown [ 162.788277] IRQ 800: no longer affine to CPU4 [ 162.793267] CPU4: shutdown [ 162.795975] psci: CPU4 killed. [ 162.849680] run queue from wrong CPU 13, hctx active [ 162.849692] CPU3: shutdown [ 162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 162.857362] psci: CPU3 killed. [ 162.866039] Workqueue: kblockd blk_mq_run_work_fn [ 162.882281] Call trace: [ 162.884716] dump_backtrace+0x0/0x150 [ 162.888365] show_stack+0x14/0x20 [ 162.891668] dump_stack+0xb0/0xf8 [ 162.894970] __blk_mq_run_hw_queue+0x11c/0x128 [ 162.899400] blk_mq_run_work_fn+0x1c/0x28 [ 162.903397] process_one_work+0x1e0/0x358 [ 162.907393] worker_thread+0x40/0x488 [ 162.911042] kthread+0x118/0x120 [ 162.914257] ret_from_fork+0x10/0x18 [ 162.917834] run queue from wrong CPU 13, hctx active [ 162.922789] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 162.931472] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 162.939983] Workqueue: kblockd blk_mq_run_work_fn [ 162.944674] Call trace: [ 162.947107] dump_backtrace+0x0/0x150 [ 162.950755] show_stack+0x14/0x20 [ 162.954057] dump_stack+0xb0/0xf8 [ 162.957359] __blk_mq_run_hw_queue+0x11c/0x128 [ 162.961788] blk_mq_run_work_fn+0x1c/0x28 [ 162.965784] process_one_work+0x1e0/0x358 [ 162.969780] worker_thread+0x40/0x488 [ 162.973429] kthread+0x118/0x120 [ 162.976644] ret_from_fork+0x10/0x18 [ 162.980214] run queue from wrong CPU 13, hctx active [ 162.985171] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 162.993853] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 163.002366] Workqueue: kblockd blk_mq_run_work_fn [ 163.007057] Call trace: [ 163.009490] dump_backtrace+0x0/0x150 [ 163.013138] show_stack+0x14/0x20 [ 163.016440] dump_stack+0xb0/0xf8 [ 163.019742] __blk_mq_run_hw_queue+0x11c/0x128 [ 163.024172] blk_mq_run_work_fn+0x1c/0x28 [ 163.028167] process_one_work+0x1e0/0x358 [ 163.032163] worker_thread+0x238/0x488 [ 163.035899] kthread+0x118/0x120 [ 163.039113] ret_from_fork+0x10/0x18 [ 163.042736] run queue from wrong CPU 13, hctx active [ 163.047692] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 163.056374] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 163.064885] Workqueue: kblockd blk_mq_run_work_fn [ 163.069575] Call trace: [ 163.072008] dump_backtrace+0x0/0x150 [ 163.075656] show_stack+0x14/0x20 [ 163.078958] dump_stack+0xb0/0xf8 [ 163.082260] __blk_mq_run_hw_queue+0x11c/0x128 [ 163.086690] blk_mq_run_work_fn+0x1c/0x28 [ 163.090686] process_one_work+0x1e0/0x358 [ 163.094681] worker_thread+0x238/0x488 [ 163.098417] kthread+0x118/0x120 [ 163.101631] ret_from_fork+0x10/0x18 [ 163.105200] run queue from wrong CPU 13, hctx active [ 163.110534] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 5.4.0-rc1-00012-gad025dd3d001 #1098 [ 163.111852] CPU2: shutdown [ 163.119218] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 [ 163.119222] Workqueue:
Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 08/10/2019 05:18, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. Will do, Thanks V3: - re-organize patch 2 & 3 a bit for addressing Hannes's comment - fix patch 4 for avoiding potential deadlock, as found by Hannes V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - addres
[PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! John, I don't add your tested-by tag since V3 have some changes, and I appreciate if you may run your test on V3. V3: - re-organize patch 2 & 3 a bit for addressing Hannes's comment - fix patch 4 for avoiding potential deadlock, as found by Hannes V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: prepare for draining IO when hctx's all CPUs are offline blk-mq: stop to handle IO and drain IO before hctx becomes dead blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 135 + block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 138 insertions(+), 16 deletions(-) Cc: John Garry Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Wed, Oct 02, 2019 at 08:36:52AM -0600, Jens Axboe wrote: > On 10/2/19 3:56 AM, John Garry wrote: > > On 22/08/2019 18:39, John Garry wrote: > >> On 12/08/2019 14:43, Ming Lei wrote: > >>> Hi, > >>> > >>> Thomas mentioned: > >>> " > >>> That was the constraint of managed interrupts from the very > >>> beginning: > >>> > >>>The driver/subsystem has to quiesce the interrupt line and the > >>> associated > >>>queue _before_ it gets shutdown in CPU unplug and not fiddle > >>> with it > >>>until it's restarted by the core when the CPU is plugged in again. > >>> " > >>> > >>> But no drivers or blk-mq do that before one hctx becomes dead(all > >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > >>> > >>> This patchset tries to address the issue by two stages: > >>> > >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > >>> > >>> - mark the hctx as internal stopped, and drain all in-flight requests > >>> if the hctx is going to be dead. > >>> > >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx > >>> becomes dead > >>> > >>> - steal bios from the request, and resubmit them via > >>> generic_make_request(), > >>> then these IO will be mapped to other live hctx for dispatch > >>> > >>> Please comment & review, thanks! > >>> > >>> V2: > >>> - patch4 & patch 5 in V1 have been merged to block tree, so remove > >>>them > >>> - address comments from John Garry and Minwoo > >>> > >>> > >>> Ming Lei (5): > >>>blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > >>>blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > >>>blk-mq: stop to handle IO before hctx's all CPUs become offline > >>>blk-mq: re-submit IO in case that hctx is dead > >>>blk-mq: handle requests dispatched from IO scheduler in case that hctx > >>> is dead > >> > >> Hi Ming, > >> > >> This looks to fix the hotplug issue for me. > >> > >> Previously I could manufacture a scenario while running fio where I got > >> IO timeouts, like this: > >> > >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > >> [ 296.897627] process 891 (fio) no longer affine to cpu0 > >> [ 296.898488] process 893 (fio) no longer affine to cpu0 > >> [ 296.910270] process 890 (fio) no longer affine to cpu0 > >> [ 296.927322] IRQ 775: no longer affine to CPU0 > >> [ 296.932762] CPU0: shutdown > >> [ 296.935469] psci: CPU0 killed. > >> root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 > >> failed: 61 > >> [ 326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b > >> root@(none)$ [ 333.047964] hisi_sas_v3_hw :74:02.0: internal task > >> abort: timeout and not done. > >> [ 333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5) > >> [ 333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b > >> [ 333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU > >> [ 333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover > >> [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 > >> > >> Please notice the 30-second delay for the SCSI IO timeout. > >> > >> And now I don't see it; here's a sample for irq shutdown: > >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > >> [ 344.608148] process 849 (fio) no longer affine to cpu0 > >> [ 344.608639] process 848 (fio) no longer affine to cpu0 > >> [ 344.609454] process 850 (fio) no longer affine to cpu0 > >> [ 344.643481] process 847 (fio) no longer affine to cpu0 > >> [ 346.213842] IRQ 775: no longer affine to CPU0 > >> [ 346.219712] CPU0: shutdown > >> [ 346.222425] psci: CPU0 killed. > >> > >> Please notice the ~1.5s pause, which would be the queue draining. > >> > >> So FWIW: > >> Tested-by: John Garry > >> > >> JFYI, I tested on 5.3-rc5 and cherry-picked > >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 > >> and "blk-mq: add callback of .cleanup_rq". > >> > >> Cheers, > >> John > > > > Hi Jens, > > > > I don't mean to be pushy, but can we consider to get these patches from > > Ming merged? > > > > As above, I tested on my SCSI driver and it works. I also tested on an > > NVMe disk, and it solves the condition which generates this message: > > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online > > [ 465.635960] CPU2: shutdown > > [ 465.638662] psci: CPU2 killed. > > [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled > > > > (that's on top off v5.4-rc1) > > Ming, can you repost the series? It has been resent out just now. Thanks, Ming
[PATCH V2 RESEND 0/5] blk-mq: improvement on handling IO during CPU hotplug
Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 143 + block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 146 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 10/2/19 3:56 AM, John Garry wrote: > On 22/08/2019 18:39, John Garry wrote: >> On 12/08/2019 14:43, Ming Lei wrote: >>> Hi, >>> >>> Thomas mentioned: >>> " >>> That was the constraint of managed interrupts from the very >>> beginning: >>> >>>The driver/subsystem has to quiesce the interrupt line and the >>> associated >>>queue _before_ it gets shutdown in CPU unplug and not fiddle >>> with it >>>until it's restarted by the core when the CPU is plugged in again. >>> " >>> >>> But no drivers or blk-mq do that before one hctx becomes dead(all >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). >>> >>> This patchset tries to address the issue by two stages: >>> >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE >>> >>> - mark the hctx as internal stopped, and drain all in-flight requests >>> if the hctx is going to be dead. >>> >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx >>> becomes dead >>> >>> - steal bios from the request, and resubmit them via >>> generic_make_request(), >>> then these IO will be mapped to other live hctx for dispatch >>> >>> Please comment & review, thanks! >>> >>> V2: >>> - patch4 & patch 5 in V1 have been merged to block tree, so remove >>>them >>> - address comments from John Garry and Minwoo >>> >>> >>> Ming Lei (5): >>>blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED >>>blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ >>>blk-mq: stop to handle IO before hctx's all CPUs become offline >>>blk-mq: re-submit IO in case that hctx is dead >>>blk-mq: handle requests dispatched from IO scheduler in case that hctx >>> is dead >> >> Hi Ming, >> >> This looks to fix the hotplug issue for me. >> >> Previously I could manufacture a scenario while running fio where I got >> IO timeouts, like this: >> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online >> [ 296.897627] process 891 (fio) no longer affine to cpu0 >> [ 296.898488] process 893 (fio) no longer affine to cpu0 >> [ 296.910270] process 890 (fio) no longer affine to cpu0 >> [ 296.927322] IRQ 775: no longer affine to CPU0 >> [ 296.932762] CPU0: shutdown >> [ 296.935469] psci: CPU0 killed. >> root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 >> failed: 61 >> [ 326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b >> root@(none)$ [ 333.047964] hisi_sas_v3_hw :74:02.0: internal task >> abort: timeout and not done. >> [ 333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5) >> [ 333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b >> [ 333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU >> [ 333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover >> [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 >> >> Please notice the 30-second delay for the SCSI IO timeout. >> >> And now I don't see it; here's a sample for irq shutdown: >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online >> [ 344.608148] process 849 (fio) no longer affine to cpu0 >> [ 344.608639] process 848 (fio) no longer affine to cpu0 >> [ 344.609454] process 850 (fio) no longer affine to cpu0 >> [ 344.643481] process 847 (fio) no longer affine to cpu0 >> [ 346.213842] IRQ 775: no longer affine to CPU0 >> [ 346.219712] CPU0: shutdown >> [ 346.222425] psci: CPU0 killed. >> >> Please notice the ~1.5s pause, which would be the queue draining. >> >> So FWIW: >> Tested-by: John Garry >> >> JFYI, I tested on 5.3-rc5 and cherry-picked >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 >> and "blk-mq: add callback of .cleanup_rq". >> >> Cheers, >> John > > Hi Jens, > > I don't mean to be pushy, but can we consider to get these patches from > Ming merged? > > As above, I tested on my SCSI driver and it works. I also tested on an > NVMe disk, and it solves the condition which generates this message: > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online > [ 465.635960] CPU2: shutdown > [ 465.638662] psci: CPU2 killed. > [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled > > (that's on top off v5.4-rc1) Ming, can you repost the series? -- Jens Axboe
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 22/08/2019 18:39, John Garry wrote: On 12/08/2019 14:43, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead Hi Ming, This looks to fix the hotplug issue for me. Previously I could manufacture a scenario while running fio where I got IO timeouts, like this: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 296.897627] process 891 (fio) no longer affine to cpu0 [ 296.898488] process 893 (fio) no longer affine to cpu0 [ 296.910270] process 890 (fio) no longer affine to cpu0 [ 296.927322] IRQ 775: no longer affine to CPU0 [ 296.932762] CPU0: shutdown [ 296.935469] psci: CPU0 killed. root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 failed: 61 [ 326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b root@(none)$ [ 333.047964] hisi_sas_v3_hw :74:02.0: internal task abort: timeout and not done. [ 333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5) [ 333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b [ 333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU [ 333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 Please notice the 30-second delay for the SCSI IO timeout. And now I don't see it; here's a sample for irq shutdown: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 344.608148] process 849 (fio) no longer affine to cpu0 [ 344.608639] process 848 (fio) no longer affine to cpu0 [ 344.609454] process 850 (fio) no longer affine to cpu0 [ 344.643481] process 847 (fio) no longer affine to cpu0 [ 346.213842] IRQ 775: no longer affine to CPU0 [ 346.219712] CPU0: shutdown [ 346.222425] psci: CPU0 killed. Please notice the ~1.5s pause, which would be the queue draining. So FWIW: Tested-by: John Garry JFYI, I tested on 5.3-rc5 and cherry-picked https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 and "blk-mq: add callback of .cleanup_rq". Cheers, John Hi Jens, I don't mean to be pushy, but can we consider to get these patches from Ming merged? As above, I tested on my SCSI driver and it works. I also tested on an NVMe disk, and it solves the condition which generates this message: root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online [ 465.635960] CPU2: shutdown [ 465.638662] psci: CPU2 killed. [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled (that's on top off v5.4-rc1) Thanks, John block/blk-mq-debugfs.c | 2 +
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 12/08/2019 14:43, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead Hi Ming, This looks to fix the hotplug issue for me. Previously I could manufacture a scenario while running fio where I got IO timeouts, like this: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 296.897627] process 891 (fio) no longer affine to cpu0 [ 296.898488] process 893 (fio) no longer affine to cpu0 [ 296.910270] process 890 (fio) no longer affine to cpu0 [ 296.927322] IRQ 775: no longer affine to CPU0 [ 296.932762] CPU0: shutdown [ 296.935469] psci: CPU0 killed. root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 failed: 61 [ 326.977978] sas: sas_scsi_find_task: aborting task 0xe2cdc79b root@(none)$ [ 333.047964] hisi_sas_v3_hw :74:02.0: internal task abort: timeout and not done. [ 333.055616] hisi_sas_v3_hw :74:02.0: abort task: internal abort (-5) [ 333.062306] sas: sas_scsi_find_task: querying task 0xe2cdc79b [ 333.068776] sas: sas_scsi_find_task: task 0xe2cdc79b not at LU [ 333.075295] sas: task 0xe2cdc79b is not at LU: I_T recover [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 Please notice the 30-second delay for the SCSI IO timeout. And now I don't see it; here's a sample for irq shutdown: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 344.608148] process 849 (fio) no longer affine to cpu0 [ 344.608639] process 848 (fio) no longer affine to cpu0 [ 344.609454] process 850 (fio) no longer affine to cpu0 [ 344.643481] process 847 (fio) no longer affine to cpu0 [ 346.213842] IRQ 775: no longer affine to CPU0 [ 346.219712] CPU0: shutdown [ 346.222425] psci: CPU0 killed. Please notice the ~1.5s pause, which would be the queue draining. So FWIW: Tested-by: John Garry JFYI, I tested on 5.3-rc5 and cherry-picked https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 and "blk-mq: add callback of .cleanup_rq". Cheers, John block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 143 + block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 146 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On Mon, Aug 12, 2019 at 05:21:44PM +0100, John Garry wrote: > On 12/08/2019 14:46, Ming Lei wrote: > > Hi John, > > > > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: > > > Hi, > > > > > > Thomas mentioned: > > > " > > > That was the constraint of managed interrupts from the very > > > beginning: > > > > > > The driver/subsystem has to quiesce the interrupt line and the > > > associated > > > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > > > until it's restarted by the core when the CPU is plugged in again. > > > " > > > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > > > This patchset tries to address the issue by two stages: > > > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > > if the hctx is going to be dead. > > > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes > > > dead > > > > > > - steal bios from the request, and resubmit them via > > > generic_make_request(), > > > then these IO will be mapped to other live hctx for dispatch > > > > > > Please comment & review, thanks! > > > > > > V2: > > > - patch4 & patch 5 in V1 have been merged to block tree, so remove > > > them > > > - address comments from John Garry and Minwoo > > > > > > > > > Ming Lei (5): > > > blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > > > blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > > > blk-mq: stop to handle IO before hctx's all CPUs become offline > > > blk-mq: re-submit IO in case that hctx is dead > > > blk-mq: handle requests dispatched from IO scheduler in case that hctx > > > is dead > > > > > > block/blk-mq-debugfs.c | 2 + > > > block/blk-mq-tag.c | 2 +- > > > block/blk-mq-tag.h | 2 + > > > block/blk-mq.c | 143 + > > > block/blk-mq.h | 3 +- > > > drivers/block/loop.c | 2 +- > > > drivers/md/dm-rq.c | 2 +- > > > include/linux/blk-mq.h | 5 ++ > > > include/linux/cpuhotplug.h | 1 + > > > 9 files changed, 146 insertions(+), 16 deletions(-) > > > > > > Cc: Bart Van Assche > > > Cc: Hannes Reinecke > > > Cc: Christoph Hellwig > > > Cc: Thomas Gleixner > > > Cc: Keith Busch > > > -- > > > 2.20.1 > > > > > > > Sorry for forgetting to Cc you. > > Already subscribed :) > > I don't mean to hijack this thread, but JFYI we're getting around to test > https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately > we're still seeing a performance regression. I can't see where it's coming > from. We're double-checking the test though. host-tag patchset is only for several particular drivers which use private reply queue as completion queue. This patchset is for handling generic blk-mq CPU hotplug issue, and the several particular scsi drivers(hisi_sas_v3, hpsa, megaraid_sas and mp3sas) won't be covered so far. I'd suggest to move on for generic blk-mq devices first given now blk-mq is the only request IO path now. There are at least two choices for us to handle drivers/devices with private completion queue: 1) host-tags - performance issue shouldn't be hard to solve, given it is same with with single tags in theory, and just corner cases is there. What I am not glad with this approach is that blk-mq-tag code becomes mess. 2) private callback - we could define private callback to drain each completion queue in driver simply. - problem is that the four drivers have to duplicate the same job Thanks, Ming
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
On 12/08/2019 14:46, Ming Lei wrote: Hi John, On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 143 + block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 146 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1 Sorry for forgetting to Cc you. Already subscribed :) I don't mean to hijack this thread, but JFYI we're getting around to test https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately we're still seeing a performance regression. I can't see where it's coming from. We're double-checking the test though. Thanks, John Thanks, Ming .
Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
Hi John, On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: > Hi, > > Thomas mentioned: > " > That was the constraint of managed interrupts from the very beginning: > > The driver/subsystem has to quiesce the interrupt line and the > associated > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > until it's restarted by the core when the CPU is plugged in again. > " > > But no drivers or blk-mq do that before one hctx becomes dead(all > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > This patchset tries to address the issue by two stages: > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > - mark the hctx as internal stopped, and drain all in-flight requests > if the hctx is going to be dead. > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead > > - steal bios from the request, and resubmit them via generic_make_request(), > then these IO will be mapped to other live hctx for dispatch > > Please comment & review, thanks! > > V2: > - patch4 & patch 5 in V1 have been merged to block tree, so remove > them > - address comments from John Garry and Minwoo > > > Ming Lei (5): > blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > blk-mq: stop to handle IO before hctx's all CPUs become offline > blk-mq: re-submit IO in case that hctx is dead > blk-mq: handle requests dispatched from IO scheduler in case that hctx > is dead > > block/blk-mq-debugfs.c | 2 + > block/blk-mq-tag.c | 2 +- > block/blk-mq-tag.h | 2 + > block/blk-mq.c | 143 + > block/blk-mq.h | 3 +- > drivers/block/loop.c | 2 +- > drivers/md/dm-rq.c | 2 +- > include/linux/blk-mq.h | 5 ++ > include/linux/cpuhotplug.h | 1 + > 9 files changed, 146 insertions(+), 16 deletions(-) > > Cc: Bart Van Assche > Cc: Hannes Reinecke > Cc: Christoph Hellwig > Cc: Thomas Gleixner > Cc: Keith Busch > -- > 2.20.1 > Sorry for forgetting to Cc you. Thanks, Ming
[PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! V2: - patch4 & patch 5 in V1 have been merged to block tree, so remove them - address comments from John Garry and Minwoo Ming Lei (5): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 143 + block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- include/linux/blk-mq.h | 5 ++ include/linux/cpuhotplug.h | 1 + 9 files changed, 146 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1
Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
在 16/07/2019 15:18, Ming Lei 写道: So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert private reply queue into blk_mq hw queue" then? Hi Ming, V2 has been in the following tree for a while: https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 It works, however the implementation is a bit ugly even though the idea is simple. Yeah, sorry to say that I agree - the BLK_MQ_F_TAG_HCTX_SHARED checks look bolted on. So I think we may need to think of it further, for better implementation or approach. Understood. But at least we may test it to ensure no performance regression. Thanks, John Thanks, Ming Lei
Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
Hi John, On Tue, Jul 16, 2019 at 2:55 PM John Garry wrote: > > 在 12/07/2019 10:47, Ming Lei 写道: > > Hi, > > > > Thomas mentioned: > > " > > That was the constraint of managed interrupts from the very beginning: > > > >The driver/subsystem has to quiesce the interrupt line and the > > associated > >queue _before_ it gets shutdown in CPU unplug and not fiddle with it > >until it's restarted by the core when the CPU is plugged in again. > > " > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > This patchset tries to address the issue by two stages: > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > if the hctx is going to be dead. > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes > > dead > > > > - steal bios from the request, and resubmit them via generic_make_request(), > > then these IO will be mapped to other live hctx for dispatch > > > > Please comment & review, thanks! > > > > Hi Ming, > > FWIW, to me this series looks reasonable. Thanks! > > So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert > private reply queue into blk_mq hw queue" then? V2 has been in the following tree for a while: https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 It works, however the implementation is a bit ugly even though the idea is simple. So I think we may need to think of it further, for better implementation or approach. Thanks, Ming Lei
Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
在 12/07/2019 10:47, Ming Lei 写道: Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! Hi Ming, FWIW, to me this series looks reasonable. So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert private reply queue into blk_mq hw queue" then? All the best, John Ming Lei (7): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: add callback of .free_request SCSI: implement .free_request callback blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 147 ++--- block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- drivers/scsi/scsi_lib.c| 13 include/linux/blk-mq.h | 12 +++ include/linux/cpuhotplug.h | 1 + 10 files changed, 170 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch
[RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
Hi, Thomas mentioned: " That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again. " But no drivers or blk-mq do that before one hctx becomes dead(all CPUs for one hctx are offline), and even it is worse, blk-mq stills tries to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). This patchset tries to address the issue by two stages: 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE - mark the hctx as internal stopped, and drain all in-flight requests if the hctx is going to be dead. 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead - steal bios from the request, and resubmit them via generic_make_request(), then these IO will be mapped to other live hctx for dispatch Please comment & review, thanks! Ming Lei (7): blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ blk-mq: stop to handle IO before hctx's all CPUs become offline blk-mq: add callback of .free_request SCSI: implement .free_request callback blk-mq: re-submit IO in case that hctx is dead blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead block/blk-mq-debugfs.c | 2 + block/blk-mq-tag.c | 2 +- block/blk-mq-tag.h | 2 + block/blk-mq.c | 147 ++--- block/blk-mq.h | 3 +- drivers/block/loop.c | 2 +- drivers/md/dm-rq.c | 2 +- drivers/scsi/scsi_lib.c| 13 include/linux/blk-mq.h | 12 +++ include/linux/cpuhotplug.h | 1 + 10 files changed, 170 insertions(+), 16 deletions(-) Cc: Bart Van Assche Cc: Hannes Reinecke Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Keith Busch -- 2.20.1
RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)
Hi Ming, > -Original Message- > From: Ming Lei > Sent: Tuesday, June 25, 2019 10:27 AM > To: wenbinzeng(曾文斌) > Cc: Wenbin Zeng ; ax...@kernel.dk; > keith.bu...@intel.com; > h...@suse.com; osan...@fb.com; s...@grimberg.me; bvanass...@acm.org; > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet > mail) > > On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote: > > Hi Ming, > > > > > -Original Message- > > > From: Ming Lei > > > Sent: Tuesday, June 25, 2019 9:55 AM > > > To: Wenbin Zeng > > > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; > > > osan...@fb.com; s...@grimberg.me; bvanass...@acm.org; > > > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; > > > wenbinzeng(曾文斌) > > > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at > > > cpu-hotplug(Internet mail) > > > > > > On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote: > > > > Currently hctx->cpumask is not updated when hot-plugging new cpus, > > > > as there are many chances kblockd_mod_delayed_work_on() getting > > > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run > > > > > > There are only two cases in which WORK_CPU_UNBOUND is applied: > > > > > > 1) single hw queue > > > > > > 2) multiple hw queue, and all CPUs in this hctx become offline > > > > > > For 1), all CPUs can be found in hctx->cpumask. > > > > > > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > > > > reporting excessive "run queue from wrong CPU" messages because > > > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. > > > > > > The message means CPU hotplug race is triggered. > > > > > > Yeah, there is big problem in blk_mq_hctx_notify_dead() which is > > > called after one CPU is dead, but still run this hw queue to > > > dispatch request, and all CPUs in this hctx might become offline. > > > > > > We have some discussion before on this issue: > > > > > > https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+pu > > > E78STz8hj > > > 9j5bs82...@mail.gmail.com/ > > > > > > > There is another scenario, you can reproduce it by hot-plugging cpus to kvm > > guests > via qemu monitor (I believe virsh setvcpus --live can do the same thing), for > example: > > (qemu) cpu-add 1 > > (qemu) cpu-add 2 > > (qemu) cpu-add 3 > > > > In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask > > doesn't > get synced when these cpus are added. > > It is CPU cold-plug, we suppose to support it. > > The new added CPUs should be visible to hctx, since we spread queues among all > possible CPUs(), please see blk_mq_map_queues() and > irq_build_affinity_masks(), > which is like static allocation on CPU resources. > > Otherwise, you might use an old kernel or there is bug somewhere. It turns out that I was using old kernel, version 4.14, I tested the latest version, it works well as you said. Thank you very much. > > > > > > > > > > > This patch added a cpu-hotplug handler into blk-mq, updating > > > > hctx->cpumask at cpu-hotplug. > > > > > > This way isn't correct, hctx->cpumask should be kept as sync with > > > queue mapping. > > > > Please advise what should I do to deal with the above situation? Thanks a > > lot. > > As I shared in last email, there is one approach discussed, which seems > doable. > > Thanks, > Ming
Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)
On 6/25/19 10:27 AM, Ming Lei wrote: > On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote: >> Hi Ming, >> >>> -Original Message- >>> From: Ming Lei >>> Sent: Tuesday, June 25, 2019 9:55 AM >>> To: Wenbin Zeng >>> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com; >>> s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org; >>> linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) >>> Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet >>> mail) >>> >>> On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote: >>>> Currently hctx->cpumask is not updated when hot-plugging new cpus, >>>> as there are many chances kblockd_mod_delayed_work_on() getting >>>> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run >>> >>> There are only two cases in which WORK_CPU_UNBOUND is applied: >>> >>> 1) single hw queue >>> >>> 2) multiple hw queue, and all CPUs in this hctx become offline >>> >>> For 1), all CPUs can be found in hctx->cpumask. >>> >>>> on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() >>>> reporting excessive "run queue from wrong CPU" messages because >>>> cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. >>> >>> The message means CPU hotplug race is triggered. >>> >>> Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called >>> after one CPU is dead, but still run this hw queue to dispatch request, >>> and all CPUs in this hctx might become offline. >>> >>> We have some discussion before on this issue: >>> >>> https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj >>> 9j5bs82...@mail.gmail.com/ >>> >> >> There is another scenario, you can reproduce it by hot-plugging cpus to kvm >> guests via qemu monitor (I believe virsh setvcpus --live can do the same >> thing), for example: >> (qemu) cpu-add 1 >> (qemu) cpu-add 2 >> (qemu) cpu-add 3 >> >> In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask >> doesn't get synced when these cpus are added. Here is how I play with it with the most recent qemu and linux. Boot VM with 1 out of 4 vcpu online. # qemu-system-x86_64 -hda disk.img \ -smp 1,maxcpus=4 \ -m 4096M -enable-kvm \ -device nvme,drive=lightnvme,serial=deadbeaf1 \ -drive file=nvme.img,if=none,id=lightnvme \ -vnc :0 \ -kernel /.../mainline-linux/arch/x86_64/boot/bzImage \ -append "root=/dev/sda1 init=/sbin/init text" \ -monitor stdio -net nic -net user,hostfwd=tcp::5022-:22 As Ming mentioned, after boot: # cat /proc/cpuinfo | grep processor processor : 0 # cat /sys/block/nvme0n1/mq/0/cpu_list 0 # cat /sys/block/nvme0n1/mq/1/cpu_list 1 # cat /sys/block/nvme0n1/mq/2/cpu_list 2 # cat /sys/block/nvme0n1/mq/3/cpu_list 3 # cat /proc/interrupts | grep nvme 24: 11 PCI-MSI 65536-edge nvme0q0 25: 78 PCI-MSI 65537-edge nvme0q1 26: 0 PCI-MSI 65538-edge nvme0q2 27: 0 PCI-MSI 65539-edge nvme0q3 28: 0 PCI-MSI 65540-edge nvme0q4 I hotplug with "device_add qemu64-x86_64-cpu,id=core1,socket-id=1,core-id=0,thread-id=0" in qemu monitor. Dongli Zhang > > It is CPU cold-plug, we suppose to support it. > > The new added CPUs should be visible to hctx, since we spread queues > among all possible CPUs(), please see blk_mq_map_queues() and > irq_build_affinity_masks(), which is like static allocation on CPU > resources. > > Otherwise, you might use an old kernel or there is bug somewhere. > >> >>>> >>>> This patch added a cpu-hotplug handler into blk-mq, updating >>>> hctx->cpumask at cpu-hotplug. >>> >>> This way isn't correct, hctx->cpumask should be kept as sync with >>> queue mapping. >> >> Please advise what should I do to deal with the above situation? Thanks a >> lot. > > As I shared in last email, there is one approach discussed, which seems > doable. > > Thanks, > Ming >
RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)
Hi Dongli, > -Original Message- > From: Dongli Zhang > Sent: Tuesday, June 25, 2019 9:30 AM > To: Wenbin Zeng > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; > ming@redhat.com; > osan...@fb.com; s...@grimberg.me; bvanass...@acm.org; > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) > > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet > mail) > > Hi Wenbin, > > On 6/24/19 11:24 PM, Wenbin Zeng wrote: > > Currently hctx->cpumask is not updated when hot-plugging new cpus, > > as there are many chances kblockd_mod_delayed_work_on() getting > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > > reporting excessive "run queue from wrong CPU" messages because > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. > > > > This patch added a cpu-hotplug handler into blk-mq, updating > > hctx->cpumask at cpu-hotplug. > > > > Signed-off-by: Wenbin Zeng > > --- > > block/blk-mq.c | 29 + > > include/linux/blk-mq.h | 1 + > > 2 files changed, 30 insertions(+) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index ce0f5f4..2e465fc 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -39,6 +39,8 @@ > > #include "blk-mq-sched.h" > > #include "blk-rq-qos.h" > > > > +static enum cpuhp_state cpuhp_blk_mq_online; > > + > > static void blk_mq_poll_stats_start(struct request_queue *q); > > static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); > > > > @@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, > > struct > blk_mq_tags *tags, > > return -ENOMEM; > > } > > > > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node > > *node) > > +{ > > + struct blk_mq_hw_ctx *hctx; > > + > > + hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); > > + > > + if (!cpumask_test_cpu(cpu, hctx->cpumask)) { > > + mutex_lock(&hctx->queue->sysfs_lock); > > + cpumask_set_cpu(cpu, hctx->cpumask); > > + mutex_unlock(&hctx->queue->sysfs_lock); > > + } > > + > > + return 0; > > +} > > + > > As this callback is registered for each hctx, when a cpu is online, it is > called > for each hctx. > > Just taking a 4-queue nvme as example (regardless about other block like > loop). > Suppose cpu=2 (out of 0, 1, 2 and 3) is offline. When we online cpu=2, > > blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=3 > blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=2 > blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=1 > blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=0 > > There is no need to set cpu 2 for blk_mq_hw_ctx->queue_num=[3, 1, 0]. I am > afraid this patch would erroneously set cpumask for > blk_mq_hw_ctx->queue_num=[3, > 1, 0]. > > I used to submit the below patch explaining above for removing a cpu and it is > unfortunately not merged yet. > > https://patchwork.kernel.org/patch/10889307/ > > > Another thing is during initialization, the hctx->cpumask should already been > set and even the cpu is offline. Would you please explain the case > hctx->cpumask > is not set correctly, e.g., how to reproduce with a kvm guest running > scsi/virtio/nvme/loop? My scenario is: A kvm guest started with single cpu, during initialization only one cpu was visible by kernel. After boot, I hot-add some cpus via qemu monitor (I believe virsh setvcpus --live can do the same thing), for example: (qemu) cpu-add 1 (qemu) cpu-add 2 (qemu) cpu-add 3 In such scenario, hctx->cpumask doesn't get updated when these cpus are added. > > Dongli Zhang
Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)
On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote: > Hi Ming, > > > -Original Message- > > From: Ming Lei > > Sent: Tuesday, June 25, 2019 9:55 AM > > To: Wenbin Zeng > > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com; > > s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org; > > linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) > > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet > > mail) > > > > On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote: > > > Currently hctx->cpumask is not updated when hot-plugging new cpus, > > > as there are many chances kblockd_mod_delayed_work_on() getting > > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run > > > > There are only two cases in which WORK_CPU_UNBOUND is applied: > > > > 1) single hw queue > > > > 2) multiple hw queue, and all CPUs in this hctx become offline > > > > For 1), all CPUs can be found in hctx->cpumask. > > > > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > > > reporting excessive "run queue from wrong CPU" messages because > > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. > > > > The message means CPU hotplug race is triggered. > > > > Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called > > after one CPU is dead, but still run this hw queue to dispatch request, > > and all CPUs in this hctx might become offline. > > > > We have some discussion before on this issue: > > > > https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj > > 9j5bs82...@mail.gmail.com/ > > > > There is another scenario, you can reproduce it by hot-plugging cpus to kvm > guests via qemu monitor (I believe virsh setvcpus --live can do the same > thing), for example: > (qemu) cpu-add 1 > (qemu) cpu-add 2 > (qemu) cpu-add 3 > > In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask > doesn't get synced when these cpus are added. It is CPU cold-plug, we suppose to support it. The new added CPUs should be visible to hctx, since we spread queues among all possible CPUs(), please see blk_mq_map_queues() and irq_build_affinity_masks(), which is like static allocation on CPU resources. Otherwise, you might use an old kernel or there is bug somewhere. > > > > > > > This patch added a cpu-hotplug handler into blk-mq, updating > > > hctx->cpumask at cpu-hotplug. > > > > This way isn't correct, hctx->cpumask should be kept as sync with > > queue mapping. > > Please advise what should I do to deal with the above situation? Thanks a lot. As I shared in last email, there is one approach discussed, which seems doable. Thanks, Ming
RE: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet mail)
Hi Ming, > -Original Message- > From: Ming Lei > Sent: Tuesday, June 25, 2019 9:55 AM > To: Wenbin Zeng > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com; > s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org; > linux-ker...@vger.kernel.org; wenbinzeng(曾文斌) > Subject: Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug(Internet > mail) > > On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote: > > Currently hctx->cpumask is not updated when hot-plugging new cpus, > > as there are many chances kblockd_mod_delayed_work_on() getting > > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run > > There are only two cases in which WORK_CPU_UNBOUND is applied: > > 1) single hw queue > > 2) multiple hw queue, and all CPUs in this hctx become offline > > For 1), all CPUs can be found in hctx->cpumask. > > > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > > reporting excessive "run queue from wrong CPU" messages because > > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. > > The message means CPU hotplug race is triggered. > > Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called > after one CPU is dead, but still run this hw queue to dispatch request, > and all CPUs in this hctx might become offline. > > We have some discussion before on this issue: > > https://lore.kernel.org/linux-block/CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj > 9j5bs82...@mail.gmail.com/ > There is another scenario, you can reproduce it by hot-plugging cpus to kvm guests via qemu monitor (I believe virsh setvcpus --live can do the same thing), for example: (qemu) cpu-add 1 (qemu) cpu-add 2 (qemu) cpu-add 3 In such scenario, cpu 1, 2 and 3 are not visible at boot, hctx->cpumask doesn't get synced when these cpus are added. > > > > This patch added a cpu-hotplug handler into blk-mq, updating > > hctx->cpumask at cpu-hotplug. > > This way isn't correct, hctx->cpumask should be kept as sync with > queue mapping. Please advise what should I do to deal with the above situation? Thanks a lot. > > Thanks, > Ming
Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug
On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote: > Currently hctx->cpumask is not updated when hot-plugging new cpus, > as there are many chances kblockd_mod_delayed_work_on() getting > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run There are only two cases in which WORK_CPU_UNBOUND is applied: 1) single hw queue 2) multiple hw queue, and all CPUs in this hctx become offline For 1), all CPUs can be found in hctx->cpumask. > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > reporting excessive "run queue from wrong CPU" messages because > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. The message means CPU hotplug race is triggered. Yeah, there is big problem in blk_mq_hctx_notify_dead() which is called after one CPU is dead, but still run this hw queue to dispatch request, and all CPUs in this hctx might become offline. We have some discussion before on this issue: https://lore.kernel.org/linux-block/cacvxfvn729sgfqgugmu1in7p6mv5+pue78stz8hj9j5bs82...@mail.gmail.com/ > > This patch added a cpu-hotplug handler into blk-mq, updating > hctx->cpumask at cpu-hotplug. This way isn't correct, hctx->cpumask should be kept as sync with queue mapping. Thanks, Ming
Re: [PATCH] blk-mq: update hctx->cpumask at cpu-hotplug
Hi Wenbin, On 6/24/19 11:24 PM, Wenbin Zeng wrote: > Currently hctx->cpumask is not updated when hot-plugging new cpus, > as there are many chances kblockd_mod_delayed_work_on() getting > called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run > on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() > reporting excessive "run queue from wrong CPU" messages because > cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. > > This patch added a cpu-hotplug handler into blk-mq, updating > hctx->cpumask at cpu-hotplug. > > Signed-off-by: Wenbin Zeng > --- > block/blk-mq.c | 29 + > include/linux/blk-mq.h | 1 + > 2 files changed, 30 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index ce0f5f4..2e465fc 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -39,6 +39,8 @@ > #include "blk-mq-sched.h" > #include "blk-rq-qos.h" > > +static enum cpuhp_state cpuhp_blk_mq_online; > + > static void blk_mq_poll_stats_start(struct request_queue *q); > static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); > > @@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, > struct blk_mq_tags *tags, > return -ENOMEM; > } > > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node > *node) > +{ > + struct blk_mq_hw_ctx *hctx; > + > + hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); > + > + if (!cpumask_test_cpu(cpu, hctx->cpumask)) { > + mutex_lock(&hctx->queue->sysfs_lock); > + cpumask_set_cpu(cpu, hctx->cpumask); > + mutex_unlock(&hctx->queue->sysfs_lock); > + } > + > + return 0; > +} > + As this callback is registered for each hctx, when a cpu is online, it is called for each hctx. Just taking a 4-queue nvme as example (regardless about other block like loop). Suppose cpu=2 (out of 0, 1, 2 and 3) is offline. When we online cpu=2, blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=3 blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=2 blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=1 blk_mq_hctx_notify_online() called: cpu=2 and blk_mq_hw_ctx->queue_num=0 There is no need to set cpu 2 for blk_mq_hw_ctx->queue_num=[3, 1, 0]. I am afraid this patch would erroneously set cpumask for blk_mq_hw_ctx->queue_num=[3, 1, 0]. I used to submit the below patch explaining above for removing a cpu and it is unfortunately not merged yet. https://patchwork.kernel.org/patch/10889307/ Another thing is during initialization, the hctx->cpumask should already been set and even the cpu is offline. Would you please explain the case hctx->cpumask is not set correctly, e.g., how to reproduce with a kvm guest running scsi/virtio/nvme/loop? Dongli Zhang
[PATCH] blk-mq: update hctx->cpumask at cpu-hotplug
Currently hctx->cpumask is not updated when hot-plugging new cpus, as there are many chances kblockd_mod_delayed_work_on() getting called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run on the newly-plugged cpus, consequently __blk_mq_run_hw_queue() reporting excessive "run queue from wrong CPU" messages because cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) returns false. This patch added a cpu-hotplug handler into blk-mq, updating hctx->cpumask at cpu-hotplug. Signed-off-by: Wenbin Zeng --- block/blk-mq.c | 29 + include/linux/blk-mq.h | 1 + 2 files changed, 30 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index ce0f5f4..2e465fc 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -39,6 +39,8 @@ #include "blk-mq-sched.h" #include "blk-rq-qos.h" +static enum cpuhp_state cpuhp_blk_mq_online; + static void blk_mq_poll_stats_start(struct request_queue *q); static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); @@ -2215,6 +2217,21 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, return -ENOMEM; } +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); + + if (!cpumask_test_cpu(cpu, hctx->cpumask)) { + mutex_lock(&hctx->queue->sysfs_lock); + cpumask_set_cpu(cpu, hctx->cpumask); + mutex_unlock(&hctx->queue->sysfs_lock); + } + + return 0; +} + /* * 'cpu' is going away. splice any existing rq_list entries from this * software queue to the hw queue dispatch list, and ensure that it @@ -2251,6 +2268,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx) { + if (cpuhp_blk_mq_online > 0) + cpuhp_state_remove_instance_nocalls(cpuhp_blk_mq_online, + &hctx->cpuhp_online); cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); } @@ -2310,6 +2330,9 @@ static int blk_mq_init_hctx(struct request_queue *q, { hctx->queue_num = hctx_idx; + if (cpuhp_blk_mq_online > 0) + cpuhp_state_add_instance_nocalls(cpuhp_blk_mq_online, +&hctx->cpuhp_online); cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); hctx->tags = set->tags[hctx_idx]; @@ -3544,6 +3567,12 @@ unsigned int blk_mq_rq_cpu(struct request *rq) static int __init blk_mq_init(void) { + cpuhp_blk_mq_online = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, + "block/mq:online", + blk_mq_hctx_notify_online, NULL); + if (cpuhp_blk_mq_online <= 0) + pr_warn("blk_mq_init: failed to setup cpu online callback\n"); + cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL, blk_mq_hctx_notify_dead); return 0; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 15d1aa5..5241659 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -58,6 +58,7 @@ struct blk_mq_hw_ctx { atomic_tnr_active; + struct hlist_node cpuhp_online; struct hlist_node cpuhp_dead; struct kobject kobj; -- 1.8.3.1
Re: Hotplug
On 4/11/18 10:14 AM, Jan Kara wrote: > Hello, > > On Wed 11-04-18 08:11:13, Jens Axboe wrote: >> On 4/11/18 7:58 AM, Jan Kara wrote: >>> On Tue 10-04-18 11:17:46, Jens Axboe wrote: >>>> Been running some tests and I keep running into issues with hotplug. >>>> This looks similar to what Bart posted the other day, but it looks >>>> more deeply rooted than just having to protect the queue in >>>> generic_make_request_checks(). The test run is blktests, >>>> block/001. Current -git doesn't survive it. I've seen at least two >>>> different oopses, pasted below. >>>> >>>> [ 102.163442] NULL pointer dereference at 0010 >>>> [ 102.163444] PGD 0 P4D 0 >>>> [ 102.163447] Oops: [#1] PREEMPT SMP >>>> [ 102.163449] Modules linked in: >>>> [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive >>>> [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common >>>> nvme nvme_core sb_edac xl >>>> [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 >>>> [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress >>>> xxhash lzo_compress zlib_defc >>>> [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 >>>> [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: >>>> crc_t10dif] >>>> [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ >>>> #650 >>>> [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 >>>> 11/09/2016 >>>> [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] >>>> [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 >>>> [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: >>>> 883ff357bb66 >>>> [ 102.373220] RDX: 0003 RSI: 7530 RDI: >>>> 881fea631000 >>>> [ 102.381705] RBP: R08: 881fe4d38400 R09: >>>> >>>> [ 102.390185] R10: R11: 01b6 R12: >>>> 085d >>>> [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: >>>> >>>> [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() >>>> knlGS: >>>> [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 >>>> [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: >>>> 003606e0 >>>> [ 102.432545] DR0: DR1: DR2: >>>> >>>> [ 102.441024] DR3: DR6: fffe0ff0 DR7: >>>> 0400 >>>> [ 102.449502] Call Trace: >>>> [ 102.452744] ? __invalidate_device+0x48/0x60 >>>> [ 102.458022] check_disk_change+0x4c/0x60 >>>> [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] >>>> [ 102.468270] __blkdev_get+0xb9/0x450 >>>> [ 102.472774] ? iget5_locked+0x1c0/0x1e0 >>>> [ 102.477568] blkdev_get+0x11e/0x320 >>>> [ 102.481969] ? bdget+0x11d/0x150 >>>> [ 102.486083] ? _raw_spin_unlock+0xa/0x20 >>>> [ 102.490968] ? bd_acquire+0xc0/0xc0 >>>> [ 102.495368] do_dentry_open+0x1b0/0x320 >>>> [ 102.500159] ? inode_permission+0x24/0xc0 >>>> [ 102.505140] path_openat+0x4e6/0x1420 >>>> [ 102.509741] ? cpumask_any_but+0x1f/0x40 >>>> [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 >>>> [ 102.519903] do_filp_open+0x8c/0xf0 >>>> [ 102.524305] ? __seccomp_filter+0x28/0x230 >>>> [ 102.529389] ? _raw_spin_unlock+0xa/0x20 >>>> [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 >>>> [ 102.539559] ? list_lru_add+0xa8/0xc0 >>>> [ 102.544157] ? _raw_spin_unlock+0xa/0x20 >>>> [ 102.549047] ? __alloc_fd+0xaf/0x160 >>>> [ 102.553549] ? do_sys_open+0x1a6/0x230 >>>> [ 102.558244] do_sys_open+0x1a6/0x230 >>>> [ 102.562742] do_syscall_64+0x5a/0x100 >>>> [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>> >>> Interesting. Thinking out loud: This is cd->device dereference I guess >>> which means disk->private_data was NULL. That gets set in sr_probe() >>> together with disk->fops which are certainly set as they must have led us >>> to the crashing function sr_block_revalidate_disk(). So likely >>> disk->private_data go
Re: Hotplug
Hello, On Wed 11-04-18 08:11:13, Jens Axboe wrote: > On 4/11/18 7:58 AM, Jan Kara wrote: > > On Tue 10-04-18 11:17:46, Jens Axboe wrote: > >> Been running some tests and I keep running into issues with hotplug. > >> This looks similar to what Bart posted the other day, but it looks > >> more deeply rooted than just having to protect the queue in > >> generic_make_request_checks(). The test run is blktests, > >> block/001. Current -git doesn't survive it. I've seen at least two > >> different oopses, pasted below. > >> > >> [ 102.163442] NULL pointer dereference at 0010 > >> [ 102.163444] PGD 0 P4D 0 > >> [ 102.163447] Oops: [#1] PREEMPT SMP > >> [ 102.163449] Modules linked in: > >> [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive > >> [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common > >> nvme nvme_core sb_edac xl > >> [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 > >> [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress > >> xxhash lzo_compress zlib_defc > >> [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 > >> [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: > >> crc_t10dif] > >> [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ > >> #650 > >> [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 > >> 11/09/2016 > >> [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > >> [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 > >> [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: > >> 883ff357bb66 > >> [ 102.373220] RDX: 0003 RSI: 7530 RDI: > >> 881fea631000 > >> [ 102.381705] RBP: R08: 881fe4d38400 R09: > >> > >> [ 102.390185] R10: R11: 01b6 R12: > >> 085d > >> [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: > >> > >> [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() > >> knlGS: > >> [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 > >> [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: > >> 003606e0 > >> [ 102.432545] DR0: DR1: DR2: > >> > >> [ 102.441024] DR3: DR6: fffe0ff0 DR7: > >> 0400 > >> [ 102.449502] Call Trace: > >> [ 102.452744] ? __invalidate_device+0x48/0x60 > >> [ 102.458022] check_disk_change+0x4c/0x60 > >> [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] > >> [ 102.468270] __blkdev_get+0xb9/0x450 > >> [ 102.472774] ? iget5_locked+0x1c0/0x1e0 > >> [ 102.477568] blkdev_get+0x11e/0x320 > >> [ 102.481969] ? bdget+0x11d/0x150 > >> [ 102.486083] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.490968] ? bd_acquire+0xc0/0xc0 > >> [ 102.495368] do_dentry_open+0x1b0/0x320 > >> [ 102.500159] ? inode_permission+0x24/0xc0 > >> [ 102.505140] path_openat+0x4e6/0x1420 > >> [ 102.509741] ? cpumask_any_but+0x1f/0x40 > >> [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 > >> [ 102.519903] do_filp_open+0x8c/0xf0 > >> [ 102.524305] ? __seccomp_filter+0x28/0x230 > >> [ 102.529389] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 > >> [ 102.539559] ? list_lru_add+0xa8/0xc0 > >> [ 102.544157] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.549047] ? __alloc_fd+0xaf/0x160 > >> [ 102.553549] ? do_sys_open+0x1a6/0x230 > >> [ 102.558244] do_sys_open+0x1a6/0x230 > >> [ 102.562742] do_syscall_64+0x5a/0x100 > >> [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > > > Interesting. Thinking out loud: This is cd->device dereference I guess > > which means disk->private_data was NULL. That gets set in sr_probe() > > together with disk->fops which are certainly set as they must have led us > > to the crashing function sr_block_revalidate_disk(). So likely > > disk->private_data got already cleared. That happens in sr_kref_release() > > and the fact that that function got called means struct scsi_cd went away - > > so sr_remove() must have been called as well. That all seems possible like: > > > > CPU1CPU2 > >
Re: Hotplug
On 4/11/18 7:58 AM, Jan Kara wrote: > Hi, > > On Tue 10-04-18 11:17:46, Jens Axboe wrote: >> Been running some tests and I keep running into issues with hotplug. >> This looks similar to what Bart posted the other day, but it looks >> more deeply rooted than just having to protect the queue in >> generic_make_request_checks(). The test run is blktests, >> block/001. Current -git doesn't survive it. I've seen at least two >> different oopses, pasted below. >> >> [ 102.163442] NULL pointer dereference at 0010 >> [ 102.163444] PGD 0 P4D 0 >> [ 102.163447] Oops: [#1] PREEMPT SMP >> [ 102.163449] Modules linked in: >> [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive >> [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common >> nvme nvme_core sb_edac xl >> [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 >> [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash >> lzo_compress zlib_defc >> [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 >> [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: >> crc_t10dif] >> [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 >> [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 >> 11/09/2016 >> [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] >> [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 >> [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: >> 883ff357bb66 >> [ 102.373220] RDX: 0003 RSI: 7530 RDI: >> 881fea631000 >> [ 102.381705] RBP: R08: 881fe4d38400 R09: >> >> [ 102.390185] R10: R11: 01b6 R12: >> 085d >> [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: >> >> [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() >> knlGS: >> [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 >> [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: >> 003606e0 >> [ 102.432545] DR0: DR1: DR2: >> >> [ 102.441024] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 102.449502] Call Trace: >> [ 102.452744] ? __invalidate_device+0x48/0x60 >> [ 102.458022] check_disk_change+0x4c/0x60 >> [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] >> [ 102.468270] __blkdev_get+0xb9/0x450 >> [ 102.472774] ? iget5_locked+0x1c0/0x1e0 >> [ 102.477568] blkdev_get+0x11e/0x320 >> [ 102.481969] ? bdget+0x11d/0x150 >> [ 102.486083] ? _raw_spin_unlock+0xa/0x20 >> [ 102.490968] ? bd_acquire+0xc0/0xc0 >> [ 102.495368] do_dentry_open+0x1b0/0x320 >> [ 102.500159] ? inode_permission+0x24/0xc0 >> [ 102.505140] path_openat+0x4e6/0x1420 >> [ 102.509741] ? cpumask_any_but+0x1f/0x40 >> [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 >> [ 102.519903] do_filp_open+0x8c/0xf0 >> [ 102.524305] ? __seccomp_filter+0x28/0x230 >> [ 102.529389] ? _raw_spin_unlock+0xa/0x20 >> [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 >> [ 102.539559] ? list_lru_add+0xa8/0xc0 >> [ 102.544157] ? _raw_spin_unlock+0xa/0x20 >> [ 102.549047] ? __alloc_fd+0xaf/0x160 >> [ 102.553549] ? do_sys_open+0x1a6/0x230 >> [ 102.558244] do_sys_open+0x1a6/0x230 >> [ 102.562742] do_syscall_64+0x5a/0x100 >> [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Interesting. Thinking out loud: This is cd->device dereference I guess > which means disk->private_data was NULL. That gets set in sr_probe() > together with disk->fops which are certainly set as they must have led us > to the crashing function sr_block_revalidate_disk(). So likely > disk->private_data got already cleared. That happens in sr_kref_release() > and the fact that that function got called means struct scsi_cd went away - > so sr_remove() must have been called as well. That all seems possible like: > > CPU1 CPU2 > sr_probe() > __blkdev_get() > disk = bdev_get_gendisk(); > > sr_remove() > del_gendisk() > ... > kref_put(&cd->kref, sr_kref_release); > disk->private_data = NULL; > put_disk(disk); > kfree(cd); > if (disk->fops->open) { > ret = disk->fops->open(bdev, mode); => sr_block_open > check_disk_change(bdev); >
Re: Hotplug
On 4/11/18 8:06 AM, Bart Van Assche wrote: > On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote: >> I'm not really sure where this is crashing and 'Code' line is incomplete to >> tell me. > > Hello Jan, > > The following patch should fix this crash: > https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html. Yeah, I forgot the link in my reply, thanks. -- Jens Axboe
Re: Hotplug
On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote: > I'm not really sure where this is crashing and 'Code' line is incomplete to > tell me. Hello Jan, The following patch should fix this crash: https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html. Thanks, Bart.
Re: Hotplug
Hi, On Tue 10-04-18 11:17:46, Jens Axboe wrote: > Been running some tests and I keep running into issues with hotplug. > This looks similar to what Bart posted the other day, but it looks > more deeply rooted than just having to protect the queue in > generic_make_request_checks(). The test run is blktests, > block/001. Current -git doesn't survive it. I've seen at least two > different oopses, pasted below. > > [ 102.163442] NULL pointer dereference at 0010 > [ 102.163444] PGD 0 P4D 0 > [ 102.163447] Oops: [#1] PREEMPT SMP > [ 102.163449] Modules linked in: > [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive > [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme > nvme_core sb_edac xl > [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 > [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash > lzo_compress zlib_defc > [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 > [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: > crc_t10dif] > [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 > [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 > 11/09/2016 > [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 > [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: > 883ff357bb66 > [ 102.373220] RDX: 0003 RSI: 7530 RDI: > 881fea631000 > [ 102.381705] RBP: R08: 881fe4d38400 R09: > > [ 102.390185] R10: R11: 01b6 R12: > 085d > [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: > > [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() > knlGS: > [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 > [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: > 003606e0 > [ 102.432545] DR0: DR1: DR2: > > [ 102.441024] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 102.449502] Call Trace: > [ 102.452744] ? __invalidate_device+0x48/0x60 > [ 102.458022] check_disk_change+0x4c/0x60 > [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] > [ 102.468270] __blkdev_get+0xb9/0x450 > [ 102.472774] ? iget5_locked+0x1c0/0x1e0 > [ 102.477568] blkdev_get+0x11e/0x320 > [ 102.481969] ? bdget+0x11d/0x150 > [ 102.486083] ? _raw_spin_unlock+0xa/0x20 > [ 102.490968] ? bd_acquire+0xc0/0xc0 > [ 102.495368] do_dentry_open+0x1b0/0x320 > [ 102.500159] ? inode_permission+0x24/0xc0 > [ 102.505140] path_openat+0x4e6/0x1420 > [ 102.509741] ? cpumask_any_but+0x1f/0x40 > [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 > [ 102.519903] do_filp_open+0x8c/0xf0 > [ 102.524305] ? __seccomp_filter+0x28/0x230 > [ 102.529389] ? _raw_spin_unlock+0xa/0x20 > [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 > [ 102.539559] ? list_lru_add+0xa8/0xc0 > [ 102.544157] ? _raw_spin_unlock+0xa/0x20 > [ 102.549047] ? __alloc_fd+0xaf/0x160 > [ 102.553549] ? do_sys_open+0x1a6/0x230 > [ 102.558244] do_sys_open+0x1a6/0x230 > [ 102.562742] do_syscall_64+0x5a/0x100 > [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Interesting. Thinking out loud: This is cd->device dereference I guess which means disk->private_data was NULL. That gets set in sr_probe() together with disk->fops which are certainly set as they must have led us to the crashing function sr_block_revalidate_disk(). So likely disk->private_data got already cleared. That happens in sr_kref_release() and the fact that that function got called means struct scsi_cd went away - so sr_remove() must have been called as well. That all seems possible like: CPU1CPU2 sr_probe() __blkdev_get() disk = bdev_get_gendisk(); sr_remove() del_gendisk() ... kref_put(&cd->kref, sr_kref_release); disk->private_data = NULL; put_disk(disk); kfree(cd); if (disk->fops->open) { ret = disk->fops->open(bdev, mode); => sr_block_open check_disk_change(bdev); sr_block_revalidate_disk() CRASH And I think the problem is in sr_block_revalidate_disk() itself as the scsi_cd() call is not guaranteed that the caller holds reference to 'cd' and thus that 'cd' does not disappear under it. IMHO it needs to use scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something? > and this one, similar to Barts: > > [ 4676.73806
Hotplug
Hi, Been running some tests and I keep running into issues with hotplug. This looks similar to what Bart posted the other day, but it looks more deeply rooted than just having to protect the queue in generic_make_request_checks(). The test run is blktests, block/001. Current -git doesn't survive it. I've seen at least two different oopses, pasted below. [ 102.163442] NULL pointer dereference at 0010 [ 102.163444] PGD 0 P4D 0 [ 102.163447] Oops: [#1] PREEMPT SMP [ 102.163449] Modules linked in: [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66 [ 102.373220] RDX: 0003 RSI: 7530 RDI: 881fea631000 [ 102.381705] RBP: R08: 881fe4d38400 R09: [ 102.390185] R10: R11: 01b6 R12: 085d [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() knlGS: [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 003606e0 [ 102.432545] DR0: DR1: DR2: [ 102.441024] DR3: DR6: fffe0ff0 DR7: 0400 [ 102.449502] Call Trace: [ 102.452744] ? __invalidate_device+0x48/0x60 [ 102.458022] check_disk_change+0x4c/0x60 [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] [ 102.468270] __blkdev_get+0xb9/0x450 [ 102.472774] ? iget5_locked+0x1c0/0x1e0 [ 102.477568] blkdev_get+0x11e/0x320 [ 102.481969] ? bdget+0x11d/0x150 [ 102.486083] ? _raw_spin_unlock+0xa/0x20 [ 102.490968] ? bd_acquire+0xc0/0xc0 [ 102.495368] do_dentry_open+0x1b0/0x320 [ 102.500159] ? inode_permission+0x24/0xc0 [ 102.505140] path_openat+0x4e6/0x1420 [ 102.509741] ? cpumask_any_but+0x1f/0x40 [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 [ 102.519903] do_filp_open+0x8c/0xf0 [ 102.524305] ? __seccomp_filter+0x28/0x230 [ 102.529389] ? _raw_spin_unlock+0xa/0x20 [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 [ 102.539559] ? list_lru_add+0xa8/0xc0 [ 102.544157] ? _raw_spin_unlock+0xa/0x20 [ 102.549047] ? __alloc_fd+0xaf/0x160 [ 102.553549] ? do_sys_open+0x1a6/0x230 [ 102.558244] do_sys_open+0x1a6/0x230 [ 102.562742] do_syscall_64+0x5a/0x100 [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 and this one, similar to Barts: [ 4676.738069] NULL pointer dereference at 0154 [ 4676.738071] PGD 0 P4D 0 [ 4676.738073] Oops: [#1] PREEMPT SMP [ 4676.738075] Modules linked in: scsi_debug crc_t10dif nvme nvme_core loop configfs crct10dif_ge] [ 4676.859272] CPU: 10 PID: 16598 Comm: systemd-udevd Not tainted 4.16.0+ #651 [ 4676.867525] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 [ 4676.876765] RIP: 0010:blk_throtl_bio+0x45/0x9b0 [ 4676.882296] RSP: 0018:881ff0c8bb38 EFLAGS: 00010246 [ 4676.888610] RAX: RBX: 881ffa273a40 RCX: [ 4676.897059] RDX: 881ffa273a40 RSI: RDI: [ 4676.905507] RBP: R08: 881ffa273ac0 R09: 881ff123f458 [ 4676.913955] R10: 881ff0c8bc70 R11: 1000 R12: [ 4676.922402] R13: 82600980 R14: 88208113 R15: [ 4676.930856] FS: 7fe63e5228c0() GS:881fff74() knlGS: [ 4676.940773] CS: 0010 DS: ES: CR0: 80050033 [ 4676.947667] CR2: 0154 CR3: 001fed2a0003 CR4: 003606e0 [ 4676.956116] DR0: DR1: DR2: [ 4676.964568] DR3: DR6: fffe0ff0 DR7: 0400 [ 4676.973021] Call Trace: [ 4676.976229] generic_make_request_checks+0x640/0x660 [ 4676.982245] ? kmem_cache_alloc+0x37/0x190 [ 4676.987295] generic_make_request+0x29/0x2f0 [ 4676.992534] ? submit_bio+0x5c/0x110 [ 4676.996998] ? bio_alloc_bioset+0x99/0x1c0 [ 4677.002050] submit_bio+0x5c/0x110 [ 4677.006317] ? guard_bio_eod+0x42/0x120 [ 4677.011074] submit_bh_wbc.isra.49+0x132/0x160 [ 4677.016517] ? bh_uptodate_or_lock+0x90/0x90 [ 4
Re: blk-mq and CPU hotplug error
Hi Jose On 03/27/2018 05:25 AM, jos...@linux.vnet.ibm.com wrote: > Hello everyone! > > I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le). > Everything looks good until I try to hotplug CPUs in my VM. As soon as I > do that I get the following error in my VM dmesg: Please refer to the following commit: 20e4d813 blk-mq: simplify queue mapping & schedule with each possisble CPU 84676c1 genirq/affinity: assign vectors to all possible CPUs Thanks Jianchao
blk-mq and CPU hotplug error
Hello everyone! I'm running Ubuntu 18.04 (4.15.0-12-generic) in KVM/QEMU (powerpc64le). Everything looks good until I try to hotplug CPUs in my VM. As soon as I do that I get the following error in my VM dmesg: [ 763.629425] WARNING: CPU: 34 PID: 2276 at /build/linux-zBVy54/linux-4.15.0/block/blk-mq.c:1211 __blk_mq_run_hw_queue+0xf0/0x140 ... [ 763.629497] CPU: 34 PID: 2276 Comm: kworker/34:1H Not tainted 4.15.0-12-generic #13-Ubuntu [ 763.629501] Workqueue: kblockd blk_mq_run_work_fn See that blk-mq complains about CPU 34. This CPU is one of the hotplugged CPUs. Looking at mq/0/cpu_list, I see the my new CPUs aren't in that list: # cat /sys/block/vda/mq/0/cpu_list 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 But the real problem comes after it, if I try to run any command that performs intensive IO operations, like: # dd if=/dev/random of=test.img bs=1M count=100 I start to get errors like: [ 967.661543] INFO: task kworker/u128:3:454 blocked for more than 120 seconds. [ 967.661656] Tainted: GW4.15.0-12-generic #13-Ubuntu [ 967.661705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.661762] kworker/u128:3 D0 454 2 0x0800 [ 967.661802] Workqueue: writeback wb_workfn (flush-252:0) [ 967.661805] Call Trace: [ 967.661808] [c00267e5f1d0] [0022] 0x22 (unreliable) [ 967.661813] [c00267e5f3a0] [c001c220] __switch_to+0x2a0/0x4d0 [ 967.661823] [c00267e5f400] [c0d035e4] __schedule+0x2a4/0xaf0 [ 967.661825] [c00267e5f4d0] [c0d03e70] schedule+0x40/0xc0 [ 967.661834] [c00267e5f4f0] [c014f77c] io_schedule+0x2c/0x50 [ 967.661838] [c00267e5f520] [c02d810c] __lock_page+0x14c/0x1c0 [ 967.661842] [c00267e5f5c0] [c04c9288] mpage_prepare_extent_to_map+0x2c8/0x3d0 [ 967.661844] [c00267e5f6c0] [c04d16bc] ext4_writepages+0x49c/0xfd0 [ 967.661847] [c00267e5f870] [c02f1a9c] do_writepages+0x4c/0x130 [ 967.661849] [c00267e5f8e0] [c041a370] __writeback_single_inode+0x70/0x570 [ 967.661852] [c00267e5f940] [c041af0c] writeback_sb_inodes+0x26c/0x5d0 [ 967.661854] [c00267e5fa50] [c041b308] __writeback_inodes_wb+0x98/0x110 [ 967.661856] [c00267e5fab0] [c041b72c] wb_writeback+0x29c/0x460 [ 967.661859] [c00267e5fb80] [c041c38c] wb_workfn+0x1ec/0x600 [ 967.661862] [c00267e5fc90] [c0131538] process_one_work+0x298/0x5a0 [ 967.661864] [c00267e5fd20] [c01318d8] worker_thread+0x98/0x630 [ 967.661866] [c00267e5fdc0] [c013a508] kthread+0x1a8/0x1b0 [ 967.661869] [c00267e5fe30] [c000b528] ret_from_kernel_thread+0x5c/0xb4 [ 967.661872] INFO: task jbd2/vda2-8:579 blocked for more than 120 seconds. [ 967.661924] Tainted: GW4.15.0-12-generic #13-Ubuntu [ 967.661973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.662035] jbd2/vda2-8 D0 579 2 0x0800 [ 967.662037] Call Trace: ...snip Which leads to an unresponsive VM that forces me to reboot it. After searching the internet I changed the IO scheduler: # cat /sys/block/vda/queue/scheduler [none] # modprobe bfq # echo bfq > /sys/block/vda/queue/scheduler # cat /sys/block/vda/queue/scheduler [bfq] none which seems to be enough (it still throws that WARNING, but no hung_task_timeout_secs so far). Now I'm not sure on how to attack the problem. At first I thought that CPU hotplug IRQ should also update the cpu_list but now I think that the problem could be in the queue, maybe it should give up scheduling tasks on the failing CPU and try another one. What do you people think? Any suggestion on how to debug it is welcome as well. Thank you! Jose R. Ziviani
Re: [PATCH V2 0/2] blk-mq: two patches related with CPU hotplug
On 1/17/18 9:41 AM, Ming Lei wrote: > Hi Jens, > > The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify > queue mapping & schedule with each possisble CPU"). > > The 2nd one add comments for current two races, and convert the WARN_ON > into printk(KERN_WARN) with dump_stack() since this warning is harmless. > > V2: > - fix comment and commit log in patch 1 > - use dump_stack() with printk to replace WARN_ON() in patch 2 Applied, thanks. -- Jens Axboe
[PATCH V2 0/2] blk-mq: two patches related with CPU hotplug
Hi Jens, The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU"). The 2nd one add comments for current two races, and convert the WARN_ON into printk(KERN_WARN) with dump_stack() since this warning is harmless. V2: - fix comment and commit log in patch 1 - use dump_stack() with printk to replace WARN_ON() in patch 2 Thanks, Ming Lei (2): blk-mq: make sure hctx->next_cpu is set correctly blk-mq: avoid one WARN_ON in __blk_mq_run_hw_queue to printk block/blk-mq.c | 52 1 file changed, 48 insertions(+), 4 deletions(-) -- 2.9.5
[PATCH 0/2] blk-mq: two patches related with CPU hotplug
Hi Jens, The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU"). The 2nd one add commens about current race, and convert the WARN_ON into printk(KERN_WARN) since this warning is harmless. Thanks, Ming Lei (2): blk-mq: make sure hctx->next_cpu is set correctly blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk block/blk-mq.c | 51 +++ 1 file changed, 47 insertions(+), 4 deletions(-) -- 2.9.5
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
On 1/11/18 7:53 PM, Ming Lei wrote: > Hi, > > This two patches support physical CPU hotplug, so that we can make blk-mq > scale well when new physical CPU is added or removed, and this use case > is normal for VM world. > > Also this patchset fixes the following warning reported by Christian > Borntraeger: > > https://marc.info/?l=linux-block&m=151092973417143&w=2 > > Christoph Hellwig (2): > genirq/affinity: assign vectors to all possible CPUs > blk-mq: simplify queue mapping & schedule with each possisble CPU Applied, thanks Ming. -- Jens Axboe
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
On Fri, Jan 12, 2018 at 09:12:24AM +0100, Christian Borntraeger wrote: > I think we also need cc stable for this. The bug was introduced with > > commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. > ("blk-mq: Create hctx for each present CPU") > > and that was even backported into 4.12 stable. Yes, and Ming (or Jens?), can you please add a Fixes tag as well? It helps with identifying patches which fix backports. Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
I think we also need cc stable for this. The bug was introduced with commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. ("blk-mq: Create hctx for each present CPU") and that was even backported into 4.12 stable. On 01/12/2018 03:53 AM, Ming Lei wrote: > Hi, > > This two patches support physical CPU hotplug, so that we can make blk-mq > scale well when new physical CPU is added or removed, and this use case > is normal for VM world. > > Also this patchset fixes the following warning reported by Christian > Borntraeger: > > https://marc.info/?l=linux-block&m=151092973417143&w=2 > > Christoph Hellwig (2): > genirq/affinity: assign vectors to all possible CPUs > blk-mq: simplify queue mapping & schedule with each possisble CPU > > block/blk-mq.c| 19 --- > kernel/irq/affinity.c | 30 +++--- > 2 files changed, 23 insertions(+), 26 deletions(-) >
[PATCH 0/2] blk-mq: support physical CPU hotplug
Hi, This two patches support physical CPU hotplug, so that we can make blk-mq scale well when new physical CPU is added or removed, and this use case is normal for VM world. Also this patchset fixes the following warning reported by Christian Borntraeger: https://marc.info/?l=linux-block&m=151092973417143&w=2 Christoph Hellwig (2): genirq/affinity: assign vectors to all possible CPUs blk-mq: simplify queue mapping & schedule with each possisble CPU block/blk-mq.c| 19 --- kernel/irq/affinity.c | 30 +++--- 2 files changed, 23 insertions(+), 26 deletions(-) -- 2.9.5
Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
On Sat, 3 Jun 2017, Christoph Hellwig wrote: > +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc, > + unsigned int cpu) > +{ > + const struct cpumask *affinity; > + struct irq_data *data; > + struct irq_chip *chip; > + unsigned long flags; > + cpumask_var_t mask; > + > + if (!desc) > + return; > + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) > + return; That's silly. Why want you to do that for each irq descriptor? That's outright silly. Either you allocated it in irq_affinity_[on|off]line_cpu() once or just make it a static cpumask. Lemme fix that up. Thanks, tglx
Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
On Sat, 3 Jun 2017, Christoph Hellwig wrote: > +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc, > + unsigned int cpu) > +{ > + > + cpumask_and(mask, affinity, cpu_online_mask); > + cpumask_set_cpu(cpu, mask); > + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) { > + irq_startup(desc, false); > + irqd_clear(data, IRQD_AFFINITY_SUSPENDED); > + } else { > + irq_affinity_set(irq, desc, mask); I don't think this should be forced. In that case we really can use the regular mechanism. Thanks, tglx
[PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
Remove a CPU from the affinity mask when it goes offline and add it back when it returns. In case the vetor was assigned only to the CPU going offline it will be shutdown and re-started when the CPU reappears. Signed-off-by: Christoph Hellwig --- arch/x86/kernel/irq.c | 3 +- include/linux/cpuhotplug.h | 1 + include/linux/irq.h| 9 kernel/cpu.c | 6 +++ kernel/irq/affinity.c | 127 - 5 files changed, 144 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index a54eac5d81b3..72c35ed534f1 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -453,7 +453,8 @@ void fixup_irqs(void) data = irq_desc_get_irq_data(desc); affinity = irq_data_get_affinity_mask(data); - if (!irq_has_action(irq) || irqd_is_per_cpu(data) || + if (irqd_affinity_is_managed(data) || + !irq_has_action(irq) || irqd_is_per_cpu(data) || cpumask_subset(affinity, cpu_online_mask)) { raw_spin_unlock(&desc->lock); continue; diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 0f2a80377520..c15f22c54535 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -124,6 +124,7 @@ enum cpuhp_state { CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SMPBOOT_THREADS, CPUHP_AP_X86_VDSO_VMA_ONLINE, + CPUHP_AP_IRQ_AFFINITY_ONLINE, CPUHP_AP_PERF_ONLINE, CPUHP_AP_PERF_X86_ONLINE, CPUHP_AP_PERF_X86_UNCORE_ONLINE, diff --git a/include/linux/irq.h b/include/linux/irq.h index f887351aa80e..ae15b8582685 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -216,6 +216,7 @@ enum { IRQD_WAKEUP_ARMED = (1 << 19), IRQD_FORWARDED_TO_VCPU = (1 << 20), IRQD_AFFINITY_MANAGED = (1 << 21), + IRQD_AFFINITY_SUSPENDED = (1 << 22), }; #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors) @@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d) __irqd_to_state(d) &= ~IRQD_ACTIVATED; } +static inline bool irqd_affinity_is_suspended(struct irq_data *d) +{ + return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED; +} + #undef __irqd_to_state static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d) @@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct cpumask *dest); int ipi_send_single(unsigned int virq, unsigned int cpu); int ipi_send_mask(unsigned int virq, const struct cpumask *dest); +int irq_affinity_online_cpu(unsigned int cpu); +int irq_affinity_offline_cpu(unsigned int cpu); + #endif /* _LINUX_IRQ_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 9ae6fbe5b5cf..ef0c5b63ca0d 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = { .startup.single = smpboot_unpark_threads, .teardown.single= NULL, }, + [CPUHP_AP_IRQ_AFFINITY_ONLINE] = { + .name = "irq/affinity:online", + .startup.single = irq_affinity_online_cpu, + .teardown.single= irq_affinity_offline_cpu, + }, [CPUHP_AP_PERF_ONLINE] = { .name = "perf:online", .startup.single = perf_event_init_cpu, diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 337e6ffba93f..e27ecfb4866f 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -1,4 +1,7 @@ - +/* + * Copyright (C) 2016 Thomas Gleixner. + * Copyright (C) 2016-2017 Christoph Hellwig. + */ #include #include #include @@ -227,3 +230,125 @@ int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd) return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv; } + +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc, + unsigned int cpu) +{ + const struct cpumask *affinity; + struct irq_data *data; + struct irq_chip *chip; + unsigned long flags; + cpumask_var_t mask; + + if (!desc) + return; + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) + return; + + raw_spin_lock_irqsave(&desc->lock, flags); + + data = irq_desc_get_irq_data(desc); + affinity = irq_data_get_affinity_mask(data); + if (!irqd_affinity_is_managed(data) || + !irq_has_action(irq) || + !cpumask_test_cpu(cpu, affinity)) + goto out_free_cpumask; + + /* +* The interrupt descriptor might have been cleaned up +* already, but it is not yet removed from the
[PATCH 4/7] genirq/affinity: update CPU affinity for CPU hotplug events
Remove a CPU from the affinity mask when it goes offline and add it back when it returns. In case the vetor was assigned only to the CPU going offline it will be shutdown and re-started when the CPU reappears. Signed-off-by: Christoph Hellwig --- arch/x86/kernel/irq.c | 3 +- include/linux/cpuhotplug.h | 1 + include/linux/irq.h| 9 kernel/cpu.c | 6 +++ kernel/irq/affinity.c | 129 - 5 files changed, 146 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index a54eac5d81b3..72c35ed534f1 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -453,7 +453,8 @@ void fixup_irqs(void) data = irq_desc_get_irq_data(desc); affinity = irq_data_get_affinity_mask(data); - if (!irq_has_action(irq) || irqd_is_per_cpu(data) || + if (irqd_affinity_is_managed(data) || + !irq_has_action(irq) || irqd_is_per_cpu(data) || cpumask_subset(affinity, cpu_online_mask)) { raw_spin_unlock(&desc->lock); continue; diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 0f2a80377520..c15f22c54535 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -124,6 +124,7 @@ enum cpuhp_state { CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SMPBOOT_THREADS, CPUHP_AP_X86_VDSO_VMA_ONLINE, + CPUHP_AP_IRQ_AFFINITY_ONLINE, CPUHP_AP_PERF_ONLINE, CPUHP_AP_PERF_X86_ONLINE, CPUHP_AP_PERF_X86_UNCORE_ONLINE, diff --git a/include/linux/irq.h b/include/linux/irq.h index f887351aa80e..ae15b8582685 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -216,6 +216,7 @@ enum { IRQD_WAKEUP_ARMED = (1 << 19), IRQD_FORWARDED_TO_VCPU = (1 << 20), IRQD_AFFINITY_MANAGED = (1 << 21), + IRQD_AFFINITY_SUSPENDED = (1 << 22), }; #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors) @@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d) __irqd_to_state(d) &= ~IRQD_ACTIVATED; } +static inline bool irqd_affinity_is_suspended(struct irq_data *d) +{ + return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED; +} + #undef __irqd_to_state static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d) @@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct cpumask *dest); int ipi_send_single(unsigned int virq, unsigned int cpu); int ipi_send_mask(unsigned int virq, const struct cpumask *dest); +int irq_affinity_online_cpu(unsigned int cpu); +int irq_affinity_offline_cpu(unsigned int cpu); + #endif /* _LINUX_IRQ_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 9ae6fbe5b5cf..ef0c5b63ca0d 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = { .startup.single = smpboot_unpark_threads, .teardown.single= NULL, }, + [CPUHP_AP_IRQ_AFFINITY_ONLINE] = { + .name = "irq/affinity:online", + .startup.single = irq_affinity_online_cpu, + .teardown.single= irq_affinity_offline_cpu, + }, [CPUHP_AP_PERF_ONLINE] = { .name = "perf:online", .startup.single = perf_event_init_cpu, diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index d58431f59f7c..809a7d241eff 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -1,8 +1,13 @@ - +/* + * Copyright (C) 2016 Thomas Gleixner. + * Copyright (C) 2016-2017 Christoph Hellwig. + */ #include #include #include #include +#include +#include "internals.h" static cpumask_var_t node_to_present_cpumask[MAX_NUMNODES] __read_mostly; @@ -176,6 +181,128 @@ bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask) return ret; } +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc, + unsigned int cpu) +{ + const struct cpumask *affinity; + struct irq_data *data; + struct irq_chip *chip; + unsigned long flags; + cpumask_var_t mask; + + if (!desc) + return; + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) + return; + + raw_spin_lock_irqsave(&desc->lock, flags); + + data = irq_desc_get_irq_data(desc); + affinity = irq_data_get_affinity_mask(data); + if (!irqd_affinity_is_managed(data) || + !irq_has_action(irq) || + !cpumask_test_cpu(cpu, affinity)) + goto out_free_cpumask; + + /* +* The interrupt descriptor might have
Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events
On Fri, 3 Feb 2017, Christoph Hellwig wrote: > @@ -127,6 +127,7 @@ enum cpuhp_state { > CPUHP_AP_ONLINE_IDLE, > CPUHP_AP_SMPBOOT_THREADS, > CPUHP_AP_X86_VDSO_VMA_ONLINE, > + CPUHP_AP_IRQ_AFFINIY_ONLINE, s/AFFINIY/AFFINITY/ perhaps? > +static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc, > + cpumask_t *mask) static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc, cpumask_t *mask) Please > +{ > + struct irq_data *data = irq_desc_get_irq_data(desc); > + struct irq_chip *chip = irq_data_get_irq_chip(data); > + int ret; > + > + if (!irqd_can_move_in_process_context(data) && chip->irq_mask) > + chip->irq_mask(data); > + ret = chip->irq_set_affinity(data, mask, true); > + WARN_ON_ONCE(ret); > + > + /* > + * We unmask if the irq was not marked masked by the core code. > + * That respects the lazy irq disable behaviour. > + */ > + if (!irqd_can_move_in_process_context(data) && > + !irqd_irq_masked(data) && chip->irq_unmask) > + chip->irq_unmask(data); > +} This looks very familiar. arch/x86/kernel/irq.c comes to mind > + > +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc, > + unsigned int cpu) > +{ > + const struct cpumask *affinity; > + struct irq_data *data; > + struct irq_chip *chip; > + unsigned long flags; > + cpumask_var_t mask; > + > + if (!desc) > + return; > + > + raw_spin_lock_irqsave(&desc->lock, flags); > + > + data = irq_desc_get_irq_data(desc); > + affinity = irq_data_get_affinity_mask(data); > + if (!irqd_affinity_is_managed(data) || > + !irq_has_action(irq) || > + !cpumask_test_cpu(cpu, affinity)) > + goto out_unlock; > + > + /* > + * The interrupt descriptor might have been cleaned up > + * already, but it is not yet removed from the radix tree > + */ > + chip = irq_data_get_irq_chip(data); > + if (!chip) > + goto out_unlock; > + > + if (WARN_ON_ONCE(!chip->irq_set_affinity)) > + goto out_unlock; > + > + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { You really want to allocate that _BEFORE_ locking desc->lock. GFP_KERNEL inside the lock held region is wrong and shows that this was never tested :) And no, we don't want GFP_ATOMIC here. You can allocate is once at the call site and hand it in, so you avoid the alloc/free dance when iterating over a large number of descriptors. > + pr_err("failed to allocate memory for cpumask\n"); > + goto out_unlock; > + } > + > + cpumask_and(mask, affinity, cpu_online_mask); > + cpumask_set_cpu(cpu, mask); > + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) { > + irq_startup(desc, false); > + irqd_clear(data, IRQD_AFFINITY_SUSPENDED); > + } else { > + __irq_affinity_set(irq, desc, mask); > + } > + > + free_cpumask_var(mask); > +out_unlock: > + raw_spin_unlock_irqrestore(&desc->lock, flags); > +} > +int irq_affinity_online_cpu(unsigned int cpu) > +{ > + struct irq_desc *desc; > + unsigned int irq; > + > + for_each_irq_desc(irq, desc) > + irq_affinity_online_irq(irq, desc, cpu); That lacks protection against concurrent irq setup/teardown. Wants to be protected with irq_lock_sparse() > + return 0; > +} > + > +static void irq_affinity_offline_irq(unsigned int irq, struct irq_desc *desc, > + unsigned int cpu) > +{ > + const struct cpumask *affinity; > + struct irq_data *data; > + struct irq_chip *chip; > + unsigned long flags; > + cpumask_var_t mask; > + > + if (!desc) > + return; > + > + raw_spin_lock_irqsave(&desc->lock, flags); > + > + data = irq_desc_get_irq_data(desc); > + affinity = irq_data_get_affinity_mask(data); > + if (!irqd_affinity_is_managed(data) || > + !irq_has_action(irq) || > + irqd_has_set(data, IRQD_AFFINITY_SUSPENDED) || > + !cpumask_test_cpu(cpu, affinity)) > + goto out_unlock; > + > + /* > + * Complete the irq move. This cpu is going down and for > + * non intr-remapping case, we can't wait till this interrupt > + * arrives at this cpu before completing the irq move. > + */ > + irq_force_complete_move(desc); Hmm. That's what we do in x86 when the cpu is really dying, i.e. before it really goes away. It's the last resort we have. So if a move is pending, then you force it here and then you call __irq_affinity_set() further down, which queues another pending move, which then gets cleaned up in the cpu dying code. If a move is pending, then you should first verify whether the pending affinity mask is different from the one you are going to set. If it's the same, you can just let the final cleanup code do its job. If not, then you need to
Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events
Hi Christoph, [auto build test ERROR on block/for-next] [also build test ERROR on v4.10-rc6] [cannot apply to next-20170203] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/genirq-allow-assigning-affinity-to-present-but-not-online-CPUs/20170203-224056 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: arm-socfpga_defconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All error/warnings (new ones prefixed by >>): In file included from kernel/irq/internals.h:8:0, from kernel/irq/affinity.c:9: >> include/linux/irqdesc.h:52:25: error: field 'irq_common_data' has incomplete >> type struct irq_common_data irq_common_data; ^~~ >> include/linux/irqdesc.h:53:19: error: field 'irq_data' has incomplete type struct irq_data irq_data; ^~~~ >> include/linux/irqdesc.h:55:2: error: unknown type name 'irq_flow_handler_t' irq_flow_handler_t handle_irq; ^~ In file included from include/linux/interrupt.h:5:0, from kernel/irq/affinity.c:5: include/linux/irqdesc.h: In function 'irq_data_to_desc': >> include/linux/irqdesc.h:111:26: error: dereferencing pointer to incomplete >> type 'struct irq_data' return container_of(data->common, struct irq_desc, irq_common_data); ^ include/linux/kernel.h:850:49: note: in definition of macro 'container_of' const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^~~ In file included from kernel/irq/internals.h:8:0, from kernel/irq/affinity.c:9: include/linux/irqdesc.h: In function 'generic_handle_irq_desc': >> include/linux/irqdesc.h:150:2: error: called object is not a function or >> function pointer desc->handle_irq(desc); ^~~~ include/linux/irqdesc.h: At top level: include/linux/irqdesc.h:194:8: error: unknown type name 'irq_flow_handler_t' irq_flow_handler_t handler) ^~ include/linux/irqdesc.h:215:6: error: unknown type name 'irq_flow_handler_t' irq_flow_handler_t handler, const char *name) ^~ include/linux/irqdesc.h: In function 'irq_balancing_disabled': >> include/linux/irqdesc.h:229:38: error: 'IRQ_NO_BALANCING_MASK' undeclared >> (first use in this function) return desc->status_use_accessors & IRQ_NO_BALANCING_MASK; ^ include/linux/irqdesc.h:229:38: note: each undeclared identifier is reported only once for each function it appears in include/linux/irqdesc.h: In function 'irq_is_percpu': >> include/linux/irqdesc.h:237:38: error: 'IRQ_PER_CPU' undeclared (first use >> in this function) return desc->status_use_accessors & IRQ_PER_CPU; ^~~ In file included from kernel/irq/internals.h:62:0, from kernel/irq/affinity.c:9: kernel/irq/debug.h: In function 'print_irq_desc': >> kernel/irq/debug.h:16:28: warning: format '%p' expects argument of type >> 'void *', but argument 2 has type 'int' [-Wformat=] printk("->handle_irq(): %p, ", desc->handle_irq); ^ >> kernel/irq/debug.h:26:7: error: 'IRQ_LEVEL' undeclared (first use in this >> function) ___P(IRQ_LEVEL); ^ kernel/irq/debug.h:7:50: note: in definition of macro '___P' #define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f) ^ >> kernel/irq/debug.h:27:7: error: 'IRQ_PER_CPU' undeclared (first use in this >> function) ___P(IRQ_PER_CPU); ^ kernel/irq/debug.h:7:50: note: in definition of macro '___P' #define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f) ^ >> kernel/irq/debug.h:28:7: error: 'IRQ_NOPROBE' undeclared (first use in this >> function) ___P(IRQ_NOPROBE); ^ kernel/irq/debug.h:7:50: note: in definition of macro '___P' #define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f) ^ >> kernel/irq/debug.h:29:7: error: 'IRQ_NOREQUEST' undeclared (first use in >> this function) ___P(IRQ_NOREQUEST); ^ kernel/irq/debug.h:7:50: note: in definition of macro '___P' #define ___P(f) if (desc->status_use_accessors & f) printk("%14s set\n", #f)
Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events
Hi Christoph, [auto build test ERROR on block/for-next] [also build test ERROR on v4.10-rc6] [cannot apply to next-20170203] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/genirq-allow-assigning-affinity-to-present-but-not-online-CPUs/20170203-224056 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: ia64-allmodconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 6.2.0 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=ia64 All errors (new ones prefixed by >>): kernel/irq/affinity.c: In function 'irq_affinity_offline_irq': >> kernel/irq/affinity.c:264:2: error: implicit declaration of function >> 'irq_force_complete_move' [-Werror=implicit-function-declaration] irq_force_complete_move(desc); ^~~ cc1: some warnings being treated as errors vim +/irq_force_complete_move +264 kernel/irq/affinity.c 258 259 /* 260 * Complete the irq move. This cpu is going down and for 261 * non intr-remapping case, we can't wait till this interrupt 262 * arrives at this cpu before completing the irq move. 263 */ > 264 irq_force_complete_move(desc); 265 266 /* 267 * The interrupt descriptor might have been cleaned up --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events
Remove a CPU from the affinity mask when it goes offline and add it back when it returns. In case the vetor was assigned only to the CPU going offline it will be shutdown and re-started when the CPU reappears. Signed-off-by: Christoph Hellwig --- arch/x86/kernel/irq.c | 3 +- include/linux/cpuhotplug.h | 1 + include/linux/irq.h| 9 +++ kernel/cpu.c | 6 ++ kernel/irq/affinity.c | 157 - 5 files changed, 174 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 7c6e9ffe4424..285ef40ae290 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -449,7 +449,8 @@ void fixup_irqs(void) data = irq_desc_get_irq_data(desc); affinity = irq_data_get_affinity_mask(data); - if (!irq_has_action(irq) || irqd_is_per_cpu(data) || + if (irqd_affinity_is_managed(data) || + !irq_has_action(irq) || irqd_is_per_cpu(data) || cpumask_subset(affinity, cpu_online_mask)) { raw_spin_unlock(&desc->lock); continue; diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index d936a0021839..63406ae5b2df 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -127,6 +127,7 @@ enum cpuhp_state { CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SMPBOOT_THREADS, CPUHP_AP_X86_VDSO_VMA_ONLINE, + CPUHP_AP_IRQ_AFFINIY_ONLINE, CPUHP_AP_PERF_ONLINE, CPUHP_AP_PERF_X86_ONLINE, CPUHP_AP_PERF_X86_UNCORE_ONLINE, diff --git a/include/linux/irq.h b/include/linux/irq.h index e79875574b39..4b2a542b2591 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -214,6 +214,7 @@ enum { IRQD_WAKEUP_ARMED = (1 << 19), IRQD_FORWARDED_TO_VCPU = (1 << 20), IRQD_AFFINITY_MANAGED = (1 << 21), + IRQD_AFFINITY_SUSPENDED = (1 << 22), }; #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors) @@ -312,6 +313,11 @@ static inline bool irqd_affinity_is_managed(struct irq_data *d) return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED; } +static inline bool irqd_affinity_is_suspended(struct irq_data *d) +{ + return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED; +} + #undef __irqd_to_state static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d) @@ -989,4 +995,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct cpumask *dest); int ipi_send_single(unsigned int virq, unsigned int cpu); int ipi_send_mask(unsigned int virq, const struct cpumask *dest); +int irq_affinity_online_cpu(unsigned int cpu); +int irq_affinity_offline_cpu(unsigned int cpu); + #endif /* _LINUX_IRQ_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 0a5f630f5c54..fe19af6a896b 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -1248,6 +1249,11 @@ static struct cpuhp_step cpuhp_ap_states[] = { .startup.single = smpboot_unpark_threads, .teardown.single= NULL, }, + [CPUHP_AP_IRQ_AFFINIY_ONLINE] = { + .name = "irq/affinity:online", + .startup.single = irq_affinity_online_cpu, + .teardown.single= irq_affinity_offline_cpu, + }, [CPUHP_AP_PERF_ONLINE] = { .name = "perf:online", .startup.single = perf_event_init_cpu, diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 6cd20a569359..74006167892d 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -1,8 +1,12 @@ - +/* + * Copyright (C) 2016 Thomas Gleixner. + * Copyright (C) 2016-2017 Christoph Hellwig. + */ #include #include #include #include +#include "internals.h" static cpumask_var_t node_to_present_cpumask[MAX_NUMNODES] __read_mostly; @@ -148,6 +152,157 @@ int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd) return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv; } +static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc, + cpumask_t *mask) +{ + struct irq_data *data = irq_desc_get_irq_data(desc); + struct irq_chip *chip = irq_data_get_irq_chip(data); + int ret; + + if (!irqd_can_move_in_process_context(data) && chip->irq_mask) + chip->irq_mask(data); + ret = chip->irq_set_affinity(data, mask, true); + WARN_ON_ONCE(ret); + + /* +* We unmask if the irq was not marked masked by the core code. +* That respects the lazy irq disable behaviour. +*/ + if (!irqd_can_move_in_process_context(data) && + !irqd_irq_masked(data) && chip->irq_unmask) + chip->irq_unmask(
Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
On 01/16/2017 04:58 AM, Anshuman Khandual wrote: On 01/13/2017 04:13 AM, Dan Williams wrote: Back when we were first attempting to support DMA for DAX mappings of persistent memory the plan was to forgo 'struct page' completely and develop a pfn-to-scatterlist capability for the dma-mapping-api. That effort died in this thread: https://lkml.org/lkml/2015/8/14/3 ...where we learned that the dependencies on struct page for dma mapping are deeper than a PFN_PHYS() conversion for some architectures. That was the moment we pivoted to ZONE_DEVICE and arranged for a 'struct page' to be available for any persistent memory range that needs to be the target of DMA. ZONE_DEVICE enables any device-driver that can target "System RAM" to also be able to target persistent memory through a DAX mapping. Since that time the "page-less" DAX path has continued to mature [1] without growing new dependencies on struct page, but at the same time continuing to rely on ZONE_DEVICE to satisfy get_user_pages(). Peer-to-peer DMA appears to be evolving from a niche embedded use case to something general purpose platforms will need to comprehend. The "map_peer_resource" [2] approach looks to be headed to the same destination as the pfn-to-scatterlist effort. It's difficult to avoid 'struct page' for describing DMA operations without custom driver code. With that background, a statement and a question to discuss at LSF/MM: General purpose DMA, i.e. any DMA setup through the dma-mapping-api, requires pfn_to_page() support across the entire physical address range mapped. Is ZONE_DEVICE the proper vehicle for this? We've already seen that it collides with platform alignment assumptions [3], and if there's a wider effort to rework memory hotplug [4] it seems DMA support should be part of the discussion. I had experimented with ZONE_DEVICE representation from migration point of view. Tried migration of both anonymous pages as well as file cache pages into and away from ZONE_DEVICE memory. Learned that the lack of 'page->lru' element in the struct page of the ZONE_DEVICE memory makes it difficult for it to represent file backed mapping in it's present form. But given That reminds me: while testing out HMM in our device driver, we had some early difficulties with the LRU system (including pagevec) in general. For example, sometimes HMM was forced to say "I cannot migrate your page range, because a page is still on the very most recently used list". If the number of pages was very small, then *all* the pages might be on that list. :) HMM avoids the problem by forcing it, but it reminds me that the LRU and pagevec were never really intended to intersect with device memory. Another point that may seem unrelated at first: using struct pages and pfns to back device memory is still under discussion: a) Need to avoid using pfns that can ever be needed for other hotpluggable memory b) *Very* hard to justify adding any fields to struct page, or flags for it, of course. ...but given this new-ish requirement to support these types of devices, maybe (b) actually makes sense. Something to discuss. thanks, John Hubbard NVIDIA that ZONE_DEVICE was created to enable direct mapping (DAX) bypassing page cache, it came as no surprise. My objective has been how ZONE_DEVICE can accommodate movable coherent device memory. In our HMM discussions I had brought to the attention how ZONE_DEVICE going forward should evolve to represent all these three types of device memory. * Unmovable addressable device memory (persistent memory) * Movable addressable device memory (similar memory represented as CDM) * Movable un-addressable device memory (similar memory represented as HMM) I would like to attend to discuss on the road map for ZONE_DEVICE, struct pages and device memory in general. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
On 01/13/2017 04:13 AM, Dan Williams wrote: > Back when we were first attempting to support DMA for DAX mappings of > persistent memory the plan was to forgo 'struct page' completely and > develop a pfn-to-scatterlist capability for the dma-mapping-api. That > effort died in this thread: > > https://lkml.org/lkml/2015/8/14/3 > > ...where we learned that the dependencies on struct page for dma > mapping are deeper than a PFN_PHYS() conversion for some > architectures. That was the moment we pivoted to ZONE_DEVICE and > arranged for a 'struct page' to be available for any persistent memory > range that needs to be the target of DMA. ZONE_DEVICE enables any > device-driver that can target "System RAM" to also be able to target > persistent memory through a DAX mapping. > > Since that time the "page-less" DAX path has continued to mature [1] > without growing new dependencies on struct page, but at the same time > continuing to rely on ZONE_DEVICE to satisfy get_user_pages(). > > Peer-to-peer DMA appears to be evolving from a niche embedded use case > to something general purpose platforms will need to comprehend. The > "map_peer_resource" [2] approach looks to be headed to the same > destination as the pfn-to-scatterlist effort. It's difficult to avoid > 'struct page' for describing DMA operations without custom driver > code. > > With that background, a statement and a question to discuss at LSF/MM: > > General purpose DMA, i.e. any DMA setup through the dma-mapping-api, > requires pfn_to_page() support across the entire physical address > range mapped. > > Is ZONE_DEVICE the proper vehicle for this? We've already seen that it > collides with platform alignment assumptions [3], and if there's a > wider effort to rework memory hotplug [4] it seems DMA support should > be part of the discussion. I had experimented with ZONE_DEVICE representation from migration point of view. Tried migration of both anonymous pages as well as file cache pages into and away from ZONE_DEVICE memory. Learned that the lack of 'page->lru' element in the struct page of the ZONE_DEVICE memory makes it difficult for it to represent file backed mapping in it's present form. But given that ZONE_DEVICE was created to enable direct mapping (DAX) bypassing page cache, it came as no surprise. My objective has been how ZONE_DEVICE can accommodate movable coherent device memory. In our HMM discussions I had brought to the attention how ZONE_DEVICE going forward should evolve to represent all these three types of device memory. * Unmovable addressable device memory (persistent memory) * Movable addressable device memory (similar memory represented as CDM) * Movable un-addressable device memory (similar memory represented as HMM) I would like to attend to discuss on the road map for ZONE_DEVICE, struct pages and device memory in general. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
On Thu, Jan 12, 2017 at 3:14 PM, Jerome Glisse wrote: > On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote: >> Back when we were first attempting to support DMA for DAX mappings of >> persistent memory the plan was to forgo 'struct page' completely and >> develop a pfn-to-scatterlist capability for the dma-mapping-api. That >> effort died in this thread: >> >> https://lkml.org/lkml/2015/8/14/3 >> >> ...where we learned that the dependencies on struct page for dma >> mapping are deeper than a PFN_PHYS() conversion for some >> architectures. That was the moment we pivoted to ZONE_DEVICE and >> arranged for a 'struct page' to be available for any persistent memory >> range that needs to be the target of DMA. ZONE_DEVICE enables any >> device-driver that can target "System RAM" to also be able to target >> persistent memory through a DAX mapping. >> >> Since that time the "page-less" DAX path has continued to mature [1] >> without growing new dependencies on struct page, but at the same time >> continuing to rely on ZONE_DEVICE to satisfy get_user_pages(). >> >> Peer-to-peer DMA appears to be evolving from a niche embedded use case >> to something general purpose platforms will need to comprehend. The >> "map_peer_resource" [2] approach looks to be headed to the same >> destination as the pfn-to-scatterlist effort. It's difficult to avoid >> 'struct page' for describing DMA operations without custom driver >> code. >> >> With that background, a statement and a question to discuss at LSF/MM: >> >> General purpose DMA, i.e. any DMA setup through the dma-mapping-api, >> requires pfn_to_page() support across the entire physical address >> range mapped. > > Note that in my case it is even worse. The pfn of the page does not > correspond to anything so it need to go through a special function > to find if a page can be mapped for another device and to provide a > valid pfn at which the page can be access by other device. I still haven't quite wrapped my head about how these pfn ranges are created. Would this be a use case for a new pfn_t flag? It doesn't sound like something we'd want to risk describing with raw 'unsigned long' pfns. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote: > Back when we were first attempting to support DMA for DAX mappings of > persistent memory the plan was to forgo 'struct page' completely and > develop a pfn-to-scatterlist capability for the dma-mapping-api. That > effort died in this thread: > > https://lkml.org/lkml/2015/8/14/3 > > ...where we learned that the dependencies on struct page for dma > mapping are deeper than a PFN_PHYS() conversion for some > architectures. That was the moment we pivoted to ZONE_DEVICE and > arranged for a 'struct page' to be available for any persistent memory > range that needs to be the target of DMA. ZONE_DEVICE enables any > device-driver that can target "System RAM" to also be able to target > persistent memory through a DAX mapping. > > Since that time the "page-less" DAX path has continued to mature [1] > without growing new dependencies on struct page, but at the same time > continuing to rely on ZONE_DEVICE to satisfy get_user_pages(). > > Peer-to-peer DMA appears to be evolving from a niche embedded use case > to something general purpose platforms will need to comprehend. The > "map_peer_resource" [2] approach looks to be headed to the same > destination as the pfn-to-scatterlist effort. It's difficult to avoid > 'struct page' for describing DMA operations without custom driver > code. > > With that background, a statement and a question to discuss at LSF/MM: > > General purpose DMA, i.e. any DMA setup through the dma-mapping-api, > requires pfn_to_page() support across the entire physical address > range mapped. Note that in my case it is even worse. The pfn of the page does not correspond to anything so it need to go through a special function to find if a page can be mapped for another device and to provide a valid pfn at which the page can be access by other device. Basicly the PCIE bar is like a window into the device memory that is dynamicly remap to specific page of the device memory. Not all device memory can be expose through PCIE bar because of PCIE issues. > > Is ZONE_DEVICE the proper vehicle for this? We've already seen that it > collides with platform alignment assumptions [3], and if there's a > wider effort to rework memory hotplug [4] it seems DMA support should > be part of the discussion. Obvioulsy i would like to join this discussion :) Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
Back when we were first attempting to support DMA for DAX mappings of persistent memory the plan was to forgo 'struct page' completely and develop a pfn-to-scatterlist capability for the dma-mapping-api. That effort died in this thread: https://lkml.org/lkml/2015/8/14/3 ...where we learned that the dependencies on struct page for dma mapping are deeper than a PFN_PHYS() conversion for some architectures. That was the moment we pivoted to ZONE_DEVICE and arranged for a 'struct page' to be available for any persistent memory range that needs to be the target of DMA. ZONE_DEVICE enables any device-driver that can target "System RAM" to also be able to target persistent memory through a DAX mapping. Since that time the "page-less" DAX path has continued to mature [1] without growing new dependencies on struct page, but at the same time continuing to rely on ZONE_DEVICE to satisfy get_user_pages(). Peer-to-peer DMA appears to be evolving from a niche embedded use case to something general purpose platforms will need to comprehend. The "map_peer_resource" [2] approach looks to be headed to the same destination as the pfn-to-scatterlist effort. It's difficult to avoid 'struct page' for describing DMA operations without custom driver code. With that background, a statement and a question to discuss at LSF/MM: General purpose DMA, i.e. any DMA setup through the dma-mapping-api, requires pfn_to_page() support across the entire physical address range mapped. Is ZONE_DEVICE the proper vehicle for this? We've already seen that it collides with platform alignment assumptions [3], and if there's a wider effort to rework memory hotplug [4] it seems DMA support should be part of the discussion. --- This topic focuses on the mechanism to enable pfn_to_page() for an arbitrary physical address range, and the proposed peer-to-peer DMA topic [5] touches on the userspace presentation of this mechanism. I might be good to combine these topics if there's interest? In any event, I'm interested in both as well Michal's concern about memory hotplug in general. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-November/007672.html [2]: http://www.spinics.net/lists/linux-pci/msg44560.html [3]: https://lkml.org/lkml/2016/12/1/740 [4]: http://www.spinics.net/lists/linux-mm/msg119369.html [5]: http://marc.info/?l=linux-mm&m=148156541804940&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blk-mq: Drop explicit timeout sync in hotplug
On 11/28/2016 10:01 AM, Gabriel Krisman Bertazi wrote: Sorry for the dup. Missed linux-block address. 8 After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), deleting the timeout work after freezing the queue shouldn't be necessary, since the synchronization is already enforced by the acquisition of a q_usage_counter reference in blk_mq_timeout_work. Added for 4.10, thanks. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blk-mq: Drop explicit timeout sync in hotplug
On Tue, Nov 29, 2016 at 1:01 AM, Gabriel Krisman Bertazi wrote: > Sorry for the dup. Missed linux-block address. > >>8 > > After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), > deleting the timeout work after freezing the queue shouldn't be > necessary, since the synchronization is already enforced by the > acquisition of a q_usage_counter reference in blk_mq_timeout_work. > > Signed-off-by: Gabriel Krisman Bertazi > Cc: Ming Lei > --- > block/blk-mq.c | 9 + > 1 file changed, 1 insertion(+), 8 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 9d4a1d630d0b..bac12caece06 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2220,16 +2220,9 @@ static void blk_mq_queue_reinit_work(void) > */ > list_for_each_entry(q, &all_q_list, all_q_node) > blk_mq_freeze_queue_start(q); > - list_for_each_entry(q, &all_q_list, all_q_node) { > + list_for_each_entry(q, &all_q_list, all_q_node) > blk_mq_freeze_queue_wait(q); > > - /* > -* timeout handler can't touch hw queue during the > -* reinitialization > -*/ > - del_timer_sync(&q->timeout); > - } > - > list_for_each_entry(q, &all_q_list, all_q_node) > blk_mq_queue_reinit(q, &cpuhp_online_new); Reviewed-by: Ming Lei Thanks, Ming -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] blk-mq: Drop explicit timeout sync in hotplug
Sorry for the dup. Missed linux-block address. >8 After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), deleting the timeout work after freezing the queue shouldn't be necessary, since the synchronization is already enforced by the acquisition of a q_usage_counter reference in blk_mq_timeout_work. Signed-off-by: Gabriel Krisman Bertazi Cc: Ming Lei --- block/blk-mq.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 9d4a1d630d0b..bac12caece06 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2220,16 +2220,9 @@ static void blk_mq_queue_reinit_work(void) */ list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_freeze_queue_start(q); - list_for_each_entry(q, &all_q_list, all_q_node) { + list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_freeze_queue_wait(q); - /* -* timeout handler can't touch hw queue during the -* reinitialization -*/ - del_timer_sync(&q->timeout); - } - list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_queue_reinit(q, &cpuhp_online_new); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] blk-mq CPU hotplug
Hi Linus, Last topical branch, this is the conversion of blk-mq to the new hotplug state machine. This should merge cleanly. Please pull! git://git.kernel.dk/linux-block.git for-4.9/block-smp Jens Axboe (1): Merge branch 'smp/for-block' of git://git.kernel.org/.../tip/tip into for-4.9/block-smp Sebastian Andrzej Siewior (2): blk-mq: Convert to new hotplug state machine blk-mq: fixup "Convert to new hotplug state machine" Thomas Gleixner (1): blk-mq/cpu-notif: Convert to new hotplug state machine arch/x86/kernel/smpboot.c| 11 -- block/Makefile | 2 +- block/blk-mq-cpu.c | 67 - block/blk-mq.c | 123 +++- block/blk-mq.h | 7 - include/linux/blk-mq.h | 8 +- include/linux/cpuhotplug.h | 112 +- include/trace/events/cpuhp.h | 28 kernel/cpu.c | 344 +++ 9 files changed, 444 insertions(+), 258 deletions(-) delete mode 100644 block/blk-mq-cpu.c -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine
On 09/22/2016 07:57 AM, Christoph Hellwing wrote: On Tue, Sep 20, 2016 at 09:48:39PM -0600, Jens Axboe wrote: On 09/20/2016 09:21 AM, Thomas Gleixner wrote: The following series converts block/mq to the new hotplug state machine. The series is against block.git/for-next and depends on git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block This branch contains the necessary infrastructure for multi-instance callbacks which allows us to remove blk-mq-cpu.c completely. It can be pulled into the block tree. Thanks, I'll pull this in for 4.9. I still can't find it in your tree despite other commits today. Didn't get to it yesterday, done now. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine
On Tue, Sep 20, 2016 at 09:48:39PM -0600, Jens Axboe wrote: > On 09/20/2016 09:21 AM, Thomas Gleixner wrote: >> The following series converts block/mq to the new hotplug state >> machine. The series is against block.git/for-next and depends on >> >>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block >> >> This branch contains the necessary infrastructure for multi-instance >> callbacks which allows us to remove blk-mq-cpu.c completely. It can be >> pulled into the block tree. > > Thanks, I'll pull this in for 4.9. I still can't find it in your tree despite other commits today. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Query] increased latency observed in cpu hotplug path
On 9/21/2016 9:26 PM, Akinobu Mita wrote: > 2016-09-21 23:06 GMT+09:00 Khan, Imran : > >>>> commit 084ee793ec1ff4e2171b481642bfbdaa2e10664c >>>> Author: Imran Khan >>>> Date: Thu Sep 15 16:44:02 2016 +0530 >>>> >>>> blk-mq: use static mapping >>>> >>>> blk-mq layer performs a remapping between s/w and h/w contexts and also >>>> between h/w contexts and CPUs, whenever a CPU hotplug event happens. >>>> This remapping has to wait for queue freezing which may take tens of >>>> miliseconds, resulting in a high latency in CPU hotplug path. >>>> This patch makes the above mentioned mappings static so that we can >>>> avoid remapping when CPU hotplug event happens and this results in >>>> improved CPU hotplug latency. >>>> >>>> Change-Id: Idf38cb6c4e78c91fda3c86608c6d0441f01ab435 >>>> Signed-off-by: Imran Khan >>>> >>>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c >>>> index 8764c24..85fdb88 100644 >>>> --- a/block/blk-mq-cpumap.c >>>> +++ b/block/blk-mq-cpumap.c >>>> @@ -52,18 +52,13 @@ int blk_mq_update_queue_map(unsigned int *map, >>>> unsigned int nr_queues, >>>> >>>> queue = 0; >>>> for_each_possible_cpu(i) { >>>> -if (!cpumask_test_cpu(i, online_mask)) { >>>> -map[i] = 0; >>>> -continue; >>>> -} >>>> - >>>> /* >>>> * Easy case - we have equal or more hardware queues. Or >>>> * there are no thread siblings to take into account. Do >>>> * 1:1 if enough, or sequential mapping if less. >>>> */ >>>> -if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) { >>>> -map[i] = cpu_to_queue_index(nr_cpus, nr_queues, >>>> queue); >>>> +if (nr_queues >= nr_cpu_ids) { >>>> +map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, >>>> queue); >>>> queue++; >>>> continue; >>>> } >>>> @@ -75,7 +70,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned >>>> int nr_queues, >>>> */ >>>> first_sibling = get_first_sibling(i); >>>> if (first_sibling == i) { >>>> -map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues, >>>> +map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, >>>> queue); >>>> queue++; >>>> } else >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>> index 6d6f8fe..0753fdf 100644 >>>> --- a/block/blk-mq.c >>>> +++ b/block/blk-mq.c >>>> @@ -1783,10 +1783,6 @@ static void blk_mq_init_cpu_queues(struct >>>> request_queue *q, >>>> INIT_LIST_HEAD(&__ctx->rq_list); >>>> __ctx->queue = q; >>>> >>>> -/* If the cpu isn't online, the cpu is mapped to first hctx */ >>>> -if (!cpu_online(i)) >>>> -continue; >>>> - >>>> hctx = q->mq_ops->map_queue(q, i); >>>> >>>> /* >>>> @@ -1820,12 +1816,9 @@ static void blk_mq_map_swqueue(struct request_queue >>>> *q, >>>> * Map software to hardware queues >>>> */ >>>> queue_for_each_ctx(q, ctx, i) { >>>> -/* If the cpu isn't online, the cpu is mapped to first hctx */ >>>> -if (!cpumask_test_cpu(i, online_mask)) >>>> -continue; >>>> - >>>> hctx = q->mq_ops->map_queue(q, i); >>>> -cpumask_set_cpu(i, hctx->cpumask); >>>> +if (cpumask_test_cpu(i, online_mask)) >>>> +cpumask_set_cpu(i, hctx->cpumask); >>>> ctx->index_hw = hctx->nr_ctx; >>>> hctx->ctxs[hctx->nr_ctx++] = ctx; >>>> } >>>> @@ -1863,17 +1856,22 @@ static void blk_mq_map_swqueue(struct >>>> reque
Re: [patch V2 0/2] block/mq: Convert to new hotplug state machine
On 09/20/2016 09:21 AM, Thomas Gleixner wrote: The following series converts block/mq to the new hotplug state machine. The series is against block.git/for-next and depends on git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block This branch contains the necessary infrastructure for multi-instance callbacks which allows us to remove blk-mq-cpu.c completely. It can be pulled into the block tree. Thanks, I'll pull this in for 4.9. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch V2 1/2] blk/mq/cpu-notif: Convert to hotplug state machine
Replace the block-mq notifier list management with the multi instance facility in the cpu hotplug state machine. Signed-off-by: Thomas Gleixner Cc: Jens Axboe Cc: Peter Zijlstra Cc: linux-block@vger.kernel.org Cc: r...@linutronix.de Cc: Christoph Hellwing --- block/Makefile |2 - block/blk-mq-cpu.c | 67 - block/blk-mq.c | 36 +- block/blk-mq.h |7 - include/linux/blk-mq.h |8 - 5 files changed, 15 insertions(+), 105 deletions(-) --- a/block/Makefile +++ b/block/Makefile @@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ blk-lib.o blk-mq.o blk-mq-tag.o \ - blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ + blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ --- a/block/blk-mq-cpu.c +++ /dev/null @@ -1,67 +0,0 @@ -/* - * CPU notifier helper code for blk-mq - * - * Copyright (C) 2013-2014 Jens Axboe - */ -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include "blk-mq.h" - -static LIST_HEAD(blk_mq_cpu_notify_list); -static DEFINE_RAW_SPINLOCK(blk_mq_cpu_notify_lock); - -static int blk_mq_main_cpu_notify(struct notifier_block *self, - unsigned long action, void *hcpu) -{ - unsigned int cpu = (unsigned long) hcpu; - struct blk_mq_cpu_notifier *notify; - int ret = NOTIFY_OK; - - raw_spin_lock(&blk_mq_cpu_notify_lock); - - list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) { - ret = notify->notify(notify->data, action, cpu); - if (ret != NOTIFY_OK) - break; - } - - raw_spin_unlock(&blk_mq_cpu_notify_lock); - return ret; -} - -void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier) -{ - BUG_ON(!notifier->notify); - - raw_spin_lock(&blk_mq_cpu_notify_lock); - list_add_tail(¬ifier->list, &blk_mq_cpu_notify_list); - raw_spin_unlock(&blk_mq_cpu_notify_lock); -} - -void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier) -{ - raw_spin_lock(&blk_mq_cpu_notify_lock); - list_del(¬ifier->list); - raw_spin_unlock(&blk_mq_cpu_notify_lock); -} - -void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier, - int (*fn)(void *, unsigned long, unsigned int), - void *data) -{ - notifier->notify = fn; - notifier->data = data; -} - -void __init blk_mq_cpu_init(void) -{ - hotcpu_notifier(blk_mq_main_cpu_notify, 0); -} --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1575,11 +1575,13 @@ static struct blk_mq_tags *blk_mq_init_r * software queue to the hw queue dispatch list, and ensure that it * gets run. */ -static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu) +static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) { + struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; LIST_HEAD(tmp); + hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); ctx = __blk_mq_get_ctx(hctx->queue, cpu); spin_lock(&ctx->lock); @@ -1590,30 +1592,20 @@ static int blk_mq_hctx_cpu_offline(struc spin_unlock(&ctx->lock); if (list_empty(&tmp)) - return NOTIFY_OK; + return 0; spin_lock(&hctx->lock); list_splice_tail_init(&tmp, &hctx->dispatch); spin_unlock(&hctx->lock); blk_mq_run_hw_queue(hctx, true); - return NOTIFY_OK; + return 0; } -static int blk_mq_hctx_notify(void *data, unsigned long action, - unsigned int cpu) +static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx) { - struct blk_mq_hw_ctx *hctx = data; - - if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) - return blk_mq_hctx_cpu_offline(hctx, cpu); - - /* -* In case of CPU online, tags may be reallocated -* in blk_mq_map_swqueue() after mapping is updated. -*/ - - return NOTIFY_OK; + cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD, + &hctx->cpuhp_dead); } /* hctx->ctxs will be freed in queue's release handler */ @@ -1633,7 +1625,7 @@ static void blk_mq_exit_hctx(struct requ if (set->ops->exit_hctx) set->ops->exit_hctx(hctx, hctx_idx); - blk_mq_unregister_cpu_notif
[patch V2 2/2] blk/mq: Convert to hotplug state machine
From: Sebastian Andrzej Siewior Install the callbacks via the state machine so we can phase out the cpu hotplug notifiers mess. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: Jens Axboe Cc: Peter Zijlstra Cc: linux-block@vger.kernel.org Cc: r...@linutronix.de Cc: Christoph Hellwing Link: http://lkml.kernel.org/r/20160919212601.180033...@linutronix.de Signed-off-by: Thomas Gleixner --- block/blk-mq.c | 87 - 1 file changed, 43 insertions(+), 44 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r blk_mq_sysfs_register(q); } -static int blk_mq_queue_reinit_notify(struct notifier_block *nb, - unsigned long action, void *hcpu) +/* + * New online cpumask which is going to be set in this hotplug event. + * Declare this cpumasks as global as cpu-hotplug operation is invoked + * one-by-one and dynamically allocating this could result in a failure. + */ +static struct cpumask cpuhp_online_new; + +static void blk_mq_queue_reinit_work(void) { struct request_queue *q; - int cpu = (unsigned long)hcpu; - /* -* New online cpumask which is going to be set in this hotplug event. -* Declare this cpumasks as global as cpu-hotplug operation is invoked -* one-by-one and dynamically allocating this could result in a failure. -*/ - static struct cpumask online_new; - - /* -* Before hotadded cpu starts handling requests, new mappings must -* be established. Otherwise, these requests in hw queue might -* never be dispatched. -* -* For example, there is a single hw queue (hctx) and two CPU queues -* (ctx0 for CPU0, and ctx1 for CPU1). -* -* Now CPU1 is just onlined and a request is inserted into -* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is -* still zero. -* -* And then while running hw queue, flush_busy_ctxs() finds bit0 is -* set in pending bitmap and tries to retrieve requests in -* hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, -* so the request in ctx1->rq_list is ignored. -*/ - switch (action & ~CPU_TASKS_FROZEN) { - case CPU_DEAD: - case CPU_UP_CANCELED: - cpumask_copy(&online_new, cpu_online_mask); - break; - case CPU_UP_PREPARE: - cpumask_copy(&online_new, cpu_online_mask); - cpumask_set_cpu(cpu, &online_new); - break; - default: - return NOTIFY_OK; - } mutex_lock(&all_q_mutex); - /* * We need to freeze and reinit all existing queues. Freezing * involves synchronous wait for an RCU grace period and doing it @@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st } list_for_each_entry(q, &all_q_list, all_q_node) - blk_mq_queue_reinit(q, &online_new); + blk_mq_queue_reinit(q, &cpuhp_online_new); list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_unfreeze_queue(q); mutex_unlock(&all_q_mutex); - return NOTIFY_OK; +} + +static int blk_mq_queue_reinit_dead(unsigned int cpu) +{ + cpumask_clear_cpu(cpu, &cpuhp_online_new); + blk_mq_queue_reinit_work(); + return 0; +} + +/* + * Before hotadded cpu starts handling requests, new mappings must be + * established. Otherwise, these requests in hw queue might never be + * dispatched. + * + * For example, there is a single hw queue (hctx) and two CPU queues (ctx0 + * for CPU0, and ctx1 for CPU1). + * + * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list + * and set bit0 in pending bitmap as ctx1->index_hw is still zero. + * + * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in + * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list. + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list + * is ignored. + */ +static int blk_mq_queue_reinit_prepare(unsigned int cpu) +{ + cpumask_copy(&cpuhp_online_new, cpu_online_mask); + cpumask_set_cpu(cpu, &cpuhp_online_new); + blk_mq_queue_reinit_work(); + return 0; } static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) @@ -2382,8 +2380,9 @@ static int __init blk_mq_init(void) cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL, blk_mq_hctx_notify_dead); - hotcpu_notifier(blk_mq_queue_reinit_notify, 0); - + cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare", + blk_mq_queue_reinit_prepare, +
[patch V2 0/2] block/mq: Convert to new hotplug state machine
The following series converts block/mq to the new hotplug state machine. The series is against block.git/for-next and depends on git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block This branch contains the necessary infrastructure for multi-instance callbacks which allows us to remove blk-mq-cpu.c completely. It can be pulled into the block tree. Changes vs. V1: Use the multi instance callbacks and remove the private notifier handling in the block code. Thanks, tglx --- a/block/blk-mq-cpu.c | 67 - b/include/linux/blk-mq.h |8 --- block/Makefile |2 block/blk-mq.c | 123 +-- block/blk-mq.h |7 -- 5 files changed, 58 insertions(+), 149 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine
On Tue, 20 Sep 2016, Christoph Hellwing wrote: > On Mon, Sep 19, 2016 at 09:28:20PM -, Thomas Gleixner wrote: > > From: Sebastian Andrzej Siewior > > > > Install the callbacks via the state machine so we can phase out the cpu > > hotplug notifiers.. > > Didn't Sebastian come up with a version of the hotplug state callbacks > that can be per-object? This seems to be a perfect candidate for that. Indeed. I wrote that myself and forgot about it already. :( So yes, we can use that and get rid of blk-mq-cpu.c completely. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine
On Mon, Sep 19, 2016 at 09:28:20PM -, Thomas Gleixner wrote: > From: Sebastian Andrzej Siewior > > Install the callbacks via the state machine so we can phase out the cpu > hotplug notifiers.. Didn't Sebastian come up with a version of the hotplug state callbacks that can be per-object? This seems to be a perfect candidate for that. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/3] blk/mq: Convert to hotplug state machine
From: Sebastian Andrzej Siewior Install the callbacks via the state machine so we can phase out the cpu hotplug notifiers mess. Signed-off-by: Sebastian Andrzej Siewior Cc: Peter Zijlstra Cc: Jens Axboe Cc: r...@linutronix.de Signed-off-by: Thomas Gleixner --- block/blk-mq.c | 87 - 1 file changed, 43 insertions(+), 44 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r blk_mq_sysfs_register(q); } -static int blk_mq_queue_reinit_notify(struct notifier_block *nb, - unsigned long action, void *hcpu) +/* + * New online cpumask which is going to be set in this hotplug event. + * Declare this cpumasks as global as cpu-hotplug operation is invoked + * one-by-one and dynamically allocating this could result in a failure. + */ +static struct cpumask cpuhp_online_new; + +static void blk_mq_queue_reinit_work(void) { struct request_queue *q; - int cpu = (unsigned long)hcpu; - /* -* New online cpumask which is going to be set in this hotplug event. -* Declare this cpumasks as global as cpu-hotplug operation is invoked -* one-by-one and dynamically allocating this could result in a failure. -*/ - static struct cpumask online_new; - - /* -* Before hotadded cpu starts handling requests, new mappings must -* be established. Otherwise, these requests in hw queue might -* never be dispatched. -* -* For example, there is a single hw queue (hctx) and two CPU queues -* (ctx0 for CPU0, and ctx1 for CPU1). -* -* Now CPU1 is just onlined and a request is inserted into -* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is -* still zero. -* -* And then while running hw queue, flush_busy_ctxs() finds bit0 is -* set in pending bitmap and tries to retrieve requests in -* hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, -* so the request in ctx1->rq_list is ignored. -*/ - switch (action & ~CPU_TASKS_FROZEN) { - case CPU_DEAD: - case CPU_UP_CANCELED: - cpumask_copy(&online_new, cpu_online_mask); - break; - case CPU_UP_PREPARE: - cpumask_copy(&online_new, cpu_online_mask); - cpumask_set_cpu(cpu, &online_new); - break; - default: - return NOTIFY_OK; - } mutex_lock(&all_q_mutex); - /* * We need to freeze and reinit all existing queues. Freezing * involves synchronous wait for an RCU grace period and doing it @@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st } list_for_each_entry(q, &all_q_list, all_q_node) - blk_mq_queue_reinit(q, &online_new); + blk_mq_queue_reinit(q, &cpuhp_online_new); list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_unfreeze_queue(q); mutex_unlock(&all_q_mutex); - return NOTIFY_OK; +} + +static int blk_mq_queue_reinit_dead(unsigned int cpu) +{ + cpumask_clear_cpu(cpu, &cpuhp_online_new); + blk_mq_queue_reinit_work(); + return 0; +} + +/* + * Before hotadded cpu starts handling requests, new mappings must be + * established. Otherwise, these requests in hw queue might never be + * dispatched. + * + * For example, there is a single hw queue (hctx) and two CPU queues (ctx0 + * for CPU0, and ctx1 for CPU1). + * + * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list + * and set bit0 in pending bitmap as ctx1->index_hw is still zero. + * + * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in + * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list. + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list + * is ignored. + */ +static int blk_mq_queue_reinit_prepare(unsigned int cpu) +{ + cpumask_copy(&cpuhp_online_new, cpu_online_mask); + cpumask_set_cpu(cpu, &cpuhp_online_new); + blk_mq_queue_reinit_work(); + return 0; } static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) @@ -2381,8 +2379,9 @@ static int __init blk_mq_init(void) { blk_mq_cpu_init(); - hotcpu_notifier(blk_mq_queue_reinit_notify, 0); - + cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare", + blk_mq_queue_reinit_prepare, + blk_mq_queue_reinit_dead); return 0; } subsys_initcall(blk_mq_init); -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/3] blk/mq: Reserve hotplug states for block multiqueue
This patch only reserves two CPU hotplug states for block/mq so the block tree can apply the conversion patches. Signed-off-by: Sebastian Andrzej Siewior Cc: Peter Zijlstra Cc: Jens Axboe Cc: r...@linutronix.de Signed-off-by: Thomas Gleixner --- include/linux/cpuhotplug.h |2 ++ 1 file changed, 2 insertions(+) --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -14,6 +14,7 @@ enum cpuhp_state { CPUHP_PERF_SUPERH, CPUHP_X86_HPET_DEAD, CPUHP_X86_APB_DEAD, + CPUHP_BLK_MQ_DEAD, CPUHP_WORKQUEUE_PREP, CPUHP_POWER_NUMA_PREPARE, CPUHP_HRTIMERS_PREPARE, @@ -22,6 +23,7 @@ enum cpuhp_state { CPUHP_SMPCFD_PREPARE, CPUHP_RCUTREE_PREP, CPUHP_NOTIFY_PREPARE, + CPUHP_BLK_MQ_PREPARE, CPUHP_TIMERS_DEAD, CPUHP_BRINGUP_CPU, CPUHP_AP_IDLE_DEAD, -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/3] block/mq: Convert to the new hotplug state machine
The following series converts block/mq to the new hotplug state machine. Patch 1/3 reserves the states for the block layer and is already applied to git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block to avoid merge conflicts. This branch can be pulled into the block layer instead of applying patch 1/3 manually, Thanks, tglx --- include/linux/blk-mq.h |2 block/blk-mq-cpu.c | 15 ++ block/blk-mq.c | 108 - block/blk-mq.h |2 include/linux/cpuhotplug.h |2 5 files changed, 59 insertions(+), 70 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine
From: Sebastian Andrzej Siewior Install the callbacks via the state machine so we can phase out the cpu hotplug notifiers.. Signed-off-by: Sebastian Andrzej Siewior Cc: Peter Zijlstra Cc: Jens Axboe Cc: r...@linutronix.de Signed-off-by: Thomas Gleixner --- block/blk-mq-cpu.c | 15 +++ block/blk-mq.c | 21 + block/blk-mq.h |2 +- include/linux/blk-mq.h |2 +- 4 files changed, 14 insertions(+), 26 deletions(-) --- a/block/blk-mq-cpu.c +++ b/block/blk-mq-cpu.c @@ -18,18 +18,16 @@ static LIST_HEAD(blk_mq_cpu_notify_list); static DEFINE_RAW_SPINLOCK(blk_mq_cpu_notify_lock); -static int blk_mq_main_cpu_notify(struct notifier_block *self, - unsigned long action, void *hcpu) +static int blk_mq_cpu_dead(unsigned int cpu) { - unsigned int cpu = (unsigned long) hcpu; struct blk_mq_cpu_notifier *notify; - int ret = NOTIFY_OK; + int ret; raw_spin_lock(&blk_mq_cpu_notify_lock); list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) { - ret = notify->notify(notify->data, action, cpu); - if (ret != NOTIFY_OK) + ret = notify->notify(notify->data, cpu); + if (ret) break; } @@ -54,7 +52,7 @@ void blk_mq_unregister_cpu_notifier(stru } void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier, - int (*fn)(void *, unsigned long, unsigned int), + int (*fn)(void *, unsigned int), void *data) { notifier->notify = fn; @@ -63,5 +61,6 @@ void blk_mq_init_cpu_notifier(struct blk void __init blk_mq_cpu_init(void) { - hotcpu_notifier(blk_mq_main_cpu_notify, 0); + cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL, + blk_mq_cpu_dead); } --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1590,30 +1590,19 @@ static int blk_mq_hctx_cpu_offline(struc spin_unlock(&ctx->lock); if (list_empty(&tmp)) - return NOTIFY_OK; + return 0; spin_lock(&hctx->lock); list_splice_tail_init(&tmp, &hctx->dispatch); spin_unlock(&hctx->lock); blk_mq_run_hw_queue(hctx, true); - return NOTIFY_OK; + return 0; } -static int blk_mq_hctx_notify(void *data, unsigned long action, - unsigned int cpu) +static int blk_mq_hctx_notify_dead(void *hctx, unsigned int cpu) { - struct blk_mq_hw_ctx *hctx = data; - - if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) - return blk_mq_hctx_cpu_offline(hctx, cpu); - - /* -* In case of CPU online, tags may be reallocated -* in blk_mq_map_swqueue() after mapping is updated. -*/ - - return NOTIFY_OK; + return blk_mq_hctx_cpu_offline(hctx, cpu); } /* hctx->ctxs will be freed in queue's release handler */ @@ -1681,7 +1670,7 @@ static int blk_mq_init_hctx(struct reque hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; blk_mq_init_cpu_notifier(&hctx->cpu_notifier, - blk_mq_hctx_notify, hctx); + blk_mq_hctx_notify_dead, hctx); blk_mq_register_cpu_notifier(&hctx->cpu_notifier); hctx->tags = set->tags[hctx_idx]; --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -34,7 +34,7 @@ void blk_mq_wake_waiters(struct request_ */ struct blk_mq_cpu_notifier; void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier, - int (*fn)(void *, unsigned long, unsigned int), + int (*fn)(void *, unsigned int), void *data); void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier); void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier); --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -10,7 +10,7 @@ struct blk_flush_queue; struct blk_mq_cpu_notifier { struct list_head list; void *data; - int (*notify)(void *data, unsigned long action, unsigned int cpu); + int (*notify)(void *data, unsigned int cpu); }; struct blk_mq_hw_ctx { -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event
Currently blk-mq will totally remap hardware context when a CPU hotplug even happened, which causes major havoc for drivers, as they are never told about this remapping. E.g. any carefully sorted out CPU affinity will just be completely messed up. The rebuild also doesn't really help for the common case of cpu hotplug, which is soft onlining / offlining of cpus - in this case we should just leave the queue and irq mapping as is. If it actually worked it would have helped in the case of physical cpu hotplug, although for that we'd need a way to actually notify the driver. Note that drivers may already be able to accommodate such a topology change on their own, e.g. using the reset_controller sysfs file in NVMe will cause the driver to get things right for this case. With the rebuild removed we will simplify retain the queue mapping for a soft offlined CPU that will work when it comes back online, and will map any newly onlined CPU to queue 0 until the driver initiates a rebuild of the queue map. Signed-off-by: Christoph Hellwig --- block/blk-mq.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 13f5a6c..b29e7b2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2147,8 +2147,6 @@ static void blk_mq_queue_reinit(struct request_queue *q, blk_mq_sysfs_unregister(q); - blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask); - /* * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe * we should change hctx numa_node according to new topology (this -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] blk-mq: don't redistribute hardware queues on a CPU hotplug event
Currently blk-mq will totally remap hardware context when a CPU hotplug even happened, which causes major havoc for drivers, as they are never told about this remapping. E.g. any carefully sorted out CPU affinity will just be completely messed up. The rebuild also doesn't really help for the common case of cpu hotplug, which is soft onlining / offlining of cpus - in this case we should just leave the queue and irq mapping as is. If it actually worked it would have helped in the case of physical cpu hotplug, although for that we'd need a way to actually notify the driver. Note that drivers may already be able to accommodate such a topology change on their own, e.g. using the reset_controller sysfs file in NVMe will cause the driver to get things right for this case. With the rebuild removed we will simplify retain the queue mapping for a soft offlined CPU that will work when it comes back online, and will map any newly onlined CPU to queue 0 until the driver initiates a rebuild of the queue map. Signed-off-by: Christoph Hellwig --- block/blk-mq.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 13f5a6c..b29e7b2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2147,8 +2147,6 @@ static void blk_mq_queue_reinit(struct request_queue *q, blk_mq_sysfs_unregister(q); - blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask); - /* * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe * we should change hctx numa_node according to new topology (this -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Query] increased latency observed in cpu hotplug path
On 8/16/2016 11:23 AM, Khan, Imran wrote: > On 8/5/2016 12:49 PM, Khan, Imran wrote: >> On 8/1/2016 2:58 PM, Khan, Imran wrote: >>> On 7/30/2016 7:54 AM, Akinobu Mita wrote: >>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran : >>>>> >>>>> Hi, >>>>> >>>>> Recently we have observed some increased latency in CPU hotplug >>>>> event in CPU online path. For online latency we see that block >>>>> layer is executing notification handler for CPU_UP_PREPARE event >>>>> and this in turn waits for RCU grace period resulting (sometimes) >>>>> in an execution time of 15-20 ms for this notification handler. >>>>> This change was not there in 3.18 kernel but is present in 4.4 >>>>> kernel and was introduced by following commit: >>>>> >>>>> >>>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba >>>>> Author: Akinobu Mita >>>>> Date: Sun Sep 27 02:09:23 2015 +0900 >>>>> >>>>> blk-mq: avoid inserting requests before establishing new mapping >>>> >>>> ... >>>> >>>>> Upon reverting this commit I could see an improvement of 15-20 ms in >>>>> online latency. So I am looking for some help in analyzing the effects >>>>> of reverting this or should some other approach to reduce the online >>>>> latency must be taken. >>>> >>>> Can you observe the difference in online latency by removing >>>> get_online_cpus() and put_online_cpus() pair in >>>> blk_mq_init_allocated_queue() >>>> instead of full reverting the commit? >>>> >>> Hi Akinobu, >>> I tried your suggestion but could not achieve any improvement. Actually the >>> snippet that is causing the change in latency is the following one : >>> >>> list_for_each_entry(q, &all_q_list, all_q_node) { >>> blk_mq_freeze_queue_wait(q); >>> >>> /* >>> * timeout handler can't touch hw queue during the >>> * reinitialization >>> */ >>> del_timer_sync(&q->timeout); >>> } >>> >>> I understand that this is getting executed now for CPU_UP_PREPARE as well >>> resulting in >>> increased latency in the cpu online path. I am trying to reduce this >>> latency while keeping the >>> purpose of this commit intact. I would welcome further suggestions/feedback >>> in this regard. >>> >> Hi Akinobu, >> >> I am not able to reduce the cpu online latency with this patch, could you >> please let me know what >> functionality will be broken, if we avoid this patch in our kernel. Also if >> you have some other >> suggestions towards improving this patch please let me know. >> > After moving the remapping of queues to block layer's kworker I see that > online latency has improved > while offline latency remains the same. As the freezing of queues happens in > the context of block layer's > worker, I think it would be better to do the remapping in the same context > and then go ahead with freezing. > In this regard I have made following change: > > commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624 > Author: Imran Khan > Date: Fri Aug 12 19:59:47 2016 +0530 > > blk-mq: Move block queue remapping from cpu hotplug path > > During a cpu hotplug, the hardware and software contexts mappings > need to be updated in order to take into account requests > submitted for the hotadded CPU. But if this mapping is done > in hotplug notifier, it deteriorates the hotplug latency. > So move the block queue remapping to block layer worker which > results in significant improvements in hotplug latency. > > Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4 > Signed-off-by: Imran Khan > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 6d6f8fe..06fcf89 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -22,7 +22,11 @@ > #include > #include > #include > - > +#include > +#include > +#include > +#include > +#include > #include > > #include > @@ -32,10 +36,18 @@ > > static DEFINE_MUTEX(all_q_mutex); > static LIST_HEAD(all_q_list); > - > static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx); > > /* > + * New online cpumask which is going to be set in this hotplug event. > + * Declare t
Re: [Query] increased latency observed in cpu hotplug path
On 8/5/2016 12:49 PM, Khan, Imran wrote: > On 8/1/2016 2:58 PM, Khan, Imran wrote: >> On 7/30/2016 7:54 AM, Akinobu Mita wrote: >>> 2016-07-28 22:18 GMT+09:00 Khan, Imran : >>>> >>>> Hi, >>>> >>>> Recently we have observed some increased latency in CPU hotplug >>>> event in CPU online path. For online latency we see that block >>>> layer is executing notification handler for CPU_UP_PREPARE event >>>> and this in turn waits for RCU grace period resulting (sometimes) >>>> in an execution time of 15-20 ms for this notification handler. >>>> This change was not there in 3.18 kernel but is present in 4.4 >>>> kernel and was introduced by following commit: >>>> >>>> >>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba >>>> Author: Akinobu Mita >>>> Date: Sun Sep 27 02:09:23 2015 +0900 >>>> >>>> blk-mq: avoid inserting requests before establishing new mapping >>> >>> ... >>> >>>> Upon reverting this commit I could see an improvement of 15-20 ms in >>>> online latency. So I am looking for some help in analyzing the effects >>>> of reverting this or should some other approach to reduce the online >>>> latency must be taken. >>> >>> Can you observe the difference in online latency by removing >>> get_online_cpus() and put_online_cpus() pair in >>> blk_mq_init_allocated_queue() >>> instead of full reverting the commit? >>> >> Hi Akinobu, >> I tried your suggestion but could not achieve any improvement. Actually the >> snippet that is causing the change in latency is the following one : >> >> list_for_each_entry(q, &all_q_list, all_q_node) { >> blk_mq_freeze_queue_wait(q); >> >> /* >> * timeout handler can't touch hw queue during the >> * reinitialization >> */ >> del_timer_sync(&q->timeout); >> } >> >> I understand that this is getting executed now for CPU_UP_PREPARE as well >> resulting in >> increased latency in the cpu online path. I am trying to reduce this latency >> while keeping the >> purpose of this commit intact. I would welcome further suggestions/feedback >> in this regard. >> > Hi Akinobu, > > I am not able to reduce the cpu online latency with this patch, could you > please let me know what > functionality will be broken, if we avoid this patch in our kernel. Also if > you have some other > suggestions towards improving this patch please let me know. > After moving the remapping of queues to block layer's kworker I see that online latency has improved while offline latency remains the same. As the freezing of queues happens in the context of block layer's worker, I think it would be better to do the remapping in the same context and then go ahead with freezing. In this regard I have made following change: commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624 Author: Imran Khan Date: Fri Aug 12 19:59:47 2016 +0530 blk-mq: Move block queue remapping from cpu hotplug path During a cpu hotplug, the hardware and software contexts mappings need to be updated in order to take into account requests submitted for the hotadded CPU. But if this mapping is done in hotplug notifier, it deteriorates the hotplug latency. So move the block queue remapping to block layer worker which results in significant improvements in hotplug latency. Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4 Signed-off-by: Imran Khan diff --git a/block/blk-mq.c b/block/blk-mq.c index 6d6f8fe..06fcf89 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -22,7 +22,11 @@ #include #include #include - +#include +#include +#include +#include +#include #include #include @@ -32,10 +36,18 @@ static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); - static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx); /* + * New online cpumask which is going to be set in this hotplug event. + * Declare this cpumasks as global as cpu-hotplug operation is invoked + * one-by-one and dynamically allocating this could result in a failure. + */ +static struct cpumask online_new; + +static struct work_struct blk_mq_remap_work; + +/* * Check if any of the ctx's have pending work in this hardware queue */ static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx) @@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q, static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
Re: [Query] increased latency observed in cpu hotplug path
On 8/1/2016 2:58 PM, Khan, Imran wrote: > On 7/30/2016 7:54 AM, Akinobu Mita wrote: >> 2016-07-28 22:18 GMT+09:00 Khan, Imran : >>> >>> Hi, >>> >>> Recently we have observed some increased latency in CPU hotplug >>> event in CPU online path. For online latency we see that block >>> layer is executing notification handler for CPU_UP_PREPARE event >>> and this in turn waits for RCU grace period resulting (sometimes) >>> in an execution time of 15-20 ms for this notification handler. >>> This change was not there in 3.18 kernel but is present in 4.4 >>> kernel and was introduced by following commit: >>> >>> >>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba >>> Author: Akinobu Mita >>> Date: Sun Sep 27 02:09:23 2015 +0900 >>> >>> blk-mq: avoid inserting requests before establishing new mapping >> >> ... >> >>> Upon reverting this commit I could see an improvement of 15-20 ms in >>> online latency. So I am looking for some help in analyzing the effects >>> of reverting this or should some other approach to reduce the online >>> latency must be taken. >> >> Can you observe the difference in online latency by removing >> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue() >> instead of full reverting the commit? >> > Hi Akinobu, > I tried your suggestion but could not achieve any improvement. Actually the > snippet that is causing the change in latency is the following one : > > list_for_each_entry(q, &all_q_list, all_q_node) { > blk_mq_freeze_queue_wait(q); > > /* > * timeout handler can't touch hw queue during the > * reinitialization > */ > del_timer_sync(&q->timeout); > } > > I understand that this is getting executed now for CPU_UP_PREPARE as well > resulting in > increased latency in the cpu online path. I am trying to reduce this latency > while keeping the > purpose of this commit intact. I would welcome further suggestions/feedback > in this regard. > Hi Akinobu, I am not able to reduce the cpu online latency with this patch, could you please let me know what functionality will be broken, if we avoid this patch in our kernel. Also if you have some other suggestions towards improving this patch please let me know. -- Imran Khan QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Query] increased latency observed in cpu hotplug path
On 7/30/2016 7:54 AM, Akinobu Mita wrote: > 2016-07-28 22:18 GMT+09:00 Khan, Imran : >> >> Hi, >> >> Recently we have observed some increased latency in CPU hotplug >> event in CPU online path. For online latency we see that block >> layer is executing notification handler for CPU_UP_PREPARE event >> and this in turn waits for RCU grace period resulting (sometimes) >> in an execution time of 15-20 ms for this notification handler. >> This change was not there in 3.18 kernel but is present in 4.4 >> kernel and was introduced by following commit: >> >> >> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba >> Author: Akinobu Mita >> Date: Sun Sep 27 02:09:23 2015 +0900 >> >> blk-mq: avoid inserting requests before establishing new mapping > > ... > >> Upon reverting this commit I could see an improvement of 15-20 ms in >> online latency. So I am looking for some help in analyzing the effects >> of reverting this or should some other approach to reduce the online >> latency must be taken. > > Can you observe the difference in online latency by removing > get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue() > instead of full reverting the commit? > Hi Akinobu, I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one : list_for_each_entry(q, &all_q_list, all_q_node) { blk_mq_freeze_queue_wait(q); /* * timeout handler can't touch hw queue during the * reinitialization */ del_timer_sync(&q->timeout); } I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in increased latency in the cpu online path. I am trying to reduce this latency while keeping the purpose of this commit intact. I would welcome further suggestions/feedback in this regard. -- Imran Khan QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html