Re: system hung up when offlining CPUs
On 11/01/2017 01:47 AM, Thomas Gleixner wrote: > On Mon, 30 Oct 2017, Shivasharan Srikanteshwara wrote: > >> In managed-interrupts case, interrupts which were affine to the offlined >> CPU is not getting migrated to another available CPU. But the >> documentation at below link says that "all interrupts" are migrated to a >> new CPU. So not all interrupts are getting migrated to a new CPU then. > > Correct. > >> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html#the-offlin >> e-case >> "- All interrupts targeted to this CPU are migrated to a new CPU" > > Well, documentation is not always up to date :) > >> Once the last CPU in the affinity mask is offlined and a particular IRQ >> is shutdown, is there a way currently for the device driver to get >> callback to complete all outstanding requests on that queue? > > No and I have no idea how the other drivers deal with that. > > The way you can do that is to have your own hotplug callback which is > invoked when the cpu goes down, but way before the interrupt is shut down, > which is one of the last steps. Ideally this would be a callback in the > generic block code which then calls out to all instances like its done for > the cpu dead state. > In principle, yes, that would be (and, in fact, might already) moved to the block layer for blk-mq, as this has full control over the individual queues and hence can ensure that the queues with dead/removed CPUs are properly handled. Here, OTOH, we are dealing with the legacy sq implementation (or, to be precised, a blk-mq implementation utilizing only a single queue), so that any of this handling need to be implemented in the driver. So what would need to be done here is to implement a hotplug callback in the driver, which would disable the CPU from the list/bitmap of valid cpus. Then the driver could validate the CPU number with this bitmap upon I/O submission (instead of just using raw_smp_cpu_number()), and could set the queue ID to '0' if an invalid CPU was found. With that the driver should be able to ensure that no new I/O will be submitted which will hit the dead CPU, so with a bit of luck this might already solve the problem. Alternatively I could resurrect my patchset converting the driver to blk-mq, which got vetoed the last time ... Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
RE: system hung up when offlining CPUs
On Mon, 30 Oct 2017, Shivasharan Srikanteshwara wrote: > In managed-interrupts case, interrupts which were affine to the offlined > CPU is not getting migrated to another available CPU. But the > documentation at below link says that "all interrupts" are migrated to a > new CPU. So not all interrupts are getting migrated to a new CPU then. Correct. > https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html#the-offlin > e-case > "- All interrupts targeted to this CPU are migrated to a new CPU" Well, documentation is not always up to date :) > Once the last CPU in the affinity mask is offlined and a particular IRQ > is shutdown, is there a way currently for the device driver to get > callback to complete all outstanding requests on that queue? No and I have no idea how the other drivers deal with that. The way you can do that is to have your own hotplug callback which is invoked when the cpu goes down, but way before the interrupt is shut down, which is one of the last steps. Ideally this would be a callback in the generic block code which then calls out to all instances like its done for the cpu dead state. Jens, Christoph? Thanks, tglx
RE: system hung up when offlining CPUs
> -Original Message- > From: Thomas Gleixner [mailto:t...@linutronix.de] > Sent: Tuesday, October 17, 2017 1:57 AM > To: YASUAKI ISHIMATSU > Cc: Kashyap Desai; Hannes Reinecke; Marc Zyngier; Christoph Hellwig; > ax...@kernel.dk; m...@ellerman.id.au; keith.bu...@intel.com; > pet...@infradead.org; LKML; linux-s...@vger.kernel.org; Sumit Saxena; > Shivasharan Srikanteshwara > Subject: Re: system hung up when offlining CPUs > > Yasuaki, > > On Mon, 16 Oct 2017, YASUAKI ISHIMATSU wrote: > > > Hi Thomas, > > > > > Can you please apply the patch below on top of Linus tree and retest? > > > > > > Please send me the outputs I asked you to provide last time in any > > > case (success or fail). > > > > The issue still occurs even if I applied your patch to linux 4.14.0-rc4. > > Thanks for testing. > > > --- > > [ ...] INFO: task setroubleshootd:4972 blocked for more than 120 seconds. > > [ ...] Not tainted 4.14.0-rc4.thomas.with.irqdebug+ #6 > > [ ...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > > [ ...] setroubleshootd D0 4972 1 0x0080 > > [ ...] Call Trace: > > [ ...] __schedule+0x28d/0x890 > > [ ...] ? release_pages+0x16f/0x3f0 > > [ ...] schedule+0x36/0x80 > > [ ...] io_schedule+0x16/0x40 > > [ ...] wait_on_page_bit+0x107/0x150 > > [ ...] ? page_cache_tree_insert+0xb0/0xb0 [ ...] > > truncate_inode_pages_range+0x3dd/0x7d0 > > [ ...] ? schedule_hrtimeout_range_clock+0xad/0x140 > > [ ...] ? remove_wait_queue+0x59/0x60 > > [ ...] ? down_write+0x12/0x40 > > [ ...] ? unmap_mapping_range+0x75/0x130 [ ...] > > truncate_pagecache+0x47/0x60 [ ...] truncate_setsize+0x32/0x40 [ ...] > > xfs_setattr_size+0x100/0x300 [xfs] [ ...] > > xfs_vn_setattr_size+0x40/0x90 [xfs] [ ...] xfs_vn_setattr+0x87/0xa0 > > [xfs] [ ...] notify_change+0x266/0x440 [ ...] do_truncate+0x75/0xc0 > > [ ...] path_openat+0xaba/0x13b0 [ ...] ? > > mem_cgroup_commit_charge+0x31/0x130 > > [ ...] do_filp_open+0x91/0x100 > > [ ...] ? __alloc_fd+0x46/0x170 > > [ ...] do_sys_open+0x124/0x210 > > [ ...] SyS_open+0x1e/0x20 > > [ ...] do_syscall_64+0x67/0x1b0 > > [ ...] entry_SYSCALL64_slow_path+0x25/0x25 > > This is definitely a driver issue. The driver requests an affinity managed > interrupt. Affinity managed interrupts are different from non managed > interrupts in several ways: > > Non-Managed interrupts: > > 1) At setup time the default interrupt affinity is assigned to each > interrupt. The effective affinity is usually a subset of the online > CPUs. > > 2) User space can modify the affinity of the interrupt > > 3) If a CPU in the affinity mask goes offline and there are still online > CPUs in the affinity mask then the effective affinity is moved to a > subset of the online CPUs in the affinity mask. > > If the last CPU in the affinity mask of an interrupt goes offline then > the hotplug code breaks the affinity and makes it affine to the online > CPUs. The effective affinity is a subset of the new affinity setting, > > Managed interrupts: > > 1) At setup time the interrupts of a multiqueue device are evenly spread > over the possible CPUs. If all CPUs in the affinity mask of a given > interrupt are offline at request_irq() time, the interrupt stays shut > down. If the first CPU in the affinity mask comes online later the > interrupt is started up. > > 2) User space cannot modify the affinity of the interrupt > > 3) If a CPU in the affinity mask goes offline and there are still online > CPUs in the affinity mask then the effective affinity is moved a subset > of the online CPUs in the affinity mask. I.e. the same as with > Non-Managed interrupts. > > If the last CPU in the affinity mask of a managed interrupt goes > offline then the interrupt is shutdown. If the first CPU in the > affinity mask becomes online again then the interrupt is started up > again. > Hi Thomas, Thanks for the detailed explanation about the behavior of managed interrupts. This helped me to understand the issue better. This is first time I am checking CPU hotplug system, so my input is very preliminary. Please bear with my understanding and correct me where required. This issue is reproducible on our local setup as well, with managed interrupts. I have few queries on the requirements for device driver that you have mentioned. In managed-interrupts case, interrupts which were affine to the offlined CPU is not getting migrated to another available CPU. But the documentation at below link says that "all
Re: system hung up when offlining CPUs
Yasuaki, On Mon, 16 Oct 2017, YASUAKI ISHIMATSU wrote: > Hi Thomas, > > > Can you please apply the patch below on top of Linus tree and retest? > > > > Please send me the outputs I asked you to provide last time in any case > > (success or fail). > > The issue still occurs even if I applied your patch to linux 4.14.0-rc4. Thanks for testing. > --- > [ ...] INFO: task setroubleshootd:4972 blocked for more than 120 seconds. > [ ...] Not tainted 4.14.0-rc4.thomas.with.irqdebug+ #6 > [ ...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > [ ...] setroubleshootd D0 4972 1 0x0080 > [ ...] Call Trace: > [ ...] __schedule+0x28d/0x890 > [ ...] ? release_pages+0x16f/0x3f0 > [ ...] schedule+0x36/0x80 > [ ...] io_schedule+0x16/0x40 > [ ...] wait_on_page_bit+0x107/0x150 > [ ...] ? page_cache_tree_insert+0xb0/0xb0 > [ ...] truncate_inode_pages_range+0x3dd/0x7d0 > [ ...] ? schedule_hrtimeout_range_clock+0xad/0x140 > [ ...] ? remove_wait_queue+0x59/0x60 > [ ...] ? down_write+0x12/0x40 > [ ...] ? unmap_mapping_range+0x75/0x130 > [ ...] truncate_pagecache+0x47/0x60 > [ ...] truncate_setsize+0x32/0x40 > [ ...] xfs_setattr_size+0x100/0x300 [xfs] > [ ...] xfs_vn_setattr_size+0x40/0x90 [xfs] > [ ...] xfs_vn_setattr+0x87/0xa0 [xfs] > [ ...] notify_change+0x266/0x440 > [ ...] do_truncate+0x75/0xc0 > [ ...] path_openat+0xaba/0x13b0 > [ ...] ? mem_cgroup_commit_charge+0x31/0x130 > [ ...] do_filp_open+0x91/0x100 > [ ...] ? __alloc_fd+0x46/0x170 > [ ...] do_sys_open+0x124/0x210 > [ ...] SyS_open+0x1e/0x20 > [ ...] do_syscall_64+0x67/0x1b0 > [ ...] entry_SYSCALL64_slow_path+0x25/0x25 This is definitely a driver issue. The driver requests an affinity managed interrupt. Affinity managed interrupts are different from non managed interrupts in several ways: Non-Managed interrupts: 1) At setup time the default interrupt affinity is assigned to each interrupt. The effective affinity is usually a subset of the online CPUs. 2) User space can modify the affinity of the interrupt 3) If a CPU in the affinity mask goes offline and there are still online CPUs in the affinity mask then the effective affinity is moved to a subset of the online CPUs in the affinity mask. If the last CPU in the affinity mask of an interrupt goes offline then the hotplug code breaks the affinity and makes it affine to the online CPUs. The effective affinity is a subset of the new affinity setting, Managed interrupts: 1) At setup time the interrupts of a multiqueue device are evenly spread over the possible CPUs. If all CPUs in the affinity mask of a given interrupt are offline at request_irq() time, the interrupt stays shut down. If the first CPU in the affinity mask comes online later the interrupt is started up. 2) User space cannot modify the affinity of the interrupt 3) If a CPU in the affinity mask goes offline and there are still online CPUs in the affinity mask then the effective affinity is moved a subset of the online CPUs in the affinity mask. I.e. the same as with Non-Managed interrupts. If the last CPU in the affinity mask of a managed interrupt goes offline then the interrupt is shutdown. If the first CPU in the affinity mask becomes online again then the interrupt is started up again. So this has consequences: 1) The device driver has to make sure that no requests are targeted at a queue whose interrupt is affine to offline CPUs and therefor shut down. If the driver ignores that then this queue will not deliver an interrupt simply because that interrupt is shut down. 2) When the last CPU in the affinity mask of a queue interrupt goes offline the device driver has to make sure, that all outstanding requests in the queue which have not yet delivered their interrupt are completed. This is required because when the CPU is finally offline the interrupt is shut down and wont deliver any more interrupts. If that does not happen then the not yet completed request will try to send the completion interrupt which obviously gets not delivered because it is shut down. It's hard to tell from the debug information which of the constraints (#1 or #2 or both) has been violated by the driver (or the device hardware / firmware) but the effect that the task which submitted the I/O operation is hung after an offline operation points clearly into that direction. The irq core code is doing what is expected and I have no clue about that megasas driver/hardware so I have to punt and redirect you to the SCSI and megasas people. Thanks, tglx
Re: system hung up when offlining CPUs
Hi Thomas, > Can you please apply the patch below on top of Linus tree and retest? > > Please send me the outputs I asked you to provide last time in any case > (success or fail). The issue still occurs even if I applied your patch to linux 4.14.0-rc4. --- [ ...] INFO: task setroubleshootd:4972 blocked for more than 120 seconds. [ ...] Not tainted 4.14.0-rc4.thomas.with.irqdebug+ #6 [ ...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ ...] setroubleshootd D0 4972 1 0x0080 [ ...] Call Trace: [ ...] __schedule+0x28d/0x890 [ ...] ? release_pages+0x16f/0x3f0 [ ...] schedule+0x36/0x80 [ ...] io_schedule+0x16/0x40 [ ...] wait_on_page_bit+0x107/0x150 [ ...] ? page_cache_tree_insert+0xb0/0xb0 [ ...] truncate_inode_pages_range+0x3dd/0x7d0 [ ...] ? schedule_hrtimeout_range_clock+0xad/0x140 [ ...] ? remove_wait_queue+0x59/0x60 [ ...] ? down_write+0x12/0x40 [ ...] ? unmap_mapping_range+0x75/0x130 [ ...] truncate_pagecache+0x47/0x60 [ ...] truncate_setsize+0x32/0x40 [ ...] xfs_setattr_size+0x100/0x300 [xfs] [ ...] xfs_vn_setattr_size+0x40/0x90 [xfs] [ ...] xfs_vn_setattr+0x87/0xa0 [xfs] [ ...] notify_change+0x266/0x440 [ ...] do_truncate+0x75/0xc0 [ ...] path_openat+0xaba/0x13b0 [ ...] ? mem_cgroup_commit_charge+0x31/0x130 [ ...] do_filp_open+0x91/0x100 [ ...] ? __alloc_fd+0x46/0x170 [ ...] do_sys_open+0x124/0x210 [ ...] SyS_open+0x1e/0x20 [ ...] do_syscall_64+0x67/0x1b0 [ ...] entry_SYSCALL64_slow_path+0x25/0x25 [ ...] RIP: 0033:0x7f275e2365bd [ ...] RSP: 002b:7ffe29337da0 EFLAGS: 0293 ORIG_RAX: 0002 [ ...] RAX: ffda RBX: 040aea00 RCX: 7f275e2365bd [ ...] RDX: 01b6 RSI: 0241 RDI: 040ae840 [ ...] RBP: 7ffe29337e00 R08: 040aea06 R09: 0240 [ ...] R10: 0024 R11: 0293 R12: 040eb660 [ ...] R13: 0004 R14: 040ae840 R15: 0186a0a0 [ ...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(9b4bf2306160) [ ...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0b 3a 82 a0 00 00 20 00 [ ...] sd 0:2:0:0: task abort: FAILED scmd(9b4bf2306160) [ ...] sd 0:2:0:0: target reset called for scmd(9b4bf2306160) [ ...] sd 0:2:0:0: [sda] tag#0 megasas: target reset FAILED!! [ ...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout [ ...] SCSI command pointer: (9b4bf2306160) SCSI host state: 5 SCSI --- I could not prepare the same environment I reported. So I reproduced the issue on the following megasas environment. --- IRQ affinity_list IRQ_TYPE 340-1IR-PCI-MSI 1048576-edge megasas 352-3IR-PCI-MSI 1048577-edge megasas 364-5IR-PCI-MSI 1048578-edge megasas 376-7IR-PCI-MSI 1048579-edge megasas 388-9IR-PCI-MSI 1048580-edge megasas 39 10-11IR-PCI-MSI 1048581-edge megasas 40 12-13IR-PCI-MSI 1048582-edge megasas 41 14-15IR-PCI-MSI 1048583-edge megasas 42 16-17IR-PCI-MSI 1048584-edge megasas 43 18-19IR-PCI-MSI 1048585-edge megasas 44 20-21IR-PCI-MSI 1048586-edge megasas 45 22-23IR-PCI-MSI 1048587-edge megasas 46 24-25IR-PCI-MSI 1048588-edge megasas 47 26-27IR-PCI-MSI 1048589-edge megasas 48 28-29IR-PCI-MSI 1048590-edge megasas 49 30-31IR-PCI-MSI 1048591-edge megasas 50 32-33IR-PCI-MSI 1048592-edge megasas 51 34-35IR-PCI-MSI 1048593-edge megasas 52 36-37IR-PCI-MSI 1048594-edge megasas 53 38-39IR-PCI-MSI 1048595-edge megasas 54 40-41IR-PCI-MSI 1048596-edge megasas 55 42-43IR-PCI-MSI 1048597-edge megasas 56 44-45IR-PCI-MSI 1048598-edge megasas 57 46-47IR-PCI-MSI 1048599-edge megasas 58 48-49IR-PCI-MSI 1048600-edge megasas 59 50-51IR-PCI-MSI 1048601-edge megasas 60 52-53IR-PCI-MSI 1048602-edge megasas 61 54-55IR-PCI-MSI 1048603-edge megasas 62 56-57IR-PCI-MSI 1048604-edge megasas 63 58-59IR-PCI-MSI 1048605-edge megasas 64 60-61IR-PCI-MSI 1048606-edge megasas 65 62-63IR-PCI-MSI 1048607-edge megasas 66 64-65IR-PCI-MSI 1048608-edge megasas 67 66-67IR-PCI-MSI 1048609-edge megasas 68 68-69IR-PCI-MSI 1048610-edge megasas 69 70-71IR-PCI-MSI 1048611-edge megasas 70 72-73IR-PCI-MSI 1048612-edge megasas 71 74-75IR-PCI-MSI 1048613-edge megasas 72 76-77IR-PCI-MSI 1048614-edge megasas 73 78-79IR-PCI-MSI 1048615-edge megasas 74 80-81IR-PCI-MSI 1048616-edge megasas 75 82-83IR-PCI-MSI 1048617-edge megasas 76 84-85IR-PCI-MSI 1048618-edge megasas 77 86-87IR-PCI-MSI 1048619-edge megasas 78 88-89IR-PCI-MSI 1048620-edge megasas 79 90-91IR-PCI-MSI 1048621-edge megasas 80 92-93IR-PCI-MSI 1048622-edge megasas 81 94-95I
Re: system hung up when offlining CPUs
Hi Thomas, Sorry for the late reply. I'll apply the patches and retest in this week. Please wait a while. Thanks, Yasuaki Ishimatsu On 10/04/2017 05:04 PM, Thomas Gleixner wrote: > On Tue, 3 Oct 2017, Thomas Gleixner wrote: >> Can you please apply the debug patch below. > > I found an issue with managed interrupts when the affinity mask of an > managed interrupt spawns multiple CPUs. Explanation in the changelog > below. I'm not sure that this cures the problems you have, but at least I > could prove that it's not doing what it should do. The failure I'm seing is > fixed, but I can't test that megasas driver due to -ENOHARDWARE. > > Can you please apply the patch below on top of Linus tree and retest? > > Please send me the outputs I asked you to provide last time in any case > (success or fail). > > @block/scsi folks: Can you please run that through your tests as well? > > Thanks, > > tglx > > 8<--- > Subject: genirq/cpuhotplug: Enforce affinity setting on startup of managed > irqs > From: Thomas Gleixner > Date: Wed, 04 Oct 2017 21:07:38 +0200 > > Managed interrupts can end up in a stale state on CPU hotplug. If the > interrupt is not targeting a single CPU, i.e. the affinity mask spawns > multiple CPUs then the following can happen: > > After boot: > > dstate: 0x01601200 > IRQD_ACTIVATED > IRQD_IRQ_STARTED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_SET > IRQD_AFFINITY_MANAGED > node: 0 > affinity: 24-31 > effectiv: 24 > pending: 0 > > After offlining CPU 31 - 24 > > dstate: 0x01a31000 > IRQD_IRQ_DISABLED > IRQD_IRQ_MASKED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_SET > IRQD_AFFINITY_MANAGED > IRQD_MANAGED_SHUTDOWN > node: 0 > affinity: 24-31 > effectiv: 24 > pending: 0 > > Now CPU 25 gets onlined again, so it should get the effective interrupt > affinity for this interruopt, but due to the x86 interrupt affinity setter > restrictions this ends up after restarting the interrupt with: > > dstate: 0x01601300 > IRQD_ACTIVATED > IRQD_IRQ_STARTED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_SET > IRQD_SETAFFINITY_PENDING > IRQD_AFFINITY_MANAGED > node: 0 > affinity: 24-31 > effectiv: 24 > pending: 24-31 > > So the interrupt is still affine to CPU 24, which was the last CPU to go > offline of that affinity set and the move to an online CPU within 24-31, > in this case 25, is pending. This mechanism is x86/ia64 specific as those > architectures cannot move interrupts from thread context and do this when > an interrupt is actually handled. So the move is set to pending. > > Whats worse is that offlining CPU 25 again results in: > > dstate: 0x01601300 > IRQD_ACTIVATED > IRQD_IRQ_STARTED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_SET > IRQD_SETAFFINITY_PENDING > IRQD_AFFINITY_MANAGED > node: 0 > affinity: 24-31 > effectiv: 24 > pending: 24-31 > > This means the interrupt has not been shut down, because the outgoing CPU > is not in the effective affinity mask, but of course nothing notices that > the effective affinity mask is pointing at an offline CPU. > > In the case of restarting a managed interrupt the move restriction does not > apply, so the affinity setting can be made unconditional. This needs to be > done _before_ the interrupt is started up as otherwise the condition for > moving it from thread context would not longer be fulfilled. > > With that change applied onlining CPU 25 after offlining 31-24 results in: > > dstate: 0x01600200 > IRQD_ACTIVATED > IRQD_IRQ_STARTED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_MANAGED > node: 0 > affinity: 24-31 > effectiv: 25 > pending: > > And after offlining CPU 25: > > dstate: 0x01a3 > IRQD_IRQ_DISABLED > IRQD_IRQ_MASKED > IRQD_SINGLE_TARGET > IRQD_AFFINITY_MANAGED > IRQD_MANAGED_SHUTDOWN > node: 0 > affinity: 24-31 > effectiv: 25 > pending: > > which is the correct and expected result. > > To complete that, add some debug code to catch this kind of situation in > the cpu offline code and warn about interrupt chips which allow affinity > setting and do not update the effective affinity mask if that feature is > enabled. > > Reported-by: YASUAKI ISHIMATSU > Signed-off-by: Thomas Gleixner > > --- > kernel/irq/chip.c |2 +- > kernel/irq/cpuhotplug.c | 28 +++- > kernel/irq/manage.c | 17 + > 3 files changed, 45 insertions(+), 2 deletions(-) > > --- a/kernel/irq/chip.c > +++ b/kernel/irq/chip.c > @@ -265,8 +265,8 @@ int irq_startup(struct irq_desc *desc, b > irq_setup_affinity(desc); > break; > case IR
Re: system hung up when offlining CPUs
On Mon, 2 Oct 2017, YASUAKI ISHIMATSU wrote: > > We are talking about megasas driver. > So I added linux-scsi and maintainers of megasas into the thread. Another question: Is this the in tree megasas driver and you are observing this on Linus latest tree, i.e. 4.14-rc3+ ? Thanks, tglx
Re: system hung up when offlining CPUs
On Tue, 3 Oct 2017, Thomas Gleixner wrote: > Can you please apply the debug patch below. I found an issue with managed interrupts when the affinity mask of an managed interrupt spawns multiple CPUs. Explanation in the changelog below. I'm not sure that this cures the problems you have, but at least I could prove that it's not doing what it should do. The failure I'm seing is fixed, but I can't test that megasas driver due to -ENOHARDWARE. Can you please apply the patch below on top of Linus tree and retest? Please send me the outputs I asked you to provide last time in any case (success or fail). @block/scsi folks: Can you please run that through your tests as well? Thanks, tglx 8<--- Subject: genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs From: Thomas Gleixner Date: Wed, 04 Oct 2017 21:07:38 +0200 Managed interrupts can end up in a stale state on CPU hotplug. If the interrupt is not targeting a single CPU, i.e. the affinity mask spawns multiple CPUs then the following can happen: After boot: dstate: 0x01601200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 0 After offlining CPU 31 - 24 dstate: 0x01a31000 IRQD_IRQ_DISABLED IRQD_IRQ_MASKED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED IRQD_MANAGED_SHUTDOWN node: 0 affinity: 24-31 effectiv: 24 pending: 0 Now CPU 25 gets onlined again, so it should get the effective interrupt affinity for this interruopt, but due to the x86 interrupt affinity setter restrictions this ends up after restarting the interrupt with: dstate: 0x01601300 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_SETAFFINITY_PENDING IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 24-31 So the interrupt is still affine to CPU 24, which was the last CPU to go offline of that affinity set and the move to an online CPU within 24-31, in this case 25, is pending. This mechanism is x86/ia64 specific as those architectures cannot move interrupts from thread context and do this when an interrupt is actually handled. So the move is set to pending. Whats worse is that offlining CPU 25 again results in: dstate: 0x01601300 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_SETAFFINITY_PENDING IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 24-31 This means the interrupt has not been shut down, because the outgoing CPU is not in the effective affinity mask, but of course nothing notices that the effective affinity mask is pointing at an offline CPU. In the case of restarting a managed interrupt the move restriction does not apply, so the affinity setting can be made unconditional. This needs to be done _before_ the interrupt is started up as otherwise the condition for moving it from thread context would not longer be fulfilled. With that change applied onlining CPU 25 after offlining 31-24 results in: dstate: 0x01600200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 25 pending: And after offlining CPU 25: dstate: 0x01a3 IRQD_IRQ_DISABLED IRQD_IRQ_MASKED IRQD_SINGLE_TARGET IRQD_AFFINITY_MANAGED IRQD_MANAGED_SHUTDOWN node: 0 affinity: 24-31 effectiv: 25 pending: which is the correct and expected result. To complete that, add some debug code to catch this kind of situation in the cpu offline code and warn about interrupt chips which allow affinity setting and do not update the effective affinity mask if that feature is enabled. Reported-by: YASUAKI ISHIMATSU Signed-off-by: Thomas Gleixner --- kernel/irq/chip.c |2 +- kernel/irq/cpuhotplug.c | 28 +++- kernel/irq/manage.c | 17 + 3 files changed, 45 insertions(+), 2 deletions(-) --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -265,8 +265,8 @@ int irq_startup(struct irq_desc *desc, b irq_setup_affinity(desc); break; case IRQ_STARTUP_MANAGED: + irq_do_set_affinity(d, aff, false); ret = __irq_startup(desc); - irq_set_affinity_locked(d, aff, false); break; case IRQ_STARTUP_ABORT: return 0; --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -18,8 +18,34 @@ static inline bool irq_needs_fixup(struct irq_data *d) { const struct cpumask *m = ir
Re: system hung up when offlining CPUs
On Mon, 2 Oct 2017, YASUAKI ISHIMATSU wrote: > On 09/16/2017 11:02 AM, Thomas Gleixner wrote: > > Which driver are we talking about? > > We are talking about megasas driver. Can you please apply the debug patch below. After booting enable stack traces for the tracer: # echo 1 >/sys/kernel/debug/tracing/options/stacktrace Then offline CPUs 24-29. After that do # cat /sys/kernel/debug/tracing/trace >somefile Please compress the file and upload it to some place or if you have no place to upload it then send it to me in private mail. Thanks, tglx 8< --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -171,11 +171,16 @@ void irq_set_thread_affinity(struct irq_ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force) { + const struct cpumask *eff = irq_data_get_effective_affinity_mask(data); struct irq_desc *desc = irq_data_to_desc(data); struct irq_chip *chip = irq_data_get_irq_chip(data); int ret; ret = chip->irq_set_affinity(data, mask, force); + + trace_printk("irq: %u ret %d mask: %*pbl eff: %*pbl\n", data->irq, ret, +cpumask_pr_args(mask), cpumask_pr_args(eff)); + switch (ret) { case IRQ_SET_MASK_OK: case IRQ_SET_MASK_OK_DONE:
Re: system hung up when offlining CPUs
On 09/16/2017 11:02 AM, Thomas Gleixner wrote: > On Sat, 16 Sep 2017, Thomas Gleixner wrote: >> On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote: >>> Here are one irq's info of megasas: >>> >>> - Before offline CPU >>> /proc/irq/70/smp_affinity_list >>> 24-29 >>> >>> /proc/irq/70/effective_affinity >>> ,,,,,,,,,,,,,,,3f00 >>> >>> /sys/kernel/debug/irq/irqs/70 >>> handler: handle_edge_irq >>> status: 0x4000 >>> istate: 0x >>> ddepth: 0 >>> wdepth: 0 >>> dstate: 0x00609200 >>> IRQD_ACTIVATED >>> IRQD_IRQ_STARTED >>> IRQD_MOVE_PCNTXT >>> IRQD_AFFINITY_SET >>> IRQD_AFFINITY_MANAGED >> >> So this uses managed affinity, which means that once the last CPU in the >> affinity mask goes offline, the interrupt is shut down by the irq core >> code, which is the case: >> >>> dstate: 0x00a39000 >>> IRQD_IRQ_DISABLED >>> IRQD_IRQ_MASKED >>> IRQD_MOVE_PCNTXT >>> IRQD_AFFINITY_SET >>> IRQD_AFFINITY_MANAGED >>> IRQD_MANAGED_SHUTDOWN <--- >> >> So the irq core code works as expected, but something in the >> driver/scsi/block stack seems to fiddle with that shut down queue. >> >> I only can tell about the inner workings of the irq code, but I have no >> clue about the rest. > > Though there is something wrong here: > >> affinity: 24-29 >> effectiv: 24-29 > > and after offlining: > >> affinity: 29 >> effectiv: 29 > > But that should be: > > affinity: 24-29 > effectiv: 29 > > because the irq core code preserves 'affinity'. It merily updates > 'effective', which is where your interrupts are routed to. > > Is the driver issuing any set_affinity() calls? If so, that's wrong. > > Which driver are we talking about? We are talking about megasas driver. So I added linux-scsi and maintainers of megasas into the thread. Thanks, Yasuaki Ishimatsu > > Thanks, > > tglx >
Re: system hung up when offlining CPUs
On Sat, 16 Sep 2017, Thomas Gleixner wrote: > On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote: > > Here are one irq's info of megasas: > > > > - Before offline CPU > > /proc/irq/70/smp_affinity_list > > 24-29 > > > > /proc/irq/70/effective_affinity > > ,,,,,,,,,,,,,,,3f00 > > > > /sys/kernel/debug/irq/irqs/70 > > handler: handle_edge_irq > > status: 0x4000 > > istate: 0x > > ddepth: 0 > > wdepth: 0 > > dstate: 0x00609200 > > IRQD_ACTIVATED > > IRQD_IRQ_STARTED > > IRQD_MOVE_PCNTXT > > IRQD_AFFINITY_SET > > IRQD_AFFINITY_MANAGED > > So this uses managed affinity, which means that once the last CPU in the > affinity mask goes offline, the interrupt is shut down by the irq core > code, which is the case: > > > dstate: 0x00a39000 > > IRQD_IRQ_DISABLED > > IRQD_IRQ_MASKED > > IRQD_MOVE_PCNTXT > > IRQD_AFFINITY_SET > > IRQD_AFFINITY_MANAGED > > IRQD_MANAGED_SHUTDOWN <--- > > So the irq core code works as expected, but something in the > driver/scsi/block stack seems to fiddle with that shut down queue. > > I only can tell about the inner workings of the irq code, but I have no > clue about the rest. Though there is something wrong here: > affinity: 24-29 > effectiv: 24-29 and after offlining: > affinity: 29 > effectiv: 29 But that should be: affinity: 24-29 effectiv: 29 because the irq core code preserves 'affinity'. It merily updates 'effective', which is where your interrupts are routed to. Is the driver issuing any set_affinity() calls? If so, that's wrong. Which driver are we talking about? Thanks, tglx
Re: system hung up when offlining CPUs
On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote: > On 09/13/2017 09:33 AM, Thomas Gleixner wrote: > >> Question - "what happens once __cpu_disable is called and some of the > >> queued > >> interrupt has affinity to that particular CPU ?" > >> I assume ideally those pending/queued Interrupt should be migrated to > >> remaining online CPUs. It should not be unhandled if we want to avoid such > >> IO timeout. > > > > Can you please provide the following information, before and after > > offlining the last CPU in the affinity set: > > > > # cat /proc/irq/$IRQNUM/smp_affinity_list > > # cat /proc/irq/$IRQNUM/effective_affinity > > # cat /sys/kernel/debug/irq/irqs/$IRQNUM > > > > The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y > > Here are one irq's info of megasas: > > - Before offline CPU > /proc/irq/70/smp_affinity_list > 24-29 > > /proc/irq/70/effective_affinity > ,,,,,,,,,,,,,,,3f00 > > /sys/kernel/debug/irq/irqs/70 > handler: handle_edge_irq > status: 0x4000 > istate: 0x > ddepth: 0 > wdepth: 0 > dstate: 0x00609200 > IRQD_ACTIVATED > IRQD_IRQ_STARTED > IRQD_MOVE_PCNTXT > IRQD_AFFINITY_SET > IRQD_AFFINITY_MANAGED So this uses managed affinity, which means that once the last CPU in the affinity mask goes offline, the interrupt is shut down by the irq core code, which is the case: > dstate: 0x00a39000 > IRQD_IRQ_DISABLED > IRQD_IRQ_MASKED > IRQD_MOVE_PCNTXT > IRQD_AFFINITY_SET > IRQD_AFFINITY_MANAGED > IRQD_MANAGED_SHUTDOWN <--- So the irq core code works as expected, but something in the driver/scsi/block stack seems to fiddle with that shut down queue. I only can tell about the inner workings of the irq code, but I have no clue about the rest. Thanks, tglx
Re: system hung up when offlining CPUs
On 09/13/2017 09:33 AM, Thomas Gleixner wrote: > On Wed, 13 Sep 2017, Kashyap Desai wrote: >>> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: + linux-scsi and maintainers of megasas > > In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > CPU#24-29, I/O does not work, showing the following messages. > > > >>> This indeed looks like a problem. >>> We're going to great lengths to submit and complete I/O on the same CPU, >>> so >>> if the CPU is offlined while I/O is in flight we won't be getting a >>> completion for >>> this particular I/O. >>> However, the megasas driver should be able to cope with this situation; >>> after >>> all, the firmware maintains completions queues, so it would be dead easy >>> to >>> look at _other_ completions queues, too, if a timeout occurs. >> In case of IO timeout, megaraid_sas driver is checking other queues as well. >> That is why IO was completed in this case and further IOs were resumed. >> >> Driver complete commands as below code executed from >> megasas_wait_for_outstanding_fusion(). >> for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++) >> complete_cmd_fusion(instance, MSIxIndex); >> >> Because of above code executed in driver, we see only one print as below in >> this logs. >> megaraid_sas :02:00.0: [ 0]waiting for 2 commands to complete for scsi0 >> >> As per below link CPU hotplug will take care- "All interrupts targeted to >> this CPU are migrated to a new CPU" >> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html >> >> BTW - We are also able reproduce this issue locally. Reason for IO timeout >> is -" IO is completed, but corresponding interrupt did not arrived on Online >> CPU. Either missed due to CPU is in transient state of being OFFLINED. I am >> not sure which component should take care this." >> >> Question - "what happens once __cpu_disable is called and some of the queued >> interrupt has affinity to that particular CPU ?" >> I assume ideally those pending/queued Interrupt should be migrated to >> remaining online CPUs. It should not be unhandled if we want to avoid such >> IO timeout. > > Can you please provide the following information, before and after > offlining the last CPU in the affinity set: > > # cat /proc/irq/$IRQNUM/smp_affinity_list > # cat /proc/irq/$IRQNUM/effective_affinity > # cat /sys/kernel/debug/irq/irqs/$IRQNUM > > The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y Here are one irq's info of megasas: - Before offline CPU /proc/irq/70/smp_affinity_list 24-29 /proc/irq/70/effective_affinity ,,,,,,,,,,,,,,,3f00 /sys/kernel/debug/irq/irqs/70 handler: handle_edge_irq status: 0x4000 istate: 0x ddepth: 0 wdepth: 0 dstate: 0x00609200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_MOVE_PCNTXT IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED node: 1 affinity: 24-29 effectiv: 24-29 pending: domain: INTEL-IR-MSI-0-2 hwirq: 0x100018 chip:IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-0 hwirq: 0x40 chip:INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x46 chip:APIC flags: 0x0 - After offline CPU#24-29 /proc/irq/70/smp_affinity_list 29 /proc/irq/70/effective_affinity ,,,,,,,,,,,,,,,2000 /sys/kernel/debug/irq/irqs/70 handler: handle_edge_irq status: 0x4000 istate: 0x ddepth: 1 wdepth: 0 dstate: 0x00a39000 IRQD_IRQ_DISABLED IRQD_IRQ_MASKED IRQD_MOVE_PCNTXT IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED IRQD_MANAGED_SHUTDOWN node: 1 affinity: 29 effectiv: 29 pending: domain: INTEL-IR-MSI-0-2 hwirq: 0x100018 chip:IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-0 hwirq: 0x40 chip:INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x46 chip:APIC flags: 0x0 Thanks, Yasuaki Ishimatsu > > Thanks, > > tglx >
RE: system hung up when offlining CPUs
On Wed, 13 Sep 2017, Kashyap Desai wrote: > > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > > > + linux-scsi and maintainers of megasas > > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > > >> CPU#24-29, I/O does not work, showing the following messages. > > This indeed looks like a problem. > > We're going to great lengths to submit and complete I/O on the same CPU, > > so > > if the CPU is offlined while I/O is in flight we won't be getting a > > completion for > > this particular I/O. > > However, the megasas driver should be able to cope with this situation; > > after > > all, the firmware maintains completions queues, so it would be dead easy > > to > > look at _other_ completions queues, too, if a timeout occurs. > In case of IO timeout, megaraid_sas driver is checking other queues as well. > That is why IO was completed in this case and further IOs were resumed. > > Driver complete commands as below code executed from > megasas_wait_for_outstanding_fusion(). > for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++) > complete_cmd_fusion(instance, MSIxIndex); > > Because of above code executed in driver, we see only one print as below in > this logs. > megaraid_sas :02:00.0: [ 0]waiting for 2 commands to complete for scsi0 > > As per below link CPU hotplug will take care- "All interrupts targeted to > this CPU are migrated to a new CPU" > https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html > > BTW - We are also able reproduce this issue locally. Reason for IO timeout > is -" IO is completed, but corresponding interrupt did not arrived on Online > CPU. Either missed due to CPU is in transient state of being OFFLINED. I am > not sure which component should take care this." > > Question - "what happens once __cpu_disable is called and some of the queued > interrupt has affinity to that particular CPU ?" > I assume ideally those pending/queued Interrupt should be migrated to > remaining online CPUs. It should not be unhandled if we want to avoid such > IO timeout. Can you please provide the following information, before and after offlining the last CPU in the affinity set: # cat /proc/irq/$IRQNUM/smp_affinity_list # cat /proc/irq/$IRQNUM/effective_affinity # cat /sys/kernel/debug/irq/irqs/$IRQNUM The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y Thanks, tglx
RE: system hung up when offlining CPUs
> > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > > + linux-scsi and maintainers of megasas > > > > When offlining CPU, I/O stops. Do you have any ideas? > > > > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote: > >> Hi Mark and Christoph, > >> > >> Sorry for the late reply. I appreciated that you fixed the issue on kvm > environment. > >> But the issue still occurs on physical server. > >> > >> Here ares irq information that I summarized megasas irqs from > >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server: > >> > >> --- > >> IRQ affinity_list IRQ_TYPE > >> 420-5IR-PCI-MSI 1048576-edge megasas > >> 430-5IR-PCI-MSI 1048577-edge megasas > >> 440-5IR-PCI-MSI 1048578-edge megasas > >> 450-5IR-PCI-MSI 1048579-edge megasas > >> 460-5IR-PCI-MSI 1048580-edge megasas > >> 470-5IR-PCI-MSI 1048581-edge megasas > >> 480-5IR-PCI-MSI 1048582-edge megasas > >> 490-5IR-PCI-MSI 1048583-edge megasas > >> 500-5IR-PCI-MSI 1048584-edge megasas > >> 510-5IR-PCI-MSI 1048585-edge megasas > >> 520-5IR-PCI-MSI 1048586-edge megasas > >> 530-5IR-PCI-MSI 1048587-edge megasas > >> 540-5IR-PCI-MSI 1048588-edge megasas > >> 550-5IR-PCI-MSI 1048589-edge megasas > >> 560-5IR-PCI-MSI 1048590-edge megasas > >> 570-5IR-PCI-MSI 1048591-edge megasas > >> 580-5IR-PCI-MSI 1048592-edge megasas > >> 590-5IR-PCI-MSI 1048593-edge megasas > >> 600-5IR-PCI-MSI 1048594-edge megasas > >> 610-5IR-PCI-MSI 1048595-edge megasas > >> 620-5IR-PCI-MSI 1048596-edge megasas > >> 630-5IR-PCI-MSI 1048597-edge megasas > >> 640-5IR-PCI-MSI 1048598-edge megasas > >> 650-5IR-PCI-MSI 1048599-edge megasas > >> 66 24-29IR-PCI-MSI 1048600-edge megasas > >> 67 24-29IR-PCI-MSI 1048601-edge megasas > >> 68 24-29IR-PCI-MSI 1048602-edge megasas > >> 69 24-29IR-PCI-MSI 1048603-edge megasas > >> 70 24-29IR-PCI-MSI 1048604-edge megasas > >> 71 24-29IR-PCI-MSI 1048605-edge megasas > >> 72 24-29IR-PCI-MSI 1048606-edge megasas > >> 73 24-29IR-PCI-MSI 1048607-edge megasas > >> 74 24-29IR-PCI-MSI 1048608-edge megasas > >> 75 24-29IR-PCI-MSI 1048609-edge megasas > >> 76 24-29IR-PCI-MSI 1048610-edge megasas > >> 77 24-29IR-PCI-MSI 1048611-edge megasas > >> 78 24-29IR-PCI-MSI 1048612-edge megasas > >> 79 24-29IR-PCI-MSI 1048613-edge megasas > >> 80 24-29IR-PCI-MSI 1048614-edge megasas > >> 81 24-29IR-PCI-MSI 1048615-edge megasas > >> 82 24-29IR-PCI-MSI 1048616-edge megasas > >> 83 24-29IR-PCI-MSI 1048617-edge megasas > >> 84 24-29IR-PCI-MSI 1048618-edge megasas > >> 85 24-29IR-PCI-MSI 1048619-edge megasas > >> 86 24-29IR-PCI-MSI 1048620-edge megasas > >> 87 24-29IR-PCI-MSI 1048621-edge megasas > >> 88 24-29IR-PCI-MSI 1048622-edge megasas > >> 89 24-29IR-PCI-MSI 1048623-edge megasas > >> --- > >> > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > >> CPU#24-29, I/O does not work, showing the following messages. > >> > >> --- > >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 > >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort > >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB: > >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort: > >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called > >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas: > >> target > reset FAILED!! > >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO > >> timeout > >> [...] SCSI command pointer: (882057426560) SCSI host state: 5 > >> SCSI > >> [...] IO request frame: > >> [...] > >> > >> [...] > >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to > >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than > >> 120 > seconds. > >> [...] Not tainted 4.13.0+ #15 > >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > >> [...] auditd D0 1200 1 0x > >> [...] Call Trace: > >> [...] __schedule+0x28d/0x890 > >> [...] schedule+0x36/0x80 > >> [...] io_schedule+0x16/0x40 > >> [...] wait_on_page_bit_common+0x109/0x1c0 > >> [...] ? page_cache_tree_insert+0xf0/0xf0 [...] > >> __filemap_fdatawait_range+0x127/0x190 > >> [...] ? __filemap_fdatawrite_range+0xd1/0x100 > >> [...] file_write_and_wait_range+0x60/0xb0 > >> [...] xfs_file_fsync+0x67/0x1d0 [xfs] [...] > >> vfs_fsync_range+0x3d/0x
Re: system hung up when offlining CPUs
On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > + linux-scsi and maintainers of megasas > > When offlining CPU, I/O stops. Do you have any ideas? > > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote: >> Hi Mark and Christoph, >> >> Sorry for the late reply. I appreciated that you fixed the issue on kvm >> environment. >> But the issue still occurs on physical server. >> >> Here ares irq information that I summarized megasas irqs from >> /proc/interrupts >> and /proc/irq/*/smp_affinity_list on my server: >> >> --- >> IRQ affinity_list IRQ_TYPE >> 420-5IR-PCI-MSI 1048576-edge megasas >> 430-5IR-PCI-MSI 1048577-edge megasas >> 440-5IR-PCI-MSI 1048578-edge megasas >> 450-5IR-PCI-MSI 1048579-edge megasas >> 460-5IR-PCI-MSI 1048580-edge megasas >> 470-5IR-PCI-MSI 1048581-edge megasas >> 480-5IR-PCI-MSI 1048582-edge megasas >> 490-5IR-PCI-MSI 1048583-edge megasas >> 500-5IR-PCI-MSI 1048584-edge megasas >> 510-5IR-PCI-MSI 1048585-edge megasas >> 520-5IR-PCI-MSI 1048586-edge megasas >> 530-5IR-PCI-MSI 1048587-edge megasas >> 540-5IR-PCI-MSI 1048588-edge megasas >> 550-5IR-PCI-MSI 1048589-edge megasas >> 560-5IR-PCI-MSI 1048590-edge megasas >> 570-5IR-PCI-MSI 1048591-edge megasas >> 580-5IR-PCI-MSI 1048592-edge megasas >> 590-5IR-PCI-MSI 1048593-edge megasas >> 600-5IR-PCI-MSI 1048594-edge megasas >> 610-5IR-PCI-MSI 1048595-edge megasas >> 620-5IR-PCI-MSI 1048596-edge megasas >> 630-5IR-PCI-MSI 1048597-edge megasas >> 640-5IR-PCI-MSI 1048598-edge megasas >> 650-5IR-PCI-MSI 1048599-edge megasas >> 66 24-29IR-PCI-MSI 1048600-edge megasas >> 67 24-29IR-PCI-MSI 1048601-edge megasas >> 68 24-29IR-PCI-MSI 1048602-edge megasas >> 69 24-29IR-PCI-MSI 1048603-edge megasas >> 70 24-29IR-PCI-MSI 1048604-edge megasas >> 71 24-29IR-PCI-MSI 1048605-edge megasas >> 72 24-29IR-PCI-MSI 1048606-edge megasas >> 73 24-29IR-PCI-MSI 1048607-edge megasas >> 74 24-29IR-PCI-MSI 1048608-edge megasas >> 75 24-29IR-PCI-MSI 1048609-edge megasas >> 76 24-29IR-PCI-MSI 1048610-edge megasas >> 77 24-29IR-PCI-MSI 1048611-edge megasas >> 78 24-29IR-PCI-MSI 1048612-edge megasas >> 79 24-29IR-PCI-MSI 1048613-edge megasas >> 80 24-29IR-PCI-MSI 1048614-edge megasas >> 81 24-29IR-PCI-MSI 1048615-edge megasas >> 82 24-29IR-PCI-MSI 1048616-edge megasas >> 83 24-29IR-PCI-MSI 1048617-edge megasas >> 84 24-29IR-PCI-MSI 1048618-edge megasas >> 85 24-29IR-PCI-MSI 1048619-edge megasas >> 86 24-29IR-PCI-MSI 1048620-edge megasas >> 87 24-29IR-PCI-MSI 1048621-edge megasas >> 88 24-29IR-PCI-MSI 1048622-edge megasas >> 89 24-29IR-PCI-MSI 1048623-edge megasas >> --- >> >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29, >> I/O does not work, showing the following messages. >> >> --- >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(8820574d7560) >> [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00 >> [...] sd 0:2:0:0: task abort: FAILED scmd(8820574d7560) >> [...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(882057426560) >> [...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00 >> [...] sd 0:2:0:0: task abort: FAILED scmd(882057426560) >> [...] sd 0:2:0:0: target reset called for scmd(8820574d7560) >> [...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!! >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout >> [...] SCSI command pointer: (882057426560) SCSI host state: 5 SCSI >> [...] IO request frame: >> [...] >> >> [...] >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to complete for >> scsi0 >> [...] INFO: task auditd:1200 blocked for more than 120 seconds. >> [...] Not tainted 4.13.0+ #15 >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> [...] auditd D0 1200 1 0x >> [...] Call Trace: >> [...] __schedule+0x28d/0x890 >> [...] schedule+0x36/0x80 >> [...] io_schedule+0x16/0x40 >> [...] wait_on_page_bit_common+0x109/0x1c0 >> [...] ? page_cache_tree_insert+0xf0/0xf0 >> [...] __filemap_fdatawait_range+0x127/0x190 >> [...] ? __filemap_fdatawrite_range+0xd1/0x100 >> [...] file_write_and_wait_range+0x60/0xb0 >> [...] xfs_file_fsync+0x67/0x1d0 [xfs] >> [...] vfs_fsync_range+0x3d/0xb0 >> [...] do_fsync+0x3d/0x70 >> [...] SyS_fsync+0x10/0x20 >> [...] entry_SYSCALL_64_fastpath+0x1a/0xa5 >> [...] RIP: 0033:0x7f0bd9633d2d >> [...] RSP: 002b:7f0bd751ed30 EFLAGS: 0293 ORIG_RAX:
Re: system hung up when offlining CPUs
+ linux-scsi and maintainers of megasas When offlining CPU, I/O stops. Do you have any ideas? On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote: > Hi Mark and Christoph, > > Sorry for the late reply. I appreciated that you fixed the issue on kvm > environment. > But the issue still occurs on physical server. > > Here ares irq information that I summarized megasas irqs from /proc/interrupts > and /proc/irq/*/smp_affinity_list on my server: > > --- > IRQ affinity_list IRQ_TYPE > 420-5IR-PCI-MSI 1048576-edge megasas > 430-5IR-PCI-MSI 1048577-edge megasas > 440-5IR-PCI-MSI 1048578-edge megasas > 450-5IR-PCI-MSI 1048579-edge megasas > 460-5IR-PCI-MSI 1048580-edge megasas > 470-5IR-PCI-MSI 1048581-edge megasas > 480-5IR-PCI-MSI 1048582-edge megasas > 490-5IR-PCI-MSI 1048583-edge megasas > 500-5IR-PCI-MSI 1048584-edge megasas > 510-5IR-PCI-MSI 1048585-edge megasas > 520-5IR-PCI-MSI 1048586-edge megasas > 530-5IR-PCI-MSI 1048587-edge megasas > 540-5IR-PCI-MSI 1048588-edge megasas > 550-5IR-PCI-MSI 1048589-edge megasas > 560-5IR-PCI-MSI 1048590-edge megasas > 570-5IR-PCI-MSI 1048591-edge megasas > 580-5IR-PCI-MSI 1048592-edge megasas > 590-5IR-PCI-MSI 1048593-edge megasas > 600-5IR-PCI-MSI 1048594-edge megasas > 610-5IR-PCI-MSI 1048595-edge megasas > 620-5IR-PCI-MSI 1048596-edge megasas > 630-5IR-PCI-MSI 1048597-edge megasas > 640-5IR-PCI-MSI 1048598-edge megasas > 650-5IR-PCI-MSI 1048599-edge megasas > 66 24-29IR-PCI-MSI 1048600-edge megasas > 67 24-29IR-PCI-MSI 1048601-edge megasas > 68 24-29IR-PCI-MSI 1048602-edge megasas > 69 24-29IR-PCI-MSI 1048603-edge megasas > 70 24-29IR-PCI-MSI 1048604-edge megasas > 71 24-29IR-PCI-MSI 1048605-edge megasas > 72 24-29IR-PCI-MSI 1048606-edge megasas > 73 24-29IR-PCI-MSI 1048607-edge megasas > 74 24-29IR-PCI-MSI 1048608-edge megasas > 75 24-29IR-PCI-MSI 1048609-edge megasas > 76 24-29IR-PCI-MSI 1048610-edge megasas > 77 24-29IR-PCI-MSI 1048611-edge megasas > 78 24-29IR-PCI-MSI 1048612-edge megasas > 79 24-29IR-PCI-MSI 1048613-edge megasas > 80 24-29IR-PCI-MSI 1048614-edge megasas > 81 24-29IR-PCI-MSI 1048615-edge megasas > 82 24-29IR-PCI-MSI 1048616-edge megasas > 83 24-29IR-PCI-MSI 1048617-edge megasas > 84 24-29IR-PCI-MSI 1048618-edge megasas > 85 24-29IR-PCI-MSI 1048619-edge megasas > 86 24-29IR-PCI-MSI 1048620-edge megasas > 87 24-29IR-PCI-MSI 1048621-edge megasas > 88 24-29IR-PCI-MSI 1048622-edge megasas > 89 24-29IR-PCI-MSI 1048623-edge megasas > --- > > In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29, > I/O does not work, showing the following messages. > > --- > [...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(8820574d7560) > [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00 > [...] sd 0:2:0:0: task abort: FAILED scmd(8820574d7560) > [...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(882057426560) > [...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00 > [...] sd 0:2:0:0: task abort: FAILED scmd(882057426560) > [...] sd 0:2:0:0: target reset called for scmd(8820574d7560) > [...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!! > [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout > [...] SCSI command pointer: (882057426560) SCSI host state: 5 SCSI > [...] IO request frame: > [...] > > [...] > [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to complete for > scsi0 > [...] INFO: task auditd:1200 blocked for more than 120 seconds. > [...] Not tainted 4.13.0+ #15 > [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > [...] auditd D0 1200 1 0x > [...] Call Trace: > [...] __schedule+0x28d/0x890 > [...] schedule+0x36/0x80 > [...] io_schedule+0x16/0x40 > [...] wait_on_page_bit_common+0x109/0x1c0 > [...] ? page_cache_tree_insert+0xf0/0xf0 > [...] __filemap_fdatawait_range+0x127/0x190 > [...] ? __filemap_fdatawrite_range+0xd1/0x100 > [...] file_write_and_wait_range+0x60/0xb0 > [...] xfs_file_fsync+0x67/0x1d0 [xfs] > [...] vfs_fsync_range+0x3d/0xb0 > [...] do_fsync+0x3d/0x70 > [...] SyS_fsync+0x10/0x20 > [...] entry_SYSCALL_64_fastpath+0x1a/0xa5 > [...] RIP: 0033:0x7f0bd9633d2d > [...] RSP: 002b:7f0bd751ed30 EFLAGS: 0293 ORIG_RAX: 004a > [...] RAX: ffda RBX: 5590566d0080 RCX: 7f0bd9633d2d > [...] RDX: 5590566d1260 RSI: RDI:
Re: system hung up when offlining CPUs
Hi Mark and Christoph, Sorry for the late reply. I appreciated that you fixed the issue on kvm environment. But the issue still occurs on physical server. Here ares irq information that I summarized megasas irqs from /proc/interrupts and /proc/irq/*/smp_affinity_list on my server: --- IRQ affinity_list IRQ_TYPE 420-5IR-PCI-MSI 1048576-edge megasas 430-5IR-PCI-MSI 1048577-edge megasas 440-5IR-PCI-MSI 1048578-edge megasas 450-5IR-PCI-MSI 1048579-edge megasas 460-5IR-PCI-MSI 1048580-edge megasas 470-5IR-PCI-MSI 1048581-edge megasas 480-5IR-PCI-MSI 1048582-edge megasas 490-5IR-PCI-MSI 1048583-edge megasas 500-5IR-PCI-MSI 1048584-edge megasas 510-5IR-PCI-MSI 1048585-edge megasas 520-5IR-PCI-MSI 1048586-edge megasas 530-5IR-PCI-MSI 1048587-edge megasas 540-5IR-PCI-MSI 1048588-edge megasas 550-5IR-PCI-MSI 1048589-edge megasas 560-5IR-PCI-MSI 1048590-edge megasas 570-5IR-PCI-MSI 1048591-edge megasas 580-5IR-PCI-MSI 1048592-edge megasas 590-5IR-PCI-MSI 1048593-edge megasas 600-5IR-PCI-MSI 1048594-edge megasas 610-5IR-PCI-MSI 1048595-edge megasas 620-5IR-PCI-MSI 1048596-edge megasas 630-5IR-PCI-MSI 1048597-edge megasas 640-5IR-PCI-MSI 1048598-edge megasas 650-5IR-PCI-MSI 1048599-edge megasas 66 24-29IR-PCI-MSI 1048600-edge megasas 67 24-29IR-PCI-MSI 1048601-edge megasas 68 24-29IR-PCI-MSI 1048602-edge megasas 69 24-29IR-PCI-MSI 1048603-edge megasas 70 24-29IR-PCI-MSI 1048604-edge megasas 71 24-29IR-PCI-MSI 1048605-edge megasas 72 24-29IR-PCI-MSI 1048606-edge megasas 73 24-29IR-PCI-MSI 1048607-edge megasas 74 24-29IR-PCI-MSI 1048608-edge megasas 75 24-29IR-PCI-MSI 1048609-edge megasas 76 24-29IR-PCI-MSI 1048610-edge megasas 77 24-29IR-PCI-MSI 1048611-edge megasas 78 24-29IR-PCI-MSI 1048612-edge megasas 79 24-29IR-PCI-MSI 1048613-edge megasas 80 24-29IR-PCI-MSI 1048614-edge megasas 81 24-29IR-PCI-MSI 1048615-edge megasas 82 24-29IR-PCI-MSI 1048616-edge megasas 83 24-29IR-PCI-MSI 1048617-edge megasas 84 24-29IR-PCI-MSI 1048618-edge megasas 85 24-29IR-PCI-MSI 1048619-edge megasas 86 24-29IR-PCI-MSI 1048620-edge megasas 87 24-29IR-PCI-MSI 1048621-edge megasas 88 24-29IR-PCI-MSI 1048622-edge megasas 89 24-29IR-PCI-MSI 1048623-edge megasas --- In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29, I/O does not work, showing the following messages. --- [...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!! [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout [...] SCSI command pointer: (882057426560) SCSI host state: 5 SCSI [...] IO request frame: [...] [...] [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to complete for scsi0 [...] INFO: task auditd:1200 blocked for more than 120 seconds. [...] Not tainted 4.13.0+ #15 [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [...] auditd D0 1200 1 0x [...] Call Trace: [...] __schedule+0x28d/0x890 [...] schedule+0x36/0x80 [...] io_schedule+0x16/0x40 [...] wait_on_page_bit_common+0x109/0x1c0 [...] ? page_cache_tree_insert+0xf0/0xf0 [...] __filemap_fdatawait_range+0x127/0x190 [...] ? __filemap_fdatawrite_range+0xd1/0x100 [...] file_write_and_wait_range+0x60/0xb0 [...] xfs_file_fsync+0x67/0x1d0 [xfs] [...] vfs_fsync_range+0x3d/0xb0 [...] do_fsync+0x3d/0x70 [...] SyS_fsync+0x10/0x20 [...] entry_SYSCALL_64_fastpath+0x1a/0xa5 [...] RIP: 0033:0x7f0bd9633d2d [...] RSP: 002b:7f0bd751ed30 EFLAGS: 0293 ORIG_RAX: 004a [...] RAX: ffda RBX: 5590566d0080 RCX: 7f0bd9633d2d [...] RDX: 5590566d1260 RSI: RDI: 0005 [...] RBP: R08: R09: 0017 [...] R10: R11: 0293 R12: [...] R13: 7f0bd751f9c0 R14: 7f0bd751f700 R15: --- Thanks, Yasuaki Ishimatsu On 08/21/2017 09:37 AM, Marc Zyngier wrote: > On 21/08/17 14:18, Christoph Hellwig wrote: >> Can you
Re: system hung up when offlining CPUs
On 21/08/17 14:18, Christoph Hellwig wrote: > Can you try the patch below please? > > --- > From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Mon, 21 Aug 2017 14:24:11 +0200 > Subject: virtio_pci: fix cpu affinity support > > Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for > virtqueues"") removed the adjustment of the pre_vectors for the virtio > MSI-X vector allocation which was added in commit fb5e31d9 ("virtio: > allow drivers to request IRQ affinity when creating VQs"). This will > lead to an incorrect assignment of MSI-X vectors, and potential > deadlocks when offlining cpus. > > Signed-off-by: Christoph Hellwig > Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues") > Reported-by: YASUAKI ISHIMATSU Just gave it a go on an arm64 VM, and the behaviour seems much saner (the virtio queue affinity now spans the whole system). Tested-by: Marc Zyngier Thanks, M. -- Jazz is not dead. It just smells funny...
Re: system hung up when offlining CPUs
Can you try the patch below please? --- >From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 21 Aug 2017 14:24:11 +0200 Subject: virtio_pci: fix cpu affinity support Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues"") removed the adjustment of the pre_vectors for the virtio MSI-X vector allocation which was added in commit fb5e31d9 ("virtio: allow drivers to request IRQ affinity when creating VQs"). This will lead to an incorrect assignment of MSI-X vectors, and potential deadlocks when offlining cpus. Signed-off-by: Christoph Hellwig Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues") Reported-by: YASUAKI ISHIMATSU --- drivers/virtio/virtio_pci_common.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c index 007a4f366086..1c4797e53f68 100644 --- a/drivers/virtio/virtio_pci_common.c +++ b/drivers/virtio/virtio_pci_common.c @@ -107,6 +107,7 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, { struct virtio_pci_device *vp_dev = to_vp_device(vdev); const char *name = dev_name(&vp_dev->vdev.dev); + unsigned flags = PCI_IRQ_MSIX; unsigned i, v; int err = -ENOMEM; @@ -126,10 +127,13 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, GFP_KERNEL)) goto error; + if (desc) { + flags |= PCI_IRQ_AFFINITY; + desc->pre_vectors++; /* virtio config vector */ + } + err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors, -nvectors, PCI_IRQ_MSIX | -(desc ? PCI_IRQ_AFFINITY : 0), -desc); +nvectors, flags, desc); if (err < 0) goto error; vp_dev->msix_enabled = 1; -- 2.11.0
Re: system hung up when offlining CPUs
Hi Marc, in general the driver should know not to use the queue / irq, as blk-mq will never schedule I/O to queues that have no online cpus. The real bugs seems to be that we're using affinity for a device that only has one real queue (as the config queue should not have affinity). Let me dig into what's going on here with virtio.
Re: system hung up when offlining CPUs
+ Christoph, since he's the one who came up with the idea On 09/08/17 20:09, YASUAKI ISHIMATSU wrote: > Hi Marc, > > On 08/09/2017 07:42 AM, Marc Zyngier wrote: >> On Tue, 8 Aug 2017 15:25:35 -0400 >> YASUAKI ISHIMATSU wrote: >> >>> Hi Thomas, >>> >>> When offlining all CPUs except cpu0, system hung up with the following >>> message. >>> >>> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds. >>> [...] Not tainted 4.12.0-rc6+ #19 >>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>> message. >>> [...] kworker/u384:1 D0 1234 2 0x >>> [...] Workqueue: writeback wb_workfn (flush-253:0) >>> [...] Call Trace: >>> [...] __schedule+0x28a/0x880 >>> [...] schedule+0x36/0x80 >>> [...] schedule_timeout+0x249/0x300 >>> [...] ? __schedule+0x292/0x880 >>> [...] __down_common+0xfc/0x132 >>> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs] >>> [...] __down+0x1d/0x1f >>> [...] down+0x41/0x50 >>> [...] xfs_buf_lock+0x3c/0xf0 [xfs] >>> [...] _xfs_buf_find+0x2bb/0x510 [xfs] >>> [...] xfs_buf_get_map+0x2a/0x280 [xfs] >>> [...] xfs_buf_read_map+0x2d/0x180 [xfs] >>> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs] >>> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs] >>> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs] >>> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs] >>> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs] >>> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs] >>> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs] >>> [...] xfs_alloc_vextent+0x425/0x590 [xfs] >>> [...] xfs_bmap_btalloc+0x448/0x770 [xfs] >>> [...] xfs_bmap_alloc+0xe/0x10 [xfs] >>> [...] xfs_bmapi_write+0x61d/0xc10 [xfs] >>> [...] ? kmem_zone_alloc+0x96/0x100 [xfs] >>> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] >>> [...] xfs_map_blocks+0x1e8/0x260 [xfs] >>> [...] xfs_do_writepage+0x1ca/0x680 [xfs] >>> [...] write_cache_pages+0x26f/0x510 >>> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs] >>> [...] ? blk_mq_dispatch_rq_list+0x305/0x410 >>> [...] ? deadline_remove_request+0x7d/0xc0 >>> [...] xfs_vm_writepages+0xb6/0xd0 [xfs] >>> [...] do_writepages+0x1c/0x70 >>> [...] __writeback_single_inode+0x45/0x320 >>> [...] writeback_sb_inodes+0x280/0x570 >>> [...] __writeback_inodes_wb+0x8c/0xc0 >>> [...] wb_writeback+0x276/0x310 >>> [...] ? get_nr_dirty_inodes+0x4d/0x80 >>> [...] wb_workfn+0x2d4/0x3b0 >>> [...] process_one_work+0x149/0x360 >>> [...] worker_thread+0x4d/0x3c0 >>> [...] kthread+0x109/0x140 >>> [...] ? rescuer_thread+0x380/0x380 >>> [...] ? kthread_park+0x60/0x60 >>> [...] ret_from_fork+0x25/0x30 >>> >>> >>> I bisected upstream kernel. And I found that the following commit lead >>> the issue. >>> >>> commit c5cb83bb337c25caae995d992d1cdf9b317f83de >>> Author: Thomas Gleixner >>> Date: Tue Jun 20 01:37:51 2017 +0200 >>> >>> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug >> >> Can you please post your /proc/interrupts and details of which >> interrupt you think goes wrong? This backtrace is not telling us much >> in terms of where to start looking... > > Thank you for giving advise. > > The issue is easily reproduced on physical/virtual machine by offling CPUs > except cpu0. > Here are my /proc/interrupts on kvm guest before reproducing the issue. And > when offlining > cpu1, the issue occurred. But when offling cpu0, the issue didn't occur. > >CPU0 CPU1 > 0:127 0 IO-APIC 2-edge timer > 1: 10 0 IO-APIC 1-edge i8042 > 4:227 0 IO-APIC 4-edge ttyS0 > 6: 3 0 IO-APIC 6-edge floppy > 8: 0 0 IO-APIC 8-edge rtc0 > 9: 0 0 IO-APIC 9-fasteoi acpi > 10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, > uhci_hcd:usb2, virtio3 > 11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, > uhci_hcd:usb4, qxl > 12: 15 0 IO-APIC 12-edge i8042 > 14:218 0 IO-APIC 14-edge ata_piix > 15: 0 0 IO-APIC 15-edge ata_piix > 24: 0 0 PCI-MSI 49152-edge virtio0-config > 25:359 0 PCI-MSI 49153-edge virtio0-input.0 > 26: 1 0 PCI-MSI 49154-edge virtio0-output.0 > 27: 0 0 PCI-MSI 114688-edge virtio2-config > 28: 1 3639 PCI-MSI 114689-edge virtio2-req.0 > 29: 0 0 PCI-MSI 98304-edge virtio1-config > 30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues > 31:189 0 PCI-MSI 65536-edge snd_hda_intel:card0 > NMI: 0 0 Non-maskable interrupts > LOC: 16115 12845 Local timer interrupts > SPU: 0 0 Spurious interrupts > PMI: 0 0 Performance monitoring interrupts > IWI: 0 0 IRQ work interrupts > RTR:
Re: system hung up when offlining CPUs
Hi Marc, On 08/09/2017 07:42 AM, Marc Zyngier wrote: > On Tue, 8 Aug 2017 15:25:35 -0400 > YASUAKI ISHIMATSU wrote: > >> Hi Thomas, >> >> When offlining all CPUs except cpu0, system hung up with the following >> message. >> >> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds. >> [...] Not tainted 4.12.0-rc6+ #19 >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> [...] kworker/u384:1 D0 1234 2 0x >> [...] Workqueue: writeback wb_workfn (flush-253:0) >> [...] Call Trace: >> [...] __schedule+0x28a/0x880 >> [...] schedule+0x36/0x80 >> [...] schedule_timeout+0x249/0x300 >> [...] ? __schedule+0x292/0x880 >> [...] __down_common+0xfc/0x132 >> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs] >> [...] __down+0x1d/0x1f >> [...] down+0x41/0x50 >> [...] xfs_buf_lock+0x3c/0xf0 [xfs] >> [...] _xfs_buf_find+0x2bb/0x510 [xfs] >> [...] xfs_buf_get_map+0x2a/0x280 [xfs] >> [...] xfs_buf_read_map+0x2d/0x180 [xfs] >> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs] >> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs] >> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs] >> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs] >> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs] >> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs] >> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs] >> [...] xfs_alloc_vextent+0x425/0x590 [xfs] >> [...] xfs_bmap_btalloc+0x448/0x770 [xfs] >> [...] xfs_bmap_alloc+0xe/0x10 [xfs] >> [...] xfs_bmapi_write+0x61d/0xc10 [xfs] >> [...] ? kmem_zone_alloc+0x96/0x100 [xfs] >> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] >> [...] xfs_map_blocks+0x1e8/0x260 [xfs] >> [...] xfs_do_writepage+0x1ca/0x680 [xfs] >> [...] write_cache_pages+0x26f/0x510 >> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs] >> [...] ? blk_mq_dispatch_rq_list+0x305/0x410 >> [...] ? deadline_remove_request+0x7d/0xc0 >> [...] xfs_vm_writepages+0xb6/0xd0 [xfs] >> [...] do_writepages+0x1c/0x70 >> [...] __writeback_single_inode+0x45/0x320 >> [...] writeback_sb_inodes+0x280/0x570 >> [...] __writeback_inodes_wb+0x8c/0xc0 >> [...] wb_writeback+0x276/0x310 >> [...] ? get_nr_dirty_inodes+0x4d/0x80 >> [...] wb_workfn+0x2d4/0x3b0 >> [...] process_one_work+0x149/0x360 >> [...] worker_thread+0x4d/0x3c0 >> [...] kthread+0x109/0x140 >> [...] ? rescuer_thread+0x380/0x380 >> [...] ? kthread_park+0x60/0x60 >> [...] ret_from_fork+0x25/0x30 >> >> >> I bisected upstream kernel. And I found that the following commit lead >> the issue. >> >> commit c5cb83bb337c25caae995d992d1cdf9b317f83de >> Author: Thomas Gleixner >> Date: Tue Jun 20 01:37:51 2017 +0200 >> >> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug > > Can you please post your /proc/interrupts and details of which > interrupt you think goes wrong? This backtrace is not telling us much > in terms of where to start looking... Thank you for giving advise. The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0. Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining cpu1, the issue occurred. But when offling cpu0, the issue didn't occur. CPU0 CPU1 0:127 0 IO-APIC 2-edge timer 1: 10 0 IO-APIC 1-edge i8042 4:227 0 IO-APIC 4-edge ttyS0 6: 3 0 IO-APIC 6-edge floppy 8: 0 0 IO-APIC 8-edge rtc0 9: 0 0 IO-APIC 9-fasteoi acpi 10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, virtio3 11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, uhci_hcd:usb4, qxl 12: 15 0 IO-APIC 12-edge i8042 14:218 0 IO-APIC 14-edge ata_piix 15: 0 0 IO-APIC 15-edge ata_piix 24: 0 0 PCI-MSI 49152-edge virtio0-config 25:359 0 PCI-MSI 49153-edge virtio0-input.0 26: 1 0 PCI-MSI 49154-edge virtio0-output.0 27: 0 0 PCI-MSI 114688-edge virtio2-config 28: 1 3639 PCI-MSI 114689-edge virtio2-req.0 29: 0 0 PCI-MSI 98304-edge virtio1-config 30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues 31:189 0 PCI-MSI 65536-edge snd_hda_intel:card0 NMI: 0 0 Non-maskable interrupts LOC: 16115 12845 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 0 0 APIC ICR read retries RES: 3016 2135 Rescheduling interrupts CAL: 3666557 Function call interrupts TLB: 65 12 TLB shootdowns TRM: 0 0 Thermal event interrupts THR:
Re: system hung up when offlining CPUs
On Tue, 8 Aug 2017 15:25:35 -0400 YASUAKI ISHIMATSU wrote: > Hi Thomas, > > When offlining all CPUs except cpu0, system hung up with the following > message. > > [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds. > [...] Not tainted 4.12.0-rc6+ #19 > [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > [...] kworker/u384:1 D0 1234 2 0x > [...] Workqueue: writeback wb_workfn (flush-253:0) > [...] Call Trace: > [...] __schedule+0x28a/0x880 > [...] schedule+0x36/0x80 > [...] schedule_timeout+0x249/0x300 > [...] ? __schedule+0x292/0x880 > [...] __down_common+0xfc/0x132 > [...] ? _xfs_buf_find+0x2bb/0x510 [xfs] > [...] __down+0x1d/0x1f > [...] down+0x41/0x50 > [...] xfs_buf_lock+0x3c/0xf0 [xfs] > [...] _xfs_buf_find+0x2bb/0x510 [xfs] > [...] xfs_buf_get_map+0x2a/0x280 [xfs] > [...] xfs_buf_read_map+0x2d/0x180 [xfs] > [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs] > [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs] > [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs] > [...] xfs_btree_lookup+0xd0/0x3b0 [xfs] > [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs] > [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs] > [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs] > [...] xfs_alloc_vextent+0x425/0x590 [xfs] > [...] xfs_bmap_btalloc+0x448/0x770 [xfs] > [...] xfs_bmap_alloc+0xe/0x10 [xfs] > [...] xfs_bmapi_write+0x61d/0xc10 [xfs] > [...] ? kmem_zone_alloc+0x96/0x100 [xfs] > [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] > [...] xfs_map_blocks+0x1e8/0x260 [xfs] > [...] xfs_do_writepage+0x1ca/0x680 [xfs] > [...] write_cache_pages+0x26f/0x510 > [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs] > [...] ? blk_mq_dispatch_rq_list+0x305/0x410 > [...] ? deadline_remove_request+0x7d/0xc0 > [...] xfs_vm_writepages+0xb6/0xd0 [xfs] > [...] do_writepages+0x1c/0x70 > [...] __writeback_single_inode+0x45/0x320 > [...] writeback_sb_inodes+0x280/0x570 > [...] __writeback_inodes_wb+0x8c/0xc0 > [...] wb_writeback+0x276/0x310 > [...] ? get_nr_dirty_inodes+0x4d/0x80 > [...] wb_workfn+0x2d4/0x3b0 > [...] process_one_work+0x149/0x360 > [...] worker_thread+0x4d/0x3c0 > [...] kthread+0x109/0x140 > [...] ? rescuer_thread+0x380/0x380 > [...] ? kthread_park+0x60/0x60 > [...] ret_from_fork+0x25/0x30 > > > I bisected upstream kernel. And I found that the following commit lead > the issue. > > commit c5cb83bb337c25caae995d992d1cdf9b317f83de > Author: Thomas Gleixner > Date: Tue Jun 20 01:37:51 2017 +0200 > > genirq/cpuhotplug: Handle managed IRQs on CPU hotplug Can you please post your /proc/interrupts and details of which interrupt you think goes wrong? This backtrace is not telling us much in terms of where to start looking... Thanks, M. -- Without deviation from the norm, progress is not possible.