Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
Success! This resolved the issue. Thanks Thomas! Rick On 05/17/18 08:36, Thomas Gleixner wrote: > Rick, > > On Wed, 16 May 2018, Rick Warner wrote: > >> I've attached the dmesg output with the kernel parameter and supplied patch. > Thanks for providing the data. I think I know what's wrong here. Does the > patch below fix your problem? > > While you test, I go looking for a big brown paperbag. > > Thanks, > > tglx > > 8< > > --- a/arch/x86/kernel/apic/x2apic_cluster.c > +++ b/arch/x86/kernel/apic/x2apic_cluster.c > @@ -124,6 +124,7 @@ static void init_x2apic_ldr(void) > goto update; > } > cmsk = cluster_hotplug_mask; > + cmsk->clusterid = cluster; > cluster_hotplug_mask = NULL; > update: > this_cpu_write(cluster_masks, cmsk);
Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
Rick, On Wed, 16 May 2018, Rick Warner wrote: > I've attached the dmesg output with the kernel parameter and supplied patch. Thanks for providing the data. I think I know what's wrong here. Does the patch below fix your problem? While you test, I go looking for a big brown paperbag. Thanks, tglx 8< --- a/arch/x86/kernel/apic/x2apic_cluster.c +++ b/arch/x86/kernel/apic/x2apic_cluster.c @@ -124,6 +124,7 @@ static void init_x2apic_ldr(void) goto update; } cmsk = cluster_hotplug_mask; + cmsk->clusterid = cluster; cluster_hotplug_mask = NULL; update: this_cpu_write(cluster_masks, cmsk);
Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
Thanks Thomas! I've attached the dmesg output with the kernel parameter and supplied patch. Here is the excerpt of the ftrace dump: Dumping ftrace buffer: - swapper/-1 0d... 463331us : __x2apic_send_IPI_mask: To: swapper/-1 0d... 46us : __x2apic_send_IPI_mask: CPU: 0 cluster: 0-1 swapper/-1 0d... 463334us : __x2apic_send_IPI_mask: dest 00010001 --> tmpmsk swapper/-1 0d... 504423us : __x2apic_send_IPI_mask: To: swapper/-1 0d... 504424us : __x2apic_send_IPI_mask: CPU: 1 cluster: 0-1 swapper/-1 0d... 504425us : __x2apic_send_IPI_mask: dest 0400 --> tmpmsk swapper/-1 0d... 504425us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2 swapper/-1 0d... 504425us : __x2apic_send_IPI_mask: dest 00010010 --> tmpmsk swapper/-1 0d... 504425us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 swapper/-1 0d... 504426us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk swapper/-1 0d... 504430us : __x2apic_send_IPI_mask: To: swapper/-1 0d... 504430us : __x2apic_send_IPI_mask: CPU: 1 cluster: 0-1 swapper/-1 0d... 504431us : __x2apic_send_IPI_mask: dest 0400 --> tmpmsk swapper/-1 0d... 504431us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2 swapper/-1 0d... 504431us : __x2apic_send_IPI_mask: dest 00010010 --> tmpmsk swapper/-1 0d... 504431us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 swapper/-1 0d... 504431us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk swapper/-1 0d... 504433us : __x2apic_send_IPI_mask: To: swapper/-1 0d... 504433us : __x2apic_send_IPI_mask: CPU: 1 cluster: 0-1 swapper/-1 0d... 504433us : __x2apic_send_IPI_mask: dest 0400 --> tmpmsk swapper/-1 0d... 504434us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2 swapper/-1 0d... 504434us : __x2apic_send_IPI_mask: dest 00010010 --> tmpmsk swapper/-1 0d... 504434us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 swapper/-1 0d... 504434us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk swapper/-1 3d... 1813404us : __x2apic_send_IPI_mask: To: swapper/-1 3d... 1813405us : __x2apic_send_IPI_mask: CPU: 0 cluster: 0-1 swapper/-1 3d... 1813406us : __x2apic_send_IPI_mask: dest 00010001 --> tmpmsk swapper/-1 3 1879964us : on_each_cpu_cond: 0-1,3 swapper/-1 3 1879972us : on_each_cpu_cond: 0-1,3 swapper/-1 3 1879975us : on_each_cpu_cond: 0-1,3 swapper/-1 3d... 1879976us : __x2apic_send_IPI_mask: To: swapper/-1 3d... 1879977us : __x2apic_send_IPI_mask: CPU: 0 cluster: 0-1 swapper/-1 3d... 1879977us : __x2apic_send_IPI_mask: dest 00010401 --> tmpmsk -0 0d.h. 61903171us : __x2apic_send_IPI_mask: To: -0 0d.h. 61903172us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 -0 0d.h. 61903172us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk -0 0d.h. 61904223us : __x2apic_send_IPI_mask: To: -0 0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 1 cluster: 0-1 -0 0d.h. 61904223us : __x2apic_send_IPI_mask: dest 0400 --> tmpmsk -0 0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2 -0 0d.h. 61904224us : __x2apic_send_IPI_mask: dest 00010010 --> tmpmsk -0 0d.h. 61904224us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 -0 0d.h. 61904224us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk -0 0d.h. 62924392us : __x2apic_send_IPI_mask: To: -0 0d.h. 62924393us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3 -0 0d.h. 62924394us : __x2apic_send_IPI_mask: dest 00010040 --> tmpmsk Thanks, Rick On 05/16/18 10:50, Thomas Gleixner wrote: > Rick, > > On Tue, 15 May 2018, Thomas Gleixner wrote: >> I can't spot an immediate fail with that commit, but I'll have a look >> tomorrow for instrumenting this with tracepoints which can be dumped from >> the stall detector. > can you please give the patch below a try? I assume you have a serial > console, otherwise this is going to be tedious. > > Please add the following to the kernel command line: > > ftrace_dump_on_oops > > Note, the box will panic after the first stall and spill out the trace > buffer over serial, which might take a while. > > Thanks, > > tglx > > 8< > > --- a/arch/x86/kernel/apic/x2apic_cluster.c > +++ b/arch/x86/kernel/apic/x2apic_cluster.c > @@ -52,20 +52,28 @@ static void > if (apic_dest != APIC_DEST_ALLINC) > cpumask_clear_cpu(smp_processor_id(), tmpmsk); > > + trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk)); > + > /* Collapse cpus in a cluster so a single IPI per cluster is sent */ > for_each_cpu(cpu, tmpmsk) { > struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu); > > dest = 0; > + trace_printk("CPU: %u cluster: %*pbl\n", cpu, > + cpumask_pr_args(&cmsk->mask)); > for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask) > de
Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
Rick, On Tue, 15 May 2018, Thomas Gleixner wrote: > I can't spot an immediate fail with that commit, but I'll have a look > tomorrow for instrumenting this with tracepoints which can be dumped from > the stall detector. can you please give the patch below a try? I assume you have a serial console, otherwise this is going to be tedious. Please add the following to the kernel command line: ftrace_dump_on_oops Note, the box will panic after the first stall and spill out the trace buffer over serial, which might take a while. Thanks, tglx 8< --- a/arch/x86/kernel/apic/x2apic_cluster.c +++ b/arch/x86/kernel/apic/x2apic_cluster.c @@ -52,20 +52,28 @@ static void if (apic_dest != APIC_DEST_ALLINC) cpumask_clear_cpu(smp_processor_id(), tmpmsk); + trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk)); + /* Collapse cpus in a cluster so a single IPI per cluster is sent */ for_each_cpu(cpu, tmpmsk) { struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu); dest = 0; + trace_printk("CPU: %u cluster: %*pbl\n", cpu, +cpumask_pr_args(&cmsk->mask)); for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask) dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu); - if (!dest) + if (!dest) { + trace_printk("dest = 0!?\n"); continue; + } __x2apic_send_IPI_dest(dest, vector, apic->dest_logical); /* Remove cluster CPUs from tmpmask */ cpumask_andnot(tmpmsk, tmpmsk, &cmsk->mask); + trace_printk("dest %08x --> tmpmsk %*pbl\n", dest, +cpumask_pr_args(tmpmsk)); } local_irq_restore(flags); --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -124,7 +124,7 @@ int rcu_num_lvls __read_mostly = RCU_NUM int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ /* panic() on RCU Stall sysctl. */ -int sysctl_panic_on_rcu_stall __read_mostly; +int sysctl_panic_on_rcu_stall __read_mostly = 1; /* * The rcu_scheduler_active variable is initialized to the value --- a/kernel/smp.c +++ b/kernel/smp.c @@ -681,6 +681,7 @@ void on_each_cpu_cond(bool (*cond_func)( for_each_online_cpu(cpu) if (cond_func(cpu, info)) cpumask_set_cpu(cpu, cpus); + trace_printk("%*pbl\n", cpumask_pr_args(cpus)); on_each_cpu_mask(cpus, func, info, wait); preempt_enable(); free_cpumask_var(cpus);
Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC
On Tue, 15 May 2018, Rick Warner wrote: > > I've discovered that some new Supermicro skylake systems will hang/stall > > while booting the 4.15 kernel when extended APIC (x2apic) is enabled in > > the BIOS. The issue happens on specific CPUs only and follows the CPUs. > > > > We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were > > exhibiting this behavior. We replaced 2 CPUs at that time and the > > behavior was eliminated. Those systems were then shipped to our customer > > (we are an HPC system integrator). > > > > Now, we have 5 single socket systems with 5122 CPUs. 2 out of the 5 are > > hanging. If we swap the CPUs from the hanging systems with working > > systems, the behavior follows the CPU. That's weird. > > I've done a git bisect between 4.14 and 4.15 and found this commit is > > triggering the issue: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5 > > Interesting. > > I've attached a dmesg log captured via serial console from a system > > exhibiting this problem. Here is an excerpt from it where the problems > > start: > > NMI backtrace for cpu 34 > > RIP: 0010:smp_call_function_many+0x1f1/0x204 So this waits for the IPI to be handled on some other CPU(s). > > RSP: :c90f3af0 EFLAGS: 0202 ORIG_RAX: ff11 > > RAX: 0001 RBX: 880c110a0488 RCX: 0001 > > RDX: 880c10e64440 RSI: RDI: 880c110a0488 > > RBP: 880c110a0480 R08: fffe R09: 0003 > > R10: R11: ea00c03c1a60 R12: 0001 > > R13: 880c110a04b8 R14: 00020440 R15: 81ed5400 > > ? slub_cpu_dead+0xa0/0xa0 > > ? slub_cpu_dead+0xa0/0xa0 > > ? __mmu_notifier_mm_destroy+0x32/0x32 > > on_each_cpu_mask+0x23/0x53 > > ? slub_cpu_dead+0xa0/0xa0 > > on_each_cpu_cond+0x7c/0x8b > > __kmem_cache_shrink+0x3c/0x237 > > ? acpi_ps_delete_parse_tree+0x2d/0x59 > > ? set_debug_rodata+0x11/0x11 > > ? acpi_os_purge_cache+0xa/0xd > > acpi_os_purge_cache+0xa/0xd > > acpi_purge_cached_objects+0x29/0x38 > > acpi_initialize_objects+0x46/0x4f > > ? acpi_sleep_init+0xd6/0xd6 > > acpi_init+0xb6/0x324 > > ? scan_for_dmi_ipmi+0x15/0xec > > ? acpi_sleep_init+0xd6/0xd6 > > do_one_initcall+0x89/0x128 > > ? set_debug_rodata+0x11/0x11 > > ? set_debug_rodata+0x11/0x11 > > kernel_init_freeable+0x112/0x18e > > ? rest_init+0xaa/0xaa > > kernel_init+0xa/0xf0 > > ret_from_fork+0x35/0x40 > > If any other information is needed, please let me know. I've reported > > this issue to Supermicro already and they believe it is an issue with > > the kernel opposed to an issue specific to their systems. I don't have > > any other brand Xeon skylake systems with extended APIC support that I > > can try this with. I can't spot an immediate fail with that commit, but I'll have a look tomorrow for instrumenting this with tracepoints which can be dumped from the stall detector. Thanks, tglx