Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

2018-05-17 Thread Rick Warner
Success!

This resolved the issue.  Thanks Thomas!

Rick

On 05/17/18 08:36, Thomas Gleixner wrote:
> Rick,
>
> On Wed, 16 May 2018, Rick Warner wrote:
>
>> I've attached the dmesg output with the kernel parameter and supplied patch.
> Thanks for providing the data. I think I know what's wrong here. Does the
> patch below fix your problem?
>
> While you test, I go looking for a big brown paperbag.
>
> Thanks,
>
>   tglx
>
> 8<
>
> --- a/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> @@ -124,6 +124,7 @@ static void init_x2apic_ldr(void)
>   goto update;
>   }
>   cmsk = cluster_hotplug_mask;
> + cmsk->clusterid = cluster;
>   cluster_hotplug_mask = NULL;
>  update:
>   this_cpu_write(cluster_masks, cmsk);



Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

2018-05-17 Thread Thomas Gleixner
Rick,

On Wed, 16 May 2018, Rick Warner wrote:

> I've attached the dmesg output with the kernel parameter and supplied patch.

Thanks for providing the data. I think I know what's wrong here. Does the
patch below fix your problem?

While you test, I go looking for a big brown paperbag.

Thanks,

tglx

8<

--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -124,6 +124,7 @@ static void init_x2apic_ldr(void)
goto update;
}
cmsk = cluster_hotplug_mask;
+   cmsk->clusterid = cluster;
cluster_hotplug_mask = NULL;
 update:
this_cpu_write(cluster_masks, cmsk);


Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

2018-05-16 Thread Rick Warner
Thanks Thomas!

I've attached the dmesg output with the kernel parameter and supplied patch.

Here is the excerpt of the ftrace dump:
Dumping ftrace buffer:
-
swapper/-1   0d... 463331us : __x2apic_send_IPI_mask: To:
swapper/-1   0d... 46us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1   0d... 463334us : __x2apic_send_IPI_mask: dest 00010001
--> tmpmsk
swapper/-1   0d... 504423us : __x2apic_send_IPI_mask: To:
swapper/-1   0d... 504424us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1   0d... 504425us : __x2apic_send_IPI_mask: dest 0400
--> tmpmsk
swapper/-1   0d... 504425us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1   0d... 504425us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1   0d... 504425us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1   0d... 504426us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1   0d... 504430us : __x2apic_send_IPI_mask: To:
swapper/-1   0d... 504430us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1   0d... 504431us : __x2apic_send_IPI_mask: dest 0400
--> tmpmsk
swapper/-1   0d... 504431us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1   0d... 504431us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1   0d... 504431us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1   0d... 504431us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1   0d... 504433us : __x2apic_send_IPI_mask: To:
swapper/-1   0d... 504433us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1   0d... 504433us : __x2apic_send_IPI_mask: dest 0400
--> tmpmsk
swapper/-1   0d... 504434us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1   0d... 504434us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1   0d... 504434us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1   0d... 504434us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1   3d... 1813404us : __x2apic_send_IPI_mask: To:
swapper/-1   3d... 1813405us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1   3d... 1813406us : __x2apic_send_IPI_mask: dest 00010001
--> tmpmsk
swapper/-1   3 1879964us : on_each_cpu_cond: 0-1,3
swapper/-1   3 1879972us : on_each_cpu_cond: 0-1,3
swapper/-1   3 1879975us : on_each_cpu_cond: 0-1,3
swapper/-1   3d... 1879976us : __x2apic_send_IPI_mask: To:
swapper/-1   3d... 1879977us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1   3d... 1879977us : __x2apic_send_IPI_mask: dest 00010401
--> tmpmsk
  -0   0d.h. 61903171us : __x2apic_send_IPI_mask: To:
  -0   0d.h. 61903172us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  -0   0d.h. 61903172us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk
  -0   0d.h. 61904223us : __x2apic_send_IPI_mask: To:
  -0   0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
  -0   0d.h. 61904223us : __x2apic_send_IPI_mask: dest
0400 --> tmpmsk
  -0   0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 2
cluster: 2
  -0   0d.h. 61904224us : __x2apic_send_IPI_mask: dest
00010010 --> tmpmsk
  -0   0d.h. 61904224us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  -0   0d.h. 61904224us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk
  -0   0d.h. 62924392us : __x2apic_send_IPI_mask: To:
  -0   0d.h. 62924393us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  -0   0d.h. 62924394us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk

Thanks,
Rick

On 05/16/18 10:50, Thomas Gleixner wrote:
> Rick,
>
> On Tue, 15 May 2018, Thomas Gleixner wrote:
>> I can't spot an immediate fail with that commit, but I'll have a look
>> tomorrow for instrumenting this with tracepoints which can be dumped from
>> the stall detector.
> can you please give the patch below a try? I assume you have a serial
> console, otherwise this is going to be tedious.
>
> Please add the following to the kernel command line:
>
>   ftrace_dump_on_oops
>
> Note, the box will panic after the first stall and spill out the trace
> buffer over serial, which might take a while.
>
> Thanks,
>
>   tglx
>
> 8<
>
> --- a/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> @@ -52,20 +52,28 @@ static void
>   if (apic_dest != APIC_DEST_ALLINC)
>   cpumask_clear_cpu(smp_processor_id(), tmpmsk);
>  
> + trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk));
> +
>   /* Collapse cpus in a cluster so a single IPI per cluster is sent */
>   for_each_cpu(cpu, tmpmsk) {
>   struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);
>  
>   dest = 0;
> + trace_printk("CPU: %u cluster: %*pbl\n", cpu,
> +  cpumask_pr_args(&cmsk->mask));
>   for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask)
>   de

Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

2018-05-16 Thread Thomas Gleixner
Rick,

On Tue, 15 May 2018, Thomas Gleixner wrote:
> I can't spot an immediate fail with that commit, but I'll have a look
> tomorrow for instrumenting this with tracepoints which can be dumped from
> the stall detector.

can you please give the patch below a try? I assume you have a serial
console, otherwise this is going to be tedious.

Please add the following to the kernel command line:

  ftrace_dump_on_oops

Note, the box will panic after the first stall and spill out the trace
buffer over serial, which might take a while.

Thanks,

tglx

8<

--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -52,20 +52,28 @@ static void
if (apic_dest != APIC_DEST_ALLINC)
cpumask_clear_cpu(smp_processor_id(), tmpmsk);
 
+   trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk));
+
/* Collapse cpus in a cluster so a single IPI per cluster is sent */
for_each_cpu(cpu, tmpmsk) {
struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);
 
dest = 0;
+   trace_printk("CPU: %u cluster: %*pbl\n", cpu,
+cpumask_pr_args(&cmsk->mask));
for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask)
dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu);
 
-   if (!dest)
+   if (!dest) {
+   trace_printk("dest = 0!?\n");
continue;
+   }
 
__x2apic_send_IPI_dest(dest, vector, apic->dest_logical);
/* Remove cluster CPUs from tmpmask */
cpumask_andnot(tmpmsk, tmpmsk, &cmsk->mask);
+   trace_printk("dest %08x --> tmpmsk %*pbl\n", dest,
+cpumask_pr_args(tmpmsk));
}
 
local_irq_restore(flags);
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -124,7 +124,7 @@ int rcu_num_lvls __read_mostly = RCU_NUM
 int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
 int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. 
*/
 /* panic() on RCU Stall sysctl. */
-int sysctl_panic_on_rcu_stall __read_mostly;
+int sysctl_panic_on_rcu_stall __read_mostly = 1;
 
 /*
  * The rcu_scheduler_active variable is initialized to the value
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -681,6 +681,7 @@ void on_each_cpu_cond(bool (*cond_func)(
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
cpumask_set_cpu(cpu, cpus);
+   trace_printk("%*pbl\n", cpumask_pr_args(cpus));
on_each_cpu_mask(cpus, func, info, wait);
preempt_enable();
free_cpumask_var(cpus);


Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

2018-05-15 Thread Thomas Gleixner
On Tue, 15 May 2018, Rick Warner wrote:
> > I've discovered that some new Supermicro skylake systems will hang/stall
> > while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
> > the BIOS. The issue happens on specific CPUs only and follows the CPUs.
> > 
> > We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
> > exhibiting this behavior.  We replaced 2 CPUs at that time and the
> > behavior was eliminated. Those systems were then shipped to our customer
> > (we are an HPC system integrator).
> > 
> > Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are
> > hanging.  If we swap the CPUs from the hanging systems with working
> > systems, the behavior follows the CPU.

That's weird.

> > I've done a git bisect between 4.14 and 4.15 and found this commit is
> > triggering the issue:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5
> >

Interesting.

> > I've attached a dmesg log captured via serial console from a system
> > exhibiting this problem.  Here is an excerpt from it where the problems
> > start:

> > NMI backtrace for cpu 34

> > RIP: 0010:smp_call_function_many+0x1f1/0x204

So this waits for the IPI to be handled on some other CPU(s).

> > RSP: :c90f3af0 EFLAGS: 0202 ORIG_RAX: ff11
> > RAX: 0001 RBX: 880c110a0488 RCX: 0001
> > RDX: 880c10e64440 RSI:  RDI: 880c110a0488
> > RBP: 880c110a0480 R08: fffe R09: 0003
> > R10:  R11: ea00c03c1a60 R12: 0001
> > R13: 880c110a04b8 R14: 00020440 R15: 81ed5400
> >   ? slub_cpu_dead+0xa0/0xa0
> >   ? slub_cpu_dead+0xa0/0xa0
> >   ? __mmu_notifier_mm_destroy+0x32/0x32
> >   on_each_cpu_mask+0x23/0x53
> >   ? slub_cpu_dead+0xa0/0xa0
> >   on_each_cpu_cond+0x7c/0x8b
> >   __kmem_cache_shrink+0x3c/0x237
> >   ? acpi_ps_delete_parse_tree+0x2d/0x59
> >   ? set_debug_rodata+0x11/0x11
> >   ? acpi_os_purge_cache+0xa/0xd
> >   acpi_os_purge_cache+0xa/0xd
> >   acpi_purge_cached_objects+0x29/0x38
> >   acpi_initialize_objects+0x46/0x4f
> >   ? acpi_sleep_init+0xd6/0xd6
> >   acpi_init+0xb6/0x324
> >   ? scan_for_dmi_ipmi+0x15/0xec
> >   ? acpi_sleep_init+0xd6/0xd6
> >   do_one_initcall+0x89/0x128
> >   ? set_debug_rodata+0x11/0x11
> >   ? set_debug_rodata+0x11/0x11
> >   kernel_init_freeable+0x112/0x18e
> >   ? rest_init+0xaa/0xaa
> >   kernel_init+0xa/0xf0
> >   ret_from_fork+0x35/0x40

> > If any other information is needed, please let me know.  I've reported
> > this issue to Supermicro already and they believe it is an issue with
> > the kernel opposed to an issue specific to their systems.  I don't have
> > any other brand Xeon skylake systems with extended APIC support that I
> > can try this with.

I can't spot an immediate fail with that commit, but I'll have a look
tomorrow for instrumenting this with tracepoints which can be dumped from
the stall detector.

Thanks,

tglx