Re: [BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Viresh Kumar
On 09-07-15, 00:25, Steven Rostedt wrote:
> On Thu, 9 Jul 2015 09:34:45 +0530
> Viresh Kumar  wrote:
> 
> 
> > I think it might be related to what I chased down yesterday:
> > 
> > http://marc.info/?l=linux-kernel&m=143633485824975&w=2
> > 
> > @Steven: Can you please give this a try ?
> > 
> 
> Yes that seems to fix my issue as well.
> 
> Tested-by: Steven Rostedt 

Awesome, so the problem was that cpufreq_set_policy() was failing
because of the latest bug I planted :), and that caused ->exit() but
didn't free the policy completely. (I have fixed that as well in a
separate patch).

And so you are hitting a policy which has already exited. Sorry about
that :)

-- 
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Steven Rostedt
On Thu, 9 Jul 2015 09:34:45 +0530
Viresh Kumar  wrote:


> I think it might be related to what I chased down yesterday:
> 
> http://marc.info/?l=linux-kernel&m=143633485824975&w=2
> 
> @Steven: Can you please give this a try ?
> 

Yes that seems to fix my issue as well.

Tested-by: Steven Rostedt 

Thanks!

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Viresh Kumar
On 09-07-15, 02:13, Rafael J. Wysocki wrote:
> So the cpufreq driver's ->get() callback returns 0 for the given CPU and
> that's what triggers the WARN_ON().  And it most likely returns 0, because
> its internal data structure for that CPU is not present.
> 
> I *guess* that before the above commit policy was NULL in 
> cpufreq_update_policy()
> and we didn't get to the point where ->get() was called.

I am not sure if that behavior should have changed at all.. Earlier we
were clearing per-cpu cpufreq_cpu_data for offline CPUs and so policy
would have been NULL for offline CPUs.

Now that per-cpu variable isn't cleared, but cpufreq_cpu_get() does
check if the CPU is part of policy->cpus or not, i.e. if it is
offline. And so policy should still be NULL for offline CPUs.

I think it might be related to what I chased down yesterday:

http://marc.info/?l=linux-kernel&m=143633485824975&w=2

@Steven: Can you please give this a try ?

-- 
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Steven Rostedt
On Thu, 09 Jul 2015 02:13:45 +0200
"Rafael J. Wysocki"  wrote:


> > Initializing CPU#1
> > [ cut here ]
> > WARNING: CPU: 0 PID: 1609 at 
> > /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 
> > cpufreq_update_policy+0xc8/0x139()
> 
> So the cpufreq driver's ->get() callback returns 0 for the given CPU and
> that's what triggers the WARN_ON().  And it most likely returns 0, because
> its internal data structure for that CPU is not present.
> 
> I *guess* that before the above commit policy was NULL in 
> cpufreq_update_policy()
> and we didn't get to the point where ->get() was called.

Just some more info. That ->get() is get_cur_freq_on_cpu() (I added a
printk to find out what that was).

Also, adding more printks() (patch of printk's added below) I got this:

 # trace-cmd start -p mmiotrace  # offlines all but one CPU
 # trace-cmd start -p nop# onlines the CPUs
 # trace-cmd start -p mmiotrace  # again offlines all but one CPU
 # trace-cmd start -p nop# again onlines the CPUs

produces:


in mmio_trace_init
mmiotrace: Disabling non-boot CPUs...
smpboot: CPU 1 is now offline
exit free f252c180 (1)
mmiotrace: CPU1 is down.
Broke affinity for irq 28
smpboot: CPU 2 is now offline
exit free f252c260 (2)
mmiotrace: CPU2 is down.
Broke affinity for irq 4
Broke affinity for irq 25
Broke affinity for irq 26
Broke affinity for irq 27
Broke affinity for irq 28
smpboot: CPU 3 is now offline
exit free f252c280 (3)
mmiotrace: CPU3 is down.
mmiotrace: enabled.
in mmio_trace_start
in mmio_trace_reset
mmiotrace: Re-enabling CPUs...
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x2
Initializing CPU#1
INIT data = f05a6b40 (1)
data=f05a6b40
data-acpi_data=f3539634
data-freq_table_data=f2073b00
exit free f05a6b40 (1)
mmiotrace: enabled CPU1.
smpboot: Booting Node 0 Processor 2 APIC 0x1
Initializing CPU#2
INIT data = efe567a0 (2)
data=efe567a0
data-acpi_data=f368b634
data-freq_table_data=ef849100
exit free efe567a0 (2)
mmiotrace: enabled CPU2.
smpboot: Booting Node 0 Processor 3 APIC 0x3
Initializing CPU#3
INIT data = efe56760 (3)
data=efe56760
data-acpi_data=f37dd634
data-freq_table_data=ef840600
exit free efe56760 (3)
mmiotrace: enabled CPU3.
mmiotrace: disabled.
in mmio_trace_init
mmiotrace: Disabling non-boot CPUs...
cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor
smpboot: CPU 1 is now offline
mmiotrace: CPU1 is down.
cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor
Broke affinity for irq 28
smpboot: CPU 2 is now offline
mmiotrace: CPU2 is down.
cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor
Broke affinity for irq 28
smpboot: CPU 3 is now offline
mmiotrace: CPU3 is down.
mmiotrace: enabled.
in mmio_trace_start
in mmio_trace_reset
mmiotrace: Re-enabling CPUs...
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x2
Initializing CPU#1
get=get_cur_freq_on_cpu+0x0/0xe9
data=  (null)
[ cut here ]
WARNING: CPU: 0 PID: 1994 at 
/home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2351 
cpufreq_update_policy+0xe8/0x159()
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables ipv6 microcode r8169 ppdev parport_pc parport
CPU: 0 PID: 1994 Comm: trace-cmd Not tainted 4.2.0-rc1-test+ #30
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
   efa11b54 c0cd0386 c10d4414 efa11b84 c0440fbe c101046c
  07ca c10d4414 092f c0a6db4a c0a6db4a f146cc00 
 efa11d60 efa11b94 c0440ff7 0009  efa11d6c c0a6db4a c10d4e15
Call Trace:
 [] dump_stack+0x41/0x52
 [] warn_slowpath_common+0x9d/0xb4
 [] ? cpufreq_update_policy+0xe8/0x159
 [] ? cpufreq_update_policy+0xe8/0x159
 [] warn_slowpath_null+0x22/0x24
 [] cpufreq_update_policy+0xe8/0x159
 [] ? extract_freq+0xa1/0xa1
 [] ? cpufreq_update_policy+0x159/0x159
 [] ? cpufreq_update_policy+0x3b/0x159
 [] ? cpufreq_freq_transition_begin+0x97/0xd9
 [] ? __wake_up+0x1a/0x47
 [] acpi_processor_ppc_has_changed+0x54/0x5d
 [] acpi_cpu_soft_notify+0xb0/0xf1
 [] ? compute_batch_value+0xd/0x22
 [] ? percpu_counter_hotcpu_callback+0x11/0x80
 [] notifier_call_chain+0x68/0x91
 [] __raw_notifier_call_chain+0x1e/0x23
 [] __cpu_notify+0x24/0x39
 [] _cpu_up+0xef/0x105
 [] cpu_up+0x4e/0x5f
 [] ? find_next_bit+0x1a/0x20
 [] disable_mmiotrace+0xd4/0x13e
 [] mmio_trace_reset+0x36/0x5e
 [] tracing_set_tracer+0xb1/0x155
 [] ? _copy_from_user+0x42/0x57
 [] tracing_set_trace_write+0x6a/0x80
 [] ? handle_mm_fault+0x75b/0xc42
 [] ? file_start_write+0x27/0x29
 [] ? tracing_set_tracer+0x155/0x155
 [] __vfs_write+0x24/0x9b
 [] ? file_start_write+0x27/0x29
 [] ? rw_verify_area+0xce/0xef
 [] ? __do_page_fault+0x2be/0x3be
 [] vfs_write+0x7a/0xc4
 [] SyS_write+0x54/0x7f
 [] sysenter_do_call+0x12/0x12
---[ end trace 47cc28ca9538eb2d ]---
mmiotrace: enabled CPU1.
smpboot: Booting Node 0 Processor 2 APIC 0x1
Initializing CPU#2
get=get_

Re: [BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Rafael J. Wysocki
On Wednesday, July 08, 2015 03:24:56 PM Steven Rostedt wrote:
> 
> My tests for ftrace includes testing the mmiotracer, which to run
> requires taking all CPUs offline but one of them. This test crashed
> every so often, and I was able to bisect down to this commit:
> 
> commit 87549141d516 ("cpufreq: Stop migrating sysfs files on hotplug")

Thanks for the report, adding linux-pm and linux-acpi to the CC.


> Just to make sure this wasn't just the mmiotracer causing the issue, I
> was able to trigger this same bug by simply doing the following:
> 
> 
> (on a 4 cpu machine)
> 
> 
>  # echo 0 > /sys/devices/system/cpu/cpu1/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu3/online 
>  # echo 1 > /sys/devices/system/cpu/cpu1/online 
>  # echo 1 > /sys/devices/system/cpu/cpu2/online 
>  # echo 1 > /sys/devices/system/cpu/cpu3/online 
>  # echo 0 > /sys/devices/system/cpu/cpu1/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu3/online 
>  # echo 1 > /sys/devices/system/cpu/cpu1/online 
>  # echo 1 > /sys/devices/system/cpu/cpu2/online 
>  # echo 1 > /sys/devices/system/cpu/cpu3/online 
> 
> It usually takes two or three tries (shutting down all but one CPU, and
> starting them again) before it triggers.
> 
> Here's the splat:
> 
> Initializing CPU#1
> [ cut here ]
> WARNING: CPU: 0 PID: 1609 at 
> /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 
> cpufreq_update_policy+0xc8/0x139()

So the cpufreq driver's ->get() callback returns 0 for the given CPU and
that's what triggers the WARN_ON().  And it most likely returns 0, because
its internal data structure for that CPU is not present.

I *guess* that before the above commit policy was NULL in 
cpufreq_update_policy()
and we didn't get to the point where ->get() was called.

There seems to be a couple of ways to address that, but I'd like Viresh to have
a look at this too.


> Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
> nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ppdev parport_pc r8169 parport 
> microcode
> CPU: 0 PID: 1609 Comm: bash Tainted: GW   4.2.0-rc1-test #26
> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>    ee47db9c c0cd04e6 c10d4463 ee47dbcc c0440fbe c1010460
>   0649 c10d4463 092e c0a6dd28 c0a6dd28 f13fd600 
>  ee47dda8 ee47dbdc c0440ff7 0009  ee47ddb8 c0a6dd28 efb01bc0
> Call Trace:
>  [] dump_stack+0x41/0x52
>  [] warn_slowpath_common+0x9d/0xb4
>  [] ? cpufreq_update_policy+0xc8/0x139
>  [] ? cpufreq_update_policy+0xc8/0x139
>  [] warn_slowpath_null+0x22/0x24
>  [] cpufreq_update_policy+0xc8/0x139
>  [] ? cpufreq_update_policy+0x139/0x139
>  [] ? cpufreq_update_policy+0x3b/0x139
>  [] ? cpufreq_freq_transition_begin+0x97/0xd9
>  [] ? __wake_up+0x1a/0x47
>  [] acpi_processor_ppc_has_changed+0x54/0x5d
>  [] acpi_cpu_soft_notify+0xb0/0xf1
>  [] ? compute_batch_value+0xd/0x22
>  [] ? percpu_counter_hotcpu_callback+0x11/0x80
>  [] notifier_call_chain+0x68/0x91
>  [] ? sched_debug_header+0x15c/0x58e
>  [] __raw_notifier_call_chain+0x1e/0x23
>  [] __cpu_notify+0x24/0x39
>  [] _cpu_up+0xef/0x105
>  [] cpu_up+0x4e/0x5f
>  [] cpu_subsys_online+0x13/0x15
>  [] device_online+0x45/0x6e
>  [] online_store+0x32/0x4f
>  [] ? device_online+0x6e/0x6e
>  [] dev_attr_store+0x24/0x29
>  [] sysfs_kf_write+0x3a/0x41
>  [] ? sysfs_file_ops+0x48/0x48
>  [] kernfs_fop_write+0xe2/0x11f
>  [] ? kernfs_vma_page_mkwrite+0x6c/0x6c
>  [] __vfs_write+0x24/0x9b
>  [] ? file_start_write+0x27/0x29
>  [] ? rw_verify_area+0xce/0xef
>  [] vfs_write+0x7a/0xc4
>  [] SyS_write+0x54/0x7f
>  [] sysenter_do_call+0x12/0x12
> ---[ end trace e2c32eead4f4e541 ]---
> 
> I'll dig more into it, but wanted to give people a heads up.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] Kernel splat when taking CPUs offline

2015-07-08 Thread Steven Rostedt

My tests for ftrace includes testing the mmiotracer, which to run
requires taking all CPUs offline but one of them. This test crashed
every so often, and I was able to bisect down to this commit:

commit 87549141d516 ("cpufreq: Stop migrating sysfs files on hotplug")


Just to make sure this wasn't just the mmiotracer causing the issue, I
was able to trigger this same bug by simply doing the following:


(on a 4 cpu machine)


 # echo 0 > /sys/devices/system/cpu/cpu1/online 
 # echo 0 > /sys/devices/system/cpu/cpu2/online 
 # echo 0 > /sys/devices/system/cpu/cpu3/online 
 # echo 1 > /sys/devices/system/cpu/cpu1/online 
 # echo 1 > /sys/devices/system/cpu/cpu2/online 
 # echo 1 > /sys/devices/system/cpu/cpu3/online 
 # echo 0 > /sys/devices/system/cpu/cpu1/online 
 # echo 0 > /sys/devices/system/cpu/cpu2/online 
 # echo 0 > /sys/devices/system/cpu/cpu2/online 
 # echo 0 > /sys/devices/system/cpu/cpu3/online 
 # echo 1 > /sys/devices/system/cpu/cpu1/online 
 # echo 1 > /sys/devices/system/cpu/cpu2/online 
 # echo 1 > /sys/devices/system/cpu/cpu3/online 

It usually takes two or three tries (shutting down all but one CPU, and
starting them again) before it triggers.

Here's the splat:

Initializing CPU#1
[ cut here ]
WARNING: CPU: 0 PID: 1609 at 
/home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 
cpufreq_update_policy+0xc8/0x139()
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables ipv6 ppdev parport_pc r8169 parport microcode
CPU: 0 PID: 1609 Comm: bash Tainted: GW   4.2.0-rc1-test #26
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
   ee47db9c c0cd04e6 c10d4463 ee47dbcc c0440fbe c1010460
  0649 c10d4463 092e c0a6dd28 c0a6dd28 f13fd600 
 ee47dda8 ee47dbdc c0440ff7 0009  ee47ddb8 c0a6dd28 efb01bc0
Call Trace:
 [] dump_stack+0x41/0x52
 [] warn_slowpath_common+0x9d/0xb4
 [] ? cpufreq_update_policy+0xc8/0x139
 [] ? cpufreq_update_policy+0xc8/0x139
 [] warn_slowpath_null+0x22/0x24
 [] cpufreq_update_policy+0xc8/0x139
 [] ? cpufreq_update_policy+0x139/0x139
 [] ? cpufreq_update_policy+0x3b/0x139
 [] ? cpufreq_freq_transition_begin+0x97/0xd9
 [] ? __wake_up+0x1a/0x47
 [] acpi_processor_ppc_has_changed+0x54/0x5d
 [] acpi_cpu_soft_notify+0xb0/0xf1
 [] ? compute_batch_value+0xd/0x22
 [] ? percpu_counter_hotcpu_callback+0x11/0x80
 [] notifier_call_chain+0x68/0x91
 [] ? sched_debug_header+0x15c/0x58e
 [] __raw_notifier_call_chain+0x1e/0x23
 [] __cpu_notify+0x24/0x39
 [] _cpu_up+0xef/0x105
 [] cpu_up+0x4e/0x5f
 [] cpu_subsys_online+0x13/0x15
 [] device_online+0x45/0x6e
 [] online_store+0x32/0x4f
 [] ? device_online+0x6e/0x6e
 [] dev_attr_store+0x24/0x29
 [] sysfs_kf_write+0x3a/0x41
 [] ? sysfs_file_ops+0x48/0x48
 [] kernfs_fop_write+0xe2/0x11f
 [] ? kernfs_vma_page_mkwrite+0x6c/0x6c
 [] __vfs_write+0x24/0x9b
 [] ? file_start_write+0x27/0x29
 [] ? rw_verify_area+0xce/0xef
 [] vfs_write+0x7a/0xc4
 [] SyS_write+0x54/0x7f
 [] sysenter_do_call+0x12/0x12
---[ end trace e2c32eead4f4e541 ]---

I'll dig more into it, but wanted to give people a heads up.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/