Re: [BUG] Kernel splat when taking CPUs offline
On 09-07-15, 00:25, Steven Rostedt wrote: > On Thu, 9 Jul 2015 09:34:45 +0530 > Viresh Kumar wrote: > > > > I think it might be related to what I chased down yesterday: > > > > http://marc.info/?l=linux-kernel&m=143633485824975&w=2 > > > > @Steven: Can you please give this a try ? > > > > Yes that seems to fix my issue as well. > > Tested-by: Steven Rostedt Awesome, so the problem was that cpufreq_set_policy() was failing because of the latest bug I planted :), and that caused ->exit() but didn't free the policy completely. (I have fixed that as well in a separate patch). And so you are hitting a policy which has already exited. Sorry about that :) -- viresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Kernel splat when taking CPUs offline
On Thu, 9 Jul 2015 09:34:45 +0530 Viresh Kumar wrote: > I think it might be related to what I chased down yesterday: > > http://marc.info/?l=linux-kernel&m=143633485824975&w=2 > > @Steven: Can you please give this a try ? > Yes that seems to fix my issue as well. Tested-by: Steven Rostedt Thanks! -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Kernel splat when taking CPUs offline
On 09-07-15, 02:13, Rafael J. Wysocki wrote: > So the cpufreq driver's ->get() callback returns 0 for the given CPU and > that's what triggers the WARN_ON(). And it most likely returns 0, because > its internal data structure for that CPU is not present. > > I *guess* that before the above commit policy was NULL in > cpufreq_update_policy() > and we didn't get to the point where ->get() was called. I am not sure if that behavior should have changed at all.. Earlier we were clearing per-cpu cpufreq_cpu_data for offline CPUs and so policy would have been NULL for offline CPUs. Now that per-cpu variable isn't cleared, but cpufreq_cpu_get() does check if the CPU is part of policy->cpus or not, i.e. if it is offline. And so policy should still be NULL for offline CPUs. I think it might be related to what I chased down yesterday: http://marc.info/?l=linux-kernel&m=143633485824975&w=2 @Steven: Can you please give this a try ? -- viresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Kernel splat when taking CPUs offline
On Thu, 09 Jul 2015 02:13:45 +0200 "Rafael J. Wysocki" wrote: > > Initializing CPU#1 > > [ cut here ] > > WARNING: CPU: 0 PID: 1609 at > > /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 > > cpufreq_update_policy+0xc8/0x139() > > So the cpufreq driver's ->get() callback returns 0 for the given CPU and > that's what triggers the WARN_ON(). And it most likely returns 0, because > its internal data structure for that CPU is not present. > > I *guess* that before the above commit policy was NULL in > cpufreq_update_policy() > and we didn't get to the point where ->get() was called. Just some more info. That ->get() is get_cur_freq_on_cpu() (I added a printk to find out what that was). Also, adding more printks() (patch of printk's added below) I got this: # trace-cmd start -p mmiotrace # offlines all but one CPU # trace-cmd start -p nop# onlines the CPUs # trace-cmd start -p mmiotrace # again offlines all but one CPU # trace-cmd start -p nop# again onlines the CPUs produces: in mmio_trace_init mmiotrace: Disabling non-boot CPUs... smpboot: CPU 1 is now offline exit free f252c180 (1) mmiotrace: CPU1 is down. Broke affinity for irq 28 smpboot: CPU 2 is now offline exit free f252c260 (2) mmiotrace: CPU2 is down. Broke affinity for irq 4 Broke affinity for irq 25 Broke affinity for irq 26 Broke affinity for irq 27 Broke affinity for irq 28 smpboot: CPU 3 is now offline exit free f252c280 (3) mmiotrace: CPU3 is down. mmiotrace: enabled. in mmio_trace_start in mmio_trace_reset mmiotrace: Re-enabling CPUs... x86: Booting SMP configuration: smpboot: Booting Node 0 Processor 1 APIC 0x2 Initializing CPU#1 INIT data = f05a6b40 (1) data=f05a6b40 data-acpi_data=f3539634 data-freq_table_data=f2073b00 exit free f05a6b40 (1) mmiotrace: enabled CPU1. smpboot: Booting Node 0 Processor 2 APIC 0x1 Initializing CPU#2 INIT data = efe567a0 (2) data=efe567a0 data-acpi_data=f368b634 data-freq_table_data=ef849100 exit free efe567a0 (2) mmiotrace: enabled CPU2. smpboot: Booting Node 0 Processor 3 APIC 0x3 Initializing CPU#3 INIT data = efe56760 (3) data=efe56760 data-acpi_data=f37dd634 data-freq_table_data=ef840600 exit free efe56760 (3) mmiotrace: enabled CPU3. mmiotrace: disabled. in mmio_trace_init mmiotrace: Disabling non-boot CPUs... cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor smpboot: CPU 1 is now offline mmiotrace: CPU1 is down. cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor Broke affinity for irq 28 smpboot: CPU 2 is now offline mmiotrace: CPU2 is down. cpufreq: __cpufreq_remove_dev_prepare: Failed to stop governor Broke affinity for irq 28 smpboot: CPU 3 is now offline mmiotrace: CPU3 is down. mmiotrace: enabled. in mmio_trace_start in mmio_trace_reset mmiotrace: Re-enabling CPUs... x86: Booting SMP configuration: smpboot: Booting Node 0 Processor 1 APIC 0x2 Initializing CPU#1 get=get_cur_freq_on_cpu+0x0/0xe9 data= (null) [ cut here ] WARNING: CPU: 0 PID: 1994 at /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2351 cpufreq_update_policy+0xe8/0x159() Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 microcode r8169 ppdev parport_pc parport CPU: 0 PID: 1994 Comm: trace-cmd Not tainted 4.2.0-rc1-test+ #30 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 efa11b54 c0cd0386 c10d4414 efa11b84 c0440fbe c101046c 07ca c10d4414 092f c0a6db4a c0a6db4a f146cc00 efa11d60 efa11b94 c0440ff7 0009 efa11d6c c0a6db4a c10d4e15 Call Trace: [] dump_stack+0x41/0x52 [] warn_slowpath_common+0x9d/0xb4 [] ? cpufreq_update_policy+0xe8/0x159 [] ? cpufreq_update_policy+0xe8/0x159 [] warn_slowpath_null+0x22/0x24 [] cpufreq_update_policy+0xe8/0x159 [] ? extract_freq+0xa1/0xa1 [] ? cpufreq_update_policy+0x159/0x159 [] ? cpufreq_update_policy+0x3b/0x159 [] ? cpufreq_freq_transition_begin+0x97/0xd9 [] ? __wake_up+0x1a/0x47 [] acpi_processor_ppc_has_changed+0x54/0x5d [] acpi_cpu_soft_notify+0xb0/0xf1 [] ? compute_batch_value+0xd/0x22 [] ? percpu_counter_hotcpu_callback+0x11/0x80 [] notifier_call_chain+0x68/0x91 [] __raw_notifier_call_chain+0x1e/0x23 [] __cpu_notify+0x24/0x39 [] _cpu_up+0xef/0x105 [] cpu_up+0x4e/0x5f [] ? find_next_bit+0x1a/0x20 [] disable_mmiotrace+0xd4/0x13e [] mmio_trace_reset+0x36/0x5e [] tracing_set_tracer+0xb1/0x155 [] ? _copy_from_user+0x42/0x57 [] tracing_set_trace_write+0x6a/0x80 [] ? handle_mm_fault+0x75b/0xc42 [] ? file_start_write+0x27/0x29 [] ? tracing_set_tracer+0x155/0x155 [] __vfs_write+0x24/0x9b [] ? file_start_write+0x27/0x29 [] ? rw_verify_area+0xce/0xef [] ? __do_page_fault+0x2be/0x3be [] vfs_write+0x7a/0xc4 [] SyS_write+0x54/0x7f [] sysenter_do_call+0x12/0x12 ---[ end trace 47cc28ca9538eb2d ]--- mmiotrace: enabled CPU1. smpboot: Booting Node 0 Processor 2 APIC 0x1 Initializing CPU#2 get=get_
Re: [BUG] Kernel splat when taking CPUs offline
On Wednesday, July 08, 2015 03:24:56 PM Steven Rostedt wrote: > > My tests for ftrace includes testing the mmiotracer, which to run > requires taking all CPUs offline but one of them. This test crashed > every so often, and I was able to bisect down to this commit: > > commit 87549141d516 ("cpufreq: Stop migrating sysfs files on hotplug") Thanks for the report, adding linux-pm and linux-acpi to the CC. > Just to make sure this wasn't just the mmiotracer causing the issue, I > was able to trigger this same bug by simply doing the following: > > > (on a 4 cpu machine) > > > # echo 0 > /sys/devices/system/cpu/cpu1/online > # echo 0 > /sys/devices/system/cpu/cpu2/online > # echo 0 > /sys/devices/system/cpu/cpu3/online > # echo 1 > /sys/devices/system/cpu/cpu1/online > # echo 1 > /sys/devices/system/cpu/cpu2/online > # echo 1 > /sys/devices/system/cpu/cpu3/online > # echo 0 > /sys/devices/system/cpu/cpu1/online > # echo 0 > /sys/devices/system/cpu/cpu2/online > # echo 0 > /sys/devices/system/cpu/cpu2/online > # echo 0 > /sys/devices/system/cpu/cpu3/online > # echo 1 > /sys/devices/system/cpu/cpu1/online > # echo 1 > /sys/devices/system/cpu/cpu2/online > # echo 1 > /sys/devices/system/cpu/cpu3/online > > It usually takes two or three tries (shutting down all but one CPU, and > starting them again) before it triggers. > > Here's the splat: > > Initializing CPU#1 > [ cut here ] > WARNING: CPU: 0 PID: 1609 at > /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 > cpufreq_update_policy+0xc8/0x139() So the cpufreq driver's ->get() callback returns 0 for the given CPU and that's what triggers the WARN_ON(). And it most likely returns 0, because its internal data structure for that CPU is not present. I *guess* that before the above commit policy was NULL in cpufreq_update_policy() and we didn't get to the point where ->get() was called. There seems to be a couple of ways to address that, but I'd like Viresh to have a look at this too. > Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 > nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ppdev parport_pc r8169 parport > microcode > CPU: 0 PID: 1609 Comm: bash Tainted: GW 4.2.0-rc1-test #26 > Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 > ee47db9c c0cd04e6 c10d4463 ee47dbcc c0440fbe c1010460 > 0649 c10d4463 092e c0a6dd28 c0a6dd28 f13fd600 > ee47dda8 ee47dbdc c0440ff7 0009 ee47ddb8 c0a6dd28 efb01bc0 > Call Trace: > [] dump_stack+0x41/0x52 > [] warn_slowpath_common+0x9d/0xb4 > [] ? cpufreq_update_policy+0xc8/0x139 > [] ? cpufreq_update_policy+0xc8/0x139 > [] warn_slowpath_null+0x22/0x24 > [] cpufreq_update_policy+0xc8/0x139 > [] ? cpufreq_update_policy+0x139/0x139 > [] ? cpufreq_update_policy+0x3b/0x139 > [] ? cpufreq_freq_transition_begin+0x97/0xd9 > [] ? __wake_up+0x1a/0x47 > [] acpi_processor_ppc_has_changed+0x54/0x5d > [] acpi_cpu_soft_notify+0xb0/0xf1 > [] ? compute_batch_value+0xd/0x22 > [] ? percpu_counter_hotcpu_callback+0x11/0x80 > [] notifier_call_chain+0x68/0x91 > [] ? sched_debug_header+0x15c/0x58e > [] __raw_notifier_call_chain+0x1e/0x23 > [] __cpu_notify+0x24/0x39 > [] _cpu_up+0xef/0x105 > [] cpu_up+0x4e/0x5f > [] cpu_subsys_online+0x13/0x15 > [] device_online+0x45/0x6e > [] online_store+0x32/0x4f > [] ? device_online+0x6e/0x6e > [] dev_attr_store+0x24/0x29 > [] sysfs_kf_write+0x3a/0x41 > [] ? sysfs_file_ops+0x48/0x48 > [] kernfs_fop_write+0xe2/0x11f > [] ? kernfs_vma_page_mkwrite+0x6c/0x6c > [] __vfs_write+0x24/0x9b > [] ? file_start_write+0x27/0x29 > [] ? rw_verify_area+0xce/0xef > [] vfs_write+0x7a/0xc4 > [] SyS_write+0x54/0x7f > [] sysenter_do_call+0x12/0x12 > ---[ end trace e2c32eead4f4e541 ]--- > > I'll dig more into it, but wanted to give people a heads up. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] Kernel splat when taking CPUs offline
My tests for ftrace includes testing the mmiotracer, which to run requires taking all CPUs offline but one of them. This test crashed every so often, and I was able to bisect down to this commit: commit 87549141d516 ("cpufreq: Stop migrating sysfs files on hotplug") Just to make sure this wasn't just the mmiotracer causing the issue, I was able to trigger this same bug by simply doing the following: (on a 4 cpu machine) # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu3/online # echo 1 > /sys/devices/system/cpu/cpu1/online # echo 1 > /sys/devices/system/cpu/cpu2/online # echo 1 > /sys/devices/system/cpu/cpu3/online # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu3/online # echo 1 > /sys/devices/system/cpu/cpu1/online # echo 1 > /sys/devices/system/cpu/cpu2/online # echo 1 > /sys/devices/system/cpu/cpu3/online It usually takes two or three tries (shutting down all but one CPU, and starting them again) before it triggers. Here's the splat: Initializing CPU#1 [ cut here ] WARNING: CPU: 0 PID: 1609 at /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 cpufreq_update_policy+0xc8/0x139() Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ppdev parport_pc r8169 parport microcode CPU: 0 PID: 1609 Comm: bash Tainted: GW 4.2.0-rc1-test #26 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 ee47db9c c0cd04e6 c10d4463 ee47dbcc c0440fbe c1010460 0649 c10d4463 092e c0a6dd28 c0a6dd28 f13fd600 ee47dda8 ee47dbdc c0440ff7 0009 ee47ddb8 c0a6dd28 efb01bc0 Call Trace: [] dump_stack+0x41/0x52 [] warn_slowpath_common+0x9d/0xb4 [] ? cpufreq_update_policy+0xc8/0x139 [] ? cpufreq_update_policy+0xc8/0x139 [] warn_slowpath_null+0x22/0x24 [] cpufreq_update_policy+0xc8/0x139 [] ? cpufreq_update_policy+0x139/0x139 [] ? cpufreq_update_policy+0x3b/0x139 [] ? cpufreq_freq_transition_begin+0x97/0xd9 [] ? __wake_up+0x1a/0x47 [] acpi_processor_ppc_has_changed+0x54/0x5d [] acpi_cpu_soft_notify+0xb0/0xf1 [] ? compute_batch_value+0xd/0x22 [] ? percpu_counter_hotcpu_callback+0x11/0x80 [] notifier_call_chain+0x68/0x91 [] ? sched_debug_header+0x15c/0x58e [] __raw_notifier_call_chain+0x1e/0x23 [] __cpu_notify+0x24/0x39 [] _cpu_up+0xef/0x105 [] cpu_up+0x4e/0x5f [] cpu_subsys_online+0x13/0x15 [] device_online+0x45/0x6e [] online_store+0x32/0x4f [] ? device_online+0x6e/0x6e [] dev_attr_store+0x24/0x29 [] sysfs_kf_write+0x3a/0x41 [] ? sysfs_file_ops+0x48/0x48 [] kernfs_fop_write+0xe2/0x11f [] ? kernfs_vma_page_mkwrite+0x6c/0x6c [] __vfs_write+0x24/0x9b [] ? file_start_write+0x27/0x29 [] ? rw_verify_area+0xce/0xef [] vfs_write+0x7a/0xc4 [] SyS_write+0x54/0x7f [] sysenter_do_call+0x12/0x12 ---[ end trace e2c32eead4f4e541 ]--- I'll dig more into it, but wanted to give people a heads up. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/