Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 13, 2017 at 02:26:20AM +0100, Gabriel C wrote: > I didn't tested your patch yet but did a boot with mce=off and nomce > which seems to not really works since is still want to mc_device_add() > even when off. mc_device_add() is microcode loader's ->add_dev() subsys pointer and that's not from mce. From mce you should be seeing only (with the debug patch applied): [1.717508] mce: mcheck_init_device: entry [1.718769] mce: Unable to init device /dev/mcelog (rc: -5) > See : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg That looks like core 13 got the NMI from the watchdog at if (wait) csd_lock_wait(csd); IINM and from what I could correlate to the asm it generates here, RIP points to that READ_ONCE there in smp_cond_load_acquire() in smp_call_function_single() which is called by collect_cpu_info() of the microcode loader to get the microcode-relevant info from the CPU. So this is simply a bystander CPU which got interrupted. > I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :) Ok. > Another thing is .. there seems to be a real bug in tsc code . > > I've build an -rc8 with a lot more debug options on an now I see the > following : Right before I went to bed I thought of telling you to enable lockdep :-) Good. :-) -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 13, 2017 at 02:26:20AM +0100, Gabriel C wrote: > I didn't tested your patch yet but did a boot with mce=off and nomce > which seems to not really works since is still want to mc_device_add() > even when off. mc_device_add() is microcode loader's ->add_dev() subsys pointer and that's not from mce. From mce you should be seeing only (with the debug patch applied): [1.717508] mce: mcheck_init_device: entry [1.718769] mce: Unable to init device /dev/mcelog (rc: -5) > See : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg That looks like core 13 got the NMI from the watchdog at if (wait) csd_lock_wait(csd); IINM and from what I could correlate to the asm it generates here, RIP points to that READ_ONCE there in smp_cond_load_acquire() in smp_call_function_single() which is called by collect_cpu_info() of the microcode loader to get the microcode-relevant info from the CPU. So this is simply a bystander CPU which got interrupted. > I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :) Ok. > Another thing is .. there seems to be a real bug in tsc code . > > I've build an -rc8 with a lot more debug options on an now I see the > following : Right before I went to bed I thought of telling you to enable lockdep :-) Good. :-) -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 13 Feb 2017, Mike Galbraith wrote: > kernel/time/tick-broadcast.c |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > --- a/kernel/time/tick-broadcast.c > +++ b/kernel/time/tick-broadcast.c > @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br > struct clock_event_device *bc, *dev; > struct tick_device *td; > int cpu, bc_stopped; > + unsigned long flags; > > td = this_cpu_ptr(_cpu_device); > dev = td->evtdev; > @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br > if (!tick_device_is_functional(dev)) > return; > > - raw_spin_lock(_broadcast_lock); > + raw_spin_lock_irqsave(_broadcast_lock, flags); > cpu = smp_processor_id(); > bc = tick_broadcast_device.evtdev; > bc_stopped = cpumask_empty(tick_broadcast_mask); > @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br > tick_broadcast_setup_oneshot(bc); > } > } > - raw_spin_unlock(_broadcast_lock); > + raw_spin_unlock_irqrestore(_broadcast_lock, flags); That cures the lockdep splat, but the comment above tick_broadcast_control() says: * Called with interrupts disabled, so clockevents_lock is not * required here because the local clock event device cannot go away * under us. So if we want to relax the calling convention, then we need to take the lock early. Otherwise it's unsafe to fiddle with the local clock event device. The calling convention was broken with the following commit: 29d7bbada98e intel_idle: Remove superfluous SMP fuction call So we could fix it at the call site, but making the core more robust is the better solution. I'll fix it up. Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 13 Feb 2017, Mike Galbraith wrote: > kernel/time/tick-broadcast.c |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > --- a/kernel/time/tick-broadcast.c > +++ b/kernel/time/tick-broadcast.c > @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br > struct clock_event_device *bc, *dev; > struct tick_device *td; > int cpu, bc_stopped; > + unsigned long flags; > > td = this_cpu_ptr(_cpu_device); > dev = td->evtdev; > @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br > if (!tick_device_is_functional(dev)) > return; > > - raw_spin_lock(_broadcast_lock); > + raw_spin_lock_irqsave(_broadcast_lock, flags); > cpu = smp_processor_id(); > bc = tick_broadcast_device.evtdev; > bc_stopped = cpumask_empty(tick_broadcast_mask); > @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br > tick_broadcast_setup_oneshot(bc); > } > } > - raw_spin_unlock(_broadcast_lock); > + raw_spin_unlock_irqrestore(_broadcast_lock, flags); That cures the lockdep splat, but the comment above tick_broadcast_control() says: * Called with interrupts disabled, so clockevents_lock is not * required here because the local clock event device cannot go away * under us. So if we want to relax the calling convention, then we need to take the lock early. Otherwise it's unsafe to fiddle with the local clock event device. The calling convention was broken with the following commit: 29d7bbada98e intel_idle: Remove superfluous SMP fuction call So we could fix it at the call site, but making the core more robust is the better solution. I'll fix it up. Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote: > [5.276704]CPU0 > [5.312400] > [5.347605] lock(tick_broadcast_lock); > [5.383163] > [5.418457] lock(tick_broadcast_lock); > [5.454015] > *** DEADLOCK *** > > [5.557982] no locks held by cpuhp/0/14. Oh, that looks familiar... tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave() Otherwise we end up with the lockdep splat below: [ 12.703619] = [ 12.703619] [ INFO: inconsistent lock state ] [ 12.703621] 4.10.0-rt1-rt #18 Not tainted [ 12.703622] - [ 12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 12.703625] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x5a/0x1a0 [ 12.703632] {IN-HARDIRQ-W} state was registered at: [ 12.703637] [] __lock_acquire+0xa21/0x1550 [ 12.703639] [] lock_acquire+0xbd/0x250 [ 12.703642] [] _raw_spin_lock_irqsave+0x53/0x70 [ 12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50 [ 12.703646] [] tick_switch_to_oneshot+0x59/0xd0 [ 12.703647] [] tick_init_highres+0x15/0x20 [ 12.703652] [] hrtimer_run_queues+0x9f/0xe0 [ 12.703654] [] run_local_timers+0x25/0x60 [ 12.703656] [] update_process_times+0x2c/0x60 [ 12.703659] [] tick_periodic+0x2f/0x100 [ 12.703661] [] tick_handle_periodic+0x24/0x70 [ 12.703664] [] local_apic_timer_interrupt+0x33/0x60 [ 12.703669] [] smp_apic_timer_interrupt+0x38/0x50 [ 12.703671] [] apic_timer_interrupt+0x9d/0xb0 [ 12.703672] [] mwait_idle+0x94/0x290 [ 12.703676] [] arch_cpu_idle+0xf/0x20 [ 12.703677] [] default_idle_call+0x31/0x60 [ 12.703681] [] do_idle+0x175/0x290 [ 12.703683] [] cpu_startup_entry+0x48/0x50 [ 12.703687] [] start_secondary+0x133/0x160 [ 12.703689] [] verify_cpu+0x0/0xfc [ 12.703690] irq event stamp: 71 [ 12.703691] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x2c/0x80 [ 12.703696] hardirqs last disabled at (70): [] __schedule+0x9c/0x7e0 [ 12.703699] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22d0 [ 12.703700] softirqs last disabled at (0): [< (null)>] (null) [ 12.703701] [ 12.703701] other info that might help us debug this: [ 12.703701] Possible unsafe locking scenario: [ 12.703701] [ 12.703701]CPU0 [ 12.703702] [ 12.703702] lock(tick_broadcast_lock); [ 12.703703] [ 12.703704] lock(tick_broadcast_lock); [ 12.703705] [ 12.703705] *** DEADLOCK *** [ 12.703705] [ 12.703705] no locks held by cpuhp/0/23. [ 12.703705] [ 12.703705] stack backtrace: [ 12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18 [ 12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010 [ 12.703709] Call Trace: [ 12.703715] dump_stack+0x85/0xc8 [ 12.703717] print_usage_bug+0x1ea/0x1fb [ 12.703719] ? print_shortest_lock_dependencies+0x1c0/0x1c0 [ 12.703721] mark_lock+0x20d/0x290 [ 12.703723] __lock_acquire+0x8e6/0x1550 [ 12.703724] ? __lock_acquire+0x2ce/0x1550 [ 12.703726] ? load_balance+0x1b4/0xaf0 [ 12.703728] lock_acquire+0xbd/0x250 [ 12.703729] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703735] ? efifb_probe+0x170/0x170 [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703737] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703740] ? efifb_probe+0x170/0x170 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703747] ? finish_task_switch+0x78/0x290 [ 12.703750] ? check_preemption_disabled+0x9f/0x130 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 [ 12.703757] kthread+0x10c/0x140 [ 12.703759] ? smpboot_update_cpumask_percpu_thread+0x130/0x130 [ 12.703760] ? kthread_park+0x90/0x90 [ 12.703762] ret_from_fork+0x2a/0x40 [ 12.709790] intel_idle: lapic_timer_reliable_states 0x2 Signed-off-by: Mike Galbraith--- kernel/time/tick-broadcast.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br if (!tick_device_is_functional(dev)) return; - raw_spin_lock(_broadcast_lock); + raw_spin_lock_irqsave(_broadcast_lock, flags); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote: > [5.276704]CPU0 > [5.312400] > [5.347605] lock(tick_broadcast_lock); > [5.383163] > [5.418457] lock(tick_broadcast_lock); > [5.454015] > *** DEADLOCK *** > > [5.557982] no locks held by cpuhp/0/14. Oh, that looks familiar... tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave() Otherwise we end up with the lockdep splat below: [ 12.703619] = [ 12.703619] [ INFO: inconsistent lock state ] [ 12.703621] 4.10.0-rt1-rt #18 Not tainted [ 12.703622] - [ 12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 12.703625] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x5a/0x1a0 [ 12.703632] {IN-HARDIRQ-W} state was registered at: [ 12.703637] [] __lock_acquire+0xa21/0x1550 [ 12.703639] [] lock_acquire+0xbd/0x250 [ 12.703642] [] _raw_spin_lock_irqsave+0x53/0x70 [ 12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50 [ 12.703646] [] tick_switch_to_oneshot+0x59/0xd0 [ 12.703647] [] tick_init_highres+0x15/0x20 [ 12.703652] [] hrtimer_run_queues+0x9f/0xe0 [ 12.703654] [] run_local_timers+0x25/0x60 [ 12.703656] [] update_process_times+0x2c/0x60 [ 12.703659] [] tick_periodic+0x2f/0x100 [ 12.703661] [] tick_handle_periodic+0x24/0x70 [ 12.703664] [] local_apic_timer_interrupt+0x33/0x60 [ 12.703669] [] smp_apic_timer_interrupt+0x38/0x50 [ 12.703671] [] apic_timer_interrupt+0x9d/0xb0 [ 12.703672] [] mwait_idle+0x94/0x290 [ 12.703676] [] arch_cpu_idle+0xf/0x20 [ 12.703677] [] default_idle_call+0x31/0x60 [ 12.703681] [] do_idle+0x175/0x290 [ 12.703683] [] cpu_startup_entry+0x48/0x50 [ 12.703687] [] start_secondary+0x133/0x160 [ 12.703689] [] verify_cpu+0x0/0xfc [ 12.703690] irq event stamp: 71 [ 12.703691] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x2c/0x80 [ 12.703696] hardirqs last disabled at (70): [] __schedule+0x9c/0x7e0 [ 12.703699] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22d0 [ 12.703700] softirqs last disabled at (0): [< (null)>] (null) [ 12.703701] [ 12.703701] other info that might help us debug this: [ 12.703701] Possible unsafe locking scenario: [ 12.703701] [ 12.703701]CPU0 [ 12.703702] [ 12.703702] lock(tick_broadcast_lock); [ 12.703703] [ 12.703704] lock(tick_broadcast_lock); [ 12.703705] [ 12.703705] *** DEADLOCK *** [ 12.703705] [ 12.703705] no locks held by cpuhp/0/23. [ 12.703705] [ 12.703705] stack backtrace: [ 12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18 [ 12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010 [ 12.703709] Call Trace: [ 12.703715] dump_stack+0x85/0xc8 [ 12.703717] print_usage_bug+0x1ea/0x1fb [ 12.703719] ? print_shortest_lock_dependencies+0x1c0/0x1c0 [ 12.703721] mark_lock+0x20d/0x290 [ 12.703723] __lock_acquire+0x8e6/0x1550 [ 12.703724] ? __lock_acquire+0x2ce/0x1550 [ 12.703726] ? load_balance+0x1b4/0xaf0 [ 12.703728] lock_acquire+0xbd/0x250 [ 12.703729] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703735] ? efifb_probe+0x170/0x170 [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703737] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703740] ? efifb_probe+0x170/0x170 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703747] ? finish_task_switch+0x78/0x290 [ 12.703750] ? check_preemption_disabled+0x9f/0x130 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 [ 12.703757] kthread+0x10c/0x140 [ 12.703759] ? smpboot_update_cpumask_percpu_thread+0x130/0x130 [ 12.703760] ? kthread_park+0x90/0x90 [ 12.703762] ret_from_fork+0x2a/0x40 [ 12.709790] intel_idle: lapic_timer_reliable_states 0x2 Signed-off-by: Mike Galbraith --- kernel/time/tick-broadcast.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br if (!tick_device_is_functional(dev)) return; - raw_spin_lock(_broadcast_lock); + raw_spin_lock_irqsave(_broadcast_lock, flags); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 13.02.2017 01:38, Borislav Petkov wrote: On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote: http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 Thanks and interesting. In both cases, mcheck_init_device() doesn't return or we don't see the "initcall returned" message. Ok, let's try a silly sprinkling of printks in that function and try to pinpoint how far we manage to come. Apply, build, boot and shoot video again :-) I didn't tested your patch yet but did a boot with mce=off and nomce which seems to not really works since is still want to mc_device_add() even when off. See : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :) Another thing is .. there seems to be a real bug in tsc code . I've build an -rc8 with a lot more debug options on an now I see the following : ... [4.321029] = [4.321909] [ INFO: inconsistent lock state ] [4.322789] 4.10.0-rc8-debug #1 Tainted: G I [4.323879] - [4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes: [4.326993] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x57/0x190 [4.328879] {IN-HARDIRQ-W} state was registered at: [4.329866] __lock_acquire+0x24f/0x19e0 [4.330675] lock_acquire+0xa5/0xd0 [4.331399] _raw_spin_lock_irqsave+0x54/0x90 [4.332297] tick_broadcast_switch_to_oneshot+0x11/0x50 [4.71] tick_switch_to_oneshot+0x8c/0xd0 [4.334269] tick_init_highres+0x10/0x20 [4.335079] hrtimer_run_queues+0x5a/0xe0 [4.335907] run_local_timers+0x20/0x50 [4.336699] update_process_times+0x22/0x50 [4.337562] tick_periodic+0xa5/0xb0 [4.338302] tick_handle_periodic+0x1f/0x60 [4.378065] smp_trace_apic_timer_interrupt+0x74/0x90 [4.418107] smp_apic_timer_interrupt+0x9/0x10 [4.458095] apic_timer_interrupt+0x93/0xa0 [4.498048] mwait_idle+0x5a/0x90 [4.537618] arch_cpu_idle+0xa/0x10 [4.577098] default_idle_call+0x2c/0x30 [4.616211] do_idle+0x10c/0x1e0 [4.654606] cpu_startup_entry+0x5d/0x60 [4.692388] rest_init+0x12c/0x140 [4.729557] start_kernel+0x45f/0x46c [4.766325] x86_64_start_reservations+0x2a/0x2c [4.803075] x86_64_start_kernel+0xeb/0xf8 [4.839178] verify_cpu+0x0/0xfc [4.874629] irq event stamp: 71 [4.909417] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x27/0x50 [4.945642] hardirqs last disabled at (70): [] __schedule+0x13a/0x7c0 [4.981797] softirqs last enabled at (0): [] copy_process+0x7c0/0x1ea0 [5.018580] softirqs last disabled at (0): [< (null)>] (null) [5.055677] other info that might help us debug this: [5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz [5.072467] clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns [5.202828] Possible unsafe locking scenario: ^ This seems to be the place where the other patch breaks hell here.. [5.276704]CPU0 [5.312400] [5.347605] lock(tick_broadcast_lock); [5.383163] [5.418457] lock(tick_broadcast_lock); [5.454015] *** DEADLOCK *** [5.557982] no locks held by cpuhp/0/14. [5.592295] stack backtrace: [5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G I 4.10.0-rc8-debug #1 [5.690740] Hardware name: FUJITSU PRIMERGY TX200 S5 /D2709, BIOS 6.00 Rev. 1.14.2709 02/04/2013 [5.758323] Call Trace: [5.791434] dump_stack+0x86/0xc1 [5.824421] print_usage_bug+0x283/0x2a0 [5.857357] mark_lock+0x39e/0x650 [5.890256] ? check_usage_forwards+0xf0/0xf0 [5.923436] __lock_acquire+0x2ba/0x19e0 [5.956617] ? pick_next_task_fair+0x350/0x700 [5.989903] ? finish_task_switch+0x184/0x220 [6.023171] ? debug_smp_processor_id+0x17/0x20 [6.056667] lock_acquire+0xa5/0xd0 [6.089882] ? tick_broadcast_control+0x57/0x190 [6.123395] ? smpboot_thread_fn+0x28/0x250 [6.156838] _raw_spin_lock+0x3c/0x80 [6.190175] ? tick_broadcast_control+0x57/0x190 [6.223914] tick_broadcast_control+0x57/0x190 [6.257846] ? finish_task_switch+0x184/0x220 [6.291900] ? smpboot_thread_fn+0x28/0x250 [6.325991] intel_idle_cpu_online+0x1d/0x100 [6.360220] cpuhp_invoke_callback+0x62/0x120 [6.394397] ? smpboot_thread_fn+0x28/0x250 [6.428451] cpuhp_thread_fun+0x87/0x110 [6.462611] smpboot_thread_fn+0x227/0x250 [6.496805]
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 13.02.2017 01:38, Borislav Petkov wrote: On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote: http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 Thanks and interesting. In both cases, mcheck_init_device() doesn't return or we don't see the "initcall returned" message. Ok, let's try a silly sprinkling of printks in that function and try to pinpoint how far we manage to come. Apply, build, boot and shoot video again :-) I didn't tested your patch yet but did a boot with mce=off and nomce which seems to not really works since is still want to mc_device_add() even when off. See : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :) Another thing is .. there seems to be a real bug in tsc code . I've build an -rc8 with a lot more debug options on an now I see the following : ... [4.321029] = [4.321909] [ INFO: inconsistent lock state ] [4.322789] 4.10.0-rc8-debug #1 Tainted: G I [4.323879] - [4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes: [4.326993] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x57/0x190 [4.328879] {IN-HARDIRQ-W} state was registered at: [4.329866] __lock_acquire+0x24f/0x19e0 [4.330675] lock_acquire+0xa5/0xd0 [4.331399] _raw_spin_lock_irqsave+0x54/0x90 [4.332297] tick_broadcast_switch_to_oneshot+0x11/0x50 [4.71] tick_switch_to_oneshot+0x8c/0xd0 [4.334269] tick_init_highres+0x10/0x20 [4.335079] hrtimer_run_queues+0x5a/0xe0 [4.335907] run_local_timers+0x20/0x50 [4.336699] update_process_times+0x22/0x50 [4.337562] tick_periodic+0xa5/0xb0 [4.338302] tick_handle_periodic+0x1f/0x60 [4.378065] smp_trace_apic_timer_interrupt+0x74/0x90 [4.418107] smp_apic_timer_interrupt+0x9/0x10 [4.458095] apic_timer_interrupt+0x93/0xa0 [4.498048] mwait_idle+0x5a/0x90 [4.537618] arch_cpu_idle+0xa/0x10 [4.577098] default_idle_call+0x2c/0x30 [4.616211] do_idle+0x10c/0x1e0 [4.654606] cpu_startup_entry+0x5d/0x60 [4.692388] rest_init+0x12c/0x140 [4.729557] start_kernel+0x45f/0x46c [4.766325] x86_64_start_reservations+0x2a/0x2c [4.803075] x86_64_start_kernel+0xeb/0xf8 [4.839178] verify_cpu+0x0/0xfc [4.874629] irq event stamp: 71 [4.909417] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x27/0x50 [4.945642] hardirqs last disabled at (70): [] __schedule+0x13a/0x7c0 [4.981797] softirqs last enabled at (0): [] copy_process+0x7c0/0x1ea0 [5.018580] softirqs last disabled at (0): [< (null)>] (null) [5.055677] other info that might help us debug this: [5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz [5.072467] clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns [5.202828] Possible unsafe locking scenario: ^ This seems to be the place where the other patch breaks hell here.. [5.276704]CPU0 [5.312400] [5.347605] lock(tick_broadcast_lock); [5.383163] [5.418457] lock(tick_broadcast_lock); [5.454015] *** DEADLOCK *** [5.557982] no locks held by cpuhp/0/14. [5.592295] stack backtrace: [5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G I 4.10.0-rc8-debug #1 [5.690740] Hardware name: FUJITSU PRIMERGY TX200 S5 /D2709, BIOS 6.00 Rev. 1.14.2709 02/04/2013 [5.758323] Call Trace: [5.791434] dump_stack+0x86/0xc1 [5.824421] print_usage_bug+0x283/0x2a0 [5.857357] mark_lock+0x39e/0x650 [5.890256] ? check_usage_forwards+0xf0/0xf0 [5.923436] __lock_acquire+0x2ba/0x19e0 [5.956617] ? pick_next_task_fair+0x350/0x700 [5.989903] ? finish_task_switch+0x184/0x220 [6.023171] ? debug_smp_processor_id+0x17/0x20 [6.056667] lock_acquire+0xa5/0xd0 [6.089882] ? tick_broadcast_control+0x57/0x190 [6.123395] ? smpboot_thread_fn+0x28/0x250 [6.156838] _raw_spin_lock+0x3c/0x80 [6.190175] ? tick_broadcast_control+0x57/0x190 [6.223914] tick_broadcast_control+0x57/0x190 [6.257846] ? finish_task_switch+0x184/0x220 [6.291900] ? smpboot_thread_fn+0x28/0x250 [6.325991] intel_idle_cpu_online+0x1d/0x100 [6.360220] cpuhp_invoke_callback+0x62/0x120 [6.394397] ? smpboot_thread_fn+0x28/0x250 [6.428451] cpuhp_thread_fun+0x87/0x110 [6.462611] smpboot_thread_fn+0x227/0x250 [6.496805]
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote: > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 Thanks and interesting. In both cases, mcheck_init_device() doesn't return or we don't see the "initcall returned" message. Ok, let's try a silly sprinkling of printks in that function and try to pinpoint how far we manage to come. Apply, build, boot and shoot video again :-) Thanks. --- diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 8e9725c607ea..70268867cb33 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -2565,37 +2565,57 @@ static __init int mcheck_init_device(void) enum cpuhp_state hp_online; int err; + pr_err("%s: entry\n", __func__); + if (!mce_available(_cpu_data)) { err = -EIO; goto err_out; } + pr_err("%s: mce_available\n", __func__); + if (!zalloc_cpumask_var(_device_initialized, GFP_KERNEL)) { err = -ENOMEM; goto err_out; } + pr_err("%s: zalloc_cpumask_var\n", __func__); + mce_init_banks(); + pr_err("%s: mce_init_banks\n", __func__); + err = subsys_system_register(_subsys, NULL); if (err) goto err_out_mem; + pr_err("%s: subsys_system_register\n", __func__); + err = cpuhp_setup_state(CPUHP_X86_MCE_DEAD, "x86/mce:dead", NULL, mce_cpu_dead); if (err) goto err_out_mem; + pr_err("%s: x86/mce:dead\n", __func__); + err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/mce:online", mce_cpu_online, mce_cpu_pre_down); if (err < 0) goto err_out_online; + + pr_err("%s: x86/mce:online\n", __func__); + hp_online = err; register_syscore_ops(_syscore_ops); + pr_err("%s: register_syscore_ops\n", __func__); + /* register character device /dev/mcelog */ err = misc_register(_chrdev_device); + + pr_err("%s: misc_register, err: 0x%x\n", __func__, err); + if (err) goto err_register; -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote: > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 Thanks and interesting. In both cases, mcheck_init_device() doesn't return or we don't see the "initcall returned" message. Ok, let's try a silly sprinkling of printks in that function and try to pinpoint how far we manage to come. Apply, build, boot and shoot video again :-) Thanks. --- diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 8e9725c607ea..70268867cb33 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -2565,37 +2565,57 @@ static __init int mcheck_init_device(void) enum cpuhp_state hp_online; int err; + pr_err("%s: entry\n", __func__); + if (!mce_available(_cpu_data)) { err = -EIO; goto err_out; } + pr_err("%s: mce_available\n", __func__); + if (!zalloc_cpumask_var(_device_initialized, GFP_KERNEL)) { err = -ENOMEM; goto err_out; } + pr_err("%s: zalloc_cpumask_var\n", __func__); + mce_init_banks(); + pr_err("%s: mce_init_banks\n", __func__); + err = subsys_system_register(_subsys, NULL); if (err) goto err_out_mem; + pr_err("%s: subsys_system_register\n", __func__); + err = cpuhp_setup_state(CPUHP_X86_MCE_DEAD, "x86/mce:dead", NULL, mce_cpu_dead); if (err) goto err_out_mem; + pr_err("%s: x86/mce:dead\n", __func__); + err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/mce:online", mce_cpu_online, mce_cpu_pre_down); if (err < 0) goto err_out_online; + + pr_err("%s: x86/mce:online\n", __func__); + hp_online = err; register_syscore_ops(_syscore_ops); + pr_err("%s: register_syscore_ops\n", __func__); + /* register character device /dev/mcelog */ err = misc_register(_chrdev_device); + + pr_err("%s: misc_register, err: 0x%x\n", __func__, err); + if (err) goto err_register; -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 12.02.2017 22:12, Borislav Petkov wrote: On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote: There is what I get : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 Ok, I'm watching it frame-by-frame. I can see the microcode getting updated to revision 0x19 as in your working dmesg. The machine hangs here at the clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns line too. The exact same numbers even as in the previous run! Strange. With dis_ucode_ldr there is some more output , with it will stay like this , nothing more. Ok, can you do redo the first video but with "initcall_debug" on the kernel command line? And then do video of another run with "initcall_debug dis_ucode_ldr" on the kernel command line? I'd like to see which of the initcalls doesn't return. There are both videos .. however the output seems kind same this time.. http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 I try to find out more tomorrow but right now I don't even have a clue where to add some printk's to get some more output :(
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 12.02.2017 22:12, Borislav Petkov wrote: On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote: There is what I get : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 Ok, I'm watching it frame-by-frame. I can see the microcode getting updated to revision 0x19 as in your working dmesg. The machine hangs here at the clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns line too. The exact same numbers even as in the previous run! Strange. With dis_ucode_ldr there is some more output , with it will stay like this , nothing more. Ok, can you do redo the first video but with "initcall_debug" on the kernel command line? And then do video of another run with "initcall_debug dis_ucode_ldr" on the kernel command line? I'd like to see which of the initcalls doesn't return. There are both videos .. however the output seems kind same this time.. http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4 http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4 I try to find out more tomorrow but right now I don't even have a clue where to add some printk's to get some more output :(
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote: > There is what I get : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 Ok, I'm watching it frame-by-frame. I can see the microcode getting updated to revision 0x19 as in your working dmesg. The machine hangs here at the clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns line too. The exact same numbers even as in the previous run! Strange. > With dis_ucode_ldr there is some more output , with it will stay like this , > nothing more. Ok, can you do redo the first video but with "initcall_debug" on the kernel command line? And then do video of another run with "initcall_debug dis_ucode_ldr" on the kernel command line? I'd like to see which of the initcalls doesn't return. Thanks! -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote: > There is what I get : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 Ok, I'm watching it frame-by-frame. I can see the microcode getting updated to revision 0x19 as in your working dmesg. The machine hangs here at the clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns line too. The exact same numbers even as in the previous run! Strange. > With dis_ucode_ldr there is some more output , with it will stay like this , > nothing more. Ok, can you do redo the first video but with "initcall_debug" on the kernel command line? And then do video of another run with "initcall_debug dis_ucode_ldr" on the kernel command line? I'd like to see which of the initcalls doesn't return. Thanks! -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 22:32, Borislav Petkov wrote: On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote: Yes , it will hang before tsc message .. Also sometimes I have same trace sometimes it just hangs forever. It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does, maybe the microcode applies some fix for some erratum or whatnot. Well the bug is still there but at least something in microcode code seems to trigger too.. Also when it hangs it looks like this : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.jpg Will stay like this forever.. no trace or something. Right, so please disable that splash screen and do a boot video again without the dis_ucode_ldr option. The problem was vga=.. option , not the splash :) There is what I get : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 With dis_ucode_ldr there is some more output , with it will stay like this , nothing more. Alo I disabled all 'VT-d' in BIOS .. it doesn't make any difference. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 22:32, Borislav Petkov wrote: On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote: Yes , it will hang before tsc message .. Also sometimes I have same trace sometimes it just hangs forever. It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does, maybe the microcode applies some fix for some erratum or whatnot. Well the bug is still there but at least something in microcode code seems to trigger too.. Also when it hangs it looks like this : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.jpg Will stay like this forever.. no trace or something. Right, so please disable that splash screen and do a boot video again without the dis_ucode_ldr option. The problem was vga=.. option , not the splash :) There is what I get : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4 With dis_ucode_ldr there is some more output , with it will stay like this , nothing more. Alo I disabled all 'VT-d' in BIOS .. it doesn't make any difference. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote: > Yes , it will hang before tsc message .. > Also sometimes I have same trace sometimes it just hangs forever. It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does, maybe the microcode applies some fix for some erratum or whatnot. Right, so please disable that splash screen and do a boot video again without the dis_ucode_ldr option. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote: > Yes , it will hang before tsc message .. > Also sometimes I have same trace sometimes it just hangs forever. It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does, maybe the microcode applies some fix for some erratum or whatnot. Right, so please disable that splash screen and do a boot video again without the dis_ucode_ldr option. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 15:21, Borislav Petkov wrote: On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote: Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : Wait a minute, are you saying that without dis_ucode_ldr you can't even boot so far? Yes , it will hang before tsc message .. Also sometimes I have same trace sometimes it just hangs forever. clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns ... and I have the bug triggered really quick.. Also I cannot get netconsole to work , I'm sure is some problem here local and I don't have any serial cable around right now. The only way I saw now to give you at least some ifo is to make an video of that crash. You can find it there : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Watchdog fires on all cores showing they're all idle. For some reason, not all cores get to dump the watchdog splat, though. Some seem really stuck. And you have TAINT_FIRMWARE_WORKAROUND due to intel_prepare_irq_remapping() noticing intr remapping is broken on that box. Well yes and I'm not so sure is really broken.. I reverted the patch blacklisted my box right after it was addeded the time and I don't have any issues .. however since I don't have a use of that feature I don't really care is marked broken or not.. Would be better if you could disable that frugalware splash screen and switch to grub console mode so that we can see the very beginning of the boot. I do that tomorrow ... Btw, your BIOS is from 2013. Is there new one, per chance, on your vendor's site? Might wanna consider updating it... This BIOS was / is newest one :(
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 15:21, Borislav Petkov wrote: On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote: Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : Wait a minute, are you saying that without dis_ucode_ldr you can't even boot so far? Yes , it will hang before tsc message .. Also sometimes I have same trace sometimes it just hangs forever. clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns ... and I have the bug triggered really quick.. Also I cannot get netconsole to work , I'm sure is some problem here local and I don't have any serial cable around right now. The only way I saw now to give you at least some ifo is to make an video of that crash. You can find it there : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Watchdog fires on all cores showing they're all idle. For some reason, not all cores get to dump the watchdog splat, though. Some seem really stuck. And you have TAINT_FIRMWARE_WORKAROUND due to intel_prepare_irq_remapping() noticing intr remapping is broken on that box. Well yes and I'm not so sure is really broken.. I reverted the patch blacklisted my box right after it was addeded the time and I don't have any issues .. however since I don't have a use of that feature I don't really care is marked broken or not.. Would be better if you could disable that frugalware splash screen and switch to grub console mode so that we can see the very beginning of the boot. I do that tomorrow ... Btw, your BIOS is from 2013. Is there new one, per chance, on your vendor's site? Might wanna consider updating it... This BIOS was / is newest one :(
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote: > Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : Wait a minute, are you saying that without dis_ucode_ldr you can't even boot so far? > clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, > max_idle_ns: 440795315461 ns > ... > > and I have the bug triggered really quick.. > > Also I cannot get netconsole to work , I'm sure is some problem here local > and I don't have > any serial cable around right now. The only way I saw now to give you at > least some ifo is to > make an video of that crash. You can find it there : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Watchdog fires on all cores showing they're all idle. For some reason, not all cores get to dump the watchdog splat, though. Some seem really stuck. And you have TAINT_FIRMWARE_WORKAROUND due to intel_prepare_irq_remapping() noticing intr remapping is broken on that box. Would be better if you could disable that frugalware splash screen and switch to grub console mode so that we can see the very beginning of the boot. Btw, your BIOS is from 2013. Is there new one, per chance, on your vendor's site? Might wanna consider updating it... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote: > Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : Wait a minute, are you saying that without dis_ucode_ldr you can't even boot so far? > clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, > max_idle_ns: 440795315461 ns > ... > > and I have the bug triggered really quick.. > > Also I cannot get netconsole to work , I'm sure is some problem here local > and I don't have > any serial cable around right now. The only way I saw now to give you at > least some ifo is to > make an video of that crash. You can find it there : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Watchdog fires on all cores showing they're all idle. For some reason, not all cores get to dump the watchdog splat, though. Some seem really stuck. And you have TAINT_FIRMWARE_WORKAROUND due to intel_prepare_irq_remapping() noticing intr remapping is broken on that box. Would be better if you could disable that frugalware splash screen and switch to grub console mode so that we can see the very beginning of the boot. Btw, your BIOS is from 2013. Is there new one, per chance, on your vendor's site? Might wanna consider updating it... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 09:26, Thomas Gleixner wrote: You might try with 'earlyprintk' on the command line. That should tell more. With that I have some more output.. and after lots more boots I found out there are really at least 2 bugs triggered by this in 4.10. When just boothing with earlyprintk=vga debug ignore_loglevel the kernel hangs right after : Key type dns_resolver registered .. The cursor blinks and one have to wait a while this bug to trigger if at all. Sometimes it just hangs there and that is. Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns ... and I have the bug triggered really quick.. Also I cannot get netconsole to work , I'm sure is some problem here local and I don't have any serial cable around right now. The only way I saw now to give you at least some ifo is to make an video of that crash. You can find it there : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Also this is after waiting a while : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/20170211_130319.jpg I know videos and picures are not the best solution but reight now I don't have any other way to capture some logs :| The kernel is Linus git tree + .d966564fcdc19e13eb6ba1fbe6b8101070339c3d reverted and the config is : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/config I hope the video helps at least somewhat to have a clue what could be wrong. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 09:26, Thomas Gleixner wrote: You might try with 'earlyprintk' on the command line. That should tell more. With that I have some more output.. and after lots more boots I found out there are really at least 2 bugs triggered by this in 4.10. When just boothing with earlyprintk=vga debug ignore_loglevel the kernel hangs right after : Key type dns_resolver registered .. The cursor blinks and one have to wait a while this bug to trigger if at all. Sometimes it just hangs there and that is. Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after : clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns ... and I have the bug triggered really quick.. Also I cannot get netconsole to work , I'm sure is some problem here local and I don't have any serial cable around right now. The only way I saw now to give you at least some ifo is to make an video of that crash. You can find it there : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4 Also this is after waiting a while : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/20170211_130319.jpg I know videos and picures are not the best solution but reight now I don't have any other way to capture some logs :| The kernel is Linus git tree + .d966564fcdc19e13eb6ba1fbe6b8101070339c3d reverted and the config is : http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/config I hope the video helps at least somewhat to have a clue what could be wrong. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, 11 Feb 2017, Gabriel C wrote: > On 07.02.2017 22:25, Thomas Gleixner wrote: > Hi Thomas , > > Sorry I was travelling.. Nothing to be sorry about. > > Btw, how far in the boot process is the machine when this happens? > > Right after : > > Uncompressing Linux. > Booting the kernel.. > > So early.. You might try with 'earlyprintk' on the command line. That should tell more. > One thing is strange .. on all one socket boxes I have the kernel seems > to be fine with that patch while breaks on both dual socket boxes ( well > both have near same HW ) That's really weird. Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Sat, 11 Feb 2017, Gabriel C wrote: > On 07.02.2017 22:25, Thomas Gleixner wrote: > Hi Thomas , > > Sorry I was travelling.. Nothing to be sorry about. > > Btw, how far in the boot process is the machine when this happens? > > Right after : > > Uncompressing Linux. > Booting the kernel.. > > So early.. You might try with 'earlyprintk' on the command line. That should tell more. > One thing is strange .. on all one socket boxes I have the kernel seems > to be fine with that patch while breaks on both dual socket boxes ( well > both have near same HW ) That's really weird. Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 00:17, Gabriel C wrote: Btw, how far in the boot process is the machine when this happens? Right after : Uncompressing Linux. Booting the kernel.. So early.. After lots more boots .. I found out sometimes it gets to : .. [4.656826] Key type dns_resolver registered .. next line(s) in all my logs would be : .. [4.657507] microcode: sig=0x106a5, pf=0x1, revision=0x19 [4.658678] microcode: Microcode Update Driver: v2.01, Peter Oruba .. so maybe some sort race in microcode code ? but this would be strange ?
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 11.02.2017 00:17, Gabriel C wrote: Btw, how far in the boot process is the machine when this happens? Right after : Uncompressing Linux. Booting the kernel.. So early.. After lots more boots .. I found out sometimes it gets to : .. [4.656826] Key type dns_resolver registered .. next line(s) in all my logs would be : .. [4.657507] microcode: sig=0x106a5, pf=0x1, revision=0x19 [4.658678] microcode: Microcode Update Driver: v2.01 , Peter Oruba .. so maybe some sort race in microcode code ? but this would be strange ?
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 07.02.2017 22:25, Thomas Gleixner wrote: On Tue, 7 Feb 2017, Thomas Gleixner wrote: Hi Thomas , Sorry I was travelling.. Gabriel, can you please send me the bootlog from a working kernel? http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/dmesg ( If you wish I can send you one from .10-rc with that patch reverted ) Plus content of /proc/interrupts. http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/interrupts Btw, how far in the boot process is the machine when this happens? Right after : Uncompressing Linux. Booting the kernel.. So early.. One thing is strange .. on all one socket boxes I have the kernel seems to be fine with that patch while breaks on both dual socket boxes ( well both have near same HW ) Also I'm going to test your patch from your other email Regards, Gabrile C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 07.02.2017 22:25, Thomas Gleixner wrote: On Tue, 7 Feb 2017, Thomas Gleixner wrote: Hi Thomas , Sorry I was travelling.. Gabriel, can you please send me the bootlog from a working kernel? http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/dmesg ( If you wish I can send you one from .10-rc with that patch reverted ) Plus content of /proc/interrupts. http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/interrupts Btw, how far in the boot process is the machine when this happens? Right after : Uncompressing Linux. Booting the kernel.. So early.. One thing is strange .. on all one socket boxes I have the kernel seems to be fine with that patch while breaks on both dual socket boxes ( well both have near same HW ) Also I'm going to test your patch from your other email Regards, Gabrile C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 6 Feb 2017, Linus Torvalds wrote: > That said, it also strikes me that the implicated > irq_chip_retrigger_hierarchy() function looks really very suspicious > indeed. > > Most of the other users don't seem to traverse the parent all the way > until they find something. They just do the operation in the parent, > and if the parent needs it, it might then do it in _its_ parent and so > on. The whole point of the hierarchy is that we have decoupled the stacked chips so the ioapic does not know whether it is connected to the irq remapping unit or to the vector domain directly. > So I'm wondering if that for-loop triggers a stack overflow on your > setup somehow, just because that irq_retrigger() call is now truly > recursive, and hasn't been turned into tail-calls. It would only be recursive if some level down the hierarchy would use the same callback. The ioapic is always on top of its hierarchy and its either connected to the vector domain directly, which is the last level in the hierarchy and implements the real retrigger callback or to the irq remapping unit which does not have a retrigger callback. So it's not a recursion problem AFAICT, but lets try and just use the apic callback directly as we did before the whole hierarchy rework. That's wrong for other reasons, but that does not matter in that particular case. Patch below. We have the same situation with the MSI interrupt domains which all use irq_chip_retrigger_hierarchy() function as their retrigger callback, which does not seem to have the same effect on Gabriels machine. I have the feeling that this commit unearthes some other subtle wreckage in the interrupt machinery which gets not triggered otherwise. Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 52f352b063fd..3b6e5f3f099d 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1867,6 +1867,8 @@ static int ioapic_set_affinity(struct irq_data *irq_data, return ret; } +extern int apic_retrigger_irq(struct irq_data *irq_data); + static struct irq_chip ioapic_chip __read_mostly = { .name = "IO-APIC", .irq_startup= startup_ioapic_irq, @@ -1875,7 +1877,7 @@ static struct irq_chip ioapic_chip __read_mostly = { .irq_ack= irq_chip_ack_parent, .irq_eoi= ioapic_ack_level, .irq_set_affinity = ioapic_set_affinity, - .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_retrigger = apic_retrigger_irq, .flags = IRQCHIP_SKIP_SET_WAKE, }; @@ -1887,7 +1889,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = { .irq_ack= irq_chip_ack_parent, .irq_eoi= ioapic_ir_ack_level, .irq_set_affinity = ioapic_set_affinity, - .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_retrigger = apic_retrigger_irq, .flags = IRQCHIP_SKIP_SET_WAKE, }; diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 5d30c5e42bb1..ce9b93e19266 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -496,7 +496,7 @@ void setup_vector_irq(int cpu) __setup_vector_irq(cpu); } -static int apic_retrigger_irq(struct irq_data *irq_data) +int apic_retrigger_irq(struct irq_data *irq_data) { struct apic_chip_data *data = apic_chip_data(irq_data); unsigned long flags;
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 6 Feb 2017, Linus Torvalds wrote: > That said, it also strikes me that the implicated > irq_chip_retrigger_hierarchy() function looks really very suspicious > indeed. > > Most of the other users don't seem to traverse the parent all the way > until they find something. They just do the operation in the parent, > and if the parent needs it, it might then do it in _its_ parent and so > on. The whole point of the hierarchy is that we have decoupled the stacked chips so the ioapic does not know whether it is connected to the irq remapping unit or to the vector domain directly. > So I'm wondering if that for-loop triggers a stack overflow on your > setup somehow, just because that irq_retrigger() call is now truly > recursive, and hasn't been turned into tail-calls. It would only be recursive if some level down the hierarchy would use the same callback. The ioapic is always on top of its hierarchy and its either connected to the vector domain directly, which is the last level in the hierarchy and implements the real retrigger callback or to the irq remapping unit which does not have a retrigger callback. So it's not a recursion problem AFAICT, but lets try and just use the apic callback directly as we did before the whole hierarchy rework. That's wrong for other reasons, but that does not matter in that particular case. Patch below. We have the same situation with the MSI interrupt domains which all use irq_chip_retrigger_hierarchy() function as their retrigger callback, which does not seem to have the same effect on Gabriels machine. I have the feeling that this commit unearthes some other subtle wreckage in the interrupt machinery which gets not triggered otherwise. Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 52f352b063fd..3b6e5f3f099d 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1867,6 +1867,8 @@ static int ioapic_set_affinity(struct irq_data *irq_data, return ret; } +extern int apic_retrigger_irq(struct irq_data *irq_data); + static struct irq_chip ioapic_chip __read_mostly = { .name = "IO-APIC", .irq_startup= startup_ioapic_irq, @@ -1875,7 +1877,7 @@ static struct irq_chip ioapic_chip __read_mostly = { .irq_ack= irq_chip_ack_parent, .irq_eoi= ioapic_ack_level, .irq_set_affinity = ioapic_set_affinity, - .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_retrigger = apic_retrigger_irq, .flags = IRQCHIP_SKIP_SET_WAKE, }; @@ -1887,7 +1889,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = { .irq_ack= irq_chip_ack_parent, .irq_eoi= ioapic_ir_ack_level, .irq_set_affinity = ioapic_set_affinity, - .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_retrigger = apic_retrigger_irq, .flags = IRQCHIP_SKIP_SET_WAKE, }; diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 5d30c5e42bb1..ce9b93e19266 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -496,7 +496,7 @@ void setup_vector_irq(int cpu) __setup_vector_irq(cpu); } -static int apic_retrigger_irq(struct irq_data *irq_data) +int apic_retrigger_irq(struct irq_data *irq_data) { struct apic_chip_data *data = apic_chip_data(irq_data); unsigned long flags;
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Tue, 7 Feb 2017, Thomas Gleixner wrote: > Gabriel, can you please send me the bootlog from a working kernel? Plus content of /proc/interrupts. Btw, how far in the boot process is the machine when this happens? Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Tue, 7 Feb 2017, Thomas Gleixner wrote: > Gabriel, can you please send me the bootlog from a working kernel? Plus content of /proc/interrupts. Btw, how far in the boot process is the machine when this happens? Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 6 Feb 2017, Linus Torvalds wrote: > But for now, I'd be inclined to just revert it unless somebody has a > "Duh!" moment and can tell me what's wrong with that commit with an > obvious fix. I have no "Duh!" moment even after staring at the code for quite a while. Gabriel, can you please send me the bootlog from a working kernel? Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 6 Feb 2017, Linus Torvalds wrote: > But for now, I'd be inclined to just revert it unless somebody has a > "Duh!" moment and can tell me what's wrong with that commit with an > obvious fix. I have no "Duh!" moment even after staring at the code for quite a while. Gabriel, can you please send me the bootlog from a working kernel? Thanks, tglx
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 6, 2017 at 9:30 AM, Gabriel Cwrote: > > Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 > and found out by box won't boot anymore. > > It hangs early and freeze with a lot RCU warnings. > Since I cannot setup a netconsole right now I cannot post the errors , > really sorry. > > ( but I could make a picture if needed ) > > I bisected it down to : > >> Ruslan Ruslichenko (1): >> x86/ioapic: Restore IO-APIC irq_chip retrigger callback Ok, it's 020eb3daaba2 ("x86/ioapic: Restore IO-APIC irq_chip retrigger callback") in mainline. > Reverting this one fixes the problem for me.. Since that came in rather late, I suspect we'll have to revert for now. The thing it fixes has been around for almost two years, so it can't be as serious a problem as the fix itself ended up being. Thomas? That said, it also strikes me that the implicated irq_chip_retrigger_hierarchy() function looks really very suspicious indeed. Most of the other users don't seem to traverse the parent all the way until they find something. They just do the operation in the parent, and if the parent needs it, it might then do it in _its_ parent and so on. And the compiler is able to turn the parent call into a tail call so it doesn't cause a stack use explosion even if the parenthood chains end up being pretty deep. So I'm wondering if that for-loop triggers a stack overflow on your setup somehow, just because that irq_retrigger() call is now truly recursive, and hasn't been turned into tail-calls. But for now, I'd be inclined to just revert it unless somebody has a "Duh!" moment and can tell me what's wrong with that commit with an obvious fix. Comments? Linus
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 6, 2017 at 9:30 AM, Gabriel C wrote: > > Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 > and found out by box won't boot anymore. > > It hangs early and freeze with a lot RCU warnings. > Since I cannot setup a netconsole right now I cannot post the errors , > really sorry. > > ( but I could make a picture if needed ) > > I bisected it down to : > >> Ruslan Ruslichenko (1): >> x86/ioapic: Restore IO-APIC irq_chip retrigger callback Ok, it's 020eb3daaba2 ("x86/ioapic: Restore IO-APIC irq_chip retrigger callback") in mainline. > Reverting this one fixes the problem for me.. Since that came in rather late, I suspect we'll have to revert for now. The thing it fixes has been around for almost two years, so it can't be as serious a problem as the fix itself ended up being. Thomas? That said, it also strikes me that the implicated irq_chip_retrigger_hierarchy() function looks really very suspicious indeed. Most of the other users don't seem to traverse the parent all the way until they find something. They just do the operation in the parent, and if the parent needs it, it might then do it in _its_ parent and so on. And the compiler is able to turn the parent call into a tail call so it doesn't cause a stack use explosion even if the parenthood chains end up being pretty deep. So I'm wondering if that for-loop triggers a stack overflow on your setup somehow, just because that irq_retrigger() call is now truly recursive, and hasn't been turned into tail-calls. But for now, I'd be inclined to just revert it unless somebody has a "Duh!" moment and can tell me what's wrong with that commit with an obvious fix. Comments? Linus
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 06.02.2017 20:05, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: On 02/06/2017 07:41 PM, Greg KH wrote: On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : Ruslan Ruslichenko (1): x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? This looks strange. What this patch does is just revert previous behavior, broken by d32932d02e18. So we can try to test with last v4.1 stable, where retrigger callback were still present. I can test that but first on weekend if you wish. Also on v4.10 maybe check with software emulation of this feature and reverted patch, e.g.: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e487493..49c3c71 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -170,6 +170,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMESif PROC_FS + select HARDIRQS_SW_RESEND config INSTRUCTION_DECODER def_bool y With patch reverted + this one I get a early kernel panic.. on 4.10.0-rc7 With just the patch reverted all is fine , the box boots and all seems fine. I think for further debugging logs will be needed. Yes sure , I just need to find a way to set something up like netconsole here. Right now I have no way doig that. I'll try to do that on weekend too also. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 06.02.2017 20:05, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: On 02/06/2017 07:41 PM, Greg KH wrote: On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : Ruslan Ruslichenko (1): x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? This looks strange. What this patch does is just revert previous behavior, broken by d32932d02e18. So we can try to test with last v4.1 stable, where retrigger callback were still present. I can test that but first on weekend if you wish. Also on v4.10 maybe check with software emulation of this feature and reverted patch, e.g.: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e487493..49c3c71 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -170,6 +170,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMESif PROC_FS + select HARDIRQS_SW_RESEND config INSTRUCTION_DECODER def_bool y With patch reverted + this one I get a early kernel panic.. on 4.10.0-rc7 With just the patch reverted all is fine , the box boots and all seems fine. I think for further debugging logs will be needed. Yes sure , I just need to find a way to set something up like netconsole here. Right now I have no way doig that. I'll try to do that on weekend too also. Regards, Gabriel C
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 02/06/2017 07:41 PM, Greg KH wrote: On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : Ruslan Ruslichenko (1): x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? This looks strange. What this patch does is just revert previous behavior, broken by d32932d02e18. So we can try to test with last v4.1 stable, where retrigger callback were still present. Also on v4.10 maybe check with software emulation of this feature and reverted patch, e.g.: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e487493..49c3c71 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -170,6 +170,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMESif PROC_FS + select HARDIRQS_SW_RESEND config INSTRUCTION_DECODER def_bool y I think for further debugging logs will be needed. thanks, greg k-h
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 02/06/2017 07:41 PM, Greg KH wrote: On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : Ruslan Ruslichenko (1): x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? This looks strange. What this patch does is just revert previous behavior, broken by d32932d02e18. So we can try to test with last v4.1 stable, where retrigger callback were still present. Also on v4.10 maybe check with software emulation of this feature and reverted patch, e.g.: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e487493..49c3c71 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -170,6 +170,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMESif PROC_FS + select HARDIRQS_SW_RESEND config INSTRUCTION_DECODER def_bool y I think for further debugging logs will be needed. thanks, greg k-h
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: > > On 26.01.2017 08:48, Greg KH wrote: > > Hi Greg, > > > I'm announcing the release of the 4.9.6 kernel. > > > Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 > and found out by box won't boot anymore. > > It hangs early and freeze with a lot RCU warnings. > Since I cannot setup a netconsole right now I cannot post the errors , really > sorry. > > ( but I could make a picture if needed ) > > > I bisected it down to : > > > Ruslan Ruslichenko (1): > > x86/ioapic: Restore IO-APIC irq_chip retrigger callback > > Reverting this one fixes the problem for me.. > > Also this problem exists in Linus tree , I tested on: > 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) > The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. > > Config: > https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? thanks, greg k-h
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote: > > On 26.01.2017 08:48, Greg KH wrote: > > Hi Greg, > > > I'm announcing the release of the 4.9.6 kernel. > > > Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 > and found out by box won't boot anymore. > > It hangs early and freeze with a lot RCU warnings. > Since I cannot setup a netconsole right now I cannot post the errors , really > sorry. > > ( but I could make a picture if needed ) > > > I bisected it down to : > > > Ruslan Ruslichenko (1): > > x86/ioapic: Restore IO-APIC irq_chip retrigger callback > > Reverting this one fixes the problem for me.. > > Also this problem exists in Linus tree , I tested on: > 4.10.0-rc6-00167-ga0a28644c1cf Ok, at least we are consistent :) > The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. > > Config: > https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Ruslan, any thoughts about what to do here? thanks, greg k-h
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : > Ruslan Ruslichenko (1): > x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Regards, Gabriel C.
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On 26.01.2017 08:48, Greg KH wrote: Hi Greg, I'm announcing the release of the 4.9.6 kernel. Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7 and found out by box won't boot anymore. It hangs early and freeze with a lot RCU warnings. Since I cannot setup a netconsole right now I cannot post the errors , really sorry. ( but I could make a picture if needed ) I bisected it down to : > Ruslan Ruslichenko (1): > x86/ioapic: Restore IO-APIC irq_chip retrigger callback Reverting this one fixes the problem for me.. Also this problem exists in Linus tree , I tested on: 4.10.0-rc6-00167-ga0a28644c1cf The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed. Config: https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64 Regards, Gabriel C.