Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-13 Thread Borislav Petkov
On Mon, Feb 13, 2017 at 02:26:20AM +0100, Gabriel C wrote:
> I didn't tested your patch yet but did a boot with mce=off and nomce
> which seems to not really works since is still want to mc_device_add()
> even when off.

mc_device_add() is microcode loader's ->add_dev() subsys pointer and
that's not from mce. From mce you should be seeing only (with the debug
patch applied):

[1.717508] mce: mcheck_init_device: entry
[1.718769] mce: Unable to init device /dev/mcelog (rc: -5)

> See :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg

That looks like core 13 got the NMI from the watchdog at

if (wait)
csd_lock_wait(csd);

IINM and from what I could correlate to the asm it generates here,
RIP points to that READ_ONCE there in smp_cond_load_acquire() in
smp_call_function_single() which is called by collect_cpu_info() of the
microcode loader to get the microcode-relevant info from the CPU.

So this is simply a bystander CPU which got interrupted.

> I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)

Ok.

> Another thing is .. there seems to be a real bug in tsc code .
> 
> I've build an -rc8 with a lot more debug options on an now I see the 
> following :

Right before I went to bed I thought of telling you to enable lockdep :-)

Good. :-)

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-13 Thread Borislav Petkov
On Mon, Feb 13, 2017 at 02:26:20AM +0100, Gabriel C wrote:
> I didn't tested your patch yet but did a boot with mce=off and nomce
> which seems to not really works since is still want to mc_device_add()
> even when off.

mc_device_add() is microcode loader's ->add_dev() subsys pointer and
that's not from mce. From mce you should be seeing only (with the debug
patch applied):

[1.717508] mce: mcheck_init_device: entry
[1.718769] mce: Unable to init device /dev/mcelog (rc: -5)

> See :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg

That looks like core 13 got the NMI from the watchdog at

if (wait)
csd_lock_wait(csd);

IINM and from what I could correlate to the asm it generates here,
RIP points to that READ_ONCE there in smp_cond_load_acquire() in
smp_call_function_single() which is called by collect_cpu_info() of the
microcode loader to get the microcode-relevant info from the CPU.

So this is simply a bystander CPU which got interrupted.

> I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)

Ok.

> Another thing is .. there seems to be a real bug in tsc code .
> 
> I've build an -rc8 with a lot more debug options on an now I see the 
> following :

Right before I went to bed I thought of telling you to enable lockdep :-)

Good. :-)

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-13 Thread Thomas Gleixner
On Mon, 13 Feb 2017, Mike Galbraith wrote:
>  kernel/time/tick-broadcast.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
>   struct clock_event_device *bc, *dev;
>   struct tick_device *td;
>   int cpu, bc_stopped;
> + unsigned long flags;
>  
>   td = this_cpu_ptr(_cpu_device);
>   dev = td->evtdev;
> @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
>   if (!tick_device_is_functional(dev))
>   return;
>  
> - raw_spin_lock(_broadcast_lock);
> + raw_spin_lock_irqsave(_broadcast_lock, flags);
>   cpu = smp_processor_id();
>   bc = tick_broadcast_device.evtdev;
>   bc_stopped = cpumask_empty(tick_broadcast_mask);
> @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
>   tick_broadcast_setup_oneshot(bc);
>   }
>   }
> - raw_spin_unlock(_broadcast_lock);
> + raw_spin_unlock_irqrestore(_broadcast_lock, flags);

That cures the lockdep splat, but the comment above
tick_broadcast_control() says:

* Called with interrupts disabled, so clockevents_lock is not
* required here because the local clock event device cannot go away
* under us.

So if we want to relax the calling convention, then we need to take the
lock early.  Otherwise it's unsafe to fiddle with the local clock event
device.

The calling convention was broken with the following commit:

29d7bbada98e intel_idle: Remove superfluous SMP fuction call

So we could fix it at the call site, but making the core more robust is the
better solution.

I'll fix it up.

Thanks,

tglx




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-13 Thread Thomas Gleixner
On Mon, 13 Feb 2017, Mike Galbraith wrote:
>  kernel/time/tick-broadcast.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
>   struct clock_event_device *bc, *dev;
>   struct tick_device *td;
>   int cpu, bc_stopped;
> + unsigned long flags;
>  
>   td = this_cpu_ptr(_cpu_device);
>   dev = td->evtdev;
> @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
>   if (!tick_device_is_functional(dev))
>   return;
>  
> - raw_spin_lock(_broadcast_lock);
> + raw_spin_lock_irqsave(_broadcast_lock, flags);
>   cpu = smp_processor_id();
>   bc = tick_broadcast_device.evtdev;
>   bc_stopped = cpumask_empty(tick_broadcast_mask);
> @@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
>   tick_broadcast_setup_oneshot(bc);
>   }
>   }
> - raw_spin_unlock(_broadcast_lock);
> + raw_spin_unlock_irqrestore(_broadcast_lock, flags);

That cures the lockdep splat, but the comment above
tick_broadcast_control() says:

* Called with interrupts disabled, so clockevents_lock is not
* required here because the local clock event device cannot go away
* under us.

So if we want to relax the calling convention, then we need to take the
lock early.  Otherwise it's unsafe to fiddle with the local clock event
device.

The calling convention was broken with the following commit:

29d7bbada98e intel_idle: Remove superfluous SMP fuction call

So we could fix it at the call site, but making the core more robust is the
better solution.

I'll fix it up.

Thanks,

tglx




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Mike Galbraith
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote:

> [5.276704]CPU0
> [5.312400]
> [5.347605]   lock(tick_broadcast_lock);
> [5.383163]   
> [5.418457] lock(tick_broadcast_lock);
> [5.454015]
>  *** DEADLOCK ***
> 
> [5.557982] no locks held by cpuhp/0/14.

Oh, that looks familiar...

tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave()

Otherwise we end up with the lockdep splat below:

[   12.703619] =
[   12.703619] [ INFO: inconsistent lock state ]
[   12.703621] 4.10.0-rt1-rt #18 Not tainted
[   12.703622] -
[   12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   12.703625]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x5a/0x1a0
[   12.703632] {IN-HARDIRQ-W} state was registered at:
[   12.703637] [] __lock_acquire+0xa21/0x1550
[   12.703639] [] lock_acquire+0xbd/0x250
[   12.703642] [] _raw_spin_lock_irqsave+0x53/0x70
[   12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50
[   12.703646] [] tick_switch_to_oneshot+0x59/0xd0
[   12.703647] [] tick_init_highres+0x15/0x20
[   12.703652] [] hrtimer_run_queues+0x9f/0xe0
[   12.703654] [] run_local_timers+0x25/0x60
[   12.703656] [] update_process_times+0x2c/0x60
[   12.703659] [] tick_periodic+0x2f/0x100
[   12.703661] [] tick_handle_periodic+0x24/0x70
[   12.703664] [] local_apic_timer_interrupt+0x33/0x60
[   12.703669] [] smp_apic_timer_interrupt+0x38/0x50
[   12.703671] [] apic_timer_interrupt+0x9d/0xb0
[   12.703672] [] mwait_idle+0x94/0x290
[   12.703676] [] arch_cpu_idle+0xf/0x20
[   12.703677] [] default_idle_call+0x31/0x60
[   12.703681] [] do_idle+0x175/0x290
[   12.703683] [] cpu_startup_entry+0x48/0x50
[   12.703687] [] start_secondary+0x133/0x160
[   12.703689] [] verify_cpu+0x0/0xfc
[   12.703690] irq event stamp: 71
[   12.703691] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x2c/0x80
[   12.703696] hardirqs last disabled at (70): [] 
__schedule+0x9c/0x7e0
[   12.703699] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22d0
[   12.703700] softirqs last disabled at (0): [<  (null)>]   
(null)
[   12.703701] 
[   12.703701] other info that might help us debug this:
[   12.703701]  Possible unsafe locking scenario:
[   12.703701] 
[   12.703701]CPU0
[   12.703702]
[   12.703702]   lock(tick_broadcast_lock);
[   12.703703]   
[   12.703704] lock(tick_broadcast_lock);
[   12.703705] 
[   12.703705]  *** DEADLOCK ***
[   12.703705] 
[   12.703705] no locks held by cpuhp/0/23.
[   12.703705] 
[   12.703705] stack backtrace:
[   12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18
[   12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 
07/07/2010
[   12.703709] Call Trace:
[   12.703715]  dump_stack+0x85/0xc8
[   12.703717]  print_usage_bug+0x1ea/0x1fb
[   12.703719]  ? print_shortest_lock_dependencies+0x1c0/0x1c0
[   12.703721]  mark_lock+0x20d/0x290
[   12.703723]  __lock_acquire+0x8e6/0x1550
[   12.703724]  ? __lock_acquire+0x2ce/0x1550
[   12.703726]  ? load_balance+0x1b4/0xaf0
[   12.703728]  lock_acquire+0xbd/0x250
[   12.703729]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703735]  ? efifb_probe+0x170/0x170
[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703737]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703740]  ? efifb_probe+0x170/0x170
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703747]  ? finish_task_switch+0x78/0x290
[   12.703750]  ? check_preemption_disabled+0x9f/0x130
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320
[   12.703757]  kthread+0x10c/0x140
[   12.703759]  ? smpboot_update_cpumask_percpu_thread+0x130/0x130
[   12.703760]  ? kthread_park+0x90/0x90
[   12.703762]  ret_from_fork+0x2a/0x40
[   12.709790] intel_idle: lapic_timer_reliable_states 0x2

Signed-off-by: Mike Galbraith 
---
 kernel/time/tick-broadcast.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
@@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
if (!tick_device_is_functional(dev))
return;
 
-   raw_spin_lock(_broadcast_lock);
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
  

Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Mike Galbraith
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote:

> [5.276704]CPU0
> [5.312400]
> [5.347605]   lock(tick_broadcast_lock);
> [5.383163]   
> [5.418457] lock(tick_broadcast_lock);
> [5.454015]
>  *** DEADLOCK ***
> 
> [5.557982] no locks held by cpuhp/0/14.

Oh, that looks familiar...

tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave()

Otherwise we end up with the lockdep splat below:

[   12.703619] =
[   12.703619] [ INFO: inconsistent lock state ]
[   12.703621] 4.10.0-rt1-rt #18 Not tainted
[   12.703622] -
[   12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   12.703625]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x5a/0x1a0
[   12.703632] {IN-HARDIRQ-W} state was registered at:
[   12.703637] [] __lock_acquire+0xa21/0x1550
[   12.703639] [] lock_acquire+0xbd/0x250
[   12.703642] [] _raw_spin_lock_irqsave+0x53/0x70
[   12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50
[   12.703646] [] tick_switch_to_oneshot+0x59/0xd0
[   12.703647] [] tick_init_highres+0x15/0x20
[   12.703652] [] hrtimer_run_queues+0x9f/0xe0
[   12.703654] [] run_local_timers+0x25/0x60
[   12.703656] [] update_process_times+0x2c/0x60
[   12.703659] [] tick_periodic+0x2f/0x100
[   12.703661] [] tick_handle_periodic+0x24/0x70
[   12.703664] [] local_apic_timer_interrupt+0x33/0x60
[   12.703669] [] smp_apic_timer_interrupt+0x38/0x50
[   12.703671] [] apic_timer_interrupt+0x9d/0xb0
[   12.703672] [] mwait_idle+0x94/0x290
[   12.703676] [] arch_cpu_idle+0xf/0x20
[   12.703677] [] default_idle_call+0x31/0x60
[   12.703681] [] do_idle+0x175/0x290
[   12.703683] [] cpu_startup_entry+0x48/0x50
[   12.703687] [] start_secondary+0x133/0x160
[   12.703689] [] verify_cpu+0x0/0xfc
[   12.703690] irq event stamp: 71
[   12.703691] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x2c/0x80
[   12.703696] hardirqs last disabled at (70): [] 
__schedule+0x9c/0x7e0
[   12.703699] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22d0
[   12.703700] softirqs last disabled at (0): [<  (null)>]   
(null)
[   12.703701] 
[   12.703701] other info that might help us debug this:
[   12.703701]  Possible unsafe locking scenario:
[   12.703701] 
[   12.703701]CPU0
[   12.703702]
[   12.703702]   lock(tick_broadcast_lock);
[   12.703703]   
[   12.703704] lock(tick_broadcast_lock);
[   12.703705] 
[   12.703705]  *** DEADLOCK ***
[   12.703705] 
[   12.703705] no locks held by cpuhp/0/23.
[   12.703705] 
[   12.703705] stack backtrace:
[   12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18
[   12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 
07/07/2010
[   12.703709] Call Trace:
[   12.703715]  dump_stack+0x85/0xc8
[   12.703717]  print_usage_bug+0x1ea/0x1fb
[   12.703719]  ? print_shortest_lock_dependencies+0x1c0/0x1c0
[   12.703721]  mark_lock+0x20d/0x290
[   12.703723]  __lock_acquire+0x8e6/0x1550
[   12.703724]  ? __lock_acquire+0x2ce/0x1550
[   12.703726]  ? load_balance+0x1b4/0xaf0
[   12.703728]  lock_acquire+0xbd/0x250
[   12.703729]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703735]  ? efifb_probe+0x170/0x170
[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703737]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703740]  ? efifb_probe+0x170/0x170
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703747]  ? finish_task_switch+0x78/0x290
[   12.703750]  ? check_preemption_disabled+0x9f/0x130
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320
[   12.703757]  kthread+0x10c/0x140
[   12.703759]  ? smpboot_update_cpumask_percpu_thread+0x130/0x130
[   12.703760]  ? kthread_park+0x90/0x90
[   12.703762]  ret_from_fork+0x2a/0x40
[   12.709790] intel_idle: lapic_timer_reliable_states 0x2

Signed-off-by: Mike Galbraith 
---
 kernel/time/tick-broadcast.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
@@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
if (!tick_device_is_functional(dev))
return;
 
-   raw_spin_lock(_broadcast_lock);
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +421,7 @@ void tick_broadcast_control(enum tick_br
 

Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 13.02.2017 01:38, Borislav Petkov wrote:

On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4


Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.

Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.

Apply, build, boot and shoot video again :-)



I didn't tested your patch yet but did a boot with mce=off and nomce which 
seems to not
really works since is still want to mc_device_add() even when off.

See :


http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg

I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)


Another thing is .. there seems to be a real bug in tsc code .

I've build an -rc8 with a lot more debug options on an now I see the following :

...

[4.321029] =
[4.321909] [ INFO: inconsistent lock state ]
[4.322789] 4.10.0-rc8-debug #1 Tainted: G  I
[4.323879] -
[4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[4.326993]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x57/0x190
[4.328879] {IN-HARDIRQ-W} state was registered at:
[4.329866]   __lock_acquire+0x24f/0x19e0
[4.330675]   lock_acquire+0xa5/0xd0
[4.331399]   _raw_spin_lock_irqsave+0x54/0x90
[4.332297]   tick_broadcast_switch_to_oneshot+0x11/0x50
[4.71]   tick_switch_to_oneshot+0x8c/0xd0
[4.334269]   tick_init_highres+0x10/0x20
[4.335079]   hrtimer_run_queues+0x5a/0xe0
[4.335907]   run_local_timers+0x20/0x50
[4.336699]   update_process_times+0x22/0x50
[4.337562]   tick_periodic+0xa5/0xb0
[4.338302]   tick_handle_periodic+0x1f/0x60
[4.378065]   smp_trace_apic_timer_interrupt+0x74/0x90
[4.418107]   smp_apic_timer_interrupt+0x9/0x10
[4.458095]   apic_timer_interrupt+0x93/0xa0
[4.498048]   mwait_idle+0x5a/0x90
[4.537618]   arch_cpu_idle+0xa/0x10
[4.577098]   default_idle_call+0x2c/0x30
[4.616211]   do_idle+0x10c/0x1e0
[4.654606]   cpu_startup_entry+0x5d/0x60
[4.692388]   rest_init+0x12c/0x140
[4.729557]   start_kernel+0x45f/0x46c
[4.766325]   x86_64_start_reservations+0x2a/0x2c
[4.803075]   x86_64_start_kernel+0xeb/0xf8
[4.839178]   verify_cpu+0x0/0xfc
[4.874629] irq event stamp: 71
[4.909417] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x27/0x50
[4.945642] hardirqs last disabled at (70): [] 
__schedule+0x13a/0x7c0
[4.981797] softirqs last  enabled at (0): [] 
copy_process+0x7c0/0x1ea0
[5.018580] softirqs last disabled at (0): [<  (null)>]   
(null)
[5.055677]
   other info that might help us debug this:
[5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz
[5.072467] clocksource: tsc: mask: 0x max_cycles: 
0x20ac7f6ecc6, max_idle_ns: 440795315461 ns
[5.202828]  Possible unsafe locking scenario:


^

This seems to be the place where the other patch breaks hell here..


[5.276704]CPU0
[5.312400]
[5.347605]   lock(tick_broadcast_lock);
[5.383163]   
[5.418457] lock(tick_broadcast_lock);
[5.454015]
*** DEADLOCK ***

[5.557982] no locks held by cpuhp/0/14.
[5.592295]
   stack backtrace:
[5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G  I 
4.10.0-rc8-debug #1
[5.690740] Hardware name: FUJITSU  PRIMERGY TX200 
S5 /D2709, BIOS 6.00 Rev. 1.14.2709  02/04/2013
[5.758323] Call Trace:
[5.791434]  dump_stack+0x86/0xc1
[5.824421]  print_usage_bug+0x283/0x2a0
[5.857357]  mark_lock+0x39e/0x650
[5.890256]  ? check_usage_forwards+0xf0/0xf0
[5.923436]  __lock_acquire+0x2ba/0x19e0
[5.956617]  ? pick_next_task_fair+0x350/0x700
[5.989903]  ? finish_task_switch+0x184/0x220
[6.023171]  ? debug_smp_processor_id+0x17/0x20
[6.056667]  lock_acquire+0xa5/0xd0
[6.089882]  ? tick_broadcast_control+0x57/0x190
[6.123395]  ? smpboot_thread_fn+0x28/0x250
[6.156838]  _raw_spin_lock+0x3c/0x80
[6.190175]  ? tick_broadcast_control+0x57/0x190
[6.223914]  tick_broadcast_control+0x57/0x190
[6.257846]  ? finish_task_switch+0x184/0x220
[6.291900]  ? smpboot_thread_fn+0x28/0x250
[6.325991]  intel_idle_cpu_online+0x1d/0x100
[6.360220]  cpuhp_invoke_callback+0x62/0x120
[6.394397]  ? smpboot_thread_fn+0x28/0x250
[6.428451]  cpuhp_thread_fun+0x87/0x110
[6.462611]  smpboot_thread_fn+0x227/0x250
[6.496805]  

Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 13.02.2017 01:38, Borislav Petkov wrote:

On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4


Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.

Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.

Apply, build, boot and shoot video again :-)



I didn't tested your patch yet but did a boot with mce=off and nomce which 
seems to not
really works since is still want to mc_device_add() even when off.

See :


http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg

I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)


Another thing is .. there seems to be a real bug in tsc code .

I've build an -rc8 with a lot more debug options on an now I see the following :

...

[4.321029] =
[4.321909] [ INFO: inconsistent lock state ]
[4.322789] 4.10.0-rc8-debug #1 Tainted: G  I
[4.323879] -
[4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[4.326993]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x57/0x190
[4.328879] {IN-HARDIRQ-W} state was registered at:
[4.329866]   __lock_acquire+0x24f/0x19e0
[4.330675]   lock_acquire+0xa5/0xd0
[4.331399]   _raw_spin_lock_irqsave+0x54/0x90
[4.332297]   tick_broadcast_switch_to_oneshot+0x11/0x50
[4.71]   tick_switch_to_oneshot+0x8c/0xd0
[4.334269]   tick_init_highres+0x10/0x20
[4.335079]   hrtimer_run_queues+0x5a/0xe0
[4.335907]   run_local_timers+0x20/0x50
[4.336699]   update_process_times+0x22/0x50
[4.337562]   tick_periodic+0xa5/0xb0
[4.338302]   tick_handle_periodic+0x1f/0x60
[4.378065]   smp_trace_apic_timer_interrupt+0x74/0x90
[4.418107]   smp_apic_timer_interrupt+0x9/0x10
[4.458095]   apic_timer_interrupt+0x93/0xa0
[4.498048]   mwait_idle+0x5a/0x90
[4.537618]   arch_cpu_idle+0xa/0x10
[4.577098]   default_idle_call+0x2c/0x30
[4.616211]   do_idle+0x10c/0x1e0
[4.654606]   cpu_startup_entry+0x5d/0x60
[4.692388]   rest_init+0x12c/0x140
[4.729557]   start_kernel+0x45f/0x46c
[4.766325]   x86_64_start_reservations+0x2a/0x2c
[4.803075]   x86_64_start_kernel+0xeb/0xf8
[4.839178]   verify_cpu+0x0/0xfc
[4.874629] irq event stamp: 71
[4.909417] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x27/0x50
[4.945642] hardirqs last disabled at (70): [] 
__schedule+0x13a/0x7c0
[4.981797] softirqs last  enabled at (0): [] 
copy_process+0x7c0/0x1ea0
[5.018580] softirqs last disabled at (0): [<  (null)>]   
(null)
[5.055677]
   other info that might help us debug this:
[5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz
[5.072467] clocksource: tsc: mask: 0x max_cycles: 
0x20ac7f6ecc6, max_idle_ns: 440795315461 ns
[5.202828]  Possible unsafe locking scenario:


^

This seems to be the place where the other patch breaks hell here..


[5.276704]CPU0
[5.312400]
[5.347605]   lock(tick_broadcast_lock);
[5.383163]   
[5.418457] lock(tick_broadcast_lock);
[5.454015]
*** DEADLOCK ***

[5.557982] no locks held by cpuhp/0/14.
[5.592295]
   stack backtrace:
[5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G  I 
4.10.0-rc8-debug #1
[5.690740] Hardware name: FUJITSU  PRIMERGY TX200 
S5 /D2709, BIOS 6.00 Rev. 1.14.2709  02/04/2013
[5.758323] Call Trace:
[5.791434]  dump_stack+0x86/0xc1
[5.824421]  print_usage_bug+0x283/0x2a0
[5.857357]  mark_lock+0x39e/0x650
[5.890256]  ? check_usage_forwards+0xf0/0xf0
[5.923436]  __lock_acquire+0x2ba/0x19e0
[5.956617]  ? pick_next_task_fair+0x350/0x700
[5.989903]  ? finish_task_switch+0x184/0x220
[6.023171]  ? debug_smp_processor_id+0x17/0x20
[6.056667]  lock_acquire+0xa5/0xd0
[6.089882]  ? tick_broadcast_control+0x57/0x190
[6.123395]  ? smpboot_thread_fn+0x28/0x250
[6.156838]  _raw_spin_lock+0x3c/0x80
[6.190175]  ? tick_broadcast_control+0x57/0x190
[6.223914]  tick_broadcast_control+0x57/0x190
[6.257846]  ? finish_task_switch+0x184/0x220
[6.291900]  ? smpboot_thread_fn+0x28/0x250
[6.325991]  intel_idle_cpu_online+0x1d/0x100
[6.360220]  cpuhp_invoke_callback+0x62/0x120
[6.394397]  ? smpboot_thread_fn+0x28/0x250
[6.428451]  cpuhp_thread_fun+0x87/0x110
[6.462611]  smpboot_thread_fn+0x227/0x250
[6.496805]  

Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Borislav Petkov
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4

Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.

Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.

Apply, build, boot and shoot video again :-)

Thanks.

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c607ea..70268867cb33 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2565,37 +2565,57 @@ static __init int mcheck_init_device(void)
enum cpuhp_state hp_online;
int err;
 
+   pr_err("%s: entry\n", __func__);
+
if (!mce_available(_cpu_data)) {
err = -EIO;
goto err_out;
}
 
+   pr_err("%s: mce_available\n", __func__);
+
if (!zalloc_cpumask_var(_device_initialized, GFP_KERNEL)) {
err = -ENOMEM;
goto err_out;
}
 
+   pr_err("%s: zalloc_cpumask_var\n", __func__);
+
mce_init_banks();
 
+   pr_err("%s: mce_init_banks\n", __func__);
+
err = subsys_system_register(_subsys, NULL);
if (err)
goto err_out_mem;
 
+   pr_err("%s: subsys_system_register\n", __func__);
+
err = cpuhp_setup_state(CPUHP_X86_MCE_DEAD, "x86/mce:dead", NULL,
mce_cpu_dead);
if (err)
goto err_out_mem;
 
+   pr_err("%s: x86/mce:dead\n", __func__);
+
err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/mce:online",
mce_cpu_online, mce_cpu_pre_down);
if (err < 0)
goto err_out_online;
+
+   pr_err("%s: x86/mce:online\n", __func__);
+
hp_online = err;
 
register_syscore_ops(_syscore_ops);
 
+   pr_err("%s: register_syscore_ops\n", __func__);
+
/* register character device /dev/mcelog */
err = misc_register(_chrdev_device);
+
+   pr_err("%s: misc_register, err: 0x%x\n", __func__, err);
+
if (err)
goto err_register;
 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Borislav Petkov
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4

Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.

Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.

Apply, build, boot and shoot video again :-)

Thanks.

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c607ea..70268867cb33 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2565,37 +2565,57 @@ static __init int mcheck_init_device(void)
enum cpuhp_state hp_online;
int err;
 
+   pr_err("%s: entry\n", __func__);
+
if (!mce_available(_cpu_data)) {
err = -EIO;
goto err_out;
}
 
+   pr_err("%s: mce_available\n", __func__);
+
if (!zalloc_cpumask_var(_device_initialized, GFP_KERNEL)) {
err = -ENOMEM;
goto err_out;
}
 
+   pr_err("%s: zalloc_cpumask_var\n", __func__);
+
mce_init_banks();
 
+   pr_err("%s: mce_init_banks\n", __func__);
+
err = subsys_system_register(_subsys, NULL);
if (err)
goto err_out_mem;
 
+   pr_err("%s: subsys_system_register\n", __func__);
+
err = cpuhp_setup_state(CPUHP_X86_MCE_DEAD, "x86/mce:dead", NULL,
mce_cpu_dead);
if (err)
goto err_out_mem;
 
+   pr_err("%s: x86/mce:dead\n", __func__);
+
err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/mce:online",
mce_cpu_online, mce_cpu_pre_down);
if (err < 0)
goto err_out_online;
+
+   pr_err("%s: x86/mce:online\n", __func__);
+
hp_online = err;
 
register_syscore_ops(_syscore_ops);
 
+   pr_err("%s: register_syscore_ops\n", __func__);
+
/* register character device /dev/mcelog */
err = misc_register(_chrdev_device);
+
+   pr_err("%s: misc_register, err: 0x%x\n", __func__, err);
+
if (err)
goto err_register;
 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 12.02.2017 22:12, Borislav Petkov wrote:

On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote:

There is what I get :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4


Ok, I'm watching it frame-by-frame. I can see the microcode getting
updated to revision 0x19 as in your working dmesg.

The machine hangs here at the

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns

line too. The exact same numbers even as in the previous run! Strange.


With dis_ucode_ldr there is some more output , with it will stay like this , 
nothing more.


Ok, can you do redo the first video but with "initcall_debug" on the
kernel command line?

And then do video of another run with "initcall_debug dis_ucode_ldr" on
the kernel command line?

I'd like to see which of the initcalls doesn't return.


There are both videos .. however the output seems kind same this time..

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4

I try to find out more tomorrow but right now I don't even have a clue
where to add some printk's to get some more output :(


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 12.02.2017 22:12, Borislav Petkov wrote:

On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote:

There is what I get :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4


Ok, I'm watching it frame-by-frame. I can see the microcode getting
updated to revision 0x19 as in your working dmesg.

The machine hangs here at the

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns

line too. The exact same numbers even as in the previous run! Strange.


With dis_ucode_ldr there is some more output , with it will stay like this , 
nothing more.


Ok, can you do redo the first video but with "initcall_debug" on the
kernel command line?

And then do video of another run with "initcall_debug dis_ucode_ldr" on
the kernel command line?

I'd like to see which of the initcalls doesn't return.


There are both videos .. however the output seems kind same this time..

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4

I try to find out more tomorrow but right now I don't even have a clue
where to add some printk's to get some more output :(


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Borislav Petkov
On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote:
> There is what I get :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4

Ok, I'm watching it frame-by-frame. I can see the microcode getting
updated to revision 0x19 as in your working dmesg.

The machine hangs here at the

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns

line too. The exact same numbers even as in the previous run! Strange.

> With dis_ucode_ldr there is some more output , with it will stay like this , 
> nothing more.

Ok, can you do redo the first video but with "initcall_debug" on the
kernel command line?

And then do video of another run with "initcall_debug dis_ucode_ldr" on
the kernel command line?

I'd like to see which of the initcalls doesn't return.

Thanks!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Borislav Petkov
On Sun, Feb 12, 2017 at 09:21:53PM +0100, Gabriel C wrote:
> There is what I get :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4

Ok, I'm watching it frame-by-frame. I can see the microcode getting
updated to revision 0x19 as in your working dmesg.

The machine hangs here at the

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns

line too. The exact same numbers even as in the previous run! Strange.

> With dis_ucode_ldr there is some more output , with it will stay like this , 
> nothing more.

Ok, can you do redo the first video but with "initcall_debug" on the
kernel command line?

And then do video of another run with "initcall_debug dis_ucode_ldr" on
the kernel command line?

I'd like to see which of the initcalls doesn't return.

Thanks!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 11.02.2017 22:32, Borislav Petkov wrote:

On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote:

Yes , it will hang before tsc message ..
Also sometimes I have same trace sometimes it just hangs forever.


It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does,
maybe the microcode applies some fix for some erratum or whatnot.


Well the bug is still there but at least something in microcode code seems to
trigger too..

Also when it hangs it looks like this :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.jpg

Will stay like this forever.. no trace or something.



Right, so please disable that splash screen and do a boot video again
without the dis_ucode_ldr option.



The problem was vga=.. option ,  not the splash :)

There is what I get :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4

With dis_ucode_ldr there is some more output , with it will stay like this , 
nothing more.

Alo I disabled all 'VT-d' in BIOS .. it doesn't make any difference.

Regards,

Gabriel C




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Gabriel C



On 11.02.2017 22:32, Borislav Petkov wrote:

On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote:

Yes , it will hang before tsc message ..
Also sometimes I have same trace sometimes it just hangs forever.


It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does,
maybe the microcode applies some fix for some erratum or whatnot.


Well the bug is still there but at least something in microcode code seems to
trigger too..

Also when it hangs it looks like this :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.jpg

Will stay like this forever.. no trace or something.



Right, so please disable that splash screen and do a boot video again
without the dis_ucode_ldr option.



The problem was vga=.. option ,  not the splash :)

There is what I get :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash2.mp4

With dis_ucode_ldr there is some more output , with it will stay like this , 
nothing more.

Alo I disabled all 'VT-d' in BIOS .. it doesn't make any difference.

Regards,

Gabriel C




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Borislav Petkov
On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote:
> Yes , it will hang before tsc message ..
> Also sometimes I have same trace sometimes it just hangs forever.

It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does,
maybe the microcode applies some fix for some erratum or whatnot.

Right, so please disable that splash screen and do a boot video again
without the dis_ucode_ldr option.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Borislav Petkov
On Sat, Feb 11, 2017 at 09:58:26PM +0100, Gabriel C wrote:
> Yes , it will hang before tsc message ..
> Also sometimes I have same trace sometimes it just hangs forever.

It doesn't sound like dis_ucode_ldr changes anything. Or maybe it does,
maybe the microcode applies some fix for some erratum or whatnot.

Right, so please disable that splash screen and do a boot video again
without the dis_ucode_ldr option.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Gabriel C



On 11.02.2017 15:21, Borislav Petkov wrote:

On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote:

Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :


Wait a minute, are you saying that without dis_ucode_ldr you can't even
boot so far?


Yes , it will hang before tsc message ..
Also sometimes I have same trace sometimes it just hangs forever.





clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns
...

and I have the bug triggered really quick..

Also I cannot get netconsole to work , I'm sure is some problem here local and 
I don't have
any serial cable around right now. The only way I saw now to give you at least 
some ifo is to
make an video of that crash. You can find it there :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4


Watchdog fires on all cores showing they're all idle. For some reason,
not all cores get to dump the watchdog splat, though. Some seem really
stuck.

And you have TAINT_FIRMWARE_WORKAROUND due to
intel_prepare_irq_remapping() noticing intr remapping is broken on that
box.


Well yes and I'm not so sure is really broken.. I reverted the patch 
blacklisted my box right after it was addeded
the time and I don't have any issues .. however since I don't have a use of 
that feature I don't really care is marked broken or not..



Would be better if you could disable that frugalware splash screen and
switch to grub console mode so that we can see the very beginning of the
boot.


I do that tomorrow ...



Btw, your BIOS is from 2013. Is there new one, per chance, on your
vendor's site? Might wanna consider updating it...




This BIOS was / is newest one :(



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Gabriel C



On 11.02.2017 15:21, Borislav Petkov wrote:

On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote:

Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :


Wait a minute, are you saying that without dis_ucode_ldr you can't even
boot so far?


Yes , it will hang before tsc message ..
Also sometimes I have same trace sometimes it just hangs forever.





clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns
...

and I have the bug triggered really quick..

Also I cannot get netconsole to work , I'm sure is some problem here local and 
I don't have
any serial cable around right now. The only way I saw now to give you at least 
some ifo is to
make an video of that crash. You can find it there :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4


Watchdog fires on all cores showing they're all idle. For some reason,
not all cores get to dump the watchdog splat, though. Some seem really
stuck.

And you have TAINT_FIRMWARE_WORKAROUND due to
intel_prepare_irq_remapping() noticing intr remapping is broken on that
box.


Well yes and I'm not so sure is really broken.. I reverted the patch 
blacklisted my box right after it was addeded
the time and I don't have any issues .. however since I don't have a use of 
that feature I don't really care is marked broken or not..



Would be better if you could disable that frugalware splash screen and
switch to grub console mode so that we can see the very beginning of the
boot.


I do that tomorrow ...



Btw, your BIOS is from 2013. Is there new one, per chance, on your
vendor's site? Might wanna consider updating it...




This BIOS was / is newest one :(



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Borislav Petkov
On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote:
> Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :

Wait a minute, are you saying that without dis_ucode_ldr you can't even
boot so far?

> clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
> max_idle_ns: 440795315461 ns
> ...
> 
> and I have the bug triggered really quick..
> 
> Also I cannot get netconsole to work , I'm sure is some problem here local 
> and I don't have
> any serial cable around right now. The only way I saw now to give you at 
> least some ifo is to
> make an video of that crash. You can find it there :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4

Watchdog fires on all cores showing they're all idle. For some reason,
not all cores get to dump the watchdog splat, though. Some seem really
stuck.

And you have TAINT_FIRMWARE_WORKAROUND due to
intel_prepare_irq_remapping() noticing intr remapping is broken on that
box.

Would be better if you could disable that frugalware splash screen and
switch to grub console mode so that we can see the very beginning of the
boot.

Btw, your BIOS is from 2013. Is there new one, per chance, on your
vendor's site? Might wanna consider updating it...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Borislav Petkov
On Sat, Feb 11, 2017 at 02:09:14PM +0100, Gabriel C wrote:
> Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :

Wait a minute, are you saying that without dis_ucode_ldr you can't even
boot so far?

> clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
> max_idle_ns: 440795315461 ns
> ...
> 
> and I have the bug triggered really quick..
> 
> Also I cannot get netconsole to work , I'm sure is some problem here local 
> and I don't have
> any serial cable around right now. The only way I saw now to give you at 
> least some ifo is to
> make an video of that crash. You can find it there :
> 
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4

Watchdog fires on all cores showing they're all idle. For some reason,
not all cores get to dump the watchdog splat, though. Some seem really
stuck.

And you have TAINT_FIRMWARE_WORKAROUND due to
intel_prepare_irq_remapping() noticing intr remapping is broken on that
box.

Would be better if you could disable that frugalware splash screen and
switch to grub console mode so that we can see the very beginning of the
boot.

Btw, your BIOS is from 2013. Is there new one, per chance, on your
vendor's site? Might wanna consider updating it...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Gabriel C



On 11.02.2017 09:26, Thomas Gleixner wrote:


You might try with 'earlyprintk' on the command line. That should tell more.


With that I have some more output.. and after lots more boots I found out there 
are really
at least 2 bugs triggered by this in 4.10.

When just boothing with earlyprintk=vga debug ignore_loglevel the kernel hangs 
right after :

Key type dns_resolver registered
..

The cursor blinks and one have to wait a while this bug to trigger if at all.
Sometimes it just hangs there and that is.

Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns
...

and I have the bug triggered really quick..

Also I cannot get netconsole to work , I'm sure is some problem here local and 
I don't have
any serial cable around right now. The only way I saw now to give you at least 
some ifo is to
make an video of that crash. You can find it there :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4

Also this is after waiting a while :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/20170211_130319.jpg

I know videos and picures are not the best solution but reight now I don't have 
any
other way to capture some logs :|

The kernel is Linus git tree + .d966564fcdc19e13eb6ba1fbe6b8101070339c3d 
reverted and the config is :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/config

I hope the video helps at least somewhat to have a clue what could be wrong.


Regards,

Gabriel C



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Gabriel C



On 11.02.2017 09:26, Thomas Gleixner wrote:


You might try with 'earlyprintk' on the command line. That should tell more.


With that I have some more output.. and after lots more boots I found out there 
are really
at least 2 bugs triggered by this in 4.10.

When just boothing with earlyprintk=vga debug ignore_loglevel the kernel hangs 
right after :

Key type dns_resolver registered
..

The cursor blinks and one have to wait a while this bug to trigger if at all.
Sometimes it just hangs there and that is.

Adding ' dis_ucode_ldr ' to commandline makes the kernel hangs right after :

clocksource: tsc: mask: 0x max_cycles: 0x20ac7f6ecc6, 
max_idle_ns: 440795315461 ns
...

and I have the bug triggered really quick..

Also I cannot get netconsole to work , I'm sure is some problem here local and 
I don't have
any serial cable around right now. The only way I saw now to give you at least 
some ifo is to
make an video of that crash. You can find it there :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash.mp4

Also this is after waiting a while :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/20170211_130319.jpg

I know videos and picures are not the best solution but reight now I don't have 
any
other way to capture some logs :|

The kernel is Linus git tree + .d966564fcdc19e13eb6ba1fbe6b8101070339c3d 
reverted and the config is :

http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/config

I hope the video helps at least somewhat to have a clue what could be wrong.


Regards,

Gabriel C



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Thomas Gleixner
On Sat, 11 Feb 2017, Gabriel C wrote:
> On 07.02.2017 22:25, Thomas Gleixner wrote:
> Hi Thomas ,
> 
> Sorry I was travelling..

Nothing to be sorry about.

> > Btw, how far in the boot process is the machine when this happens?
> 
> Right after :
> 
> Uncompressing Linux.
> Booting the kernel..
> 
> So early..

You might try with 'earlyprintk' on the command line. That should tell more.

> One thing is strange .. on all one socket boxes I have the kernel seems
> to be fine with that patch while breaks on both dual socket boxes ( well
> both have near same HW )

That's really weird.

Thanks,

tglx


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-11 Thread Thomas Gleixner
On Sat, 11 Feb 2017, Gabriel C wrote:
> On 07.02.2017 22:25, Thomas Gleixner wrote:
> Hi Thomas ,
> 
> Sorry I was travelling..

Nothing to be sorry about.

> > Btw, how far in the boot process is the machine when this happens?
> 
> Right after :
> 
> Uncompressing Linux.
> Booting the kernel..
> 
> So early..

You might try with 'earlyprintk' on the command line. That should tell more.

> One thing is strange .. on all one socket boxes I have the kernel seems
> to be fine with that patch while breaks on both dual socket boxes ( well
> both have near same HW )

That's really weird.

Thanks,

tglx


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-10 Thread Gabriel C



On 11.02.2017 00:17, Gabriel C wrote:




Btw, how far in the boot process is the machine when this happens?


Right after :

Uncompressing Linux.
Booting the kernel..

So early..



After lots more boots .. I found out sometimes it gets to :

..

[4.656826] Key type dns_resolver registered

..

next line(s) in all my logs would be :

..

[4.657507] microcode: sig=0x106a5, pf=0x1, revision=0x19
[4.658678] microcode: Microcode Update Driver: v2.01 
, Peter Oruba

..

so maybe some sort race in microcode code ? but this would be strange ?


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-10 Thread Gabriel C



On 11.02.2017 00:17, Gabriel C wrote:




Btw, how far in the boot process is the machine when this happens?


Right after :

Uncompressing Linux.
Booting the kernel..

So early..



After lots more boots .. I found out sometimes it gets to :

..

[4.656826] Key type dns_resolver registered

..

next line(s) in all my logs would be :

..

[4.657507] microcode: sig=0x106a5, pf=0x1, revision=0x19
[4.658678] microcode: Microcode Update Driver: v2.01 
, Peter Oruba

..

so maybe some sort race in microcode code ? but this would be strange ?


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-10 Thread Gabriel C



On 07.02.2017 22:25, Thomas Gleixner wrote:

On Tue, 7 Feb 2017, Thomas Gleixner wrote:

Hi Thomas ,

Sorry I was travelling..


Gabriel, can you please send me the bootlog from a working kernel?




http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/dmesg

( If you wish I can send you one from .10-rc with that patch reverted )


Plus content of /proc/interrupts.



http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/interrupts



Btw, how far in the boot process is the machine when this happens?


Right after :

Uncompressing Linux.
Booting the kernel..

So early..

One thing is strange .. on all one socket boxes I have the kernel seems to be 
fine with that
patch while breaks on both dual socket boxes ( well both have near same HW )

Also I'm going to test your patch from your other email

Regards,

Gabrile C


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-10 Thread Gabriel C



On 07.02.2017 22:25, Thomas Gleixner wrote:

On Tue, 7 Feb 2017, Thomas Gleixner wrote:

Hi Thomas ,

Sorry I was travelling..


Gabriel, can you please send me the bootlog from a working kernel?




http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/dmesg

( If you wish I can send you one from .10-rc with that patch reverted )


Plus content of /proc/interrupts.



http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/interrupts



Btw, how far in the boot process is the machine when this happens?


Right after :

Uncompressing Linux.
Booting the kernel..

So early..

One thing is strange .. on all one socket boxes I have the kernel seems to be 
fine with that
patch while breaks on both dual socket boxes ( well both have near same HW )

Also I'm going to test your patch from your other email

Regards,

Gabrile C


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Mon, 6 Feb 2017, Linus Torvalds wrote:
> That said, it also strikes me that the implicated
> irq_chip_retrigger_hierarchy() function looks really very suspicious
> indeed.
> 
> Most of the other users don't seem to traverse the parent all the way
> until they find something. They just do the operation in the parent,
> and if the parent needs it, it might then do it in _its_ parent and so
> on.

The whole point of the hierarchy is that we have decoupled the stacked
chips so the ioapic does not know whether it is connected to the irq
remapping unit or to the vector domain directly.

> So I'm wondering if that for-loop triggers a stack overflow on your
> setup somehow, just because that irq_retrigger() call is now truly
> recursive, and hasn't been turned into tail-calls.

It would only be recursive if some level down the hierarchy would use the
same callback.

The ioapic is always on top of its hierarchy and its either connected to
the vector domain directly, which is the last level in the hierarchy and
implements the real retrigger callback or to the irq remapping unit which
does not have a retrigger callback. So it's not a recursion problem AFAICT,
but lets try and just use the apic callback directly as we did before the
whole hierarchy rework. That's wrong for other reasons, but that does not
matter in that particular case. Patch below.

We have the same situation with the MSI interrupt domains which all use
irq_chip_retrigger_hierarchy() function as their retrigger callback, which
does not seem to have the same effect on Gabriels machine.

I have the feeling that this commit unearthes some other subtle wreckage in
the interrupt machinery which gets not triggered otherwise.

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 52f352b063fd..3b6e5f3f099d 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1867,6 +1867,8 @@ static int ioapic_set_affinity(struct irq_data *irq_data,
return ret;
 }
 
+extern int apic_retrigger_irq(struct irq_data *irq_data);
+
 static struct irq_chip ioapic_chip __read_mostly = {
.name   = "IO-APIC",
.irq_startup= startup_ioapic_irq,
@@ -1875,7 +1877,7 @@ static struct irq_chip ioapic_chip __read_mostly = {
.irq_ack= irq_chip_ack_parent,
.irq_eoi= ioapic_ack_level,
.irq_set_affinity   = ioapic_set_affinity,
-   .irq_retrigger  = irq_chip_retrigger_hierarchy,
+   .irq_retrigger  = apic_retrigger_irq,
.flags  = IRQCHIP_SKIP_SET_WAKE,
 };
 
@@ -1887,7 +1889,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = {
.irq_ack= irq_chip_ack_parent,
.irq_eoi= ioapic_ir_ack_level,
.irq_set_affinity   = ioapic_set_affinity,
-   .irq_retrigger  = irq_chip_retrigger_hierarchy,
+   .irq_retrigger  = apic_retrigger_irq,
.flags  = IRQCHIP_SKIP_SET_WAKE,
 };
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 5d30c5e42bb1..ce9b93e19266 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -496,7 +496,7 @@ void setup_vector_irq(int cpu)
__setup_vector_irq(cpu);
 }
 
-static int apic_retrigger_irq(struct irq_data *irq_data)
+int apic_retrigger_irq(struct irq_data *irq_data)
 {
struct apic_chip_data *data = apic_chip_data(irq_data);
unsigned long flags;





Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Mon, 6 Feb 2017, Linus Torvalds wrote:
> That said, it also strikes me that the implicated
> irq_chip_retrigger_hierarchy() function looks really very suspicious
> indeed.
> 
> Most of the other users don't seem to traverse the parent all the way
> until they find something. They just do the operation in the parent,
> and if the parent needs it, it might then do it in _its_ parent and so
> on.

The whole point of the hierarchy is that we have decoupled the stacked
chips so the ioapic does not know whether it is connected to the irq
remapping unit or to the vector domain directly.

> So I'm wondering if that for-loop triggers a stack overflow on your
> setup somehow, just because that irq_retrigger() call is now truly
> recursive, and hasn't been turned into tail-calls.

It would only be recursive if some level down the hierarchy would use the
same callback.

The ioapic is always on top of its hierarchy and its either connected to
the vector domain directly, which is the last level in the hierarchy and
implements the real retrigger callback or to the irq remapping unit which
does not have a retrigger callback. So it's not a recursion problem AFAICT,
but lets try and just use the apic callback directly as we did before the
whole hierarchy rework. That's wrong for other reasons, but that does not
matter in that particular case. Patch below.

We have the same situation with the MSI interrupt domains which all use
irq_chip_retrigger_hierarchy() function as their retrigger callback, which
does not seem to have the same effect on Gabriels machine.

I have the feeling that this commit unearthes some other subtle wreckage in
the interrupt machinery which gets not triggered otherwise.

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 52f352b063fd..3b6e5f3f099d 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1867,6 +1867,8 @@ static int ioapic_set_affinity(struct irq_data *irq_data,
return ret;
 }
 
+extern int apic_retrigger_irq(struct irq_data *irq_data);
+
 static struct irq_chip ioapic_chip __read_mostly = {
.name   = "IO-APIC",
.irq_startup= startup_ioapic_irq,
@@ -1875,7 +1877,7 @@ static struct irq_chip ioapic_chip __read_mostly = {
.irq_ack= irq_chip_ack_parent,
.irq_eoi= ioapic_ack_level,
.irq_set_affinity   = ioapic_set_affinity,
-   .irq_retrigger  = irq_chip_retrigger_hierarchy,
+   .irq_retrigger  = apic_retrigger_irq,
.flags  = IRQCHIP_SKIP_SET_WAKE,
 };
 
@@ -1887,7 +1889,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = {
.irq_ack= irq_chip_ack_parent,
.irq_eoi= ioapic_ir_ack_level,
.irq_set_affinity   = ioapic_set_affinity,
-   .irq_retrigger  = irq_chip_retrigger_hierarchy,
+   .irq_retrigger  = apic_retrigger_irq,
.flags  = IRQCHIP_SKIP_SET_WAKE,
 };
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 5d30c5e42bb1..ce9b93e19266 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -496,7 +496,7 @@ void setup_vector_irq(int cpu)
__setup_vector_irq(cpu);
 }
 
-static int apic_retrigger_irq(struct irq_data *irq_data)
+int apic_retrigger_irq(struct irq_data *irq_data)
 {
struct apic_chip_data *data = apic_chip_data(irq_data);
unsigned long flags;





Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Tue, 7 Feb 2017, Thomas Gleixner wrote:
> Gabriel, can you please send me the bootlog from a working kernel?

Plus content of /proc/interrupts.

Btw, how far in the boot process is the machine when this happens?

Thanks,

tglx


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Tue, 7 Feb 2017, Thomas Gleixner wrote:
> Gabriel, can you please send me the bootlog from a working kernel?

Plus content of /proc/interrupts.

Btw, how far in the boot process is the machine when this happens?

Thanks,

tglx


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Mon, 6 Feb 2017, Linus Torvalds wrote:
> But for now, I'd be inclined to just revert it unless somebody has a
> "Duh!" moment and can tell me what's wrong with that commit with an
> obvious fix.

I have no "Duh!" moment even after staring at the code for quite a while.

Gabriel, can you please send me the bootlog from a working kernel?

Thanks,

tglx



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-07 Thread Thomas Gleixner
On Mon, 6 Feb 2017, Linus Torvalds wrote:
> But for now, I'd be inclined to just revert it unless somebody has a
> "Duh!" moment and can tell me what's wrong with that commit with an
> obvious fix.

I have no "Duh!" moment even after staring at the code for quite a while.

Gabriel, can you please send me the bootlog from a working kernel?

Thanks,

tglx



Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Linus Torvalds
On Mon, Feb 6, 2017 at 9:30 AM, Gabriel C  wrote:
>
> Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
> and found out by box won't boot anymore.
>
> It hangs early and freeze with a lot RCU warnings.
> Since I cannot setup a netconsole right now I cannot post the errors ,
> really sorry.
>
> ( but I could make a picture if needed )
>
> I bisected it down to :
>
>> Ruslan Ruslichenko (1):
>>   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Ok, it's

020eb3daaba2 ("x86/ioapic: Restore IO-APIC irq_chip retrigger callback")

in mainline.

> Reverting this one fixes the problem for me..

Since that came in rather late, I suspect we'll have to revert for
now.  The thing it fixes has been around for almost two years, so it
can't be as serious a problem as the fix itself ended up being.

Thomas?

That said, it also strikes me that the implicated
irq_chip_retrigger_hierarchy() function looks really very suspicious
indeed.

Most of the other users don't seem to traverse the parent all the way
until they find something. They just do the operation in the parent,
and if the parent needs it, it might then do it in _its_ parent and so
on.

And the compiler is able to turn the parent call into a tail call so
it doesn't cause a stack use explosion even if the parenthood chains
end up being pretty deep.

So I'm wondering if that for-loop triggers a stack overflow on your
setup somehow, just because that irq_retrigger() call is now truly
recursive, and hasn't been turned into tail-calls.

But for now, I'd be inclined to just revert it unless somebody has a
"Duh!" moment and can tell me what's wrong with that commit with an
obvious fix.

Comments?

  Linus


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Linus Torvalds
On Mon, Feb 6, 2017 at 9:30 AM, Gabriel C  wrote:
>
> Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
> and found out by box won't boot anymore.
>
> It hangs early and freeze with a lot RCU warnings.
> Since I cannot setup a netconsole right now I cannot post the errors ,
> really sorry.
>
> ( but I could make a picture if needed )
>
> I bisected it down to :
>
>> Ruslan Ruslichenko (1):
>>   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Ok, it's

020eb3daaba2 ("x86/ioapic: Restore IO-APIC irq_chip retrigger callback")

in mainline.

> Reverting this one fixes the problem for me..

Since that came in rather late, I suspect we'll have to revert for
now.  The thing it fixes has been around for almost two years, so it
can't be as serious a problem as the fix itself ended up being.

Thomas?

That said, it also strikes me that the implicated
irq_chip_retrigger_hierarchy() function looks really very suspicious
indeed.

Most of the other users don't seem to traverse the parent all the way
until they find something. They just do the operation in the parent,
and if the parent needs it, it might then do it in _its_ parent and so
on.

And the compiler is able to turn the parent call into a tail call so
it doesn't cause a stack use explosion even if the parenthood chains
end up being pretty deep.

So I'm wondering if that for-loop triggers a stack overflow on your
setup somehow, just because that irq_retrigger() call is now truly
recursive, and hasn't been turned into tail-calls.

But for now, I'd be inclined to just revert it unless somebody has a
"Duh!" moment and can tell me what's wrong with that commit with an
obvious fix.

Comments?

  Linus


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Gabriel C



On 06.02.2017 20:05, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at 
Cisco) wrote:

On 02/06/2017 07:41 PM, Greg KH wrote:

On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:

On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.


Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :


Ruslan Ruslichenko (1):
   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)


The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?

This looks strange. What this patch does is just revert previous
behavior, broken by d32932d02e18.
So we can try to test with last v4.1 stable, where retrigger callback
were still present.


I can test that but first on weekend if you wish.


Also on v4.10 maybe check with software emulation of this feature and
reverted patch, e.g.:

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..49c3c71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -170,6 +170,7 @@ config X86
 select USER_STACKTRACE_SUPPORT
 select VIRT_TO_BUS
 select X86_FEATURE_NAMESif PROC_FS
+   select HARDIRQS_SW_RESEND

  config INSTRUCTION_DECODER
 def_bool y



With patch reverted + this one I get a early kernel panic.. on 4.10.0-rc7

With just the patch reverted all is fine , the box boots and all seems fine.


I think for further debugging logs will be needed.


Yes sure , I just need to find a way to set something up like netconsole here.
Right now I have no way doig that. I'll try to do that on weekend too also.

Regards,

Gabriel C


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Gabriel C



On 06.02.2017 20:05, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at 
Cisco) wrote:

On 02/06/2017 07:41 PM, Greg KH wrote:

On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:

On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.


Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :


Ruslan Ruslichenko (1):
   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)


The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?

This looks strange. What this patch does is just revert previous
behavior, broken by d32932d02e18.
So we can try to test with last v4.1 stable, where retrigger callback
were still present.


I can test that but first on weekend if you wish.


Also on v4.10 maybe check with software emulation of this feature and
reverted patch, e.g.:

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..49c3c71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -170,6 +170,7 @@ config X86
 select USER_STACKTRACE_SUPPORT
 select VIRT_TO_BUS
 select X86_FEATURE_NAMESif PROC_FS
+   select HARDIRQS_SW_RESEND

  config INSTRUCTION_DECODER
 def_bool y



With patch reverted + this one I get a early kernel panic.. on 4.10.0-rc7

With just the patch reverted all is fine , the box boots and all seems fine.


I think for further debugging logs will be needed.


Yes sure , I just need to find a way to set something up like netconsole here.
Right now I have no way doig that. I'll try to do that on weekend too also.

Regards,

Gabriel C


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)

On 02/06/2017 07:41 PM, Greg KH wrote:

On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:

On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.


Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :


Ruslan Ruslichenko (1):
   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)


The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?
This looks strange. What this patch does is just revert previous 
behavior, broken by d32932d02e18.
So we can try to test with last v4.1 stable, where retrigger callback 
were still present.


Also on v4.10 maybe check with software emulation of this feature and 
reverted patch, e.g.:


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..49c3c71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -170,6 +170,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select HARDIRQS_SW_RESEND

 config INSTRUCTION_DECODER
def_bool y

I think for further debugging logs will be needed.


thanks,

greg k-h




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)

On 02/06/2017 07:41 PM, Greg KH wrote:

On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:

On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.


Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :


Ruslan Ruslichenko (1):
   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)


The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?
This looks strange. What this patch does is just revert previous 
behavior, broken by d32932d02e18.
So we can try to test with last v4.1 stable, where retrigger callback 
were still present.


Also on v4.10 maybe check with software emulation of this feature and 
reverted patch, e.g.:


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..49c3c71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -170,6 +170,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select HARDIRQS_SW_RESEND

 config INSTRUCTION_DECODER
def_bool y

I think for further debugging logs will be needed.


thanks,

greg k-h




Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Greg KH
On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:
> 
> On 26.01.2017 08:48, Greg KH wrote:
> 
> Hi Greg,
> 
> > I'm announcing the release of the 4.9.6 kernel.
> 
> 
> Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
> and found out by box won't boot anymore.
> 
> It hangs early and freeze with a lot RCU warnings.
> Since I cannot setup a netconsole right now I cannot post the errors , really 
> sorry.
> 
> ( but I could make a picture if needed )
> 
> 
> I bisected it down to :
> 
> > Ruslan Ruslichenko (1):
> >   x86/ioapic: Restore IO-APIC irq_chip retrigger callback
> 
> Reverting this one fixes the problem for me..
> 
> Also this problem exists in Linus tree , I tested on:
> 4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)

> The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.
> 
> Config:
> https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?

thanks,

greg k-h


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Greg KH
On Mon, Feb 06, 2017 at 06:30:15PM +0100, Gabriel C wrote:
> 
> On 26.01.2017 08:48, Greg KH wrote:
> 
> Hi Greg,
> 
> > I'm announcing the release of the 4.9.6 kernel.
> 
> 
> Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
> and found out by box won't boot anymore.
> 
> It hangs early and freeze with a lot RCU warnings.
> Since I cannot setup a netconsole right now I cannot post the errors , really 
> sorry.
> 
> ( but I could make a picture if needed )
> 
> 
> I bisected it down to :
> 
> > Ruslan Ruslichenko (1):
> >   x86/ioapic: Restore IO-APIC irq_chip retrigger callback
> 
> Reverting this one fixes the problem for me..
> 
> Also this problem exists in Linus tree , I tested on:
> 4.10.0-rc6-00167-ga0a28644c1cf

Ok, at least we are consistent :)

> The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.
> 
> Config:
> https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Ruslan, any thoughts about what to do here?

thanks,

greg k-h


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Gabriel C


On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.



Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :

> Ruslan Ruslichenko (1):
>   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Regards,

Gabriel C.


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-06 Thread Gabriel C


On 26.01.2017 08:48, Greg KH wrote:

Hi Greg,


I'm announcing the release of the 4.9.6 kernel.



Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
and found out by box won't boot anymore.

It hangs early and freeze with a lot RCU warnings.
Since I cannot setup a netconsole right now I cannot post the errors , really 
sorry.

( but I could make a picture if needed )


I bisected it down to :

> Ruslan Ruslichenko (1):
>   x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Reverting this one fixes the problem for me..

Also this problem exists in Linus tree , I tested on:
4.10.0-rc6-00167-ga0a28644c1cf

The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed.

Config:
https://raw.githubusercontent.com/frugalware/frugalware-current/master/source/base/kernel/config.x86_64

Regards,

Gabriel C.