Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sun, Dec 31, 2017 at 01:03:25AM +0300, Alexander Tsoy wrote:
> > Turns out my previous code to print iret frames was a bit ...
> > misguided, to put it nicely.  Not sure what I was smoking.
> > 
> > Hopefully the below patch should fix it (in place of the previous
> > patch).  Would you mind testing again?
> > 
> 
> With that patch I get:
> 
> [2.160017] NMI backtrace for cpu 0
> [2.160017] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc5 #1
> [2.160017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1.fc27 04/01/2014
> [2.160017] RIP: 0010:double_fault+0x0/0x30
> [2.160017] RSP: :fe807fd0 EFLAGS: 00010086
> [2.160017] RAX: ffc0 RBX: 0001 RCX: 
> c101
> [2.160017] RDX: 8edc RSI:  RDI: 
> fe807f58
> [2.160017] RBP:  R08:  R09: 
> 
> [2.160017] R10:  R11:  R12: 
> a3c01426
> [2.160017] R13:  R14:  R15: 
> 
> [2.160017] FS:  () GS:8edcffc0() 
> knlGS:
> [2.160017] CS:  0010 DS:  ES:  CR0: 80050033
> [2.160017] CR2: fe806f08 CR3: 7c153000 CR4: 
> 06b0
> [2.160017] Call Trace:
> [2.160017]  <#DF>
> [2.160017] RIP: 0010:do_double_fault+0xb/0x140
> [2.160017] RSP: :fe806f18 EFLAGS: 00010086
> [2.160017]  

Yes, that's more like it.  I'll clean up the patches and submit them
soon.  These nasty bugs are always a good testcase for the stack dump
code.

Thanks for testing!

-- 
Josh


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sun, Dec 31, 2017 at 01:03:25AM +0300, Alexander Tsoy wrote:
> > Turns out my previous code to print iret frames was a bit ...
> > misguided, to put it nicely.  Not sure what I was smoking.
> > 
> > Hopefully the below patch should fix it (in place of the previous
> > patch).  Would you mind testing again?
> > 
> 
> With that patch I get:
> 
> [2.160017] NMI backtrace for cpu 0
> [2.160017] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc5 #1
> [2.160017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1.fc27 04/01/2014
> [2.160017] RIP: 0010:double_fault+0x0/0x30
> [2.160017] RSP: :fe807fd0 EFLAGS: 00010086
> [2.160017] RAX: ffc0 RBX: 0001 RCX: 
> c101
> [2.160017] RDX: 8edc RSI:  RDI: 
> fe807f58
> [2.160017] RBP:  R08:  R09: 
> 
> [2.160017] R10:  R11:  R12: 
> a3c01426
> [2.160017] R13:  R14:  R15: 
> 
> [2.160017] FS:  () GS:8edcffc0() 
> knlGS:
> [2.160017] CS:  0010 DS:  ES:  CR0: 80050033
> [2.160017] CR2: fe806f08 CR3: 7c153000 CR4: 
> 06b0
> [2.160017] Call Trace:
> [2.160017]  <#DF>
> [2.160017] RIP: 0010:do_double_fault+0xb/0x140
> [2.160017] RSP: :fe806f18 EFLAGS: 00010086
> [2.160017]  

Yes, that's more like it.  I'll clean up the patches and submit them
soon.  These nasty bugs are always a good testcase for the stack dump
code.

Thanks for testing!

-- 
Josh


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Sat, 30 Dec 2017 11:57:46 -0600
Josh Poimboeuf  пишет:

> On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote:
> > On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:  
> > > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:  
> > > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski
> > > > wrote:  
> > > > > (Also, Josh, the oops code should have printed the contents
> > > > > of the struct pt_regs at the top of the DF stack.  Any idea
> > > > > why it didn't?)  
> > > > 
> > > > Looking at one of the dumps:
> > > > 
> > > >   [  392.774879] NMI backtrace for cpu 0
> > > >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted
> > > > 4.14.9-gentoo #1
> > > >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1
> > > > 01/01/2011 [  392.774882] task: 8802368b8000 task.stack:
> > > > c900c000 [  392.774885] RIP: 0010:double_fault+0x0/0x30
> > > >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> > > >   [  392.774887] RAX: 3fc0 RBX: 0001
> > > > RCX: c101
> > > >   [  392.774887] RDX: 8802 RSI: 
> > > > RDI: ff527f58
> > > >   [  392.774887] RBP:  R08: 
> > > > R09: 
> > > >   [  392.774888] R10:  R11: 
> > > > R12: 816ae726
> > > >   [  392.774888] R13:  R14: 
> > > > R15: 
> > > >   [  392.774889] FS:  ()
> > > > GS:88023fc0() knlGS:
> > > >   [  392.774889] CS:  0010 DS:  ES:  CR0:
> > > > 80050033 [  392.774890] CR2: ff526f08 CR3:
> > > > 000235b48002 CR4: 001606f0
> > > >   [  392.774892] Call Trace:
> > > >   [  392.774894]  <#DF>
> > > >   [  392.774897]  do_double_fault+0xb/0x140
> > > >   [  392.774898]  
> > > > 
> > > > It should have at least printed the #DF iret frame registers,
> > > > which I recently added support for in "x86/unwinder: Handle
> > > > stack overflows more
> > > > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > > > 
> > > > I think the missing iret regs are due to a bug in
> > > > show_trace_log_lvl(),
> > > > where if the unwind starts with two regs frames in a row, the
> > > > second regs don't get printed.
> > > > 
> > > > Alexander, would you mind reproducing again with the below
> > > > patch?  It should still fail, but this time it should hopefully
> > > > show another RIP/RSP/EFLAGS instead of the
> > > > "do_double_fault+0xb/0x140" line. 
> > > 
> > > Yes, it works:
> > > 
> > > [   23.058064] NMI backtrace for cpu 2
> > > [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> > > [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS 1.10.2-1.fc27 04/01/2014
> > > [   23.058074] RIP: 0010:double_fault+0x0/0x30
> > > [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> > > [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> > > c101
> > > [   23.058077] RDX: 9681 RSI:  RDI:
> > > fe85ff58
> > > [   23.058078] RBP:  R08:  R09:
> > > 
> > > [   23.058079] R10:  R11:  R12:
> > > 92001426
> > > [   23.058080] R13:  R14:  R15:
> > > 
> > > [   23.058083] FS:  ()
> > > GS:96813fd0() knlGS:
> > > [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> > > [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> > > 000406a0
> > > [   23.058089] Call Trace:
> > > [   23.058101]  <#DF>
> > > [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> > > [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086
> > > ORIG_RAX: 
> > > [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> > > c101
> > > [   23.058107] RDX: 9681 RSI:  RDI:
> > > fe85ff58
> > > [   23.058107] RBP:  R08:  R09:
> > > 
> > > [   23.058108] R10:  R11:  R12:
> > > 92001426
> > > [   23.058108] R13:  R14:  R15:
> > > 
> > > [   23.058111]  
> > > [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69
> > > 06 00 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00
> > > 00 0f 1f 44 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7
> > > 48 8b 74 24 78 48  
> > 
> > That's better indeed, though still not quite right.  It should have
> > only shown a subset of those registers.  One more bug to fix
> > there...  
> 
> Turns out my previous code to print iret frames was a bit ...
> misguided, to put it nicely.  Not sure what I was smoking.
> 
> Hopefully the below patch should fix it (in 

Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Sat, 30 Dec 2017 11:57:46 -0600
Josh Poimboeuf  пишет:

> On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote:
> > On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:  
> > > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:  
> > > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski
> > > > wrote:  
> > > > > (Also, Josh, the oops code should have printed the contents
> > > > > of the struct pt_regs at the top of the DF stack.  Any idea
> > > > > why it didn't?)  
> > > > 
> > > > Looking at one of the dumps:
> > > > 
> > > >   [  392.774879] NMI backtrace for cpu 0
> > > >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted
> > > > 4.14.9-gentoo #1
> > > >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1
> > > > 01/01/2011 [  392.774882] task: 8802368b8000 task.stack:
> > > > c900c000 [  392.774885] RIP: 0010:double_fault+0x0/0x30
> > > >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> > > >   [  392.774887] RAX: 3fc0 RBX: 0001
> > > > RCX: c101
> > > >   [  392.774887] RDX: 8802 RSI: 
> > > > RDI: ff527f58
> > > >   [  392.774887] RBP:  R08: 
> > > > R09: 
> > > >   [  392.774888] R10:  R11: 
> > > > R12: 816ae726
> > > >   [  392.774888] R13:  R14: 
> > > > R15: 
> > > >   [  392.774889] FS:  ()
> > > > GS:88023fc0() knlGS:
> > > >   [  392.774889] CS:  0010 DS:  ES:  CR0:
> > > > 80050033 [  392.774890] CR2: ff526f08 CR3:
> > > > 000235b48002 CR4: 001606f0
> > > >   [  392.774892] Call Trace:
> > > >   [  392.774894]  <#DF>
> > > >   [  392.774897]  do_double_fault+0xb/0x140
> > > >   [  392.774898]  
> > > > 
> > > > It should have at least printed the #DF iret frame registers,
> > > > which I recently added support for in "x86/unwinder: Handle
> > > > stack overflows more
> > > > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > > > 
> > > > I think the missing iret regs are due to a bug in
> > > > show_trace_log_lvl(),
> > > > where if the unwind starts with two regs frames in a row, the
> > > > second regs don't get printed.
> > > > 
> > > > Alexander, would you mind reproducing again with the below
> > > > patch?  It should still fail, but this time it should hopefully
> > > > show another RIP/RSP/EFLAGS instead of the
> > > > "do_double_fault+0xb/0x140" line. 
> > > 
> > > Yes, it works:
> > > 
> > > [   23.058064] NMI backtrace for cpu 2
> > > [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> > > [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS 1.10.2-1.fc27 04/01/2014
> > > [   23.058074] RIP: 0010:double_fault+0x0/0x30
> > > [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> > > [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> > > c101
> > > [   23.058077] RDX: 9681 RSI:  RDI:
> > > fe85ff58
> > > [   23.058078] RBP:  R08:  R09:
> > > 
> > > [   23.058079] R10:  R11:  R12:
> > > 92001426
> > > [   23.058080] R13:  R14:  R15:
> > > 
> > > [   23.058083] FS:  ()
> > > GS:96813fd0() knlGS:
> > > [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> > > [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> > > 000406a0
> > > [   23.058089] Call Trace:
> > > [   23.058101]  <#DF>
> > > [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> > > [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086
> > > ORIG_RAX: 
> > > [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> > > c101
> > > [   23.058107] RDX: 9681 RSI:  RDI:
> > > fe85ff58
> > > [   23.058107] RBP:  R08:  R09:
> > > 
> > > [   23.058108] R10:  R11:  R12:
> > > 92001426
> > > [   23.058108] R13:  R14:  R15:
> > > 
> > > [   23.058111]  
> > > [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69
> > > 06 00 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00
> > > 00 0f 1f 44 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7
> > > 48 8b 74 24 78 48  
> > 
> > That's better indeed, though still not quite right.  It should have
> > only shown a subset of those registers.  One more bug to fix
> > there...  
> 
> Turns out my previous code to print iret frames was a bit ...
> misguided, to put it nicely.  Not sure what I was smoking.
> 
> Hopefully the below patch should fix it (in place of the previous
> 

Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote:
> On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:
> > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > > > (Also, Josh, the oops code should have printed the contents of the
> > > > struct pt_regs at the top of the DF stack.  Any idea why it
> > > > didn't?)
> > > 
> > > Looking at one of the dumps:
> > > 
> > >   [  392.774879] NMI backtrace for cpu 0
> > >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> > > #1
> > >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > >   [  392.774882] task: 8802368b8000 task.stack: c900c000
> > >   [  392.774885] RIP: 0010:double_fault+0x0/0x30
> > >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> > >   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> > > c101
> > >   [  392.774887] RDX: 8802 RSI:  RDI:
> > > ff527f58
> > >   [  392.774887] RBP:  R08:  R09:
> > > 
> > >   [  392.774888] R10:  R11:  R12:
> > > 816ae726
> > >   [  392.774888] R13:  R14:  R15:
> > > 
> > >   [  392.774889] FS:  ()
> > > GS:88023fc0() knlGS:
> > >   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
> > >   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> > > 001606f0
> > >   [  392.774892] Call Trace:
> > >   [  392.774894]  <#DF>
> > >   [  392.774897]  do_double_fault+0xb/0x140
> > >   [  392.774898]  
> > > 
> > > It should have at least printed the #DF iret frame registers, which I
> > > recently added support for in "x86/unwinder: Handle stack overflows
> > > more
> > > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > > 
> > > I think the missing iret regs are due to a bug in
> > > show_trace_log_lvl(),
> > > where if the unwind starts with two regs frames in a row, the second
> > > regs don't get printed.
> > > 
> > > Alexander, would you mind reproducing again with the below patch?  It
> > > should still fail, but this time it should hopefully show another
> > > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> > > 
> > 
> > Yes, it works:
> > 
> > [   23.058064] NMI backtrace for cpu 2
> > [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> > [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.10.2-1.fc27 04/01/2014
> > [   23.058074] RIP: 0010:double_fault+0x0/0x30
> > [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> > [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> > c101
> > [   23.058077] RDX: 9681 RSI:  RDI:
> > fe85ff58
> > [   23.058078] RBP:  R08:  R09:
> > 
> > [   23.058079] R10:  R11:  R12:
> > 92001426
> > [   23.058080] R13:  R14:  R15:
> > 
> > [   23.058083] FS:  () GS:96813fd0()
> > knlGS:
> > [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> > [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> > 000406a0
> > [   23.058089] Call Trace:
> > [   23.058101]  <#DF>
> > [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> > [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:
> > 
> > [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> > c101
> > [   23.058107] RDX: 9681 RSI:  RDI:
> > fe85ff58
> > [   23.058107] RBP:  R08:  R09:
> > 
> > [   23.058108] R10:  R11:  R12:
> > 92001426
> > [   23.058108] R13:  R14:  R15:
> > 
> > [   23.058111]  
> > [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
> > 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
> > 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48
> 
> That's better indeed, though still not quite right.  It should have only
> shown a subset of those registers.  One more bug to fix there...

Turns out my previous code to print iret frames was a bit ... misguided,
to put it nicely.  Not sure what I was smoking.

Hopefully the below patch should fix it (in place of the previous
patch).  Would you mind testing again?

diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index c1688c2d0a12..1f86e1b0a5cd 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -56,18 +56,27 @@ void unwind_start(struct unwind_state 

Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote:
> On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:
> > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > > > (Also, Josh, the oops code should have printed the contents of the
> > > > struct pt_regs at the top of the DF stack.  Any idea why it
> > > > didn't?)
> > > 
> > > Looking at one of the dumps:
> > > 
> > >   [  392.774879] NMI backtrace for cpu 0
> > >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> > > #1
> > >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > >   [  392.774882] task: 8802368b8000 task.stack: c900c000
> > >   [  392.774885] RIP: 0010:double_fault+0x0/0x30
> > >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> > >   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> > > c101
> > >   [  392.774887] RDX: 8802 RSI:  RDI:
> > > ff527f58
> > >   [  392.774887] RBP:  R08:  R09:
> > > 
> > >   [  392.774888] R10:  R11:  R12:
> > > 816ae726
> > >   [  392.774888] R13:  R14:  R15:
> > > 
> > >   [  392.774889] FS:  ()
> > > GS:88023fc0() knlGS:
> > >   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
> > >   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> > > 001606f0
> > >   [  392.774892] Call Trace:
> > >   [  392.774894]  <#DF>
> > >   [  392.774897]  do_double_fault+0xb/0x140
> > >   [  392.774898]  
> > > 
> > > It should have at least printed the #DF iret frame registers, which I
> > > recently added support for in "x86/unwinder: Handle stack overflows
> > > more
> > > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > > 
> > > I think the missing iret regs are due to a bug in
> > > show_trace_log_lvl(),
> > > where if the unwind starts with two regs frames in a row, the second
> > > regs don't get printed.
> > > 
> > > Alexander, would you mind reproducing again with the below patch?  It
> > > should still fail, but this time it should hopefully show another
> > > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> > > 
> > 
> > Yes, it works:
> > 
> > [   23.058064] NMI backtrace for cpu 2
> > [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> > [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.10.2-1.fc27 04/01/2014
> > [   23.058074] RIP: 0010:double_fault+0x0/0x30
> > [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> > [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> > c101
> > [   23.058077] RDX: 9681 RSI:  RDI:
> > fe85ff58
> > [   23.058078] RBP:  R08:  R09:
> > 
> > [   23.058079] R10:  R11:  R12:
> > 92001426
> > [   23.058080] R13:  R14:  R15:
> > 
> > [   23.058083] FS:  () GS:96813fd0()
> > knlGS:
> > [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> > [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> > 000406a0
> > [   23.058089] Call Trace:
> > [   23.058101]  <#DF>
> > [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> > [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:
> > 
> > [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> > c101
> > [   23.058107] RDX: 9681 RSI:  RDI:
> > fe85ff58
> > [   23.058107] RBP:  R08:  R09:
> > 
> > [   23.058108] R10:  R11:  R12:
> > 92001426
> > [   23.058108] R13:  R14:  R15:
> > 
> > [   23.058111]  
> > [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
> > 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
> > 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48
> 
> That's better indeed, though still not quite right.  It should have only
> shown a subset of those registers.  One more bug to fix there...

Turns out my previous code to print iret frames was a bit ... misguided,
to put it nicely.  Not sure what I was smoking.

Hopefully the below patch should fix it (in place of the previous
patch).  Would you mind testing again?

diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index c1688c2d0a12..1f86e1b0a5cd 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -56,18 +56,27 @@ void unwind_start(struct unwind_state 

Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:
> В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > > (Also, Josh, the oops code should have printed the contents of the
> > > struct pt_regs at the top of the DF stack.  Any idea why it
> > > didn't?)
> > 
> > Looking at one of the dumps:
> > 
> >   [  392.774879] NMI backtrace for cpu 0
> >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> > #1
> >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> >   [  392.774882] task: 8802368b8000 task.stack: c900c000
> >   [  392.774885] RIP: 0010:double_fault+0x0/0x30
> >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> >   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> > c101
> >   [  392.774887] RDX: 8802 RSI:  RDI:
> > ff527f58
> >   [  392.774887] RBP:  R08:  R09:
> > 
> >   [  392.774888] R10:  R11:  R12:
> > 816ae726
> >   [  392.774888] R13:  R14:  R15:
> > 
> >   [  392.774889] FS:  ()
> > GS:88023fc0() knlGS:
> >   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
> >   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> > 001606f0
> >   [  392.774892] Call Trace:
> >   [  392.774894]  <#DF>
> >   [  392.774897]  do_double_fault+0xb/0x140
> >   [  392.774898]  
> > 
> > It should have at least printed the #DF iret frame registers, which I
> > recently added support for in "x86/unwinder: Handle stack overflows
> > more
> > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > 
> > I think the missing iret regs are due to a bug in
> > show_trace_log_lvl(),
> > where if the unwind starts with two regs frames in a row, the second
> > regs don't get printed.
> > 
> > Alexander, would you mind reproducing again with the below patch?  It
> > should still fail, but this time it should hopefully show another
> > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> > 
> 
> Yes, it works:
> 
> [   23.058064] NMI backtrace for cpu 2
> [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1.fc27 04/01/2014
> [   23.058074] RIP: 0010:double_fault+0x0/0x30
> [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> c101
> [   23.058077] RDX: 9681 RSI:  RDI:
> fe85ff58
> [   23.058078] RBP:  R08:  R09:
> 
> [   23.058079] R10:  R11:  R12:
> 92001426
> [   23.058080] R13:  R14:  R15:
> 
> [   23.058083] FS:  () GS:96813fd0()
> knlGS:
> [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> 000406a0
> [   23.058089] Call Trace:
> [   23.058101]  <#DF>
> [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:
> 
> [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> c101
> [   23.058107] RDX: 9681 RSI:  RDI:
> fe85ff58
> [   23.058107] RBP:  R08:  R09:
> 
> [   23.058108] R10:  R11:  R12:
> 92001426
> [   23.058108] R13:  R14:  R15:
> 
> [   23.058111]  
> [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
> 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
> 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48

That's better indeed, though still not quite right.  It should have only
shown a subset of those registers.  One more bug to fix there...

-- 
Josh


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Josh Poimboeuf
On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote:
> В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > > (Also, Josh, the oops code should have printed the contents of the
> > > struct pt_regs at the top of the DF stack.  Any idea why it
> > > didn't?)
> > 
> > Looking at one of the dumps:
> > 
> >   [  392.774879] NMI backtrace for cpu 0
> >   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> > #1
> >   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> >   [  392.774882] task: 8802368b8000 task.stack: c900c000
> >   [  392.774885] RIP: 0010:double_fault+0x0/0x30
> >   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
> >   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> > c101
> >   [  392.774887] RDX: 8802 RSI:  RDI:
> > ff527f58
> >   [  392.774887] RBP:  R08:  R09:
> > 
> >   [  392.774888] R10:  R11:  R12:
> > 816ae726
> >   [  392.774888] R13:  R14:  R15:
> > 
> >   [  392.774889] FS:  ()
> > GS:88023fc0() knlGS:
> >   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
> >   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> > 001606f0
> >   [  392.774892] Call Trace:
> >   [  392.774894]  <#DF>
> >   [  392.774897]  do_double_fault+0xb/0x140
> >   [  392.774898]  
> > 
> > It should have at least printed the #DF iret frame registers, which I
> > recently added support for in "x86/unwinder: Handle stack overflows
> > more
> > gracefully", which is in both 4.14.9 and 4.15-rc5.
> > 
> > I think the missing iret regs are due to a bug in
> > show_trace_log_lvl(),
> > where if the unwind starts with two regs frames in a row, the second
> > regs don't get printed.
> > 
> > Alexander, would you mind reproducing again with the below patch?  It
> > should still fail, but this time it should hopefully show another
> > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> > 
> 
> Yes, it works:
> 
> [   23.058064] NMI backtrace for cpu 2
> [   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
> [   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1.fc27 04/01/2014
> [   23.058074] RIP: 0010:double_fault+0x0/0x30
> [   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
> [   23.058077] RAX: 3fd0 RBX: 0001 RCX:
> c101
> [   23.058077] RDX: 9681 RSI:  RDI:
> fe85ff58
> [   23.058078] RBP:  R08:  R09:
> 
> [   23.058079] R10:  R11:  R12:
> 92001426
> [   23.058080] R13:  R14:  R15:
> 
> [   23.058083] FS:  () GS:96813fd0()
> knlGS:
> [   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
> [   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
> 000406a0
> [   23.058089] Call Trace:
> [   23.058101]  <#DF>
> [   23.058104] RIP: 0010:do_double_fault+0xb/0x140
> [   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:
> 
> [   23.058106] RAX: 3fd0 RBX: 0001 RCX:
> c101
> [   23.058107] RDX: 9681 RSI:  RDI:
> fe85ff58
> [   23.058107] RBP:  R08:  R09:
> 
> [   23.058108] R10:  R11:  R12:
> 92001426
> [   23.058108] R13:  R14:  R15:
> 
> [   23.058111]  
> [   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
> 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
> 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48

That's better indeed, though still not quite right.  It should have only
shown a subset of those registers.  One more bug to fix there...

-- 
Josh


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Jiri Kosina
On Sat, 30 Dec 2017, Toralf Förster wrote:

> This made the issue go away :
> 
> diff --git a/Makefile b/Makefile
> index ac8c441866b7..11a12947c550 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -414,7 +414,7 @@ LINUXINCLUDE:= \
>  
>  KBUILD_AFLAGS   := -D__ASSEMBLY__
>  KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
> -  -fno-strict-aliasing -fno-common -fshort-wchar \
> +  -fno-strict-aliasing -fno-common -fshort-wchar 
> -fstack-check=no \
>-Werror-implicit-function-declaration \
>-Wno-format-security \
>-std=gnu89
> 
> But this doesn't solve the root cause, right ? So if the root cause is 
> "Gentoo hardened GCC is broken" please just let me know this - FWIW I'm 
> in #gentoo-dev on freenode.

-fstack-check for kernel is never going to work properly.

That option is purely for userspace, and assumes all the logic around 
'stack guard gap' and the auto-growing semantics being in place; which is 
there for user stack VMA, but definitely not for kernel stack.

It's probably the "hardened" flavor of your distro trying to push 
'-fstack-check' to everything it compiles; so I actually think the 
Makefile patch, sanitizing CFLAGS by force-disabling -fstack-check is 
exactly what we should be doing.

Thanks,

-- 
Jiri Kosina
SUSE Labs


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Jiri Kosina
On Sat, 30 Dec 2017, Toralf Förster wrote:

> This made the issue go away :
> 
> diff --git a/Makefile b/Makefile
> index ac8c441866b7..11a12947c550 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -414,7 +414,7 @@ LINUXINCLUDE:= \
>  
>  KBUILD_AFLAGS   := -D__ASSEMBLY__
>  KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
> -  -fno-strict-aliasing -fno-common -fshort-wchar \
> +  -fno-strict-aliasing -fno-common -fshort-wchar 
> -fstack-check=no \
>-Werror-implicit-function-declaration \
>-Wno-format-security \
>-std=gnu89
> 
> But this doesn't solve the root cause, right ? So if the root cause is 
> "Gentoo hardened GCC is broken" please just let me know this - FWIW I'm 
> in #gentoo-dev on freenode.

-fstack-check for kernel is never going to work properly.

That option is purely for userspace, and assumes all the logic around 
'stack guard gap' and the auto-growing semantics being in place; which is 
there for user stack VMA, but definitely not for kernel stack.

It's probably the "hardened" flavor of your distro trying to push 
'-fstack-check' to everything it compiles; so I actually think the 
Makefile patch, sanitizing CFLAGS by force-disabling -fstack-check is 
exactly what we should be doing.

Thanks,

-- 
Jiri Kosina
SUSE Labs


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 02:13 AM, Alexander Tsoy wrote:
> You are right, It's due to fstack-check enabled in gentoo's gcc spec.
> "-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/

This made the issue go away :

diff --git a/Makefile b/Makefile
index ac8c441866b7..11a12947c550 100644
--- a/Makefile
+++ b/Makefile
@@ -414,7 +414,7 @@ LINUXINCLUDE:= \
 
 KBUILD_AFLAGS   := -D__ASSEMBLY__
 KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
-  -fno-strict-aliasing -fno-common -fshort-wchar \
+  -fno-strict-aliasing -fno-common -fshort-wchar 
-fstack-check=no \
   -Werror-implicit-function-declaration \
   -Wno-format-security \
   -std=gnu89

But this doesn't solve the root cause, right ? So if the root cause is "Gentoo 
hardened GCC is broken" please just let me know this - FWIW I'm in #gentoo-dev 
on freenode.

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 02:13 AM, Alexander Tsoy wrote:
> You are right, It's due to fstack-check enabled in gentoo's gcc spec.
> "-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/

This made the issue go away :

diff --git a/Makefile b/Makefile
index ac8c441866b7..11a12947c550 100644
--- a/Makefile
+++ b/Makefile
@@ -414,7 +414,7 @@ LINUXINCLUDE:= \
 
 KBUILD_AFLAGS   := -D__ASSEMBLY__
 KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
-  -fno-strict-aliasing -fno-common -fshort-wchar \
+  -fno-strict-aliasing -fno-common -fshort-wchar 
-fstack-check=no \
   -Werror-implicit-function-declaration \
   -Wno-format-security \
   -std=gnu89

But this doesn't solve the root cause, right ? So if the root cause is "Gentoo 
hardened GCC is broken" please just let me know this - FWIW I'm in #gentoo-dev 
on freenode.

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Jiri Kosina
On Fri, 29 Dec 2017, Linus Torvalds wrote:

> Ok, so what does seem to be consistent for everybody is that 
> double-fault in the NMI backtrace.
> 
> So the fact that the NMI always hits on a double-fault does make me
> suspect that it's a infinite stream of double-faults, and that is
> presumably also what causes the RCU timeout.

As I've been fighting with recursive double-faults lately (backporting PTI 
to ancient kernels), I can tell you that this is not the symptom you'd be 
seeing in such case; recursive double fault pretty quickly overflows the 
interrupt stack and triple-faults.

-- 
Jiri Kosina
SUSE Labs


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Jiri Kosina
On Fri, 29 Dec 2017, Linus Torvalds wrote:

> Ok, so what does seem to be consistent for everybody is that 
> double-fault in the NMI backtrace.
> 
> So the fact that the NMI always hits on a double-fault does make me
> suspect that it's a infinite stream of double-faults, and that is
> presumably also what causes the RCU timeout.

As I've been fighting with recursive double-faults lately (backporting PTI 
to ancient kernels), I can tell you that this is not the symptom you'd be 
seeing in such case; recursive double fault pretty quickly overflows the 
interrupt stack and triple-faults.

-- 
Jiri Kosina
SUSE Labs


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 04:49 AM, Josh Poimboeuf wrote:
> Alexander, would you mind reproducing again with the below patch?  It
> should still fail, but this time it should hopefully show another
> RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.

I applied that too on top of v4.15-rc5-114-g2758b3e3e630 (no other patches or 
changes to cflags or so), make c clean, then build and booted the kernel, still 
stucks, the result is in [1]


[1] https://zwiebeltoralf.de/pub/IMG_20171230_102325.jpg

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 04:49 AM, Josh Poimboeuf wrote:
> Alexander, would you mind reproducing again with the below patch?  It
> should still fail, but this time it should hopefully show another
> RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.

I applied that too on top of v4.15-rc5-114-g2758b3e3e630 (no other patches or 
changes to cflags or so), make c clean, then build and booted the kernel, still 
stucks, the result is in [1]


[1] https://zwiebeltoralf.de/pub/IMG_20171230_102325.jpg

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 10:14 AM, Alexander Tsoy wrote:
> Yes, and only in hardened profile, so most users don't have -fstack-
> check by default. :)
Indeed, I do run hardened Gentoo only.

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 10:14 AM, Alexander Tsoy wrote:
> Yes, and only in hardened profile, so most users don't have -fstack-
> check by default. :)
Indeed, I do run hardened Gentoo only.

-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Пт, 29/12/2017 в 17:34 -0800, Linus Torvalds пишет:
> On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds
>  wrote:
> > 
> > Good. I was not feeling so happy about this bug report, but now I
> > can
> > firmly just blame the gentoo compiler for having some shit-for-
> > brains
> > "feature".
> 
> Looks like I can generate similar bad code with the F26 version of
> gcc, it's just not enabled by default.
> 
> So all gentoo did was change the default options.

Yes, and only in hardened profile, so most users don't have -fstack-
check by default. :)

> 
> I suspect we should just add a
> 
> KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)
> 
> somewhere to the main Makefile, just to make sure.
> 
> Maybe like the appended?
> 
> Toralf, Alexander, does this make things JustWork(tm) for you?

I can confirm that with your patch my gcc produces working kernel.


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Пт, 29/12/2017 в 17:34 -0800, Linus Torvalds пишет:
> On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds
>  wrote:
> > 
> > Good. I was not feeling so happy about this bug report, but now I
> > can
> > firmly just blame the gentoo compiler for having some shit-for-
> > brains
> > "feature".
> 
> Looks like I can generate similar bad code with the F26 version of
> gcc, it's just not enabled by default.
> 
> So all gentoo did was change the default options.

Yes, and only in hardened profile, so most users don't have -fstack-
check by default. :)

> 
> I suspect we should just add a
> 
> KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)
> 
> somewhere to the main Makefile, just to make sure.
> 
> Maybe like the appended?
> 
> Toralf, Alexander, does this make things JustWork(tm) for you?

I can confirm that with your patch my gcc produces working kernel.


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > (Also, Josh, the oops code should have printed the contents of the
> > struct pt_regs at the top of the DF stack.  Any idea why it
> > didn't?)
> 
> Looking at one of the dumps:
> 
>   [  392.774879] NMI backtrace for cpu 0
>   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> #1
>   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>   [  392.774882] task: 8802368b8000 task.stack: c900c000
>   [  392.774885] RIP: 0010:double_fault+0x0/0x30
>   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
>   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> c101
>   [  392.774887] RDX: 8802 RSI:  RDI:
> ff527f58
>   [  392.774887] RBP:  R08:  R09:
> 
>   [  392.774888] R10:  R11:  R12:
> 816ae726
>   [  392.774888] R13:  R14:  R15:
> 
>   [  392.774889] FS:  ()
> GS:88023fc0() knlGS:
>   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
>   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> 001606f0
>   [  392.774892] Call Trace:
>   [  392.774894]  <#DF>
>   [  392.774897]  do_double_fault+0xb/0x140
>   [  392.774898]  
> 
> It should have at least printed the #DF iret frame registers, which I
> recently added support for in "x86/unwinder: Handle stack overflows
> more
> gracefully", which is in both 4.14.9 and 4.15-rc5.
> 
> I think the missing iret regs are due to a bug in
> show_trace_log_lvl(),
> where if the unwind starts with two regs frames in a row, the second
> regs don't get printed.
> 
> Alexander, would you mind reproducing again with the below patch?  It
> should still fail, but this time it should hopefully show another
> RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> 

Yes, it works:

[   23.058064] NMI backtrace for cpu 2
[   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
[   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1.fc27 04/01/2014
[   23.058074] RIP: 0010:double_fault+0x0/0x30
[   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
[   23.058077] RAX: 3fd0 RBX: 0001 RCX:
c101
[   23.058077] RDX: 9681 RSI:  RDI:
fe85ff58
[   23.058078] RBP:  R08:  R09:

[   23.058079] R10:  R11:  R12:
92001426
[   23.058080] R13:  R14:  R15:

[   23.058083] FS:  () GS:96813fd0()
knlGS:
[   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
[   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
000406a0
[   23.058089] Call Trace:
[   23.058101]  <#DF>
[   23.058104] RIP: 0010:do_double_fault+0xb/0x140
[   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:

[   23.058106] RAX: 3fd0 RBX: 0001 RCX:
c101
[   23.058107] RDX: 9681 RSI:  RDI:
fe85ff58
[   23.058107] RBP:  R08:  R09:

[   23.058108] R10:  R11:  R12:
92001426
[   23.058108] R13:  R14:  R15:

[   23.058111]  
[   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Alexander Tsoy
В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет:
> On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> > (Also, Josh, the oops code should have printed the contents of the
> > struct pt_regs at the top of the DF stack.  Any idea why it
> > didn't?)
> 
> Looking at one of the dumps:
> 
>   [  392.774879] NMI backtrace for cpu 0
>   [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo
> #1
>   [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>   [  392.774882] task: 8802368b8000 task.stack: c900c000
>   [  392.774885] RIP: 0010:double_fault+0x0/0x30
>   [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
>   [  392.774887] RAX: 3fc0 RBX: 0001 RCX:
> c101
>   [  392.774887] RDX: 8802 RSI:  RDI:
> ff527f58
>   [  392.774887] RBP:  R08:  R09:
> 
>   [  392.774888] R10:  R11:  R12:
> 816ae726
>   [  392.774888] R13:  R14:  R15:
> 
>   [  392.774889] FS:  ()
> GS:88023fc0() knlGS:
>   [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
>   [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4:
> 001606f0
>   [  392.774892] Call Trace:
>   [  392.774894]  <#DF>
>   [  392.774897]  do_double_fault+0xb/0x140
>   [  392.774898]  
> 
> It should have at least printed the #DF iret frame registers, which I
> recently added support for in "x86/unwinder: Handle stack overflows
> more
> gracefully", which is in both 4.14.9 and 4.15-rc5.
> 
> I think the missing iret regs are due to a bug in
> show_trace_log_lvl(),
> where if the unwind starts with two regs frames in a row, the second
> regs don't get printed.
> 
> Alexander, would you mind reproducing again with the below patch?  It
> should still fail, but this time it should hopefully show another
> RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.
> 

Yes, it works:

[   23.058064] NMI backtrace for cpu 2
[   23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1
[   23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1.fc27 04/01/2014
[   23.058074] RIP: 0010:double_fault+0x0/0x30
[   23.058075] RSP: :fe85ffd0 EFLAGS: 0086
[   23.058077] RAX: 3fd0 RBX: 0001 RCX:
c101
[   23.058077] RDX: 9681 RSI:  RDI:
fe85ff58
[   23.058078] RBP:  R08:  R09:

[   23.058079] R10:  R11:  R12:
92001426
[   23.058080] R13:  R14:  R15:

[   23.058083] FS:  () GS:96813fd0()
knlGS:
[   23.058084] CS:  0010 DS:  ES:  CR0: 80050033
[   23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4:
000406a0
[   23.058089] Call Trace:
[   23.058101]  <#DF>
[   23.058104] RIP: 0010:do_double_fault+0xb/0x140
[   23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX:

[   23.058106] RAX: 3fd0 RBX: 0001 RCX:
c101
[   23.058107] RDX: 9681 RSI:  RDI:
fe85ff58
[   23.058107] RBP:  R08:  R09:

[   23.058108] R10:  R11:  R12:
92001426
[   23.058108] R13:  R14:  R15:

[   23.058111]  
[   23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00
00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44
00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48


Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 01:10 AM, Andy Lutomirski wrote:
> Toralf, can you send the complete output of:
> 
> objdump -dr arch/x86/kernel/traps.o
> 
> From the build tree of a nonworking kernel?

I attached it.

FWIW:

tfoerste@t44 ~/devel/linux $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/6.4.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-6.4.0/work/gcc-6.4.0/configure 
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr 
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0 
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include 
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0 
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/man 
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/info 
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include/g++-v6 
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/python 
--enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror 
--with-system-zlib --enable-nls --without-included-gettext 
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/ 
--with-pkgversion='Gentoo Hardened 6.4.0 p1.1' --enable-esp 
--enable-libstdcxx-time --disable-libstdcxx-pch --enable-shared 
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu 
--enable-multilib --with-multilib-list=m32,m64 --disable-altivec 
--disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp 
--disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libmpx 
--enable-vtable-verify --enable-libvtv --disable-libquadmath --enable-lto 
--without-isl --disable-libsanitizer --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 6.4.0 (Gentoo Hardened 6.4.0 p1.1)

-- 
Toralf
PGP C4EACDDE 0076E94E

arch/x86/kernel/traps.o: file format elf64-x86-64


Disassembly of section .text:

 :
   0:   41 57   push   %r15
   2:   41 56   push   %r14
   4:   41 55   push   %r13
   6:   41 54   push   %r12
   8:   55  push   %rbp
   9:   53  push   %rbx
   a:   48 81 ec 28 10 00 00sub$0x1028,%rsp
  11:   48 83 0c 24 00  orq$0x0,(%rsp)
  16:   48 81 c4 20 10 00 00add$0x1020,%rsp
  1d:   65 48 8b 2c 25 00 00mov%gs:0x0,%rbp
  24:   00 00 
22: R_X86_64_32Scurrent_task
  26:   f6 81 88 00 00 00 03testb  $0x3,0x88(%rcx)
  2d:   4c 63 efmovslq %edi,%r13
  30:   41 89 f6mov%esi,%r14d
  33:   48 89 14 24 mov%rdx,(%rsp)
  37:   49 89 ccmov%rcx,%r12
  3a:   4d 89 c7mov%r8,%r15
  3d:   4c 89 cbmov%r9,%rbx
  40:   75 3b   jne7d 
  42:   44 89 eemov%r13d,%esi
  45:   48 89 cfmov%rcx,%rdi
  48:   e8 00 00 00 00  callq  4d 
49: R_X86_64_PC32   fixup_exception-0x4
  4d:   85 c0   test   %eax,%eax
  4f:   74 0f   je 60 
  51:   48 83 c4 08 add$0x8,%rsp
  55:   5b  pop%rbx
  56:   5d  pop%rbp
  57:   41 5c   pop%r12
  59:   41 5d   pop%r13
  5b:   41 5e   pop%r14
  5d:   41 5f   pop%r15
  5f:   c3  retq   
  60:   48 8b 3c 24 mov(%rsp),%rdi
  64:   4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp)
  6b:   4c 89 famov%r15,%rdx
  6e:   4c 89 e6mov%r12,%rsi
  71:   4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp)
  78:   e8 00 00 00 00  callq  7d 
79: R_X86_64_PC32   die-0x4
  7d:   8b 05 00 00 00 00   mov0x0(%rip),%eax# 83 
7f: R_X86_64_PC32   show_unhandled_signals-0x4
  83:   4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp)
  8a:   4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp)
  91:   85 c0   test   %eax,%eax
  93:   75 28   jnebd 
  95:   48 85 dbtest   %rbx,%rbx
  98:   b8 01 00 00 00  mov$0x1,%eax
  9d:   48 89 eamov%rbp,%rdx
  a0:   48 0f 44 d8 cmove  %rax,%rbx
  a4:   48 83 c4 08 add$0x8,%rsp
  a8:   44 89 f7mov%r14d,%edi
  ab:   48 89 demov%rbx,%rsi
  ae:   5b  pop%rbx
  af:   5d  pop%rbp
  b0:   41 5c   pop%r12
  b2:   41 5d   pop%r13
  b4:   41 5e   pop%r14
  b6:   41 5f   pop%r15
  b8:   e9 00 00 00 00  jmpq   bd 

Re: 4.14.9 doesn't boot (regression)

2017-12-30 Thread Toralf Förster
On 12/30/2017 01:10 AM, Andy Lutomirski wrote:
> Toralf, can you send the complete output of:
> 
> objdump -dr arch/x86/kernel/traps.o
> 
> From the build tree of a nonworking kernel?

I attached it.

FWIW:

tfoerste@t44 ~/devel/linux $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/6.4.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-6.4.0/work/gcc-6.4.0/configure 
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr 
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0 
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include 
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0 
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/man 
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/info 
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include/g++-v6 
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/python 
--enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror 
--with-system-zlib --enable-nls --without-included-gettext 
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/ 
--with-pkgversion='Gentoo Hardened 6.4.0 p1.1' --enable-esp 
--enable-libstdcxx-time --disable-libstdcxx-pch --enable-shared 
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu 
--enable-multilib --with-multilib-list=m32,m64 --disable-altivec 
--disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp 
--disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libmpx 
--enable-vtable-verify --enable-libvtv --disable-libquadmath --enable-lto 
--without-isl --disable-libsanitizer --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 6.4.0 (Gentoo Hardened 6.4.0 p1.1)

-- 
Toralf
PGP C4EACDDE 0076E94E

arch/x86/kernel/traps.o: file format elf64-x86-64


Disassembly of section .text:

 :
   0:   41 57   push   %r15
   2:   41 56   push   %r14
   4:   41 55   push   %r13
   6:   41 54   push   %r12
   8:   55  push   %rbp
   9:   53  push   %rbx
   a:   48 81 ec 28 10 00 00sub$0x1028,%rsp
  11:   48 83 0c 24 00  orq$0x0,(%rsp)
  16:   48 81 c4 20 10 00 00add$0x1020,%rsp
  1d:   65 48 8b 2c 25 00 00mov%gs:0x0,%rbp
  24:   00 00 
22: R_X86_64_32Scurrent_task
  26:   f6 81 88 00 00 00 03testb  $0x3,0x88(%rcx)
  2d:   4c 63 efmovslq %edi,%r13
  30:   41 89 f6mov%esi,%r14d
  33:   48 89 14 24 mov%rdx,(%rsp)
  37:   49 89 ccmov%rcx,%r12
  3a:   4d 89 c7mov%r8,%r15
  3d:   4c 89 cbmov%r9,%rbx
  40:   75 3b   jne7d 
  42:   44 89 eemov%r13d,%esi
  45:   48 89 cfmov%rcx,%rdi
  48:   e8 00 00 00 00  callq  4d 
49: R_X86_64_PC32   fixup_exception-0x4
  4d:   85 c0   test   %eax,%eax
  4f:   74 0f   je 60 
  51:   48 83 c4 08 add$0x8,%rsp
  55:   5b  pop%rbx
  56:   5d  pop%rbp
  57:   41 5c   pop%r12
  59:   41 5d   pop%r13
  5b:   41 5e   pop%r14
  5d:   41 5f   pop%r15
  5f:   c3  retq   
  60:   48 8b 3c 24 mov(%rsp),%rdi
  64:   4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp)
  6b:   4c 89 famov%r15,%rdx
  6e:   4c 89 e6mov%r12,%rsi
  71:   4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp)
  78:   e8 00 00 00 00  callq  7d 
79: R_X86_64_PC32   die-0x4
  7d:   8b 05 00 00 00 00   mov0x0(%rip),%eax# 83 
7f: R_X86_64_PC32   show_unhandled_signals-0x4
  83:   4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp)
  8a:   4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp)
  91:   85 c0   test   %eax,%eax
  93:   75 28   jnebd 
  95:   48 85 dbtest   %rbx,%rbx
  98:   b8 01 00 00 00  mov$0x1,%eax
  9d:   48 89 eamov%rbp,%rdx
  a0:   48 0f 44 d8 cmove  %rax,%rbx
  a4:   48 83 c4 08 add$0x8,%rsp
  a8:   44 89 f7mov%r14d,%edi
  ab:   48 89 demov%rbx,%rsi
  ae:   5b  pop%rbx
  af:   5d  pop%rbp
  b0:   41 5c   pop%r12
  b2:   41 5d   pop%r13
  b4:   41 5e   pop%r14
  b6:   41 5f   pop%r15
  b8:   e9 00 00 00 00  jmpq   bd 
b9: R_X86_64_PC32   force_sig_info-0x4
  bd:   44 89 f6

Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Josh Poimboeuf
On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> (Also, Josh, the oops code should have printed the contents of the
> struct pt_regs at the top of the DF stack.  Any idea why it didn't?)

Looking at one of the dumps:

  [  392.774879] NMI backtrace for cpu 0
  [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo #1
  [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  [  392.774882] task: 8802368b8000 task.stack: c900c000
  [  392.774885] RIP: 0010:double_fault+0x0/0x30
  [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
  [  392.774887] RAX: 3fc0 RBX: 0001 RCX: 
c101
  [  392.774887] RDX: 8802 RSI:  RDI: 
ff527f58
  [  392.774887] RBP:  R08:  R09: 

  [  392.774888] R10:  R11:  R12: 
816ae726
  [  392.774888] R13:  R14:  R15: 

  [  392.774889] FS:  () GS:88023fc0() 
knlGS:
  [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
  [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: 
001606f0
  [  392.774892] Call Trace:
  [  392.774894]  <#DF>
  [  392.774897]  do_double_fault+0xb/0x140
  [  392.774898]  

It should have at least printed the #DF iret frame registers, which I
recently added support for in "x86/unwinder: Handle stack overflows more
gracefully", which is in both 4.14.9 and 4.15-rc5.

I think the missing iret regs are due to a bug in show_trace_log_lvl(),
where if the unwind starts with two regs frames in a row, the second
regs don't get printed.

Alexander, would you mind reproducing again with the below patch?  It
should still fail, but this time it should hopefully show another
RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.


diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36b17e0febe8..39a320d077aa 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -103,6 +103,7 @@ void show_trace_log_lvl(struct task_struct *task, struct 
pt_regs *regs,
 
unwind_start(, task, regs, stack);
stack = stack ? : get_stack_pointer(task, regs);
+   regs = unwind_get_entry_regs();
 
/*
 * Iterate through the stacks, starting with the current stack pointer.
@@ -120,7 +121,7 @@ void show_trace_log_lvl(struct task_struct *task, struct 
pt_regs *regs,
 * - hardirq stack
 * - entry stack
 */
-   for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, 
sizeof(long))) {
+   for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
const char *stack_name;
 
if (get_stack_info(stack, task, _info, _mask)) {


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Josh Poimboeuf
On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote:
> (Also, Josh, the oops code should have printed the contents of the
> struct pt_regs at the top of the DF stack.  Any idea why it didn't?)

Looking at one of the dumps:

  [  392.774879] NMI backtrace for cpu 0
  [  392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo #1
  [  392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  [  392.774882] task: 8802368b8000 task.stack: c900c000
  [  392.774885] RIP: 0010:double_fault+0x0/0x30
  [  392.774886] RSP: :ff527fd0 EFLAGS: 0086
  [  392.774887] RAX: 3fc0 RBX: 0001 RCX: 
c101
  [  392.774887] RDX: 8802 RSI:  RDI: 
ff527f58
  [  392.774887] RBP:  R08:  R09: 

  [  392.774888] R10:  R11:  R12: 
816ae726
  [  392.774888] R13:  R14:  R15: 

  [  392.774889] FS:  () GS:88023fc0() 
knlGS:
  [  392.774889] CS:  0010 DS:  ES:  CR0: 80050033
  [  392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: 
001606f0
  [  392.774892] Call Trace:
  [  392.774894]  <#DF>
  [  392.774897]  do_double_fault+0xb/0x140
  [  392.774898]  

It should have at least printed the #DF iret frame registers, which I
recently added support for in "x86/unwinder: Handle stack overflows more
gracefully", which is in both 4.14.9 and 4.15-rc5.

I think the missing iret regs are due to a bug in show_trace_log_lvl(),
where if the unwind starts with two regs frames in a row, the second
regs don't get printed.

Alexander, would you mind reproducing again with the below patch?  It
should still fail, but this time it should hopefully show another
RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line.


diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36b17e0febe8..39a320d077aa 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -103,6 +103,7 @@ void show_trace_log_lvl(struct task_struct *task, struct 
pt_regs *regs,
 
unwind_start(, task, regs, stack);
stack = stack ? : get_stack_pointer(task, regs);
+   regs = unwind_get_entry_regs();
 
/*
 * Iterate through the stacks, starting with the current stack pointer.
@@ -120,7 +121,7 @@ void show_trace_log_lvl(struct task_struct *task, struct 
pt_regs *regs,
 * - hardirq stack
 * - entry stack
 */
-   for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, 
sizeof(long))) {
+   for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
const char *stack_name;
 
if (get_stack_info(stack, task, _info, _mask)) {


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds
 wrote:
>
> Good. I was not feeling so happy about this bug report, but now I can
> firmly just blame the gentoo compiler for having some shit-for-brains
> "feature".

Looks like I can generate similar bad code with the F26 version of
gcc, it's just not enabled by default.

So all gentoo did was change the default options.

I suspect we should just add a

KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)

somewhere to the main Makefile, just to make sure.

Maybe like the appended?

Toralf, Alexander, does this make things JustWork(tm) for you?

Linus
 Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Makefile b/Makefile
index ac8c441866b7..92b74bcd3c2a 100644
--- a/Makefile
+++ b/Makefile
@@ -789,6 +789,9 @@ KBUILD_CFLAGS += $(call cc-disable-warning, pointer-sign)
 # disable invalid "can't wrap" optimizations for signed / pointers
 KBUILD_CFLAGS  += $(call cc-option,-fno-strict-overflow)
 
+# Make sure -fstack-check isn't enabled (like gentoo apparently did)
+KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)
+
 # conserve stack if available
 KBUILD_CFLAGS   += $(call cc-option,-fconserve-stack)
 


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds
 wrote:
>
> Good. I was not feeling so happy about this bug report, but now I can
> firmly just blame the gentoo compiler for having some shit-for-brains
> "feature".

Looks like I can generate similar bad code with the F26 version of
gcc, it's just not enabled by default.

So all gentoo did was change the default options.

I suspect we should just add a

KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)

somewhere to the main Makefile, just to make sure.

Maybe like the appended?

Toralf, Alexander, does this make things JustWork(tm) for you?

Linus
 Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Makefile b/Makefile
index ac8c441866b7..92b74bcd3c2a 100644
--- a/Makefile
+++ b/Makefile
@@ -789,6 +789,9 @@ KBUILD_CFLAGS += $(call cc-disable-warning, pointer-sign)
 # disable invalid "can't wrap" optimizations for signed / pointers
 KBUILD_CFLAGS  += $(call cc-option,-fno-strict-overflow)
 
+# Make sure -fstack-check isn't enabled (like gentoo apparently did)
+KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)
+
 # conserve stack if available
 KBUILD_CFLAGS   += $(call cc-option,-fconserve-stack)
 


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 17:10 -0700, Andy Lutomirski пишет:
> 
> Also, you wouldn't happen to be using Gentoo perchance?  I already
> have two reports of a Gentoo system miscompiling the vDSO due to
> Gentoo enabling -fstack-check and GCC generating stack check code
> that is highly suboptimal, actively incorrect, and doesn't even
> manage to check the stack in a particularly helpful way.
> 
> If this is indeed what's going on, I'm going to try to come up with a
> patch to outright fail the build on these buggy systems.  We could
> probably fudge the build options to avoid the problem, but Gentoo
> really just needs fix its toolchain.

You are right, It's due to fstack-check enabled in gentoo's gcc spec.
"-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 17:10 -0700, Andy Lutomirski пишет:
> 
> Also, you wouldn't happen to be using Gentoo perchance?  I already
> have two reports of a Gentoo system miscompiling the vDSO due to
> Gentoo enabling -fstack-check and GCC generating stack check code
> that is highly suboptimal, actively incorrect, and doesn't even
> manage to check the stack in a particularly helpful way.
> 
> If this is indeed what's going on, I'm going to try to come up with a
> patch to outright fail the build on these buggy systems.  We could
> probably fudge the build options to avoid the problem, but Gentoo
> really just needs fix its toolchain.

You are right, It's due to fstack-check enabled in gentoo's gcc spec.
"-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
 f

On Fri, Dec 29, 2017 at 4:10 PM, Andy Lutomirski  wrote:
>
> Double faults use IST, so a double fault that double faults will effectively 
> just start over rather than eventually running out of stack and triple 
> faulting.
>
> But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08.
> IOW the double fault stack is ...28000 - ...28fff and we're somehow getting
> a failed page fault a couple hundred bytes below the bottom of the IST stack.
> IOW, I think we're just stuck in a neverending loop of stack overflows.

Ahh, good catch. This feels like it might finally be explaining things.

> (Also, Josh, the oops code should have printed the contents of the struct 
> pt_regs at the top of the DF stack.  Any idea why it didn't?)
>
> Toralf, can you send the complete output of:
>
> objdump -dr arch/x86/kernel/traps.o
>
> From the build tree of a nonworking kernel?

Alexander made one of his failing kernels available earlier:

https://www.dropbox.com/s/yesupqgig3uxf73/linux-4.15-rc5%2B.tar.xz?dl=0

and yes, there's something seriously wrong there. Doing a disassembly
on "do_double_fault()" shows:

8101bda0 :
8101bda0:   41 54   push   %r12
8101bda2:   55  push   %rbp
8101bda3:   53  push   %rbx
8101bda4:   48 81 ec 20 10 00 00sub$0x1020,%rsp
8101bdab:   48 83 0c 24 00  orq$0x0,(%rsp)
8101bdb0:   48 81 c4 20 10 00 00add$0x1020,%rsp

WTF? That's bogus crap, and not ok in the kernel.  Doing a stack probe
below the stack by subtracting 4128rom the stack pointer and then
oring it, and then resetting the stack pointer again is just crazy.
And it's definitely not ever going to work for the kernel that has a
limited stack.

So yes, It's a terminally broken compiler from hell. I assume gentoo
has applied some completely broken security patch to their compiler,
turning said compiler into complete garbage.

Doing some trivial grepping on the disassembly in that vmlinux file,
there's tons of those "let's probe more than a page below the stack"
issues. The biggest offset I found was 0x1400.

That one happened to be in do_sys_poll().

> Also, you wouldn't happen to be using Gentoo perchance?

Yes, several people involved are using gentoo. Maybe everybody.

> I already have two reports of a Gentoo system miscompiling the vDSO
> due to Gentoo enabling -fstack-check and GCC generating stack check
> code that is highly suboptimal, actively incorrect, and doesn't even
> manage to check the stack in a particularly helpful way.

Yes. Good. I think you root-caused it.

Good. I was not feeling so happy about this bug report, but now I can
firmly just blame the gentoo compiler for having some shit-for-brains
"feature".

   Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
 f

On Fri, Dec 29, 2017 at 4:10 PM, Andy Lutomirski  wrote:
>
> Double faults use IST, so a double fault that double faults will effectively 
> just start over rather than eventually running out of stack and triple 
> faulting.
>
> But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08.
> IOW the double fault stack is ...28000 - ...28fff and we're somehow getting
> a failed page fault a couple hundred bytes below the bottom of the IST stack.
> IOW, I think we're just stuck in a neverending loop of stack overflows.

Ahh, good catch. This feels like it might finally be explaining things.

> (Also, Josh, the oops code should have printed the contents of the struct 
> pt_regs at the top of the DF stack.  Any idea why it didn't?)
>
> Toralf, can you send the complete output of:
>
> objdump -dr arch/x86/kernel/traps.o
>
> From the build tree of a nonworking kernel?

Alexander made one of his failing kernels available earlier:

https://www.dropbox.com/s/yesupqgig3uxf73/linux-4.15-rc5%2B.tar.xz?dl=0

and yes, there's something seriously wrong there. Doing a disassembly
on "do_double_fault()" shows:

8101bda0 :
8101bda0:   41 54   push   %r12
8101bda2:   55  push   %rbp
8101bda3:   53  push   %rbx
8101bda4:   48 81 ec 20 10 00 00sub$0x1020,%rsp
8101bdab:   48 83 0c 24 00  orq$0x0,(%rsp)
8101bdb0:   48 81 c4 20 10 00 00add$0x1020,%rsp

WTF? That's bogus crap, and not ok in the kernel.  Doing a stack probe
below the stack by subtracting 4128rom the stack pointer and then
oring it, and then resetting the stack pointer again is just crazy.
And it's definitely not ever going to work for the kernel that has a
limited stack.

So yes, It's a terminally broken compiler from hell. I assume gentoo
has applied some completely broken security patch to their compiler,
turning said compiler into complete garbage.

Doing some trivial grepping on the disassembly in that vmlinux file,
there's tons of those "let's probe more than a page below the stack"
issues. The biggest offset I found was 0x1400.

That one happened to be in do_sys_poll().

> Also, you wouldn't happen to be using Gentoo perchance?

Yes, several people involved are using gentoo. Maybe everybody.

> I already have two reports of a Gentoo system miscompiling the vDSO
> due to Gentoo enabling -fstack-check and GCC generating stack check
> code that is highly suboptimal, actively incorrect, and doesn't even
> manage to check the stack in a particularly helpful way.

Yes. Good. I think you root-caused it.

Good. I was not feeling so happy about this bug report, but now I can
firmly just blame the gentoo compiler for having some shit-for-brains
"feature".

   Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Andy Lutomirski


> On Dec 29, 2017, at 3:53 PM, Linus Torvalds  
> wrote:
> 
>> On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster  
>> wrote:
>> 
>> The bad news - the issue is not solved with the changed cflags.
>> The good news - I could compile eventually a working config for my desktop  
>> (works fine with 4.14.10 with generic CPU) having a higher screen resolution 
>> during boot.
>> 
>> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
>> .config", changed the .config to use MCORE2 instead of GENERIC and defined 
>> the string "-local" to ensure that the modules directory is really unique.
>> Then I run "time make -j4 && sudo make modules_install && sudo cp 
>> arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
>> /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], 
>> look for IMG_*
> 
> Ok, so what does seem to be consistent for everybody is that
> double-fault in the NMI backtrace.
> 
> So the fact that the NMI always hits on a double-fault does make me
> suspect that it's a infinite stream of double-faults, and that is
> presumably also what causes the RCU timeout.
> 
> And as I pointed out elsewhere (damn two threads), I think that it
> would help to simply catch the *first* double-fault.
> 
> And I *think* that the only thing that can make a double-fault
> silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
> build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
> arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

Double faults use IST, so a double fault that double faults will effectively 
just start over rather than eventually running out of stack and triple faulting.

But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. IOW the 
double fault stack is ...28000 - ...28fff and we're somehow getting a failed 
page fault a couple hundred bytes below the bottom of the IST stack.  IOW, I 
think we're just stuck in a neverending loop of stack overflows.

(Also, Josh, the oops code should have printed the contents of the struct 
pt_regs at the top of the DF stack.  Any idea why it didn't?)

Toralf, can you send the complete output of:

objdump -dr arch/x86/kernel/traps.o

From the build tree of a nonworking kernel?

Also, you wouldn't happen to be using Gentoo perchance?  I already have two 
reports of a Gentoo system miscompiling the vDSO due to Gentoo enabling 
-fstack-check and GCC generating stack check code that is highly suboptimal, 
actively incorrect, and doesn't even manage to check the stack in a 
particularly helpful way.

If this is indeed what's going on, I'm going to try to come up with a patch to 
outright fail the build on these buggy systems.  We could probably fudge the 
build options to avoid the problem, but Gentoo really just needs fix its 
toolchain.


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Andy Lutomirski


> On Dec 29, 2017, at 3:53 PM, Linus Torvalds  
> wrote:
> 
>> On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster  
>> wrote:
>> 
>> The bad news - the issue is not solved with the changed cflags.
>> The good news - I could compile eventually a working config for my desktop  
>> (works fine with 4.14.10 with generic CPU) having a higher screen resolution 
>> during boot.
>> 
>> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
>> .config", changed the .config to use MCORE2 instead of GENERIC and defined 
>> the string "-local" to ensure that the modules directory is really unique.
>> Then I run "time make -j4 && sudo make modules_install && sudo cp 
>> arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
>> /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], 
>> look for IMG_*
> 
> Ok, so what does seem to be consistent for everybody is that
> double-fault in the NMI backtrace.
> 
> So the fact that the NMI always hits on a double-fault does make me
> suspect that it's a infinite stream of double-faults, and that is
> presumably also what causes the RCU timeout.
> 
> And as I pointed out elsewhere (damn two threads), I think that it
> would help to simply catch the *first* double-fault.
> 
> And I *think* that the only thing that can make a double-fault
> silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
> build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
> arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

Double faults use IST, so a double fault that double faults will effectively 
just start over rather than eventually running out of stack and triple faulting.

But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. IOW the 
double fault stack is ...28000 - ...28fff and we're somehow getting a failed 
page fault a couple hundred bytes below the bottom of the IST stack.  IOW, I 
think we're just stuck in a neverending loop of stack overflows.

(Also, Josh, the oops code should have printed the contents of the struct 
pt_regs at the top of the DF stack.  Any idea why it didn't?)

Toralf, can you send the complete output of:

objdump -dr arch/x86/kernel/traps.o

From the build tree of a nonworking kernel?

Also, you wouldn't happen to be using Gentoo perchance?  I already have two 
reports of a Gentoo system miscompiling the vDSO due to Gentoo enabling 
-fstack-check and GCC generating stack check code that is highly suboptimal, 
actively incorrect, and doesn't even manage to check the stack in a 
particularly helpful way.

If this is indeed what's going on, I'm going to try to come up with a patch to 
outright fail the build on these buggy systems.  We could probably fudge the 
build options to avoid the problem, but Gentoo really just needs fix its 
toolchain.


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 11:53 PM, Linus Torvalds wrote:
> So just change the
> 
>   #ifdef CONFIG_X86_ESPFIX64
> 
> into a
> 
>   #if 0
> 
> and see if instead of the RCU stall after 20 seconds, you get an
> immediate double fault error report instead?

well, 3 IMG_20171230_0008* should show the results https://zwiebeltoralf.de/pub/



-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 11:53 PM, Linus Torvalds wrote:
> So just change the
> 
>   #ifdef CONFIG_X86_ESPFIX64
> 
> into a
> 
>   #if 0
> 
> and see if instead of the RCU stall after 20 seconds, you get an
> immediate double fault error report instead?

well, 3 IMG_20171230_0008* should show the results https://zwiebeltoralf.de/pub/



-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster  wrote:
>
> The bad news - the issue is not solved with the changed cflags.
> The good news - I could compile eventually a working config for my desktop  
> (works fine with 4.14.10 with generic CPU) having a higher screen resolution 
> during boot.
>
> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
> .config", changed the .config to use MCORE2 instead of GENERIC and defined 
> the string "-local" to ensure that the modules directory is really unique.
> Then I run "time make -j4 && sudo make modules_install && sudo cp 
> arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
> /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], 
> look for IMG_*

Ok, so what does seem to be consistent for everybody is that
double-fault in the NMI backtrace.

So the fact that the NMI always hits on a double-fault does make me
suspect that it's a infinite stream of double-faults, and that is
presumably also what causes the RCU timeout.

And as I pointed out elsewhere (damn two threads), I think that it
would help to simply catch the *first* double-fault.

And I *think* that the only thing that can make a double-fault
silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

So just change the

  #ifdef CONFIG_X86_ESPFIX64

into a

  #if 0

and see if instead of the RCU stall after 20 seconds, you get an
immediate double fault error report instead?

I'm still entirely confused about why that MCORE2 would make _any_
difference what-so-ever, so this is all fishing for random clues in
the dark.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster  wrote:
>
> The bad news - the issue is not solved with the changed cflags.
> The good news - I could compile eventually a working config for my desktop  
> (works fine with 4.14.10 with generic CPU) having a higher screen resolution 
> during boot.
>
> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
> .config", changed the .config to use MCORE2 instead of GENERIC and defined 
> the string "-local" to ensure that the modules directory is really unique.
> Then I run "time make -j4 && sudo make modules_install && sudo cp 
> arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
> /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], 
> look for IMG_*

Ok, so what does seem to be consistent for everybody is that
double-fault in the NMI backtrace.

So the fact that the NMI always hits on a double-fault does make me
suspect that it's a infinite stream of double-faults, and that is
presumably also what causes the RCU timeout.

And as I pointed out elsewhere (damn two threads), I think that it
would help to simply catch the *first* double-fault.

And I *think* that the only thing that can make a double-fault
silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

So just change the

  #ifdef CONFIG_X86_ESPFIX64

into a

  #if 0

and see if instead of the RCU stall after 20 seconds, you get an
immediate double fault error report instead?

I'm still entirely confused about why that MCORE2 would make _any_
difference what-so-ever, so this is all fishing for random clues in
the dark.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 10:17 PM, Linus Torvalds wrote:
> On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster  
> wrote:
>> On 12/29/2017 09:12 PM, Linus Torvalds wrote:
>>> instead, and see if that makes a difference, that would narrow down
>>> the possible root cause of this problem.
>>
>> not at this ThinkPad T440s (didn't test at the server with an i7-3930).
>>
>> Boot stops just at:
>>
>> tsc: Refined TSC clocksource calibration: 2494.225 MHz
>> clocksource: tsc: mask: 0x max_cycles: 
>> 0x23f3ea95b09, max_idle_ns: 440795287034 ns
> 
> Uhhuh. So for Alexander Troy, just getting rid of the -march=core2
> fixed the boot.
> 
> But not for you.
> 
> Strange. It really looked like the exact same thing.
> 
>> This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4
> 
> Yeah, other reporters of this have used gcc-6.4.0 too.
> 
> But there's been some muddying of the waters there too - changing
> compilers have fixed it for some cases, but there's at least one
> report that a kernel build with gcc-7.2.0 still had the issue (and
> another that said it didn't).
> 
> But the MCORE2 was consistent for several people - including you.
> Until this point.
> 
> Strange.
> 
> The only other thing (apart from the compiler flag) that MCORE2
> results in is to enable
> 
>  CONFIG_X86_INTEL_USERCOPY
>  CONFIG_X86_USE_PPRO_CHECKSUM
>  CONFIG_X86_P6_NOP
> 
> and the two first of those shouldn't even matter on x86-64, and I
> don't see that last one making any difference either.
> 
> So because it looks so impossible that the "-march=core2" didn't make
> a difference for you, I'll ask you to please double-check that you
> actually booted into the right kernel.
> 
> Sorry for doubting you, but your report just broke the _one_
> consistent thing we've seen about this bug.
> 
>   Linus
> 


I double-checked it.

The bad news - the issue is not solved with the changed cflags.
The good news - I could compile eventually a working config for my desktop  
(works fine with 4.14.10 with generic CPU) having a higher screen resolution 
during boot.

So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
.config", changed the .config to use MCORE2 instead of GENERIC and defined the 
string "-local" to ensure that the modules directory is really unique.
Then I run "time make -j4 && sudo make modules_install && sudo cp 
arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
/boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look 
for IMG_*

[1] https://zwiebeltoralf.de/pub/


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 10:17 PM, Linus Torvalds wrote:
> On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster  
> wrote:
>> On 12/29/2017 09:12 PM, Linus Torvalds wrote:
>>> instead, and see if that makes a difference, that would narrow down
>>> the possible root cause of this problem.
>>
>> not at this ThinkPad T440s (didn't test at the server with an i7-3930).
>>
>> Boot stops just at:
>>
>> tsc: Refined TSC clocksource calibration: 2494.225 MHz
>> clocksource: tsc: mask: 0x max_cycles: 
>> 0x23f3ea95b09, max_idle_ns: 440795287034 ns
> 
> Uhhuh. So for Alexander Troy, just getting rid of the -march=core2
> fixed the boot.
> 
> But not for you.
> 
> Strange. It really looked like the exact same thing.
> 
>> This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4
> 
> Yeah, other reporters of this have used gcc-6.4.0 too.
> 
> But there's been some muddying of the waters there too - changing
> compilers have fixed it for some cases, but there's at least one
> report that a kernel build with gcc-7.2.0 still had the issue (and
> another that said it didn't).
> 
> But the MCORE2 was consistent for several people - including you.
> Until this point.
> 
> Strange.
> 
> The only other thing (apart from the compiler flag) that MCORE2
> results in is to enable
> 
>  CONFIG_X86_INTEL_USERCOPY
>  CONFIG_X86_USE_PPRO_CHECKSUM
>  CONFIG_X86_P6_NOP
> 
> and the two first of those shouldn't even matter on x86-64, and I
> don't see that last one making any difference either.
> 
> So because it looks so impossible that the "-march=core2" didn't make
> a difference for you, I'll ask you to please double-check that you
> actually booted into the right kernel.
> 
> Sorry for doubting you, but your report just broke the _one_
> consistent thing we've seen about this bug.
> 
>   Linus
> 


I double-checked it.

The bad news - the issue is not solved with the changed cflags.
The good news - I could compile eventually a working config for my desktop  
(works fine with 4.14.10 with generic CPU) having a higher screen resolution 
during boot.

So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > 
.config", changed the .config to use MCORE2 instead of GENERIC and defined the 
string "-local" to ensure that the modules directory is really unique.
Then I run "time make -j4 && sudo make modules_install && sudo cp 
arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o 
/boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look 
for IMG_*

[1] https://zwiebeltoralf.de/pub/


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 13:39 -0800, Linus Torvalds пишет:
> On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds
>  wrote:
> > 
> > Yeah, other reporters of this have used gcc-6.4.0 too.
> > 
> > But there's been some muddying of the waters there too - changing
> > compilers have fixed it for some cases, but there's at least one
> > report that a kernel build with gcc-7.2.0 still had the issue (and
> > another that said it didn't).
> 
> Side note: I'm not convinced that we will reliably catch a compiler
> version change in our dependency analysis, so it's probably best to
> "make clean" between switching compilers to make sure that you don't
> have old object files with the old compiler.

I did "make clean" after changing compiler flags.


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 13:39 -0800, Linus Torvalds пишет:
> On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds
>  wrote:
> > 
> > Yeah, other reporters of this have used gcc-6.4.0 too.
> > 
> > But there's been some muddying of the waters there too - changing
> > compilers have fixed it for some cases, but there's at least one
> > report that a kernel build with gcc-7.2.0 still had the issue (and
> > another that said it didn't).
> 
> Side note: I'm not convinced that we will reliably catch a compiler
> version change in our dependency analysis, so it's probably best to
> "make clean" between switching compilers to make sure that you don't
> have old object files with the old compiler.

I did "make clean" after changing compiler flags.


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds
 wrote:
>
> Yeah, other reporters of this have used gcc-6.4.0 too.
>
> But there's been some muddying of the waters there too - changing
> compilers have fixed it for some cases, but there's at least one
> report that a kernel build with gcc-7.2.0 still had the issue (and
> another that said it didn't).

Side note: I'm not convinced that we will reliably catch a compiler
version change in our dependency analysis, so it's probably best to
"make clean" between switching compilers to make sure that you don't
have old object files with the old compiler.

> But the MCORE2 was consistent for several people - including you.
> Until this point.

.. and our build infrastructure definitely _should_ catch compiler
switch changes automatically and force a re-build.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds
 wrote:
>
> Yeah, other reporters of this have used gcc-6.4.0 too.
>
> But there's been some muddying of the waters there too - changing
> compilers have fixed it for some cases, but there's at least one
> report that a kernel build with gcc-7.2.0 still had the issue (and
> another that said it didn't).

Side note: I'm not convinced that we will reliably catch a compiler
version change in our dependency analysis, so it's probably best to
"make clean" between switching compilers to make sure that you don't
have old object files with the old compiler.

> But the MCORE2 was consistent for several people - including you.
> Until this point.

.. and our build infrastructure definitely _should_ catch compiler
switch changes automatically and force a re-build.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster  wrote:
> On 12/29/2017 09:12 PM, Linus Torvalds wrote:
>> instead, and see if that makes a difference, that would narrow down
>> the possible root cause of this problem.
>
> not at this ThinkPad T440s (didn't test at the server with an i7-3930).
>
> Boot stops just at:
>
> tsc: Refined TSC clocksource calibration: 2494.225 MHz
> clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, 
> max_idle_ns: 440795287034 ns

Uhhuh. So for Alexander Troy, just getting rid of the -march=core2
fixed the boot.

But not for you.

Strange. It really looked like the exact same thing.

> This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4

Yeah, other reporters of this have used gcc-6.4.0 too.

But there's been some muddying of the waters there too - changing
compilers have fixed it for some cases, but there's at least one
report that a kernel build with gcc-7.2.0 still had the issue (and
another that said it didn't).

But the MCORE2 was consistent for several people - including you.
Until this point.

Strange.

The only other thing (apart from the compiler flag) that MCORE2
results in is to enable

 CONFIG_X86_INTEL_USERCOPY
 CONFIG_X86_USE_PPRO_CHECKSUM
 CONFIG_X86_P6_NOP

and the two first of those shouldn't even matter on x86-64, and I
don't see that last one making any difference either.

So because it looks so impossible that the "-march=core2" didn't make
a difference for you, I'll ask you to please double-check that you
actually booted into the right kernel.

Sorry for doubting you, but your report just broke the _one_
consistent thing we've seen about this bug.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster  wrote:
> On 12/29/2017 09:12 PM, Linus Torvalds wrote:
>> instead, and see if that makes a difference, that would narrow down
>> the possible root cause of this problem.
>
> not at this ThinkPad T440s (didn't test at the server with an i7-3930).
>
> Boot stops just at:
>
> tsc: Refined TSC clocksource calibration: 2494.225 MHz
> clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, 
> max_idle_ns: 440795287034 ns

Uhhuh. So for Alexander Troy, just getting rid of the -march=core2
fixed the boot.

But not for you.

Strange. It really looked like the exact same thing.

> This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4

Yeah, other reporters of this have used gcc-6.4.0 too.

But there's been some muddying of the waters there too - changing
compilers have fixed it for some cases, but there's at least one
report that a kernel build with gcc-7.2.0 still had the issue (and
another that said it didn't).

But the MCORE2 was consistent for several people - including you.
Until this point.

Strange.

The only other thing (apart from the compiler flag) that MCORE2
results in is to enable

 CONFIG_X86_INTEL_USERCOPY
 CONFIG_X86_USE_PPRO_CHECKSUM
 CONFIG_X86_P6_NOP

and the two first of those shouldn't even matter on x86-64, and I
don't see that last one making any difference either.

So because it looks so impossible that the "-march=core2" didn't make
a difference for you, I'll ask you to please double-check that you
actually booted into the right kernel.

Sorry for doubting you, but your report just broke the _one_
consistent thing we've seen about this bug.

  Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 09:12 PM, Linus Torvalds wrote:
> instead, and see if that makes a difference, that would narrow down
> the possible root cause of this problem.

not at this ThinkPad T440s (didn't test at the server with an i7-3930).

Boot stops just at:

tsc: Refined TSC clocksource calibration: 2494.225 MHz
clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, 
max_idle_ns: 440795287034 ns

I changed the Makefile accordingly to your suggestion to:

~/devel/linux $ git diff
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..fb695558821b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -128,7 +128,7 @@ else
 cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
 
 cflags-$(CONFIG_MCORE2) += \
-$(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
+$(call cc-option,-mtune=generic)
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
 cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)

~/devel/linux $ git describe
v4.15-rc5-114-g2758b3e3e630

This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4

.config attached

-- 
Toralf
PGP C4EACDDE 0076E94E
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.15.0-rc5 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-0"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y

Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 09:12 PM, Linus Torvalds wrote:
> instead, and see if that makes a difference, that would narrow down
> the possible root cause of this problem.

not at this ThinkPad T440s (didn't test at the server with an i7-3930).

Boot stops just at:

tsc: Refined TSC clocksource calibration: 2494.225 MHz
clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, 
max_idle_ns: 440795287034 ns

I changed the Makefile accordingly to your suggestion to:

~/devel/linux $ git diff
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..fb695558821b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -128,7 +128,7 @@ else
 cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
 
 cflags-$(CONFIG_MCORE2) += \
-$(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
+$(call cc-option,-mtune=generic)
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
 cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)

~/devel/linux $ git describe
v4.15-rc5-114-g2758b3e3e630

This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4

.config attached

-- 
Toralf
PGP C4EACDDE 0076E94E
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.15.0-rc5 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-0"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y

Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster  
> wrote:
> >
> > For the server the attached .config works fine but switching from
> > CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> > messages. Similar picture at the desktop.
> 
> Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to
> boot") about this same thing, but one thing to try is to see if it's
> just the
> 
>  cflags-$(CONFIG_MCORE2) += \
>  $(call cc-option,-march=core2,$(call 
> cc-option,-mtune=generic))
> 
> in arch/x86/Makefile that causes this.
> 
> The MCORE2 option does potentially have a few other effects (see
> arch/x86/Kconfig.cpu), but the first one to check might be just that
> compiler command line effect.
> 
> So if you can edit arch/x86/Makefile, and just make that say
> 
> cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)
> 
> instead, and see if that makes a difference, that would narrow down
> the possible root cause of this problem.

Or, if it's more convenient, you can try Linus's suggestion by applying the 
patch 
below.

Thanks,

Ingo

===>

 arch/x86/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..1835752fffc9 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -127,8 +127,8 @@ else
 cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8)
 cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
 
-cflags-$(CONFIG_MCORE2) += \
-$(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
+   cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)
+
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
 cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster  
> wrote:
> >
> > For the server the attached .config works fine but switching from
> > CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> > messages. Similar picture at the desktop.
> 
> Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to
> boot") about this same thing, but one thing to try is to see if it's
> just the
> 
>  cflags-$(CONFIG_MCORE2) += \
>  $(call cc-option,-march=core2,$(call 
> cc-option,-mtune=generic))
> 
> in arch/x86/Makefile that causes this.
> 
> The MCORE2 option does potentially have a few other effects (see
> arch/x86/Kconfig.cpu), but the first one to check might be just that
> compiler command line effect.
> 
> So if you can edit arch/x86/Makefile, and just make that say
> 
> cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)
> 
> instead, and see if that makes a difference, that would narrow down
> the possible root cause of this problem.

Or, if it's more convenient, you can try Linus's suggestion by applying the 
patch 
below.

Thanks,

Ingo

===>

 arch/x86/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..1835752fffc9 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -127,8 +127,8 @@ else
 cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8)
 cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
 
-cflags-$(CONFIG_MCORE2) += \
-$(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
+   cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)
+
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
 cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster  wrote:
>
> For the server the attached .config works fine but switching from
> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> messages. Similar picture at the desktop.

Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to
boot") about this same thing, but one thing to try is to see if it's
just the

 cflags-$(CONFIG_MCORE2) += \
 $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))

in arch/x86/Makefile that causes this.

The MCORE2 option does potentially have a few other effects (see
arch/x86/Kconfig.cpu), but the first one to check might be just that
compiler command line effect.

So if you can edit arch/x86/Makefile, and just make that say

cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)

instead, and see if that makes a difference, that would narrow down
the possible root cause of this problem.

Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Linus Torvalds
On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster  wrote:
>
> For the server the attached .config works fine but switching from
> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> messages. Similar picture at the desktop.

Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to
boot") about this same thing, but one thing to try is to see if it's
just the

 cflags-$(CONFIG_MCORE2) += \
 $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))

in arch/x86/Makefile that causes this.

The MCORE2 option does potentially have a few other effects (see
arch/x86/Kconfig.cpu), but the first one to check might be just that
compiler command line effect.

So if you can edit arch/x86/Makefile, and just make that say

cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic)

instead, and see if that makes a difference, that would narrow down
the possible root cause of this problem.

Linus


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 04:48 PM, Alexander Tsoy wrote:
> В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет:
>> I can confirm now, that that kernel breaks both a desktop (an
>> ThinkPad T440s i5) and a headless server (i3930) setup. For the
>> server the attached .config works fine but switching from
>> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
>> messages. Similar picture at the desktop.
> 
> You most likely have the same problem as me:
> https://lkml.org/lkml/2017/12/29/279
> 

Indeed, I got a similar message at my ThinkPad too when I tried to bisect it:

>[   21.776011] INFO: rcu_preempt detected stalls on CPUs/tasks:
>[   21.w77008]  0-...!: (0 ticks this GP) idle=c56/140/0
>softirq=73/73 fqs=0 
>[   21.777008]  (detected by 1, t=21002 jiffies, g=-255, c=-256, q=4)


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 04:48 PM, Alexander Tsoy wrote:
> В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет:
>> I can confirm now, that that kernel breaks both a desktop (an
>> ThinkPad T440s i5) and a headless server (i3930) setup. For the
>> server the attached .config works fine but switching from
>> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
>> messages. Similar picture at the desktop.
> 
> You most likely have the same problem as me:
> https://lkml.org/lkml/2017/12/29/279
> 

Indeed, I got a similar message at my ThinkPad too when I tried to bisect it:

>[   21.776011] INFO: rcu_preempt detected stalls on CPUs/tasks:
>[   21.w77008]  0-...!: (0 ticks this GP) idle=c56/140/0
>softirq=73/73 fqs=0 
>[   21.777008]  (detected by 1, t=21002 jiffies, g=-255, c=-256, q=4)


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет:
> I can confirm now, that that kernel breaks both a desktop (an
> ThinkPad T440s i5) and a headless server (i3930) setup. For the
> server the attached .config works fine but switching from
> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> messages. Similar picture at the desktop.

You most likely have the same problem as me:
https://lkml.org/lkml/2017/12/29/279


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Alexander Tsoy
В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет:
> I can confirm now, that that kernel breaks both a desktop (an
> ThinkPad T440s i5) and a headless server (i3930) setup. For the
> server the attached .config works fine but switching from
> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any
> messages. Similar picture at the desktop.

You most likely have the same problem as me:
https://lkml.org/lkml/2017/12/29/279


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Andy Shevchenko
On Fri, Dec 29, 2017 at 3:38 PM, Toralf Förster  wrote:
> On 12/29/2017 02:33 PM, Sebastian Gottschall wrote:
>> bootlog?
>>
> nothing in any logs, hang happens very early in the boot process

Does it have serial?

Does it use EFI?

You may try earlyprintk for EFI case or legacy UART.
There was support for PCI UARTs, though it wasn't really what I ever used.


-- 
With Best Regards,
Andy Shevchenko


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Andy Shevchenko
On Fri, Dec 29, 2017 at 3:38 PM, Toralf Förster  wrote:
> On 12/29/2017 02:33 PM, Sebastian Gottschall wrote:
>> bootlog?
>>
> nothing in any logs, hang happens very early in the boot process

Does it have serial?

Does it use EFI?

You may try earlyprintk for EFI case or legacy UART.
There was support for PCI UARTs, though it wasn't really what I ever used.


-- 
With Best Regards,
Andy Shevchenko


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 02:33 PM, Sebastian Gottschall wrote:
> bootlog?
> 
nothing in any logs, hang happens very early in the boot process


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Toralf Förster
On 12/29/2017 02:33 PM, Sebastian Gottschall wrote:
> bootlog?
> 
nothing in any logs, hang happens very early in the boot process


-- 
Toralf
PGP C4EACDDE 0076E94E


Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Sebastian Gottschall

bootlog?

Am 29.12.2017 um 12:14 schrieb Toralf Förster:

I can confirm now, that that kernel breaks both a desktop (an ThinkPad T440s 
i5) and a headless server (i3930) setup. For the server the attached .config 
works fine but switching from CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them 
hang at boot w/op any messages. Similar picture at the desktop.
Both are stable Gentoo Linux hardened systems.

This issue seems to exist in mainline too, probably visible with d120cd749 
(stable) and 9aaefe7b59 (upstream).



--
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Stubenwaldallee 21a, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottsch...@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565



Re: 4.14.9 doesn't boot (regression)

2017-12-29 Thread Sebastian Gottschall

bootlog?

Am 29.12.2017 um 12:14 schrieb Toralf Förster:

I can confirm now, that that kernel breaks both a desktop (an ThinkPad T440s 
i5) and a headless server (i3930) setup. For the server the attached .config 
works fine but switching from CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them 
hang at boot w/op any messages. Similar picture at the desktop.
Both are stable Gentoo Linux hardened systems.

This issue seems to exist in mainline too, probably visible with d120cd749 
(stable) and 9aaefe7b59 (upstream).



--
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Stubenwaldallee 21a, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottsch...@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565