Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-29 Thread Sergey Senozhatsky
On (12/28/18 16:03), Daniel Wang wrote:
> Thanks. I was able to confirm that commit c7c3f05e341a9a2bd alone
> fixed the problem for me. As expected, all 16 CPUs' stacktrace was
> printed, before a final panic stack dump and a successful reboot.

Cool, thanks!

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-28 Thread Daniel Wang
Thanks. I was able to confirm that commit c7c3f05e341a9a2bd alone
fixed the problem for me. As expected, all 16 CPUs' stacktrace was
printed, before a final panic stack dump and a successful reboot.

[   24.035044] Hogging a CPU now
[   48.200258] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lockme:1102]
[   48.207371] Modules linked in: lockme(O) ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype nf_nat
br_netfilter ip6table_filter ip6_tables aesni_intel aes_x86_64
crypto_simd cryptd glue_helper
[   48.226613] CPU: 3 PID: 1102 Comm: lockme Tainted: G   O
4.14.79 #33
[   48.234057] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   48.243388] task: a3da1bd7 task.stack: c04e077e
[   48.249425] RIP: 0010:hog_thread+0x13/0x1000 [lockme]
[   48.255197] RSP: 0018:c04e077e3f10 EFLAGS: 0282 ORIG_RAX:
ff10
[   48.262879] RAX: 0011 RBX: a3da362ffa80 RCX: 
[   48.270131] RDX: a3da432dd740 RSI: a3da432d54f8 RDI: a3da432d54f8
[   48.277382] RBP: c04e077e3f48 R08: 0030 R09: 
[   48.284629] R10: 0358 R11:  R12: a3da33f7c940
[   48.291881] R13: c04e079b7c58 R14:  R15: a3da362ffac8
[   48.299134] FS:  () GS:a3da432c()
knlGS:
[   48.307338] CS:  0010 DS:  ES:  CR0: 80050033
[   48.313200] CR2: 7f0142c77e5d CR3: 000b10e12002 CR4: 003606a0
[   48.320455] DR0:  DR1:  DR2: 
[   48.327705] DR3:  DR6: fffe0ff0 DR7: 0400
[   48.334955] Call Trace:
[   48.337534]  kthread+0x127/0x160
[   48.340878]  ? 0xc04bc000
[   48.344315]  ? kthread_create_on_node+0x40/0x40
[   48.348962]  ret_from_fork+0x35/0x40
[   48.352655] Code:  fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00
[   48.360712] Sending NMI from CPU 3 to CPUs 0-2,4-15:
[   48.365891] NMI backtrace for cpu 5
[   48.365892] CPU: 5 PID: 963 Comm: dd Tainted: G   O4.14.79 #33
[   48.365892] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   48.365893] task: a3da2e769c80 task.stack: c04e072dc000
[   48.365894] RIP: 0010:chacha20_block+0x203/0x350
[   48.365894] RSP: 0018:c04e072dfd08 EFLAGS: 0086
[   48.365895] RAX: 430aa37c RBX: 8849a559 RCX: 1e380d02
[   48.365896] RDX: f37255aa RSI: 430aa37c RDI: d39ce109
[   48.365896] RBP: 242dad92 R08: 942a2b36 R09: 6df44375
[   48.365897] R10: 7f47d158 R11: 80fde9af R12: 92e47c5e
[   48.365897] R13: ed09aada R14: c6fd956d R15: 1bb4deeb
[   48.365898] FS:  7f074a4a6700() GS:a3da4334()
knlGS:
[   48.365899] CS:  0010 DS:  ES:  CR0: 80050033
[   48.365899] CR2: 55a35c5b0520 CR3: 000edc93 CR4: 003606a0
[   48.365900] DR0:  DR1:  DR2: 
[   48.365900] DR3:  DR6: fffe0ff0 DR7: 0400
[   48.365900] Call Trace:
[   48.365901]  _extract_crng+0xdb/0x130
[   48.365901]  crng_backtrack_protect+0xb3/0xc0
[   48.365902]  urandom_read+0x13b/0x2c0
[   48.365902]  vfs_read+0xad/0x170
[   48.365903]  SyS_read+0x4b/0xa0
[   48.365903]  ? __audit_syscall_exit+0x21e/0x2c0
[   48.365904]  do_syscall_64+0x70/0x200
[   48.365904]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   48.365904] RIP: 0033:0x7f0749e7c410
[   48.365905] RSP: 002b:7ffd69532b18 EFLAGS: 0246 ORIG_RAX:

[   48.365906] RAX: ffda RBX: 00024ab3 RCX: 7f0749e7c410
[   48.365906] RDX: 0400 RSI: 55e440267000 RDI: 
[   48.365907] RBP: 7ffd69532b40 R08:  R09: 000d
[   48.365907] R10: f000 R11: 0246 R12: 
[   48.365908] R13: 7f074a4a6690 R14: 0400 R15: 55e440267000
[   48.365908] Code: c0 10 31 f0 01 d3 89 74 24 08 41 89 c7 8b 44 24
0c 41 31 dc 41 c1 c4 0c 46 8d 0c 1f 45 89 eb 41 c1 c7 0c 44 01 c0 45
31 cb 89 c6 <89> e8 41 c1 c3 08 89 74 24 0c 31 f0 41 8d 34 0c 8b 4c 24
10 c1
[   48.365921] NMI backtrace for cpu 13
[   48.365923] CPU: 13 PID: 967 Comm: dd Tainted: G   O4.14.79 #33
[   48.365924] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   48.365924] task: a3da1c8b task.stack: c04e07798000
[   48.365925] RIP: 0010:native_queued_spin_lock_slowpath+0xce/0x1b0
[   48.365925] RSP: 0018:c04e0779bda8 EFLAGS: 0002
[   48.365926] RAX: 0001 RBX: adecd3c8 RCX: 
[   48.365926] RDX: 0001 RSI: 0001 RDI: adecd3c8
[   48.365927] RBP: 

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-28 Thread Greg KH
On Fri, Dec 28, 2018 at 09:16:51AM +0900, Sergey Senozhatsky wrote:
> On (12/12/18 17:10), Sergey Senozhatsky wrote:
> > And there will be another -stable backport request in a week or so.
> 
> The remaining one:
> 
> commit c7c3f05e341a9a2bd

Now queued up, thanks.

greg k-h


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-27 Thread Sergey Senozhatsky
On (12/12/18 17:10), Sergey Senozhatsky wrote:
> And there will be another -stable backport request in a week or so.

The remaining one:

commit c7c3f05e341a9a2bd

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-13 Thread Sasha Levin

On Thu, Dec 13, 2018 at 10:59:31AM +0100, Petr Mladek wrote:

On Wed 2018-12-12 18:39:42, Daniel Wang wrote:

> Additionally, for dbdda842fe96f to work as expected we really
need fd5f7cde1b85d4c. Otherwise printk() can schedule under
console_sem and console_owner, which will deactivate the "load
balance" logic.

It looks like fd5f7cde1b85d4c got into 4.14.82 that was released last month.

On Wed, Dec 12, 2018 at 6:27 PM Sergey Senozhatsky
 wrote:
>
> On (12/12/18 16:40), Daniel Wang wrote:
> > In case this was buried in previous messages, the commit I'd like to
> > get backported to 4.14 is dbdda842fe96f: printk: Add console owner and
> > waiter logic to load balance console writes. But another followup
> > patch that fixes a bug in that patch is also required. That is
> > c14376de3a1b: printk: Wake klogd when passing console_lock owner.
>
> Additionally, for dbdda842fe96f to work as expected we really
> need fd5f7cde1b85d4c. Otherwise printk() can schedule under
> console_sem and console_owner, which will deactivate the "load
> balance" logic.


To make it clear. Please, make sure that the following commits are
backported together:

+ dbdda842fe96f8932ba ("printk: Add console owner and waiter
logic to load balance console writes")
+ c162d5b4338d72deed6 ("printk: Hide console waiter logic into
helpers")
+ fd5f7cde1b85d4c8e09 ("printk: Never set console_may_schedule
in console_trylock()")
+ c14376de3a1befa70d9 ("printk: Wake klogd when passing
console_lock owner")


I generated this list from git log using "Fixes:" tag. It seems
to mention all commits dicussed above.


All 4 queued for 4.14, thank you.

--
Thanks,
Sasha


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-13 Thread Petr Mladek
On Wed 2018-12-12 18:39:42, Daniel Wang wrote:
> > Additionally, for dbdda842fe96f to work as expected we really
> need fd5f7cde1b85d4c. Otherwise printk() can schedule under
> console_sem and console_owner, which will deactivate the "load
> balance" logic.
> 
> It looks like fd5f7cde1b85d4c got into 4.14.82 that was released last month.
> 
> On Wed, Dec 12, 2018 at 6:27 PM Sergey Senozhatsky
>  wrote:
> >
> > On (12/12/18 16:40), Daniel Wang wrote:
> > > In case this was buried in previous messages, the commit I'd like to
> > > get backported to 4.14 is dbdda842fe96f: printk: Add console owner and
> > > waiter logic to load balance console writes. But another followup
> > > patch that fixes a bug in that patch is also required. That is
> > > c14376de3a1b: printk: Wake klogd when passing console_lock owner.
> >
> > Additionally, for dbdda842fe96f to work as expected we really
> > need fd5f7cde1b85d4c. Otherwise printk() can schedule under
> > console_sem and console_owner, which will deactivate the "load
> > balance" logic.

To make it clear. Please, make sure that the following commits are
backported together:

+ dbdda842fe96f8932ba ("printk: Add console owner and waiter
logic to load balance console writes")
+ c162d5b4338d72deed6 ("printk: Hide console waiter logic into
helpers")
+ fd5f7cde1b85d4c8e09 ("printk: Never set console_may_schedule
in console_trylock()")
+ c14376de3a1befa70d9 ("printk: Wake klogd when passing
console_lock owner")


I generated this list from git log using "Fixes:" tag. It seems
to mention all commits dicussed above.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Daniel Wang
> Additionally, for dbdda842fe96f to work as expected we really
need fd5f7cde1b85d4c. Otherwise printk() can schedule under
console_sem and console_owner, which will deactivate the "load
balance" logic.

It looks like fd5f7cde1b85d4c got into 4.14.82 that was released last month.

On Wed, Dec 12, 2018 at 6:27 PM Sergey Senozhatsky
 wrote:
>
> On (12/12/18 16:40), Daniel Wang wrote:
> > In case this was buried in previous messages, the commit I'd like to
> > get backported to 4.14 is dbdda842fe96f: printk: Add console owner and
> > waiter logic to load balance console writes. But another followup
> > patch that fixes a bug in that patch is also required. That is
> > c14376de3a1b: printk: Wake klogd when passing console_lock owner.
>
> Additionally, for dbdda842fe96f to work as expected we really
> need fd5f7cde1b85d4c. Otherwise printk() can schedule under
> console_sem and console_owner, which will deactivate the "load
> balance" logic.
>
> -ss



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sergey Senozhatsky
On (12/12/18 16:40), Daniel Wang wrote:
> In case this was buried in previous messages, the commit I'd like to
> get backported to 4.14 is dbdda842fe96f: printk: Add console owner and
> waiter logic to load balance console writes. But another followup
> patch that fixes a bug in that patch is also required. That is
> c14376de3a1b: printk: Wake klogd when passing console_lock owner.

Additionally, for dbdda842fe96f to work as expected we really
need fd5f7cde1b85d4c. Otherwise printk() can schedule under
console_sem and console_owner, which will deactivate the "load
balance" logic.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sergey Senozhatsky
On (12/12/18 16:52), Sasha Levin wrote:
> On Wed, Dec 12, 2018 at 01:49:25PM -0800, Daniel Wang wrote:
> > Thanks for the clarification. So I guess I don't need to start another
> > thread for it? What are the next steps?
> 
> Nothing here, I'll queue it once Sergey or Petr clarify if they wanted
> additional information in the -stable commit message.

Hi Sasha,

No, no commit message change requests from my side.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Daniel Wang
In case this was buried in previous messages, the commit I'd like to
get backported to 4.14 is dbdda842fe96f: printk: Add console owner and
waiter logic to load balance console writes. But another followup
patch that fixes a bug in that patch is also required. That is
c14376de3a1b: printk: Wake klogd when passing console_lock owner.

On Wed, Dec 12, 2018 at 1:56 PM Daniel Wang  wrote:
>
> Thank you!
>
> On Wed, Dec 12, 2018 at 1:52 PM Sasha Levin  wrote:
> >
> > On Wed, Dec 12, 2018 at 01:49:25PM -0800, Daniel Wang wrote:
> > >Thanks for the clarification. So I guess I don't need to start another
> > >thread for it? What are the next steps?
> >
> > Nothing here, I'll queue it once Sergey or Petr clarify if they wanted
> > additional information in the -stable commit message.
> >
> > --
> > Thanks,
> > Sasha
> >
>
>
> --
> Best,
> Daniel



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Daniel Wang
Thank you!

On Wed, Dec 12, 2018 at 1:52 PM Sasha Levin  wrote:
>
> On Wed, Dec 12, 2018 at 01:49:25PM -0800, Daniel Wang wrote:
> >Thanks for the clarification. So I guess I don't need to start another
> >thread for it? What are the next steps?
>
> Nothing here, I'll queue it once Sergey or Petr clarify if they wanted
> additional information in the -stable commit message.
>
> --
> Thanks,
> Sasha
>


-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sasha Levin

On Wed, Dec 12, 2018 at 01:49:25PM -0800, Daniel Wang wrote:

Thanks for the clarification. So I guess I don't need to start another
thread for it? What are the next steps?


Nothing here, I'll queue it once Sergey or Petr clarify if they wanted
additional information in the -stable commit message.

--
Thanks,
Sasha



Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Daniel Wang
Thanks for the clarification. So I guess I don't need to start another
thread for it? What are the next steps?

On Wed, Dec 12, 2018 at 1:43 PM Sasha Levin  wrote:
>
> On Wed, Dec 12, 2018 at 12:11:29PM -0800, Daniel Wang wrote:
> >On Wed, Dec 12, 2018 at 9:43 AM Sasha Levin  wrote:
> >>
> >> On Wed, Dec 12, 2018 at 10:59:39PM +0900, Sergey Senozhatsky wrote:
> >> >On (12/12/18 14:36), Petr Mladek wrote:
> >> >> > OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
> >> >> > and I wasn't Cc-ed on this whole discussion and found it purely
> >> >> > accidentally while browsing linux-mm list.
> >> >>
> >> >> I am sorry that I did not CC you. There were so many people in CC.
> >> >> I expected that all people mentioned in the related commit message
> >> >> were included by default.
> >> >
> >> >No worries! I'm not blaming anyone.
> >> >
> >> >> > So if you are willing to backport this set to -stable, then I wouldn't
> >> >> > mind, probably would be more correct if we don't advertise this as a
> >> >> > "panic() deadlock fix"
> >> >>
> >> >> This should not be a problem. I guess that stable does not modify
> >> >> the original commit messages unless there is a change.
> >> >
> >> >Agreed.
> >>
> >> I'll be happy to add anything you want to the commit message. Do you
> >> have a blurb you want to use?
> >
> >If we still get to amend the commit message, I'd like to add "Cc:
> >sta...@vger.kernel.org" in the sign-off area. According to
> >https://www.kernel.org/doc/html/v4.12/process/stable-kernel-rules.html#option-1
> >patches with that tag will be automatically applied to -stable trees.
> >It's unclear though if it'll get applied to ALL -stable trees. For my
> >request, I care at least about 4.19 and 4.14. So maybe we can add two
> >lines, "Cc:  # 4.14.x" and "Cc:
> > # 4.19.x".
>
> We can't change the original commit message (but that's fine, the
> purpose of that tag is to let us know that this commit should go in
> stable - and no we do :) ).
>
> I was under the impression that Sergey or Petr wanted to add more
> information about the purpose of this patch and the issue it solves.
>
> --
> Thanks,
> Sasha



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sasha Levin

On Wed, Dec 12, 2018 at 12:11:29PM -0800, Daniel Wang wrote:

On Wed, Dec 12, 2018 at 9:43 AM Sasha Levin  wrote:


On Wed, Dec 12, 2018 at 10:59:39PM +0900, Sergey Senozhatsky wrote:
>On (12/12/18 14:36), Petr Mladek wrote:
>> > OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
>> > and I wasn't Cc-ed on this whole discussion and found it purely
>> > accidentally while browsing linux-mm list.
>>
>> I am sorry that I did not CC you. There were so many people in CC.
>> I expected that all people mentioned in the related commit message
>> were included by default.
>
>No worries! I'm not blaming anyone.
>
>> > So if you are willing to backport this set to -stable, then I wouldn't
>> > mind, probably would be more correct if we don't advertise this as a
>> > "panic() deadlock fix"
>>
>> This should not be a problem. I guess that stable does not modify
>> the original commit messages unless there is a change.
>
>Agreed.

I'll be happy to add anything you want to the commit message. Do you
have a blurb you want to use?


If we still get to amend the commit message, I'd like to add "Cc:
sta...@vger.kernel.org" in the sign-off area. According to
https://www.kernel.org/doc/html/v4.12/process/stable-kernel-rules.html#option-1
patches with that tag will be automatically applied to -stable trees.
It's unclear though if it'll get applied to ALL -stable trees. For my
request, I care at least about 4.19 and 4.14. So maybe we can add two
lines, "Cc:  # 4.14.x" and "Cc:
 # 4.19.x".


We can't change the original commit message (but that's fine, the
purpose of that tag is to let us know that this commit should go in
stable - and no we do :) ).

I was under the impression that Sergey or Petr wanted to add more
information about the purpose of this patch and the issue it solves.

--
Thanks,
Sasha


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Daniel Wang
On Wed, Dec 12, 2018 at 9:43 AM Sasha Levin  wrote:
>
> On Wed, Dec 12, 2018 at 10:59:39PM +0900, Sergey Senozhatsky wrote:
> >On (12/12/18 14:36), Petr Mladek wrote:
> >> > OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
> >> > and I wasn't Cc-ed on this whole discussion and found it purely
> >> > accidentally while browsing linux-mm list.
> >>
> >> I am sorry that I did not CC you. There were so many people in CC.
> >> I expected that all people mentioned in the related commit message
> >> were included by default.
> >
> >No worries! I'm not blaming anyone.
> >
> >> > So if you are willing to backport this set to -stable, then I wouldn't
> >> > mind, probably would be more correct if we don't advertise this as a
> >> > "panic() deadlock fix"
> >>
> >> This should not be a problem. I guess that stable does not modify
> >> the original commit messages unless there is a change.
> >
> >Agreed.
>
> I'll be happy to add anything you want to the commit message. Do you
> have a blurb you want to use?

If we still get to amend the commit message, I'd like to add "Cc:
sta...@vger.kernel.org" in the sign-off area. According to
https://www.kernel.org/doc/html/v4.12/process/stable-kernel-rules.html#option-1
patches with that tag will be automatically applied to -stable trees.
It's unclear though if it'll get applied to ALL -stable trees. For my
request, I care at least about 4.19 and 4.14. So maybe we can add two
lines, "Cc:  # 4.14.x" and "Cc:
 # 4.19.x".

>
> --
> Thanks,
> Sasha



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sasha Levin

On Wed, Dec 12, 2018 at 10:59:39PM +0900, Sergey Senozhatsky wrote:

On (12/12/18 14:36), Petr Mladek wrote:

> OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
> and I wasn't Cc-ed on this whole discussion and found it purely
> accidentally while browsing linux-mm list.

I am sorry that I did not CC you. There were so many people in CC.
I expected that all people mentioned in the related commit message
were included by default.


No worries! I'm not blaming anyone.


> So if you are willing to backport this set to -stable, then I wouldn't
> mind, probably would be more correct if we don't advertise this as a
> "panic() deadlock fix"

This should not be a problem. I guess that stable does not modify
the original commit messages unless there is a change.


Agreed.


I'll be happy to add anything you want to the commit message. Do you
have a blurb you want to use?

--
Thanks,
Sasha


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sergey Senozhatsky
On (12/12/18 14:36), Petr Mladek wrote:
> > OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
> > and I wasn't Cc-ed on this whole discussion and found it purely
> > accidentally while browsing linux-mm list.
> 
> I am sorry that I did not CC you. There were so many people in CC.
> I expected that all people mentioned in the related commit message
> were included by default.

No worries! I'm not blaming anyone.

> > So if you are willing to backport this set to -stable, then I wouldn't
> > mind, probably would be more correct if we don't advertise this as a
> > "panic() deadlock fix"
> 
> This should not be a problem. I guess that stable does not modify
> the original commit messages unless there is a change.

Agreed.

> > In the meantime, I can add my Acked-by to this backport if it helps.
> 
> I am fine with back-porting the patches now. They have got much more
> testing in the meantime and nobody reported any regression. They
> seems to help in more situations than we expected. Finally, there is
> someone requesting the backport who spent non-trivial time
> on tracking the problem and testing.

Great!


Sasha, here is
Acked-by: Sergey Senozhatsky 
from me.

And expect another backport request in 1 or 2 weeks - the patch
which eliminates the existing "panic CPU != uart_port lock owner CPU"
limitation.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Petr Mladek
On Wed 2018-12-12 17:10:34, Sergey Senozhatsky wrote:
> On (12/12/18 01:48), Sasha Levin wrote:
> > > > > I guess we still don't have a really clear understanding of what 
> > > > > exactly
> > > > is going in your system
> > > > 
> > > > I would also like to get to the bottom of it. Unfortunately I haven't
> > > > got the expertise in this area nor the time to do it yet. Hence the
> > > > intent to take a step back and backport Steven's patch to fix the
> > > > issue that has resurfaced in our production recently.
> > > 
> > > No problem.
> > > I just meant that -stable people can be a bit "unconvinced".
> > 
> > The -stable people tried adding this patch back in April, but ended up
> > getting complaints up the wazoo (https://lkml.org/lkml/2018/4/9/154)
> > about how this is not -stable material.
> 
> OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
> and I wasn't Cc-ed on this whole discussion and found it purely
> accidentally while browsing linux-mm list.

I am sorry that I did not CC you. There were so many people in CC.
I expected that all people mentioned in the related commit message
were included by default.


> So if you are willing to backport this set to -stable, then I wouldn't
> mind, probably would be more correct if we don't advertise this as a
> "panic() deadlock fix"

This should not be a problem. I guess that stable does not modify
the original commit messages unless there is a change.


> In the meantime, I can add my Acked-by to this backport if it helps.

I am fine with back-porting the patches now. They have got much more
testing in the meantime and nobody reported any regression. They
seems to help in more situations than we expected. Finally, there is
someone requesting the backport who spent non-trivial time
on tracking the problem and testing.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-12 Thread Sergey Senozhatsky
On (12/12/18 01:48), Sasha Levin wrote:
> > > > I guess we still don't have a really clear understanding of what exactly
> > > is going in your system
> > > 
> > > I would also like to get to the bottom of it. Unfortunately I haven't
> > > got the expertise in this area nor the time to do it yet. Hence the
> > > intent to take a step back and backport Steven's patch to fix the
> > > issue that has resurfaced in our production recently.
> > 
> > No problem.
> > I just meant that -stable people can be a bit "unconvinced".
> 
> The -stable people tried adding this patch back in April, but ended up
> getting complaints up the wazoo (https://lkml.org/lkml/2018/4/9/154)
> about how this is not -stable material.

OK, really didn't know that! I wasn't Cc-ed on that AUTOSEL email,
and I wasn't Cc-ed on this whole discussion and found it purely
accidentally while browsing linux-mm list.

I understand what Petr meant by his email. Not arguing; below are just
my 5 cents.

> So yes, testing/acks welcome :)

OK. The way I see it (and I can be utterly wrong here):

The patch set in question, most likely and probably (*and those are
theories*), makes panic() deadlock less likely because panic_cpu waits
for console_sem owner to release uart_port/console_owner locks before
panic_cpu pr_emerg("Kernel panic - not syncing"), dump_stack()-s and
brings other CPUs down via stop IPI or NMI.
So a precondition is
panic CPU != uart_port->lock owner CPU

If the panic happens on the same CPU which holds the uart_port spin_lock,
then the deadlock is still there, just like before; we have another patch
which attempts to fix this (it makes console drivers re-entrant from
panic()).

So if you are willing to backport this set to -stable, then I wouldn't
mind, probably would be more correct if we don't advertise this as a
"panic() deadlock fix" tho; we know that deadlock is still possible.
And there will be another -stable backport request in a week or so.


In the meantime, I can add my Acked-by to this backport if it helps.

/* Assuming that my theories explain what's happening with
   Daniel's systems. */

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-11 Thread Sasha Levin

On Wed, Dec 12, 2018 at 03:28:41PM +0900, Sergey Senozhatsky wrote:

On (12/11/18 22:08), Daniel Wang wrote:


I've been meaning to try it but kept getting distracted by other
things. I'll try to find some time for it this week or next. Right now
my intent is to get Steven's patch into 4.14 stable as it evidently
fixed the particular issue I was seeing, and as Steven said it has
been in upstream since 4.16 so it's not like backporting it will raise
any red flags. I will start another thread on -stable for it.


OK.


> I guess we still don't have a really clear understanding of what exactly
is going in your system

I would also like to get to the bottom of it. Unfortunately I haven't
got the expertise in this area nor the time to do it yet. Hence the
intent to take a step back and backport Steven's patch to fix the
issue that has resurfaced in our production recently.


No problem.
I just meant that -stable people can be a bit "unconvinced".


The -stable people tried adding this patch back in April, but ended up
getting complaints up the wazoo (https://lkml.org/lkml/2018/4/9/154)
about how this is not -stable material.

So yes, testing/acks welcome :)

--
Thanks,
Sasha


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-11 Thread Sergey Senozhatsky
On (12/11/18 22:08), Daniel Wang wrote:
>
> I've been meaning to try it but kept getting distracted by other
> things. I'll try to find some time for it this week or next. Right now
> my intent is to get Steven's patch into 4.14 stable as it evidently
> fixed the particular issue I was seeing, and as Steven said it has
> been in upstream since 4.16 so it's not like backporting it will raise
> any red flags. I will start another thread on -stable for it.

OK.

> > I guess we still don't have a really clear understanding of what exactly
> is going in your system
> 
> I would also like to get to the bottom of it. Unfortunately I haven't
> got the expertise in this area nor the time to do it yet. Hence the
> intent to take a step back and backport Steven's patch to fix the
> issue that has resurfaced in our production recently.

No problem.
I just meant that -stable people can be a bit "unconvinced".

> Which two sets are you referring to specifically?

I guess I used the wrong word:

The first set (actually just one patch) is the one that makes consoles
re-entrant from panic().
The other set - those 4 patches (Steven's patch, + Petr's patch + a
patch that makes printk() atomic again).

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-11 Thread Daniel Wang
> So... did my patch address the deadlock you are seeing or it didn't?

I've been meaning to try it but kept getting distracted by other
things. I'll try to find some time for it this week or next. Right now
my intent is to get Steven's patch into 4.14 stable as it evidently
fixed the particular issue I was seeing, and as Steven said it has
been in upstream since 4.16 so it's not like backporting it will raise
any red flags. I will start another thread on -stable for it.

> I guess we still don't have a really clear understanding of what exactly
is going in your system

I would also like to get to the bottom of it. Unfortunately I haven't
got the expertise in this area nor the time to do it yet. Hence the
intent to take a step back and backport Steven's patch to fix the
issue that has resurfaced in our production recently.

> If it's uart_port->lock and there will be 2 patch sets to choose from
for -stable, then -stable guys can pick up the one that requires less
effort: 1 two-liner patch vs. 3 or 4 bigger patches.

Which two sets are you referring to specifically?

On Tue, Dec 11, 2018 at 9:21 PM Sergey Senozhatsky
 wrote:
>
> On (12/11/18 17:16), Daniel Wang wrote:
> > > Let's first figure out if it works.
> >
> > I would still like to try applying your patches that went into
> > printk.git, but for now I wonder if we can get Steven's patch into
> > 4.14 first, for at least we know it mitigated the issue if not
> > fundamentally addressed it, and we've agreed it's an innocuous change
> > that doesn't risk breaking stable.
>
> So... did my patch address the deadlock you are seeing or it didn't?
>
> > I haven't done this before so I'll need your help. What's the next
> > step to actually get Steven's patch *in* linux-4.14.y? According to
> > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > I am supposed to send an email with the patch ID and subject, which
> > are both mentioned in this email. Should I send another one? What's
> > the process like? Thanks!
>
> I'm not doing any -stable releases, so can't really answer, sorry.
> Probably would be better to re-address this question to 4.14 -stable
> maintainers.
>
>
> ---
> I guess we still don't have a really clear understanding of what exactly
> is going in your system. We don't even know for sure which one of the locks
> is deadlocking the system. And why exactly Steven's patch helps. If it
> is uart_port->lock, then it's one thing; if it's console_sem ->lock then
> it's another thing. But those two are just theories, not supported by any
> logs/backtraces from your systems.
>
> If it's uart_port->lock and there will be 2 patch sets to choose from
> for -stable, then -stable guys can pick up the one that requires less
> effort: 1 two-liner patch vs. 3 or 4 bigger patches.
>
> -ss



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-11 Thread Sergey Senozhatsky
On (12/11/18 17:16), Daniel Wang wrote:
> > Let's first figure out if it works.
> 
> I would still like to try applying your patches that went into
> printk.git, but for now I wonder if we can get Steven's patch into
> 4.14 first, for at least we know it mitigated the issue if not
> fundamentally addressed it, and we've agreed it's an innocuous change
> that doesn't risk breaking stable.

So... did my patch address the deadlock you are seeing or it didn't?

> I haven't done this before so I'll need your help. What's the next
> step to actually get Steven's patch *in* linux-4.14.y? According to
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> I am supposed to send an email with the patch ID and subject, which
> are both mentioned in this email. Should I send another one? What's
> the process like? Thanks!

I'm not doing any -stable releases, so can't really answer, sorry.
Probably would be better to re-address this question to 4.14 -stable
maintainers.


---
I guess we still don't have a really clear understanding of what exactly
is going in your system. We don't even know for sure which one of the locks
is deadlocking the system. And why exactly Steven's patch helps. If it
is uart_port->lock, then it's one thing; if it's console_sem ->lock then
it's another thing. But those two are just theories, not supported by any
logs/backtraces from your systems.

If it's uart_port->lock and there will be 2 patch sets to choose from
for -stable, then -stable guys can pick up the one that requires less
effort: 1 two-liner patch vs. 3 or 4 bigger patches.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-12-11 Thread Daniel Wang
> Let's first figure out if it works.

I would still like to try applying your patches that went into
printk.git, but for now I wonder if we can get Steven's patch into
4.14 first, for at least we know it mitigated the issue if not
fundamentally addressed it, and we've agreed it's an innocuous change
that doesn't risk breaking stable.

I haven't done this before so I'll need your help. What's the next
step to actually get Steven's patch *in* linux-4.14.y? According to
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
I am supposed to send an email with the patch ID and subject, which
are both mentioned in this email. Should I send another one? What's
the process like? Thanks!

On Thu, Nov 8, 2018 at 10:47 PM Sergey Senozhatsky
 wrote:
>
> On (11/01/18 09:05), Daniel Wang wrote:
> > > Another deadlock scenario could be the following one:
> > >
> > > printk()
> > >  console_trylock()
> > >   down_trylock()
> > >raw_spin_lock_irqsave(>lock, flags)
> > > 
> > >  panic()
> > >   console_flush_on_panic()
> > >console_trylock()
> > > raw_spin_lock_irqsave(>lock, flags)// 
> > > deadlock
> > >
> > > There are no patches addressing this one at the moment. And it's
> > > unclear if you are hitting this scenario.
> >
> > I am not sure, but Steven's patches did make the deadlock I saw go away...
>
> You certainly can find cases when "busy spin on console_sem owner" logic
> can reduce some possibilities.
>
> But spin_lock(); NMI; spin_lock(); code is still in the kernel.
>
> > A little swamped by other things lately but I'll run a test with it.
> > If it works, would you recommend taking your patch alone
>
> Let's first figure out if it works.
>
> -ss



--
Best,
Daniel

On Thu, Nov 8, 2018 at 10:47 PM Sergey Senozhatsky
 wrote:
>
> On (11/01/18 09:05), Daniel Wang wrote:
> > > Another deadlock scenario could be the following one:
> > >
> > > printk()
> > >  console_trylock()
> > >   down_trylock()
> > >raw_spin_lock_irqsave(>lock, flags)
> > > 
> > >  panic()
> > >   console_flush_on_panic()
> > >console_trylock()
> > > raw_spin_lock_irqsave(>lock, flags)// 
> > > deadlock
> > >
> > > There are no patches addressing this one at the moment. And it's
> > > unclear if you are hitting this scenario.
> >
> > I am not sure, but Steven's patches did make the deadlock I saw go away...
>
> You certainly can find cases when "busy spin on console_sem owner" logic
> can reduce some possibilities.
>
> But spin_lock(); NMI; spin_lock(); code is still in the kernel.
>
> > A little swamped by other things lately but I'll run a test with it.
> > If it works, would you recommend taking your patch alone
>
> Let's first figure out if it works.
>
> -ss



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-08 Thread Sergey Senozhatsky
On (11/01/18 09:05), Daniel Wang wrote:
> > Another deadlock scenario could be the following one:
> >
> > printk()
> >  console_trylock()
> >   down_trylock()
> >raw_spin_lock_irqsave(>lock, flags)
> > 
> >  panic()
> >   console_flush_on_panic()
> >console_trylock()
> > raw_spin_lock_irqsave(>lock, flags)// deadlock
> >
> > There are no patches addressing this one at the moment. And it's
> > unclear if you are hitting this scenario.
> 
> I am not sure, but Steven's patches did make the deadlock I saw go away...

You certainly can find cases when "busy spin on console_sem owner" logic
can reduce some possibilities.

But spin_lock(); NMI; spin_lock(); code is still in the kernel.

> A little swamped by other things lately but I'll run a test with it.
> If it works, would you recommend taking your patch alone

Let's first figure out if it works.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-08 Thread Sergey Senozhatsky
On (11/01/18 09:05), Daniel Wang wrote:
> > Another deadlock scenario could be the following one:
> >
> > printk()
> >  console_trylock()
> >   down_trylock()
> >raw_spin_lock_irqsave(>lock, flags)
> > 
> >  panic()
> >   console_flush_on_panic()
> >console_trylock()
> > raw_spin_lock_irqsave(>lock, flags)// deadlock
> >
> > There are no patches addressing this one at the moment. And it's
> > unclear if you are hitting this scenario.
> 
> I am not sure, but Steven's patches did make the deadlock I saw go away...

You certainly can find cases when "busy spin on console_sem owner" logic
can reduce some possibilities.

But spin_lock(); NMI; spin_lock(); code is still in the kernel.

> A little swamped by other things lately but I'll run a test with it.
> If it works, would you recommend taking your patch alone

Let's first figure out if it works.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-01 Thread Daniel Wang
On Mon, Oct 22, 2018 at 3:10 AM Sergey Senozhatsky
 wrote:

> Another deadlock scenario could be the following one:
>
> printk()
>  console_trylock()
>   down_trylock()
>raw_spin_lock_irqsave(>lock, flags)
> 
>  panic()
>   console_flush_on_panic()
>console_trylock()
> raw_spin_lock_irqsave(>lock, flags)// deadlock
>
> There are no patches addressing this one at the moment. And it's
> unclear if you are hitting this scenario.

I am not sure, but Steven's patches did make the deadlock I saw go away...

>
>
> > I see that Sergey had sent an RFC series for similar things. Are those
> > trying to solve the deadlock problem in a different way?
>
> Umm, I wouldn't call it "another way". It turns non-reentrant serial
> consoles to re-entrable ones. Did you test patch [1] from that series
> on you environment, by the way?

A little swamped by other things lately but I'll run a test with it.
If it works, would you recommend taking your patch alone and not
bother taking Steven's (since as you mentioned above even with his
patches there's still possibility for other deadlocks) ?

>
> [1] lkml.kernel.org/r/20181016050428.17966-2-sergey.senozhat...@gmail.com
>
> -ss



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-11-01 Thread Daniel Wang
On Mon, Oct 22, 2018 at 3:10 AM Sergey Senozhatsky
 wrote:

> Another deadlock scenario could be the following one:
>
> printk()
>  console_trylock()
>   down_trylock()
>raw_spin_lock_irqsave(>lock, flags)
> 
>  panic()
>   console_flush_on_panic()
>console_trylock()
> raw_spin_lock_irqsave(>lock, flags)// deadlock
>
> There are no patches addressing this one at the moment. And it's
> unclear if you are hitting this scenario.

I am not sure, but Steven's patches did make the deadlock I saw go away...

>
>
> > I see that Sergey had sent an RFC series for similar things. Are those
> > trying to solve the deadlock problem in a different way?
>
> Umm, I wouldn't call it "another way". It turns non-reentrant serial
> consoles to re-entrable ones. Did you test patch [1] from that series
> on you environment, by the way?

A little swamped by other things lately but I'll run a test with it.
If it works, would you recommend taking your patch alone and not
bother taking Steven's (since as you mentioned above even with his
patches there's still possibility for other deadlocks) ?

>
> [1] lkml.kernel.org/r/20181016050428.17966-2-sergey.senozhat...@gmail.com
>
> -ss



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-22 Thread Sergey Senozhatsky
On (10/21/18 11:09), Daniel Wang wrote:
> 
> Just got back from vacation. Thanks for the continued discussion. Just so
> I understand the current state. Looks like we've got a pretty good explanation
> of what's going on (though not completely sure), and backporting Steven's
> patches is still the way to go?

Up to -stable maintainers.

Note, with or without Steven's patch, the non-reentrable consoles are
still non-reentrable, so the deadlock is still there:

spin_lock_irqsave(>lock, flags)
 
  panic()
   console_flush_on_panic()
spin_lock_irqsave(>lock, flags)   // deadlock


// And I wouldn't mind to have more reviews/testing on [1].


Another deadlock scenario could be the following one:

printk()
 console_trylock()
  down_trylock()
   raw_spin_lock_irqsave(>lock, flags)

 panic()
  console_flush_on_panic()
   console_trylock()
raw_spin_lock_irqsave(>lock, flags)// deadlock

There are no patches addressing this one at the moment. And it's
unclear if you are hitting this scenario.


> I see that Sergey had sent an RFC series for similar things. Are those
> trying to solve the deadlock problem in a different way?

Umm, I wouldn't call it "another way". It turns non-reentrant serial
consoles to re-entrable ones. Did you test patch [1] from that series
on you environment, by the way?

[1] lkml.kernel.org/r/20181016050428.17966-2-sergey.senozhat...@gmail.com

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-22 Thread Sergey Senozhatsky
On (10/21/18 11:09), Daniel Wang wrote:
> 
> Just got back from vacation. Thanks for the continued discussion. Just so
> I understand the current state. Looks like we've got a pretty good explanation
> of what's going on (though not completely sure), and backporting Steven's
> patches is still the way to go?

Up to -stable maintainers.

Note, with or without Steven's patch, the non-reentrable consoles are
still non-reentrable, so the deadlock is still there:

spin_lock_irqsave(>lock, flags)
 
  panic()
   console_flush_on_panic()
spin_lock_irqsave(>lock, flags)   // deadlock


// And I wouldn't mind to have more reviews/testing on [1].


Another deadlock scenario could be the following one:

printk()
 console_trylock()
  down_trylock()
   raw_spin_lock_irqsave(>lock, flags)

 panic()
  console_flush_on_panic()
   console_trylock()
raw_spin_lock_irqsave(>lock, flags)// deadlock

There are no patches addressing this one at the moment. And it's
unclear if you are hitting this scenario.


> I see that Sergey had sent an RFC series for similar things. Are those
> trying to solve the deadlock problem in a different way?

Umm, I wouldn't call it "another way". It turns non-reentrant serial
consoles to re-entrable ones. Did you test patch [1] from that series
on you environment, by the way?

[1] lkml.kernel.org/r/20181016050428.17966-2-sergey.senozhat...@gmail.com

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-22 Thread Petr Mladek
On Sun 2018-10-21 11:09:22, Daniel Wang wrote:
> Just got back from vacation. Thanks for the continued discussion. Just so
> I understand the current state. Looks like we've got a pretty good explanation
> of what's going on (though not completely sure), and backporting Steven's
> patches is still the way to go? I see that Sergey had sent an RFC series
> for similar things. Are those trying to solve the deadlock problem in a
> different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky

I suggest to go with backporting Steven's patchset. We do not have
anything better at the moment.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-22 Thread Petr Mladek
On Sun 2018-10-21 11:09:22, Daniel Wang wrote:
> Just got back from vacation. Thanks for the continued discussion. Just so
> I understand the current state. Looks like we've got a pretty good explanation
> of what's going on (though not completely sure), and backporting Steven's
> patches is still the way to go? I see that Sergey had sent an RFC series
> for similar things. Are those trying to solve the deadlock problem in a
> different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky

I suggest to go with backporting Steven's patchset. We do not have
anything better at the moment.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-21 Thread Daniel Wang
Just got back from vacation. Thanks for the continued discussion. Just so
I understand the current state. Looks like we've got a pretty good explanation
of what's going on (though not completely sure), and backporting Steven's
patches is still the way to go? I see that Sergey had sent an RFC series
for similar things. Are those trying to solve the deadlock problem in a
different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky
 wrote:
>
> On (10/04/18 10:36), Petr Mladek wrote:
> >
> > This looks like a reasonable explanation of what is happening here.
> > It also explains why the console owner logic helped.
>
> Well, I'm still a bit puzzled, frankly speaking. I've two theories.
>
> Theory #1 [most likely]
>
>   Steven is a wizard and his code cures whatever problem we throw it at.
>
> Theory #2
>
>   console_sem hand over actually spreads print out, so we don't have one CPU
> doing all the printing job. Instead every CPU prints its backtrace, while the
> CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace()
> still has to wait for NR_CPUS * strlen(bakctrace), which still probably
> truggers NMI panic on it at some point. The panic CPU send out stop IPI, then
> it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each
> CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with
> flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without
> the patch we probably declare NMI panic on the CPU which does all the printing
> work, and panic sometimes jumps in when that CPU is in busy in
> serial8250_console_write(), holding the uart_port->lock. So we can't re-enter
> the 8250 driver from panic CPU and we can't reboot the system. In other
> words... Steven is a wizard.
>
> > > serial8250_console_write()
> > > {
> > > if (port->sysrq)
> > > locked = 0;
> > > else if (oops_in_progress)
> > > locked = spin_trylock_irqsave(>lock, flags);
> > > else
> > > spin_lock_irqsave(>lock, flags);
> > >
> > > ...
> > > uart_console_write(port, s, count, serial8250_console_putchar);
> > > ...
> > >
> > > if (locked)
> > > spin_unlock_irqrestore(>lock, flags);
> > > }
> > >
> > > Now... the problem. A theory, in fact.
> > > panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too 
> > > soon.
> >
> > I see your point. I am just a bit scared of this way. Ignoring locks
> > is a dangerous and painful approach in general.
>
> Well, I agree. But 8250 is not the only console which does ignore
> uart_port lock state sometimes. Otherwise sysrq would be totally unreliable,
> including emergency reboot. So it's sort of how it has been for quite some
> time, I guess. We are in panic(), it's over, so we probably can ignore
> uart_port->lock at this point.
>
> -ss



--
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-21 Thread Daniel Wang
Just got back from vacation. Thanks for the continued discussion. Just so
I understand the current state. Looks like we've got a pretty good explanation
of what's going on (though not completely sure), and backporting Steven's
patches is still the way to go? I see that Sergey had sent an RFC series
for similar things. Are those trying to solve the deadlock problem in a
different way?On Thu, Oct 4, 2018 at 1:55 AM Sergey Senozhatsky
 wrote:
>
> On (10/04/18 10:36), Petr Mladek wrote:
> >
> > This looks like a reasonable explanation of what is happening here.
> > It also explains why the console owner logic helped.
>
> Well, I'm still a bit puzzled, frankly speaking. I've two theories.
>
> Theory #1 [most likely]
>
>   Steven is a wizard and his code cures whatever problem we throw it at.
>
> Theory #2
>
>   console_sem hand over actually spreads print out, so we don't have one CPU
> doing all the printing job. Instead every CPU prints its backtrace, while the
> CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace()
> still has to wait for NR_CPUS * strlen(bakctrace), which still probably
> truggers NMI panic on it at some point. The panic CPU send out stop IPI, then
> it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each
> CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with
> flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without
> the patch we probably declare NMI panic on the CPU which does all the printing
> work, and panic sometimes jumps in when that CPU is in busy in
> serial8250_console_write(), holding the uart_port->lock. So we can't re-enter
> the 8250 driver from panic CPU and we can't reboot the system. In other
> words... Steven is a wizard.
>
> > > serial8250_console_write()
> > > {
> > > if (port->sysrq)
> > > locked = 0;
> > > else if (oops_in_progress)
> > > locked = spin_trylock_irqsave(>lock, flags);
> > > else
> > > spin_lock_irqsave(>lock, flags);
> > >
> > > ...
> > > uart_console_write(port, s, count, serial8250_console_putchar);
> > > ...
> > >
> > > if (locked)
> > > spin_unlock_irqrestore(>lock, flags);
> > > }
> > >
> > > Now... the problem. A theory, in fact.
> > > panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too 
> > > soon.
> >
> > I see your point. I am just a bit scared of this way. Ignoring locks
> > is a dangerous and painful approach in general.
>
> Well, I agree. But 8250 is not the only console which does ignore
> uart_port lock state sometimes. Otherwise sysrq would be totally unreliable,
> including emergency reboot. So it's sort of how it has been for quite some
> time, I guess. We are in panic(), it's over, so we probably can ignore
> uart_port->lock at this point.
>
> -ss



--
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/04/18 10:36), Petr Mladek wrote:
> 
> This looks like a reasonable explanation of what is happening here.
> It also explains why the console owner logic helped.

Well, I'm still a bit puzzled, frankly speaking. I've two theories.

Theory #1 [most likely]

  Steven is a wizard and his code cures whatever problem we throw it at.

Theory #2

  console_sem hand over actually spreads print out, so we don't have one CPU
doing all the printing job. Instead every CPU prints its backtrace, while the
CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace()
still has to wait for NR_CPUS * strlen(bakctrace), which still probably
truggers NMI panic on it at some point. The panic CPU send out stop IPI, then
it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each
CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with
flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without
the patch we probably declare NMI panic on the CPU which does all the printing
work, and panic sometimes jumps in when that CPU is in busy in
serial8250_console_write(), holding the uart_port->lock. So we can't re-enter
the 8250 driver from panic CPU and we can't reboot the system. In other
words... Steven is a wizard.

> > serial8250_console_write()
> > {
> > if (port->sysrq)
> > locked = 0;
> > else if (oops_in_progress)
> > locked = spin_trylock_irqsave(>lock, flags);
> > else
> > spin_lock_irqsave(>lock, flags);
> > 
> > ...
> > uart_console_write(port, s, count, serial8250_console_putchar);
> > ...
> > 
> > if (locked)
> > spin_unlock_irqrestore(>lock, flags);
> > }
> > 
> > Now... the problem. A theory, in fact.
> > panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.
> 
> I see your point. I am just a bit scared of this way. Ignoring locks
> is a dangerous and painful approach in general.

Well, I agree. But 8250 is not the only console which does ignore
uart_port lock state sometimes. Otherwise sysrq would be totally unreliable,
including emergency reboot. So it's sort of how it has been for quite some
time, I guess. We are in panic(), it's over, so we probably can ignore
uart_port->lock at this point.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/04/18 10:36), Petr Mladek wrote:
> 
> This looks like a reasonable explanation of what is happening here.
> It also explains why the console owner logic helped.

Well, I'm still a bit puzzled, frankly speaking. I've two theories.

Theory #1 [most likely]

  Steven is a wizard and his code cures whatever problem we throw it at.

Theory #2

  console_sem hand over actually spreads print out, so we don't have one CPU
doing all the printing job. Instead every CPU prints its backtrace, while the
CPU which issued all_cpus_backtrace() waits for them. So all_cpus_backtrace()
still has to wait for NR_CPUS * strlen(bakctrace), which still probably
truggers NMI panic on it at some point. The panic CPU send out stop IPI, then
it waits for foreign CPUs to ACK stop IPI request - for 10 seconds. So each
CPU prints its backtrace, then ACK stop IPI. So when panic CPU proceeds with
flush_on_panic() and emergency_reboot() uart_port->lock is unlocked. Without
the patch we probably declare NMI panic on the CPU which does all the printing
work, and panic sometimes jumps in when that CPU is in busy in
serial8250_console_write(), holding the uart_port->lock. So we can't re-enter
the 8250 driver from panic CPU and we can't reboot the system. In other
words... Steven is a wizard.

> > serial8250_console_write()
> > {
> > if (port->sysrq)
> > locked = 0;
> > else if (oops_in_progress)
> > locked = spin_trylock_irqsave(>lock, flags);
> > else
> > spin_lock_irqsave(>lock, flags);
> > 
> > ...
> > uart_console_write(port, s, count, serial8250_console_putchar);
> > ...
> > 
> > if (locked)
> > spin_unlock_irqrestore(>lock, flags);
> > }
> > 
> > Now... the problem. A theory, in fact.
> > panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.
> 
> I see your point. I am just a bit scared of this way. Ignoring locks
> is a dangerous and painful approach in general.

Well, I agree. But 8250 is not the only console which does ignore
uart_port lock state sometimes. Otherwise sysrq would be totally unreliable,
including emergency reboot. So it's sort of how it has been for quite some
time, I guess. We are in panic(), it's over, so we probably can ignore
uart_port->lock at this point.

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Petr Mladek
On Thu 2018-10-04 16:44:42, Sergey Senozhatsky wrote:
> On (10/03/18 11:37), Daniel Wang wrote:
> > When `softlockup_panic` is set (which is what my original repro had and
> > what we use in production), without the backport patch, the expected panic
> > would hit a seemingly deadlock. So even when the machine is configured
> > to reboot immediately after the panic (kernel.panic=-1), it just hangs there
> > with an incomplete backtrace. With your patch, the deadlock doesn't happen
> > and the machine reboots successfully.
> > 
> > This was and still is the issue this thread is trying to fix. The last
> > log snippet
> > was from an "experiment" that I did in order to understand what's really
> > happening. So far the speculation has been that the panic path was trying
> > to get a lock held by a backtrace dumping thread, but there is not enough
> > evidence which thread is holding the lock and how it uses it. So I set
> > `softlockup_panic` to 0, to get panic out of the equation. Then I saw that 
> > one
> > CPU was indeed holding the console lock, trying to write something out. If
> > the panic was to hit while it's doing that, we might get a deadlock.
> 
> Hmm, console_sem state is ignored when we flush logbuf, so it's OK to
> have it locked when we declare panic():
> 
> void console_flush_on_panic(void)
> {
>   /*
>* If someone else is holding the console lock, trylock will fail
>* and may_schedule may be set.  Ignore and proceed to unlock so
>* that messages are flushed out.  As this can be called from any
>* context and we don't want to get preempted while flushing,
>* ensure may_schedule is cleared.
>*/
>   console_trylock();
>   console_may_schedule = 0;
>   console_unlock();
> }
> 
> Things are not so simple with uart_port lock. Generally speaking we
> should deadlock when we NMI panic() kills the system while one of the
> CPUs holds uart_port lock.

This looks like a reasonable explanation of what is happening here.
It also explains why the console owner logic helped.


> 8250 has sort of a workaround for this scenario:
> 
> serial8250_console_write()
> {
>   if (port->sysrq)
>   locked = 0;
>   else if (oops_in_progress)
>   locked = spin_trylock_irqsave(>lock, flags);
>   else
>   spin_lock_irqsave(>lock, flags);
> 
>   ...
>   uart_console_write(port, s, count, serial8250_console_putchar);
>   ...
> 
>   if (locked)
>   spin_unlock_irqrestore(>lock, flags);
> }
> 
> Now... the problem. A theory, in fact.
> panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.

I see your point. I am just a bit scared of this way. Ignoring locks
is a dangerous and painful approach in general.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Petr Mladek
On Thu 2018-10-04 16:44:42, Sergey Senozhatsky wrote:
> On (10/03/18 11:37), Daniel Wang wrote:
> > When `softlockup_panic` is set (which is what my original repro had and
> > what we use in production), without the backport patch, the expected panic
> > would hit a seemingly deadlock. So even when the machine is configured
> > to reboot immediately after the panic (kernel.panic=-1), it just hangs there
> > with an incomplete backtrace. With your patch, the deadlock doesn't happen
> > and the machine reboots successfully.
> > 
> > This was and still is the issue this thread is trying to fix. The last
> > log snippet
> > was from an "experiment" that I did in order to understand what's really
> > happening. So far the speculation has been that the panic path was trying
> > to get a lock held by a backtrace dumping thread, but there is not enough
> > evidence which thread is holding the lock and how it uses it. So I set
> > `softlockup_panic` to 0, to get panic out of the equation. Then I saw that 
> > one
> > CPU was indeed holding the console lock, trying to write something out. If
> > the panic was to hit while it's doing that, we might get a deadlock.
> 
> Hmm, console_sem state is ignored when we flush logbuf, so it's OK to
> have it locked when we declare panic():
> 
> void console_flush_on_panic(void)
> {
>   /*
>* If someone else is holding the console lock, trylock will fail
>* and may_schedule may be set.  Ignore and proceed to unlock so
>* that messages are flushed out.  As this can be called from any
>* context and we don't want to get preempted while flushing,
>* ensure may_schedule is cleared.
>*/
>   console_trylock();
>   console_may_schedule = 0;
>   console_unlock();
> }
> 
> Things are not so simple with uart_port lock. Generally speaking we
> should deadlock when we NMI panic() kills the system while one of the
> CPUs holds uart_port lock.

This looks like a reasonable explanation of what is happening here.
It also explains why the console owner logic helped.


> 8250 has sort of a workaround for this scenario:
> 
> serial8250_console_write()
> {
>   if (port->sysrq)
>   locked = 0;
>   else if (oops_in_progress)
>   locked = spin_trylock_irqsave(>lock, flags);
>   else
>   spin_lock_irqsave(>lock, flags);
> 
>   ...
>   uart_console_write(port, s, count, serial8250_console_putchar);
>   ...
> 
>   if (locked)
>   spin_unlock_irqrestore(>lock, flags);
> }
> 
> Now... the problem. A theory, in fact.
> panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.

I see your point. I am just a bit scared of this way. Ignoring locks
is a dangerous and painful approach in general.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/04/18 16:44), Sergey Senozhatsky wrote:
> So... Just an idea. Can you try a very dirty hack? Forcibly increase
> oops_in_progress in panic() before console_flush_on_panic(), so 8250
> serial8250_console_write() will use spin_trylock_irqsave() and maybe
> avoid deadlock.

E.g. something like below?
[this is not a patch; just a theory]:

---

diff --git a/kernel/panic.c b/kernel/panic.c
index 8b2e002d52eb..188338a55d1c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -233,7 +233,13 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
 
+   /*
+* Decrement oops_in_progress and let bust_spinlocks() to
+* unblank_screen(), console_unblank() and wake_up_klogd()
+*/
bust_spinlocks(0);
+   /* Set oops_in_progress, so we can reenter serial console driver*/
+   bust_spinlocks(1);
 
/*
 * We may have ended up stopping the CPU holding the lock (in


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/04/18 16:44), Sergey Senozhatsky wrote:
> So... Just an idea. Can you try a very dirty hack? Forcibly increase
> oops_in_progress in panic() before console_flush_on_panic(), so 8250
> serial8250_console_write() will use spin_trylock_irqsave() and maybe
> avoid deadlock.

E.g. something like below?
[this is not a patch; just a theory]:

---

diff --git a/kernel/panic.c b/kernel/panic.c
index 8b2e002d52eb..188338a55d1c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -233,7 +233,13 @@ void panic(const char *fmt, ...)
if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
 
+   /*
+* Decrement oops_in_progress and let bust_spinlocks() to
+* unblank_screen(), console_unblank() and wake_up_klogd()
+*/
bust_spinlocks(0);
+   /* Set oops_in_progress, so we can reenter serial console driver*/
+   bust_spinlocks(1);
 
/*
 * We may have ended up stopping the CPU holding the lock (in


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Petr Mladek
On Wed 2018-10-03 13:37:04, Steven Rostedt wrote:
> On Wed, 3 Oct 2018 10:16:08 -0700
> Daniel Wang  wrote:
> 
> > On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
> > >
> > > On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:  
> > > > I don't see the big deal of backporting this. The biggest complaints
> > > > about backports are from fixes that were added to late -rc releases
> > > > where the fixes didn't get much testing. This commit was added in 4.16,
> > > > and hasn't had any issues due to the design. Although a fix has been
> > > > added:
> > > >
> > > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")  
> > >
> > > As I said, I am fine with backporting the console_lock owner stuff
> > > into the stable release.
> > >
> > > I just wonder (like Sergey) what the real problem is. The console_lock
> > > owner handshake is not fully reliable. It is might be good enough
> 
> I'm not sure what you mean by 'not fully reliable'

I mean that it is not guaranteed that the very first printk() takes over
the console. It will happen only when the other printk() calls
console_trylock_spinning() while the current console owner does
the code between:

   console_lock_spinning_enable();
   console_lock_spinning_disable_and_check();


> > > Just to be sure. Daniel, could you please send a log with
> > > the console_lock owner stuff backported? There we would see
> > > who called the panic() and why it rebooted early.  
> > 
> > Sure. Here is one. It's a bit long but complete. I attached another log
> > snippet below it which is what I got when `softlockup_panic` was turned
> > off. The log was from the IRQ task that was flushing the printk buffer. I
> > will be taking a closer look at it too but in case you'll find it helpful.
> 
> Just so I understand correctly. Does the panic hit with and without the
> suggested backport patch? The only difference is that you get the full
> output with the patch and limited output without it?

Sigh, the other mail suggest that there was a real deadlock. It means
that the console owner logic might help but it would not prevent
the deadlock completely.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Petr Mladek
On Wed 2018-10-03 13:37:04, Steven Rostedt wrote:
> On Wed, 3 Oct 2018 10:16:08 -0700
> Daniel Wang  wrote:
> 
> > On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
> > >
> > > On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:  
> > > > I don't see the big deal of backporting this. The biggest complaints
> > > > about backports are from fixes that were added to late -rc releases
> > > > where the fixes didn't get much testing. This commit was added in 4.16,
> > > > and hasn't had any issues due to the design. Although a fix has been
> > > > added:
> > > >
> > > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")  
> > >
> > > As I said, I am fine with backporting the console_lock owner stuff
> > > into the stable release.
> > >
> > > I just wonder (like Sergey) what the real problem is. The console_lock
> > > owner handshake is not fully reliable. It is might be good enough
> 
> I'm not sure what you mean by 'not fully reliable'

I mean that it is not guaranteed that the very first printk() takes over
the console. It will happen only when the other printk() calls
console_trylock_spinning() while the current console owner does
the code between:

   console_lock_spinning_enable();
   console_lock_spinning_disable_and_check();


> > > Just to be sure. Daniel, could you please send a log with
> > > the console_lock owner stuff backported? There we would see
> > > who called the panic() and why it rebooted early.  
> > 
> > Sure. Here is one. It's a bit long but complete. I attached another log
> > snippet below it which is what I got when `softlockup_panic` was turned
> > off. The log was from the IRQ task that was flushing the printk buffer. I
> > will be taking a closer look at it too but in case you'll find it helpful.
> 
> Just so I understand correctly. Does the panic hit with and without the
> suggested backport patch? The only difference is that you get the full
> output with the patch and limited output without it?

Sigh, the other mail suggest that there was a real deadlock. It means
that the console owner logic might help but it would not prevent
the deadlock completely.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/03/18 11:37), Daniel Wang wrote:
> When `softlockup_panic` is set (which is what my original repro had and
> what we use in production), without the backport patch, the expected panic
> would hit a seemingly deadlock. So even when the machine is configured
> to reboot immediately after the panic (kernel.panic=-1), it just hangs there
> with an incomplete backtrace. With your patch, the deadlock doesn't happen
> and the machine reboots successfully.
> 
> This was and still is the issue this thread is trying to fix. The last
> log snippet
> was from an "experiment" that I did in order to understand what's really
> happening. So far the speculation has been that the panic path was trying
> to get a lock held by a backtrace dumping thread, but there is not enough
> evidence which thread is holding the lock and how it uses it. So I set
> `softlockup_panic` to 0, to get panic out of the equation. Then I saw that one
> CPU was indeed holding the console lock, trying to write something out. If
> the panic was to hit while it's doing that, we might get a deadlock.

Hmm, console_sem state is ignored when we flush logbuf, so it's OK to
have it locked when we declare panic():

void console_flush_on_panic(void)
{
/*
 * If someone else is holding the console lock, trylock will fail
 * and may_schedule may be set.  Ignore and proceed to unlock so
 * that messages are flushed out.  As this can be called from any
 * context and we don't want to get preempted while flushing,
 * ensure may_schedule is cleared.
 */
console_trylock();
console_may_schedule = 0;
console_unlock();
}

Things are not so simple with uart_port lock. Generally speaking we
should deadlock when we NMI panic() kills the system while one of the
CPUs holds uart_port lock.

8250 has sort of a workaround for this scenario:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);

...
uart_console_write(port, s, count, serial8250_console_putchar);
...

if (locked)
spin_unlock_irqrestore(>lock, flags);
}

Note, spin_trylock_irqsave() path.
So, as long as we are in sysrq or oops_in_progress, uart_port lock state
is sort of ignored.

Looking at your backtraces:

---
[  348.058207] NMI backtrace for cpu 8
[  348.058207] CPU: 8 PID: 1700 Comm: dd Tainted: G   O L  4.14.73 #18
[  348.058214]  
[  348.058214]  wait_for_xmitr+0x2c/0xb0
[  348.058215]  serial8250_console_putchar+0x1c/0x40
[  348.058215]  ? wait_for_xmitr+0xb0/0xb0
[  348.058215]  uart_console_write+0x33/0x70
[  348.058216]  serial8250_console_write+0xe2/0x2b0
[  348.058216]  ? msg_print_text+0xa6/0x110
[  348.058216]  console_unlock+0x306/0x4a0
[  348.058217]  wake_up_klogd_work_func+0x55/0x60
[  348.058217]  irq_work_run_list+0x50/0x80
[  348.058217]  smp_irq_work_interrupt+0x3f/0xe0
[  348.058218]  irq_work_interrupt+0x7d/0x90
---


Now... the problem. A theory, in fact.
panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.

When we do console_flush_on_panic() we ignore console_sem state and go
to the 8250 driver - serial8250_console_write(). But at this point
oops_in_progress is zero, so we endup in spin_lock_irqsave(>lock, flags).

If the port->lock was already locked, then this is your deadlock. We
can't emergency_restart() because the panic() CPU stuck spinning on
port->lock in serial8250_console_write(), so it never returns from
console_flush_on_panic() and there is no progress.

---

void panic(const char *fmt, ...)
{

bust_spinlocks(0);

/*
 * We may have ended up stopping the CPU holding the lock (in
 * smp_send_stop()) while still having some valuable data in the console
 * buffer.  Try to acquire the lock then release it regardless of the
 * result.  The release will also print the buffers out.  Locks debug
 * should be disabled to avoid reporting bad unlock balance when
 * panic() is not being callled from OOPS.
 */
debug_locks_off();
console_flush_on_panic();

if (!panic_blink)
panic_blink = no_blink;

if (panic_timeout > 0) {
/*
 * Delay timeout seconds before rebooting the machine.
 * We can't use the "normal" timers since we just panicked.
 */
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);

for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
if (i >= i_next) {
i += panic_blink(state ^= 1);
i_next = i + 3600 / PANIC_BLINK_SPD;
}

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-04 Thread Sergey Senozhatsky
On (10/03/18 11:37), Daniel Wang wrote:
> When `softlockup_panic` is set (which is what my original repro had and
> what we use in production), without the backport patch, the expected panic
> would hit a seemingly deadlock. So even when the machine is configured
> to reboot immediately after the panic (kernel.panic=-1), it just hangs there
> with an incomplete backtrace. With your patch, the deadlock doesn't happen
> and the machine reboots successfully.
> 
> This was and still is the issue this thread is trying to fix. The last
> log snippet
> was from an "experiment" that I did in order to understand what's really
> happening. So far the speculation has been that the panic path was trying
> to get a lock held by a backtrace dumping thread, but there is not enough
> evidence which thread is holding the lock and how it uses it. So I set
> `softlockup_panic` to 0, to get panic out of the equation. Then I saw that one
> CPU was indeed holding the console lock, trying to write something out. If
> the panic was to hit while it's doing that, we might get a deadlock.

Hmm, console_sem state is ignored when we flush logbuf, so it's OK to
have it locked when we declare panic():

void console_flush_on_panic(void)
{
/*
 * If someone else is holding the console lock, trylock will fail
 * and may_schedule may be set.  Ignore and proceed to unlock so
 * that messages are flushed out.  As this can be called from any
 * context and we don't want to get preempted while flushing,
 * ensure may_schedule is cleared.
 */
console_trylock();
console_may_schedule = 0;
console_unlock();
}

Things are not so simple with uart_port lock. Generally speaking we
should deadlock when we NMI panic() kills the system while one of the
CPUs holds uart_port lock.

8250 has sort of a workaround for this scenario:

serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(>lock, flags);
else
spin_lock_irqsave(>lock, flags);

...
uart_console_write(port, s, count, serial8250_console_putchar);
...

if (locked)
spin_unlock_irqrestore(>lock, flags);
}

Note, spin_trylock_irqsave() path.
So, as long as we are in sysrq or oops_in_progress, uart_port lock state
is sort of ignored.

Looking at your backtraces:

---
[  348.058207] NMI backtrace for cpu 8
[  348.058207] CPU: 8 PID: 1700 Comm: dd Tainted: G   O L  4.14.73 #18
[  348.058214]  
[  348.058214]  wait_for_xmitr+0x2c/0xb0
[  348.058215]  serial8250_console_putchar+0x1c/0x40
[  348.058215]  ? wait_for_xmitr+0xb0/0xb0
[  348.058215]  uart_console_write+0x33/0x70
[  348.058216]  serial8250_console_write+0xe2/0x2b0
[  348.058216]  ? msg_print_text+0xa6/0x110
[  348.058216]  console_unlock+0x306/0x4a0
[  348.058217]  wake_up_klogd_work_func+0x55/0x60
[  348.058217]  irq_work_run_list+0x50/0x80
[  348.058217]  smp_irq_work_interrupt+0x3f/0xe0
[  348.058218]  irq_work_interrupt+0x7d/0x90
---


Now... the problem. A theory, in fact.
panic() sets oops_in_progress back to zero - bust_spinlocks(0) -  too soon.

When we do console_flush_on_panic() we ignore console_sem state and go
to the 8250 driver - serial8250_console_write(). But at this point
oops_in_progress is zero, so we endup in spin_lock_irqsave(>lock, flags).

If the port->lock was already locked, then this is your deadlock. We
can't emergency_restart() because the panic() CPU stuck spinning on
port->lock in serial8250_console_write(), so it never returns from
console_flush_on_panic() and there is no progress.

---

void panic(const char *fmt, ...)
{

bust_spinlocks(0);

/*
 * We may have ended up stopping the CPU holding the lock (in
 * smp_send_stop()) while still having some valuable data in the console
 * buffer.  Try to acquire the lock then release it regardless of the
 * result.  The release will also print the buffers out.  Locks debug
 * should be disabled to avoid reporting bad unlock balance when
 * panic() is not being callled from OOPS.
 */
debug_locks_off();
console_flush_on_panic();

if (!panic_blink)
panic_blink = no_blink;

if (panic_timeout > 0) {
/*
 * Delay timeout seconds before rebooting the machine.
 * We can't use the "normal" timers since we just panicked.
 */
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);

for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
if (i >= i_next) {
i += panic_blink(state ^= 1);
i_next = i + 3600 / PANIC_BLINK_SPD;
}

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
I wanted to let you know that I am leaving for a two-week vacation. So
if you don't hear from me during that period assume bad network
connectivity and not lack of enthusiasm. :) Feel free to go with the
backports if we reach an agreement here. Otherwise I'll do it when I get
back. Thank you all!


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
I wanted to let you know that I am leaving for a two-week vacation. So
if you don't hear from me during that period assume bad network
connectivity and not lack of enthusiasm. :) Feel free to go with the
backports if we reach an agreement here. Otherwise I'll do it when I get
back. Thank you all!


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
On Wed, Oct 3, 2018 at 10:37 AM Steven Rostedt  wrote:
> Just so I understand correctly. Does the panic hit with and without the
> suggested backport patch? The only difference is that you get the full
> output with the patch and limited output without it?

When `softlockup_panic` is set (which is what my original repro had and
what we use in production), without the backport patch, the expected panic
would hit a seemingly deadlock. So even when the machine is configured
to reboot immediately after the panic (kernel.panic=-1), it just hangs there
with an incomplete backtrace. With your patch, the deadlock doesn't happen
and the machine reboots successfully.

This was and still is the issue this thread is trying to fix. The last
log snippet
was from an "experiment" that I did in order to understand what's really
happening. So far the speculation has been that the panic path was trying
to get a lock held by a backtrace dumping thread, but there is not enough
evidence which thread is holding the lock and how it uses it. So I set
`softlockup_panic` to 0, to get panic out of the equation. Then I saw that one
CPU was indeed holding the console lock, trying to write something out. If
the panic was to hit while it's doing that, we might get a deadlock.

>
> -- Steve
>


-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
On Wed, Oct 3, 2018 at 10:37 AM Steven Rostedt  wrote:
> Just so I understand correctly. Does the panic hit with and without the
> suggested backport patch? The only difference is that you get the full
> output with the patch and limited output without it?

When `softlockup_panic` is set (which is what my original repro had and
what we use in production), without the backport patch, the expected panic
would hit a seemingly deadlock. So even when the machine is configured
to reboot immediately after the panic (kernel.panic=-1), it just hangs there
with an incomplete backtrace. With your patch, the deadlock doesn't happen
and the machine reboots successfully.

This was and still is the issue this thread is trying to fix. The last
log snippet
was from an "experiment" that I did in order to understand what's really
happening. So far the speculation has been that the panic path was trying
to get a lock held by a backtrace dumping thread, but there is not enough
evidence which thread is holding the lock and how it uses it. So I set
`softlockup_panic` to 0, to get panic out of the equation. Then I saw that one
CPU was indeed holding the console lock, trying to write something out. If
the panic was to hit while it's doing that, we might get a deadlock.

>
> -- Steve
>


-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Steven Rostedt
On Wed, 3 Oct 2018 10:16:08 -0700
Daniel Wang  wrote:

> On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
> >
> > On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:  
> > > I don't see the big deal of backporting this. The biggest complaints
> > > about backports are from fixes that were added to late -rc releases
> > > where the fixes didn't get much testing. This commit was added in 4.16,
> > > and hasn't had any issues due to the design. Although a fix has been
> > > added:
> > >
> > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")  
> >
> > As I said, I am fine with backporting the console_lock owner stuff
> > into the stable release.
> >
> > I just wonder (like Sergey) what the real problem is. The console_lock
> > owner handshake is not fully reliable. It is might be good enough

I'm not sure what you mean by 'not fully reliable'

> > to prevent softlockup. But we should not relay on it to prevent
> > a deadlock.  
> 
> Yes. I myself was curious too. :)
> 
> >
> > My new theory ;-)
> >
> > printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().  
> > => watchdog_timer_fn() is blocked until all backtraces are printed.  
> >
> > Now, the original report complained that the system rebooted before
> > all backtraces were printed. It means that panic() was called
> > on another CPU. My guess is that it is from the hardlockup detector.
> > And the panic() was not able to flush the console because it was
> > not able to take console_lock.
> >
> > IMHO, there was not a real deadlock. The console_lock owner
> > handshake jsut helped to get console_lock in panic() and
> > flush all messages before reboot => it is reasonable
> > and acceptable fix.  

Agreed.


> 
> I had the same speculation. Tried to capture a lockdep snippet with
> CONFIG_PROVE_LOCKING turned on but didn't get anything. But
> maybe I was doing it wrong.
> 
> >
> > Just to be sure. Daniel, could you please send a log with
> > the console_lock owner stuff backported? There we would see
> > who called the panic() and why it rebooted early.  
> 
> Sure. Here is one. It's a bit long but complete. I attached another log
> snippet below it which is what I got when `softlockup_panic` was turned
> off. The log was from the IRQ task that was flushing the printk buffer. I
> will be taking a closer look at it too but in case you'll find it helpful.

Just so I understand correctly. Does the panic hit with and without the
suggested backport patch? The only difference is that you get the full
output with the patch and limited output without it?

-- Steve



Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Steven Rostedt
On Wed, 3 Oct 2018 10:16:08 -0700
Daniel Wang  wrote:

> On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
> >
> > On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:  
> > > I don't see the big deal of backporting this. The biggest complaints
> > > about backports are from fixes that were added to late -rc releases
> > > where the fixes didn't get much testing. This commit was added in 4.16,
> > > and hasn't had any issues due to the design. Although a fix has been
> > > added:
> > >
> > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")  
> >
> > As I said, I am fine with backporting the console_lock owner stuff
> > into the stable release.
> >
> > I just wonder (like Sergey) what the real problem is. The console_lock
> > owner handshake is not fully reliable. It is might be good enough

I'm not sure what you mean by 'not fully reliable'

> > to prevent softlockup. But we should not relay on it to prevent
> > a deadlock.  
> 
> Yes. I myself was curious too. :)
> 
> >
> > My new theory ;-)
> >
> > printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().  
> > => watchdog_timer_fn() is blocked until all backtraces are printed.  
> >
> > Now, the original report complained that the system rebooted before
> > all backtraces were printed. It means that panic() was called
> > on another CPU. My guess is that it is from the hardlockup detector.
> > And the panic() was not able to flush the console because it was
> > not able to take console_lock.
> >
> > IMHO, there was not a real deadlock. The console_lock owner
> > handshake jsut helped to get console_lock in panic() and
> > flush all messages before reboot => it is reasonable
> > and acceptable fix.  

Agreed.


> 
> I had the same speculation. Tried to capture a lockdep snippet with
> CONFIG_PROVE_LOCKING turned on but didn't get anything. But
> maybe I was doing it wrong.
> 
> >
> > Just to be sure. Daniel, could you please send a log with
> > the console_lock owner stuff backported? There we would see
> > who called the panic() and why it rebooted early.  
> 
> Sure. Here is one. It's a bit long but complete. I attached another log
> snippet below it which is what I got when `softlockup_panic` was turned
> off. The log was from the IRQ task that was flushing the printk buffer. I
> will be taking a closer look at it too but in case you'll find it helpful.

Just so I understand correctly. Does the panic hit with and without the
suggested backport patch? The only difference is that you get the full
output with the patch and limited output without it?

-- Steve



Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
>
> On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> > I don't see the big deal of backporting this. The biggest complaints
> > about backports are from fixes that were added to late -rc releases
> > where the fixes didn't get much testing. This commit was added in 4.16,
> > and hasn't had any issues due to the design. Although a fix has been
> > added:
> >
> > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")
>
> As I said, I am fine with backporting the console_lock owner stuff
> into the stable release.
>
> I just wonder (like Sergey) what the real problem is. The console_lock
> owner handshake is not fully reliable. It is might be good enough
> to prevent softlockup. But we should not relay on it to prevent
> a deadlock.

Yes. I myself was curious too. :)

>
> My new theory ;-)
>
> printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
> => watchdog_timer_fn() is blocked until all backtraces are printed.
>
> Now, the original report complained that the system rebooted before
> all backtraces were printed. It means that panic() was called
> on another CPU. My guess is that it is from the hardlockup detector.
> And the panic() was not able to flush the console because it was
> not able to take console_lock.
>
> IMHO, there was not a real deadlock. The console_lock owner
> handshake jsut helped to get console_lock in panic() and
> flush all messages before reboot => it is reasonable
> and acceptable fix.

I had the same speculation. Tried to capture a lockdep snippet with
CONFIG_PROVE_LOCKING turned on but didn't get anything. But
maybe I was doing it wrong.

>
> Just to be sure. Daniel, could you please send a log with
> the console_lock owner stuff backported? There we would see
> who called the panic() and why it rebooted early.

Sure. Here is one. It's a bit long but complete. I attached another log
snippet below it which is what I got when `softlockup_panic` was turned
off. The log was from the IRQ task that was flushing the printk buffer. I
will be taking a closer look at it too but in case you'll find it helpful.

lockup-test-16-2 login: [   89.277372] LoadPin: kernel-module
pinning-ignored obj="/tmp/release/hog.ko" pid=1992 cmdline="insmod
hog.ko"
[   89.280029] hog: loading out-of-tree module taints kernel.
[   89.294559] Hogging a CPU now
[   92.619688] watchdog: BUG: soft lockup - CPU#6 stuck for 3s! [hog:1993]
[   92.626490] Modules linked in: hog(O) ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype nf_nat
br_netfilter ip6table_filter ip6_tables aesni_intel aes_x86_64
crypto_simd cryptd glue_helper
[   92.645567] CPU: 6 PID: 1993 Comm: hog Tainted: G   O 4.15.0+ #12
[   92.652899] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   92.662245] RIP: 0010:hog_thread+0x13/0x1000 [hog]
[   92.667164] RSP: 0018:b489c741ff10 EFLAGS: 0282 ORIG_RAX:
ff11
[   92.675139] RAX: 0011 RBX: 9f5c75a88900 RCX: 
[   92.682474] RDX: 9f5c8339d840 RSI: 9f5c833954b8 RDI: 9f5c833954b8
[   92.689727] RBP: b489c741ff48 R08: 0030 R09: 
[   92.696985] R10: 03a8 R11:  R12: 9f5c7959e080
[   92.704251] R13: b489c7f2bc70 R14:  R15: 9f5c75a88948
[   92.711498] FS:  () GS:9f5c8338()
knlGS:
[   92.719699] CS:  0010 DS:  ES:  CR0: 80050033
[   92.725556] CR2: 558184c9b89c CR3: 000499e12006 CR4: 003606a0
[   92.732976] DR0:  DR1:  DR2: 
[   92.740231] DR3:  DR6: fffe0ff0 DR7: 0400
[   92.747487] Call Trace:
[   92.750054]  kthread+0x120/0x160
[   92.753419]  ? 0xc030d000
[   92.756859]  ? kthread_stop+0x120/0x120
[   92.760819]  ret_from_fork+0x35/0x40
[   92.764594] Code:  fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00
[   92.772656] Sending NMI from CPU 6 to CPUs 0-5,7-15:
[   92.43] NMI backtrace for cpu 0
[   92.46] CPU: 0 PID: 1844 Comm: dd Tainted: G   O 4.15.0+ #12
[   92.47] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   92.55] RIP: 0010:native_queued_spin_lock_slowpath+0x18/0x1b0
[   92.56] RSP: 0018:b489c7dcbdb0 EFLAGS: 0002
[   92.57] RAX: 0001 RBX: 9f5c82ca5a68 RCX: 
[   92.58] RDX: 0001 RSI: 0001 RDI: 9f5c82ca5a68
[   92.59] RBP: b489c7dcbde0 R08: f9f8f56c R09: 4c55ba96
[   92.60] R10: 84f6cd57 R11: 41f66b45 R12: 9f5c82ca5a68
[   92.61] R13: b489c7dcbe38 R14: b489c7dcbe38 R15: 0040
[   92.62] FS:  7f1e21116700() GS:9f5c8320()
knlGS:
[   92.63] CS:  0010 DS: 

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Daniel Wang
On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek  wrote:
>
> On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> > I don't see the big deal of backporting this. The biggest complaints
> > about backports are from fixes that were added to late -rc releases
> > where the fixes didn't get much testing. This commit was added in 4.16,
> > and hasn't had any issues due to the design. Although a fix has been
> > added:
> >
> > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")
>
> As I said, I am fine with backporting the console_lock owner stuff
> into the stable release.
>
> I just wonder (like Sergey) what the real problem is. The console_lock
> owner handshake is not fully reliable. It is might be good enough
> to prevent softlockup. But we should not relay on it to prevent
> a deadlock.

Yes. I myself was curious too. :)

>
> My new theory ;-)
>
> printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
> => watchdog_timer_fn() is blocked until all backtraces are printed.
>
> Now, the original report complained that the system rebooted before
> all backtraces were printed. It means that panic() was called
> on another CPU. My guess is that it is from the hardlockup detector.
> And the panic() was not able to flush the console because it was
> not able to take console_lock.
>
> IMHO, there was not a real deadlock. The console_lock owner
> handshake jsut helped to get console_lock in panic() and
> flush all messages before reboot => it is reasonable
> and acceptable fix.

I had the same speculation. Tried to capture a lockdep snippet with
CONFIG_PROVE_LOCKING turned on but didn't get anything. But
maybe I was doing it wrong.

>
> Just to be sure. Daniel, could you please send a log with
> the console_lock owner stuff backported? There we would see
> who called the panic() and why it rebooted early.

Sure. Here is one. It's a bit long but complete. I attached another log
snippet below it which is what I got when `softlockup_panic` was turned
off. The log was from the IRQ task that was flushing the printk buffer. I
will be taking a closer look at it too but in case you'll find it helpful.

lockup-test-16-2 login: [   89.277372] LoadPin: kernel-module
pinning-ignored obj="/tmp/release/hog.ko" pid=1992 cmdline="insmod
hog.ko"
[   89.280029] hog: loading out-of-tree module taints kernel.
[   89.294559] Hogging a CPU now
[   92.619688] watchdog: BUG: soft lockup - CPU#6 stuck for 3s! [hog:1993]
[   92.626490] Modules linked in: hog(O) ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype nf_nat
br_netfilter ip6table_filter ip6_tables aesni_intel aes_x86_64
crypto_simd cryptd glue_helper
[   92.645567] CPU: 6 PID: 1993 Comm: hog Tainted: G   O 4.15.0+ #12
[   92.652899] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   92.662245] RIP: 0010:hog_thread+0x13/0x1000 [hog]
[   92.667164] RSP: 0018:b489c741ff10 EFLAGS: 0282 ORIG_RAX:
ff11
[   92.675139] RAX: 0011 RBX: 9f5c75a88900 RCX: 
[   92.682474] RDX: 9f5c8339d840 RSI: 9f5c833954b8 RDI: 9f5c833954b8
[   92.689727] RBP: b489c741ff48 R08: 0030 R09: 
[   92.696985] R10: 03a8 R11:  R12: 9f5c7959e080
[   92.704251] R13: b489c7f2bc70 R14:  R15: 9f5c75a88948
[   92.711498] FS:  () GS:9f5c8338()
knlGS:
[   92.719699] CS:  0010 DS:  ES:  CR0: 80050033
[   92.725556] CR2: 558184c9b89c CR3: 000499e12006 CR4: 003606a0
[   92.732976] DR0:  DR1:  DR2: 
[   92.740231] DR3:  DR6: fffe0ff0 DR7: 0400
[   92.747487] Call Trace:
[   92.750054]  kthread+0x120/0x160
[   92.753419]  ? 0xc030d000
[   92.756859]  ? kthread_stop+0x120/0x120
[   92.760819]  ret_from_fork+0x35/0x40
[   92.764594] Code:  fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00
[   92.772656] Sending NMI from CPU 6 to CPUs 0-5,7-15:
[   92.43] NMI backtrace for cpu 0
[   92.46] CPU: 0 PID: 1844 Comm: dd Tainted: G   O 4.15.0+ #12
[   92.47] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[   92.55] RIP: 0010:native_queued_spin_lock_slowpath+0x18/0x1b0
[   92.56] RSP: 0018:b489c7dcbdb0 EFLAGS: 0002
[   92.57] RAX: 0001 RBX: 9f5c82ca5a68 RCX: 
[   92.58] RDX: 0001 RSI: 0001 RDI: 9f5c82ca5a68
[   92.59] RBP: b489c7dcbde0 R08: f9f8f56c R09: 4c55ba96
[   92.60] R10: 84f6cd57 R11: 41f66b45 R12: 9f5c82ca5a68
[   92.61] R13: b489c7dcbe38 R14: b489c7dcbe38 R15: 0040
[   92.62] FS:  7f1e21116700() GS:9f5c8320()
knlGS:
[   92.63] CS:  0010 DS: 

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Petr Mladek
On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> On Tue, 2 Oct 2018 17:15:17 -0700
> Daniel Wang  wrote:
> 
> > On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
> > >
> > > Well, I still wonder why it helped and why you do not see it with 4.4.
> > > I have a feeling that the console owner switch helped only by chance.
> > > In fact, you might be affected by a race in
> > > printk_safe_flush_on_panic() that was fixed by the commit:
> > >
> > > 554755be08fba31c7 printk: drop in_nmi check from 
> > > printk_safe_flush_on_panic()
> > >
> > > The above one commit might be enough. Well, there was one more
> > > NMI-related race that was fixed by:
> > >
> > > ba552399954dde1b printk: Split the code for storing a message into the 
> > > log buffer
> > > a338f84dc196f44b printk: Create helper function to queue deferred console 
> > > handling
> > > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> > > buffer in NMI  
> > 
> > All of these commits already exist in 4.14 stable, since 4.14.68. The 
> > deadlock
> > still exists even when built from 4.14.73 (latest tag) though. And 
> > cherrypicking
> > dbdda842fe96 fixes it.
> > 
> 
> I don't see the big deal of backporting this. The biggest complaints
> about backports are from fixes that were added to late -rc releases
> where the fixes didn't get much testing. This commit was added in 4.16,
> and hasn't had any issues due to the design. Although a fix has been
> added:
> 
> c14376de3a1 ("printk: Wake klogd when passing console_lock owner")

As I said, I am fine with backporting the console_lock owner stuff
into the stable release.

I just wonder (like Sergey) what the real problem is. The console_lock
owner handshake is not fully reliable. It is might be good enough
to prevent softlockup. But we should not relay on it to prevent
a deadlock.

My new theory ;-)

printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
=> watchdog_timer_fn() is blocked until all backtraces are printed.

Now, the original report complained that the system rebooted before
all backtraces were printed. It means that panic() was called
on another CPU. My guess is that it is from the hardlockup detector.
And the panic() was not able to flush the console because it was
not able to take console_lock.

IMHO, there was not a real deadlock. The console_lock owner
handshake jsut helped to get console_lock in panic() and
flush all messages before reboot => it is reasonable
and acceptable fix.

Just to be sure. Daniel, could you please send a log with
the console_lock owner stuff backported? There we would see
who called the panic() and why it rebooted early.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-03 Thread Petr Mladek
On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> On Tue, 2 Oct 2018 17:15:17 -0700
> Daniel Wang  wrote:
> 
> > On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
> > >
> > > Well, I still wonder why it helped and why you do not see it with 4.4.
> > > I have a feeling that the console owner switch helped only by chance.
> > > In fact, you might be affected by a race in
> > > printk_safe_flush_on_panic() that was fixed by the commit:
> > >
> > > 554755be08fba31c7 printk: drop in_nmi check from 
> > > printk_safe_flush_on_panic()
> > >
> > > The above one commit might be enough. Well, there was one more
> > > NMI-related race that was fixed by:
> > >
> > > ba552399954dde1b printk: Split the code for storing a message into the 
> > > log buffer
> > > a338f84dc196f44b printk: Create helper function to queue deferred console 
> > > handling
> > > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> > > buffer in NMI  
> > 
> > All of these commits already exist in 4.14 stable, since 4.14.68. The 
> > deadlock
> > still exists even when built from 4.14.73 (latest tag) though. And 
> > cherrypicking
> > dbdda842fe96 fixes it.
> > 
> 
> I don't see the big deal of backporting this. The biggest complaints
> about backports are from fixes that were added to late -rc releases
> where the fixes didn't get much testing. This commit was added in 4.16,
> and hasn't had any issues due to the design. Although a fix has been
> added:
> 
> c14376de3a1 ("printk: Wake klogd when passing console_lock owner")

As I said, I am fine with backporting the console_lock owner stuff
into the stable release.

I just wonder (like Sergey) what the real problem is. The console_lock
owner handshake is not fully reliable. It is might be good enough
to prevent softlockup. But we should not relay on it to prevent
a deadlock.

My new theory ;-)

printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
=> watchdog_timer_fn() is blocked until all backtraces are printed.

Now, the original report complained that the system rebooted before
all backtraces were printed. It means that panic() was called
on another CPU. My guess is that it is from the hardlockup detector.
And the panic() was not able to flush the console because it was
not able to take console_lock.

IMHO, there was not a real deadlock. The console_lock owner
handshake jsut helped to get console_lock in panic() and
flush all messages before reboot => it is reasonable
and acceptable fix.

Just to be sure. Daniel, could you please send a log with
the console_lock owner stuff backported? There we would see
who called the panic() and why it rebooted early.

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Steven Rostedt
On Tue, 2 Oct 2018 17:15:17 -0700
Daniel Wang  wrote:

> On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
> >
> > Well, I still wonder why it helped and why you do not see it with 4.4.
> > I have a feeling that the console owner switch helped only by chance.
> > In fact, you might be affected by a race in
> > printk_safe_flush_on_panic() that was fixed by the commit:
> >
> > 554755be08fba31c7 printk: drop in_nmi check from 
> > printk_safe_flush_on_panic()
> >
> > The above one commit might be enough. Well, there was one more
> > NMI-related race that was fixed by:
> >
> > ba552399954dde1b printk: Split the code for storing a message into the log 
> > buffer
> > a338f84dc196f44b printk: Create helper function to queue deferred console 
> > handling
> > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> > buffer in NMI  
> 
> All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock
> still exists even when built from 4.14.73 (latest tag) though. And 
> cherrypicking
> dbdda842fe96 fixes it.
> 

I don't see the big deal of backporting this. The biggest complaints
about backports are from fixes that were added to late -rc releases
where the fixes didn't get much testing. This commit was added in 4.16,
and hasn't had any issues due to the design. Although a fix has been
added:

c14376de3a1 ("printk: Wake klogd when passing console_lock owner")

Also from 4.16, but nothing else according to searching for "Fixes"
tags.

-- Steve


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Steven Rostedt
On Tue, 2 Oct 2018 17:15:17 -0700
Daniel Wang  wrote:

> On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
> >
> > Well, I still wonder why it helped and why you do not see it with 4.4.
> > I have a feeling that the console owner switch helped only by chance.
> > In fact, you might be affected by a race in
> > printk_safe_flush_on_panic() that was fixed by the commit:
> >
> > 554755be08fba31c7 printk: drop in_nmi check from 
> > printk_safe_flush_on_panic()
> >
> > The above one commit might be enough. Well, there was one more
> > NMI-related race that was fixed by:
> >
> > ba552399954dde1b printk: Split the code for storing a message into the log 
> > buffer
> > a338f84dc196f44b printk: Create helper function to queue deferred console 
> > handling
> > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> > buffer in NMI  
> 
> All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock
> still exists even when built from 4.14.73 (latest tag) though. And 
> cherrypicking
> dbdda842fe96 fixes it.
> 

I don't see the big deal of backporting this. The biggest complaints
about backports are from fixes that were added to late -rc releases
where the fixes didn't get much testing. This commit was added in 4.16,
and hasn't had any issues due to the design. Although a fix has been
added:

c14376de3a1 ("printk: Wake klogd when passing console_lock owner")

Also from 4.16, but nothing else according to searching for "Fixes"
tags.

-- Steve


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Daniel Wang
On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
>
> Well, I still wonder why it helped and why you do not see it with 4.4.
> I have a feeling that the console owner switch helped only by chance.
> In fact, you might be affected by a race in
> printk_safe_flush_on_panic() that was fixed by the commit:
>
> 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()
>
> The above one commit might be enough. Well, there was one more
> NMI-related race that was fixed by:
>
> ba552399954dde1b printk: Split the code for storing a message into the log 
> buffer
> a338f84dc196f44b printk: Create helper function to queue deferred console 
> handling
> 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> buffer in NMI

All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock
still exists even when built from 4.14.73 (latest tag) though. And cherrypicking
dbdda842fe96 fixes it.

>
> Best Regards,
> Petr



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Daniel Wang
On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:
>
> Well, I still wonder why it helped and why you do not see it with 4.4.
> I have a feeling that the console owner switch helped only by chance.
> In fact, you might be affected by a race in
> printk_safe_flush_on_panic() that was fixed by the commit:
>
> 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()
>
> The above one commit might be enough. Well, there was one more
> NMI-related race that was fixed by:
>
> ba552399954dde1b printk: Split the code for storing a message into the log 
> buffer
> a338f84dc196f44b printk: Create helper function to queue deferred console 
> handling
> 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> buffer in NMI

All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock
still exists even when built from 4.14.73 (latest tag) though. And cherrypicking
dbdda842fe96 fixes it.

>
> Best Regards,
> Petr



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Daniel Wang
On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:

> Well, I still wonder why it helped and why you do not see it with 4.4.
> I have a feeling that the console owner switch helped only by chance.

So do I. I don't think Steven had the deadlock in mind when working on
that patch, but with that patch and that patch alone, the deadlock
disappeared.

> In fact, you might be affected by a race in
> printk_safe_flush_on_panic() that was fixed by the commit:
>
> 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()
>
> The above one commit might be enough. Well, there was one more

Thanks for the pointer. Let me test this out.

> NMI-related race that was fixed by:
>
> ba552399954dde1b printk: Split the code for storing a message into the log 
> buffer
> a338f84dc196f44b printk: Create helper function to queue deferred console 
> handling
> 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> buffer in NMI
>
> Best Regards,
> Petr


On Tue, Oct 2, 2018 at 1:17 AM Sergey Senozhatsky
 wrote:

> Sorry, I'm missing something here. Steven's patch deals with lockups and
> I understand why you want to backport the patch set; but console output
> deadlock on panic() is another thing.

Understood. But it did fix the deadlock for me and without it I hit the deadlock
pretty consistently. Let me test the patch Petr pointed out and will update
here.

-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Daniel Wang
On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek  wrote:

> Well, I still wonder why it helped and why you do not see it with 4.4.
> I have a feeling that the console owner switch helped only by chance.

So do I. I don't think Steven had the deadlock in mind when working on
that patch, but with that patch and that patch alone, the deadlock
disappeared.

> In fact, you might be affected by a race in
> printk_safe_flush_on_panic() that was fixed by the commit:
>
> 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()
>
> The above one commit might be enough. Well, there was one more

Thanks for the pointer. Let me test this out.

> NMI-related race that was fixed by:
>
> ba552399954dde1b printk: Split the code for storing a message into the log 
> buffer
> a338f84dc196f44b printk: Create helper function to queue deferred console 
> handling
> 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
> buffer in NMI
>
> Best Regards,
> Petr


On Tue, Oct 2, 2018 at 1:17 AM Sergey Senozhatsky
 wrote:

> Sorry, I'm missing something here. Steven's patch deals with lockups and
> I understand why you want to backport the patch set; but console output
> deadlock on panic() is another thing.

Understood. But it did fix the deadlock for me and without it I hit the deadlock
pretty consistently. Let me test the patch Petr pointed out and will update
here.

-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Petr Mladek
On Mon 2018-10-01 13:37:30, Daniel Wang wrote:
> On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
> >
> > > Serial console logs leading up to the deadlock. As can be seen the stack 
> > > trace
> > > was incomplete because the printing path hit a timeout.
> >
> > I'm fine with having this backported.
> 
> Thanks. I can send the cherrypicks your way. Do you recommend that I
> include the three follow-up fixes though?
> 
> c14376de3a1b printk: Wake klogd when passing console_lock owner
> fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
> c162d5b4338d printk: Hide console waiter logic into helpers
> dbdda842fe96 printk: Add console owner and waiter logic to load
> balance console writes

This list looks complete and I am fine with backporting it to 4.14.

Well, I still wonder why it helped and why you do not see it with 4.4.
I have a feeling that the console owner switch helped only by chance.
In fact, you might be affected by a race in
printk_safe_flush_on_panic() that was fixed by the commit:

554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()

The above one commit might be enough. Well, there was one more
NMI-related race that was fixed by:

ba552399954dde1b printk: Split the code for storing a message into the log 
buffer
a338f84dc196f44b printk: Create helper function to queue deferred console 
handling
03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
buffer in NMI

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Petr Mladek
On Mon 2018-10-01 13:37:30, Daniel Wang wrote:
> On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
> >
> > > Serial console logs leading up to the deadlock. As can be seen the stack 
> > > trace
> > > was incomplete because the printing path hit a timeout.
> >
> > I'm fine with having this backported.
> 
> Thanks. I can send the cherrypicks your way. Do you recommend that I
> include the three follow-up fixes though?
> 
> c14376de3a1b printk: Wake klogd when passing console_lock owner
> fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
> c162d5b4338d printk: Hide console waiter logic into helpers
> dbdda842fe96 printk: Add console owner and waiter logic to load
> balance console writes

This list looks complete and I am fine with backporting it to 4.14.

Well, I still wonder why it helped and why you do not see it with 4.4.
I have a feeling that the console owner switch helped only by chance.
In fact, you might be affected by a race in
printk_safe_flush_on_panic() that was fixed by the commit:

554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()

The above one commit might be enough. Well, there was one more
NMI-related race that was fixed by:

ba552399954dde1b printk: Split the code for storing a message into the log 
buffer
a338f84dc196f44b printk: Create helper function to queue deferred console 
handling
03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log 
buffer in NMI

Best Regards,
Petr


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Sergey Senozhatsky
On (09/27/18 12:46), Daniel Wang wrote:
> Prior to this change, the combination of `softlockup_panic=1` and
> `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> path
> is trying to grab the console lock that is held by the stack trace printing
> path. What seems to be happening is that while there are multiple CPUs, only 
> one
> of them is tasked to print the back trace of all CPUs. On a machine with many
> CPUs and a slow serial console (on Google Compute Engine for example), the 
> stack
> trace printing routine hits a timeout and the reboot path kicks in. The latter
> then tries to print something else, but can't get the lock because it's still
> held by earlier printing path.

Sorry, I'm missing something here. Steven's patch deals with lockups and
I understand why you want to backport the patch set; but console output
deadlock on panic() is another thing.

You said
"then tries to print something else, but can't get the lock
 because it's still held by earlier printing path"

Can't get which of the locks?

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-02 Thread Sergey Senozhatsky
On (09/27/18 12:46), Daniel Wang wrote:
> Prior to this change, the combination of `softlockup_panic=1` and
> `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> path
> is trying to grab the console lock that is held by the stack trace printing
> path. What seems to be happening is that while there are multiple CPUs, only 
> one
> of them is tasked to print the back trace of all CPUs. On a machine with many
> CPUs and a slow serial console (on Google Compute Engine for example), the 
> stack
> trace printing routine hits a timeout and the reboot path kicks in. The latter
> then tries to print something else, but can't get the lock because it's still
> held by earlier printing path.

Sorry, I'm missing something here. Steven's patch deals with lockups and
I understand why you want to backport the patch set; but console output
deadlock on panic() is another thing.

You said
"then tries to print something else, but can't get the lock
 because it's still held by earlier printing path"

Can't get which of the locks?

-ss


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Sasha Levin
On Mon, Oct 01, 2018 at 01:37:30PM -0700, Daniel Wang wrote:
>On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
>>
>> > Serial console logs leading up to the deadlock. As can be seen the stack 
>> > trace
>> > was incomplete because the printing path hit a timeout.
>>
>> I'm fine with having this backported.
>
>Thanks. I can send the cherrypicks your way. Do you recommend that I
>include the three follow-up fixes though?
>
>c14376de3a1b printk: Wake klogd when passing console_lock owner
>fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
>c162d5b4338d printk: Hide console waiter logic into helpers
>dbdda842fe96 printk: Add console owner and waiter logic to load
>balance console writes

Maybe it'll also make sense to make the reproducer into a test that can
go under tools/testing/ and we can backport that as well? It'll be
helpful to have a way to make sure things are sane.

--
Thanks,
Sasha

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Sasha Levin
On Mon, Oct 01, 2018 at 01:37:30PM -0700, Daniel Wang wrote:
>On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
>>
>> > Serial console logs leading up to the deadlock. As can be seen the stack 
>> > trace
>> > was incomplete because the printing path hit a timeout.
>>
>> I'm fine with having this backported.
>
>Thanks. I can send the cherrypicks your way. Do you recommend that I
>include the three follow-up fixes though?
>
>c14376de3a1b printk: Wake klogd when passing console_lock owner
>fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
>c162d5b4338d printk: Hide console waiter logic into helpers
>dbdda842fe96 printk: Add console owner and waiter logic to load
>balance console writes

Maybe it'll also make sense to make the reproducer into a test that can
go under tools/testing/ and we can backport that as well? It'll be
helpful to have a way to make sure things are sane.

--
Thanks,
Sasha

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Daniel Wang
On Mon, Oct 1, 2018 at 1:23 PM Vlastimil Babka  wrote:
>
> On 10/1/18 10:13 PM, Pavel Machek wrote:
> >
> > Dunno. Is the patch perhaps a bit too complex? This is not exactly
> > trivial bugfix.
> >
> > pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
> >  printk.c |  108
> >  ++-
> >
> > I see that it is pretty critical to Daniel, but maybe kernel with
> > console locking redone should no longer be called 4.4?
>
> In that case it probably should no longer be called 4.4 since at least
> Meltdown/Spectre fixes :)

To clarify, I am requesting a backport to 4.14. This bug doesn't repro on 4.4.

>


-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Daniel Wang
On Mon, Oct 1, 2018 at 1:23 PM Vlastimil Babka  wrote:
>
> On 10/1/18 10:13 PM, Pavel Machek wrote:
> >
> > Dunno. Is the patch perhaps a bit too complex? This is not exactly
> > trivial bugfix.
> >
> > pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
> >  printk.c |  108
> >  ++-
> >
> > I see that it is pretty critical to Daniel, but maybe kernel with
> > console locking redone should no longer be called 4.4?
>
> In that case it probably should no longer be called 4.4 since at least
> Meltdown/Spectre fixes :)

To clarify, I am requesting a backport to 4.14. This bug doesn't repro on 4.4.

>


-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Daniel Wang
On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
>
> > Serial console logs leading up to the deadlock. As can be seen the stack 
> > trace
> > was incomplete because the printing path hit a timeout.
>
> I'm fine with having this backported.

Thanks. I can send the cherrypicks your way. Do you recommend that I
include the three follow-up fixes though?

c14376de3a1b printk: Wake klogd when passing console_lock owner
fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
c162d5b4338d printk: Hide console waiter logic into helpers
dbdda842fe96 printk: Add console owner and waiter logic to load
balance console writes

>
> -- Steve



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Daniel Wang
On Mon, Oct 1, 2018 at 12:23 PM Steven Rostedt  wrote:
>
> > Serial console logs leading up to the deadlock. As can be seen the stack 
> > trace
> > was incomplete because the printing path hit a timeout.
>
> I'm fine with having this backported.

Thanks. I can send the cherrypicks your way. Do you recommend that I
include the three follow-up fixes though?

c14376de3a1b printk: Wake klogd when passing console_lock owner
fd5f7cde1b85 printk: Never set console_may_schedule in console_trylock()
c162d5b4338d printk: Hide console waiter logic into helpers
dbdda842fe96 printk: Add console owner and waiter logic to load
balance console writes

>
> -- Steve



-- 
Best,
Daniel


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Steven Rostedt
On Mon, 1 Oct 2018 22:13:10 +0200
Pavel Machek  wrote:

> > > [1] 
> > > https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> > > 
> > > Serial console logs leading up to the deadlock. As can be seen the stack 
> > > trace
> > > was incomplete because the printing path hit a timeout.  
> > 
> > I'm fine with having this backported.  
> 
> Dunno. Is the patch perhaps a bit too complex? This is not exactly
> trivial bugfix.
> 
> pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
>  printk.c |  108
>  ++-
> 
> I see that it is pretty critical to Daniel, but maybe kernel with
> console locking redone should no longer be called 4.4?

But it prevents a deadlock.

I usually weigh backporting as benefit vs risk. And I believe the
benefit outweighs the risk in this case.

-- Steve


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Steven Rostedt
On Mon, 1 Oct 2018 22:13:10 +0200
Pavel Machek  wrote:

> > > [1] 
> > > https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> > > 
> > > Serial console logs leading up to the deadlock. As can be seen the stack 
> > > trace
> > > was incomplete because the printing path hit a timeout.  
> > 
> > I'm fine with having this backported.  
> 
> Dunno. Is the patch perhaps a bit too complex? This is not exactly
> trivial bugfix.
> 
> pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
>  printk.c |  108
>  ++-
> 
> I see that it is pretty critical to Daniel, but maybe kernel with
> console locking redone should no longer be called 4.4?

But it prevents a deadlock.

I usually weigh backporting as benefit vs risk. And I believe the
benefit outweighs the risk in this case.

-- Steve


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Vlastimil Babka
On 10/1/18 10:13 PM, Pavel Machek wrote:
> 
> Dunno. Is the patch perhaps a bit too complex? This is not exactly
> trivial bugfix.
> 
> pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
>  printk.c |  108
>  ++-
> 
> I see that it is pretty critical to Daniel, but maybe kernel with
> console locking redone should no longer be called 4.4?

In that case it probably should no longer be called 4.4 since at least
Meltdown/Spectre fixes :)



Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Vlastimil Babka
On 10/1/18 10:13 PM, Pavel Machek wrote:
> 
> Dunno. Is the patch perhaps a bit too complex? This is not exactly
> trivial bugfix.
> 
> pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
>  printk.c |  108
>  ++-
> 
> I see that it is pretty critical to Daniel, but maybe kernel with
> console locking redone should no longer be called 4.4?

In that case it probably should no longer be called 4.4 since at least
Meltdown/Spectre fixes :)



Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Pavel Machek
On Mon 2018-10-01 15:23:24, Steven Rostedt wrote:
> On Thu, 27 Sep 2018 12:46:01 -0700
> Daniel Wang  wrote:
> 
> > Prior to this change, the combination of `softlockup_panic=1` and
> > `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> > path
> > is trying to grab the console lock that is held by the stack trace printing
> > path. What seems to be happening is that while there are multiple CPUs, 
> > only one
> > of them is tasked to print the back trace of all CPUs. On a machine with 
> > many
> > CPUs and a slow serial console (on Google Compute Engine for example), the 
> > stack
> > trace printing routine hits a timeout and the reboot path kicks in. The 
> > latter
> > then tries to print something else, but can't get the lock because it's 
> > still
> > held by earlier printing path. This is easily reproducible on a VM with 16+
> > vCPUs on Google Compute Engine - which is a very common scenario.
> > 
> > A quick repro is available at
> > https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 
> > seconds
> > into executing repro.sh. Both deadlock analysis and repro are credits to 
> > Peter
> > Feiner.
> > 
> > Note that I have read previous discussions on backporting this to stable 
> > [1].
> > The argument for objecting the backport was that this is a non-trivial fix 
> > and
> > is supported to prevent hypothetical soft lockups. What we are hitting is a 
> > real
> > deadlock, in production, however. Hence this request.
> > 
> > [1] 
> > https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> > 
> > Serial console logs leading up to the deadlock. As can be seen the stack 
> > trace
> > was incomplete because the printing path hit a timeout.
> 
> I'm fine with having this backported.

Dunno. Is the patch perhaps a bit too complex? This is not exactly
trivial bugfix.

pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
 printk.c |  108
 ++-

I see that it is pretty critical to Daniel, but maybe kernel with
console locking redone should no longer be called 4.4?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Pavel Machek
On Mon 2018-10-01 15:23:24, Steven Rostedt wrote:
> On Thu, 27 Sep 2018 12:46:01 -0700
> Daniel Wang  wrote:
> 
> > Prior to this change, the combination of `softlockup_panic=1` and
> > `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> > path
> > is trying to grab the console lock that is held by the stack trace printing
> > path. What seems to be happening is that while there are multiple CPUs, 
> > only one
> > of them is tasked to print the back trace of all CPUs. On a machine with 
> > many
> > CPUs and a slow serial console (on Google Compute Engine for example), the 
> > stack
> > trace printing routine hits a timeout and the reboot path kicks in. The 
> > latter
> > then tries to print something else, but can't get the lock because it's 
> > still
> > held by earlier printing path. This is easily reproducible on a VM with 16+
> > vCPUs on Google Compute Engine - which is a very common scenario.
> > 
> > A quick repro is available at
> > https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 
> > seconds
> > into executing repro.sh. Both deadlock analysis and repro are credits to 
> > Peter
> > Feiner.
> > 
> > Note that I have read previous discussions on backporting this to stable 
> > [1].
> > The argument for objecting the backport was that this is a non-trivial fix 
> > and
> > is supported to prevent hypothetical soft lockups. What we are hitting is a 
> > real
> > deadlock, in production, however. Hence this request.
> > 
> > [1] 
> > https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> > 
> > Serial console logs leading up to the deadlock. As can be seen the stack 
> > trace
> > was incomplete because the printing path hit a timeout.
> 
> I'm fine with having this backported.

Dunno. Is the patch perhaps a bit too complex? This is not exactly
trivial bugfix.

pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
 printk.c |  108
 ++-

I see that it is pretty critical to Daniel, but maybe kernel with
console locking redone should no longer be called 4.4?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Steven Rostedt
On Thu, 27 Sep 2018 12:46:01 -0700
Daniel Wang  wrote:

> Prior to this change, the combination of `softlockup_panic=1` and
> `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> path
> is trying to grab the console lock that is held by the stack trace printing
> path. What seems to be happening is that while there are multiple CPUs, only 
> one
> of them is tasked to print the back trace of all CPUs. On a machine with many
> CPUs and a slow serial console (on Google Compute Engine for example), the 
> stack
> trace printing routine hits a timeout and the reboot path kicks in. The latter
> then tries to print something else, but can't get the lock because it's still
> held by earlier printing path. This is easily reproducible on a VM with 16+
> vCPUs on Google Compute Engine - which is a very common scenario.
> 
> A quick repro is available at
> https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds
> into executing repro.sh. Both deadlock analysis and repro are credits to Peter
> Feiner.
> 
> Note that I have read previous discussions on backporting this to stable [1].
> The argument for objecting the backport was that this is a non-trivial fix and
> is supported to prevent hypothetical soft lockups. What we are hitting is a 
> real
> deadlock, in production, however. Hence this request.
> 
> [1] 
> https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> 
> Serial console logs leading up to the deadlock. As can be seen the stack trace
> was incomplete because the printing path hit a timeout.

I'm fine with having this backported.

-- Steve


Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

2018-10-01 Thread Steven Rostedt
On Thu, 27 Sep 2018 12:46:01 -0700
Daniel Wang  wrote:

> Prior to this change, the combination of `softlockup_panic=1` and
> `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot 
> path
> is trying to grab the console lock that is held by the stack trace printing
> path. What seems to be happening is that while there are multiple CPUs, only 
> one
> of them is tasked to print the back trace of all CPUs. On a machine with many
> CPUs and a slow serial console (on Google Compute Engine for example), the 
> stack
> trace printing routine hits a timeout and the reboot path kicks in. The latter
> then tries to print something else, but can't get the lock because it's still
> held by earlier printing path. This is easily reproducible on a VM with 16+
> vCPUs on Google Compute Engine - which is a very common scenario.
> 
> A quick repro is available at
> https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds
> into executing repro.sh. Both deadlock analysis and repro are credits to Peter
> Feiner.
> 
> Note that I have read previous discussions on backporting this to stable [1].
> The argument for objecting the backport was that this is a non-trivial fix and
> is supported to prevent hypothetical soft lockups. What we are hitting is a 
> real
> deadlock, in production, however. Hence this request.
> 
> [1] 
> https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u
> 
> Serial console logs leading up to the deadlock. As can be seen the stack trace
> was incomplete because the printing path hit a timeout.

I'm fine with having this backported.

-- Steve