Re: [PATCH 24/30] panic: Refactor the panic path

2022-06-15 Thread Guilherme G. Piccoli
Perfect Petr, thanks for your feedback!

I'll be out for some weeks, but after that what I'm doing is to split
the series in 2 parts:

(a) The general fixes, which should be reviewed by subsystem maintainers
and even merged individually by them.

(b) The proper panic refactor, which includes the notifiers list split,
etc. I'll think about what I consider the best solution for the
crash_dump required ones, and will try to split in very simple patches
to make it easier to review.

Cheers,


Guilherme



Re: [PATCH 24/30] panic: Refactor the panic path

2022-06-14 Thread Petr Mladek
On Thu 2022-05-26 13:25:57, Guilherme G. Piccoli wrote:
> OK, so it seems we have some points in which agreement exists, and some
> points that there is no agreement and instead, we have antagonistic /
> opposite views and needs. Let's start with the easier part heh
>
> It seems everybody agrees that *we shouldn't over-engineer things*, and
> as per Eric good words: making the panic path more feature-full or
> increasing flexibility isn't a good idea. So, as a "corollary": the
> panic level approach I'm proposing is not a good fit, I'll drop it and
> let's go with something simpler.

Makes sense.

> Another point of agreement seems to be that _notifier lists in the panic
> path are dangerous_, for *2 different reasons*:
> 
> (a) We cannot guarantee that people won't add crazy callbacks there, we
> can plan and document things the best as possible - it'll never be
> enough, somebody eventually would slip a nonsense callback that would
> break things and defeat the planned purpose of such a list;

It is true that notifier lists might allow to add crazy stuff
without proper review more easily. Things added into the core
code would most likely get better review.

But nothing is error-proof. And bugs will happen with any approach.


> (b) As per Eric point, in a panic/crash situation we might have memory
> corruption exactly in the list code / pointers, etc, so the notifier
> lists are, by nature, a bit fragile. But I think we shouldn't consider
> it completely "bollocks", since this approach has been used for a while
> with a good success rate. So, lists aren't perfect at all, but at the
> same time, they aren't completely useless.

I am not able to judge this. Of course, any extra step increases
the risk. I am just not sure how much more complicated it would
be to hardcode the calls. Most of them are architecture
and/or feature specific. And such code is often hard to
review and maintain.

> To avoid using a 4th list,

4th or 5th? We already have "hypervisor", "info", "pre-reboot", and "pre-loop".
The 5th might be pre-crash-exec.

> especially given the list nature is a bit
> fragile, I'd suggest one of the 3 following approaches - I *really
> appreciate feedbacks* on that so I can implement the best solution and
> avoid wasting time in some poor/disliked solution:

Honestly, I am not able to decide what might be better without seeing
the code.

Most things fits pretty well into the 4 proposed lists:
"hypervisor", "info", "pre-reboot", and "pre-loop". IMHO, the
only question is the code that needs to be always called
even before crash_dump.

I suggest that you solve the crash_dump callbacks the way that
looks best to you. Ideally do it in a separate patch so it can be
reviewed and reworked more easily.

I believe that a fresh code with an updated split and simplified
logic would help us to move forward.

Best Regards,
Petr



Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-26 Thread Guilherme G. Piccoli
Hey folks, first of all thanks a lot for the reviews / opinions about
this. I imagined that such change would be polemic, and I see I was
right heh


I'll try to "mix" all the relevant opinions in a single email, since
they happened in different responses and even different mail threads.

I've looped here the most interested parties based on the feedback
received, such as Baoquan (kdump), Hatayama (kdump), Eric (kexec), Mark
(arm64), Michael (Hyper-V), Petr (console/printk) and Vitaly (hyper-v /
kvm). I hope we can discuss and try to reach some consensus - my
apologies in advance for this long message!

So, here goes some feedback we received about this change and correlated
feedback from arm64 community - my apologies if I missed something
important, I've tried to collect the most relevant portions, while
keeping the summary "as short" as possible. I'll respond to such
feedback below, after the quotes.


On 24/05/2022 05:32, Baoquan He wrote:
>> [...] 
>> Firstly, kdump is not always the first thing. In any use case, if kdump
>> kernel is not loaded, it's not the first thing at all. Not to mention
>> if crash_kexec_post_notifiers is specified.
>> [...]
>> Changing this will cause regression. During these years, nobody ever doubt
>> kdump should execute firstly if crashkernel is reserved and kdump kernel is
>> loaded. That's not saying we can't change
>> this, but need a convincing justification.
>> [...] 
>> Secondly, even with the notifiers' split, we can't guarantee people will
>> absolutely add notifiers into right list in the future. Letting kdump
>> execute behind lists by default will put kdump into risk.
>> [...] 
>> As for Hyper-V, if it enforces to terminate VMbus connection, no matter
>> it's kdump or not, why not taking it out of panic notifiers list and
>> execute it before kdump unconditionally.


On 24/05/2022 05:01, Petr Mladek wrote:
>> [...]
>> Anyway, I see four possible solutions:
>> 
>>   1. The most conservative approach is to keep the current behavior
>>  and call kdump first by default.
>> 
>>   2. A medium conservative approach to change the default default
>>  behavior and call hypervisor and eventually the info notifiers
>>  before kdump. There still would be the possibility to call kdump
>>  first by the command line parameter.
>> 
>>   3. Remove the possibility to call kdump first completely. It would
>>  assume that all the notifiers in the info list are super safe
>>  or that they make kdump actually more safe.
>> 
>>   4. Create one more notifier list for operations that always should
>>  be called before crash_dump.
>> 
>> Regarding the extra notifier list (4th solution). It is not clear to
>> me whether it would be always called even before hypervisor list or
>> when kdump is not enabled. We must not over-engineer it.
>> 
>> 2nd proposal looks like a good compromise. But maybe we could do
>> this change few releases later. The notifiers split is a big
>> change on its own.


On 24/05/2022 07:18, Baoquan He wrote:
>>[...]
>> I would vote for 1 or 4 without any hesitation, and prefer 4. I ever
>> suggest the variant of solution 4 in v1 reviewing. That's taking those
>> notifiers out of list and enforcing to execute them before kdump. E.g
>> the one on HyperV to terminate VMbus connection. Maybe solution 4 is
>> better to provide a determinate way for people to add necessary code
>> at the earliest part.
>> [...] 
>>>
>>> Regarding the extra notifier list (4th solution). It is not clear to
>>> me whether it would be always called even before hypervisor list or
>>> when kdump is not enabled. We must not over-engineer it.
>> 
>> One thing I would like to notice is, no matter how perfect we split the
>> lists this time, we can't gurantee people will add notifiers reasonablly
>> in the future. And people from different sub-component may not do
>> sufficient investigation and add them to fulfil their local purpose.
>> 
>> The current panic notifers list is the best example. Hyper-V actually
>> wants to run some necessary code before kdump, but not all of them, they
>> just add it, ignoring the original purpose of
>> crash_kexec_post_notifiers. I guess they do like this just because it's
>> easy to do, no need to bother changing code in generic place.
>> 
>> Solution 4 can make this no doubt, that's why I like it better.
>> [...] 
>> As I replied to Guilherme, solution 2 will cause regression if not
>> calling kdump firstly. Solution 3 leaves people space to make mistake,
>> they could add nontifier into wrong list.
>> 
>> I would like to note again that the panic notifiers are optional to run,
>> while kdump is expectd once loaded, from the original purpose. I guess
>> people I know will still have this thought, e.g Hatayama, Masa, they are
>> truly often use panic notifiers like this on their company's system.


On 24/05/2022 11:44, Eric W. Biederman wrote:
> [...]
> Unfortunately I am also very grouchy.
> 
> Notifiers before kexec on panic are 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-24 Thread Eric W. Biederman
"Guilherme G. Piccoli"  writes:

> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
>
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
>
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
>
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
>
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
>
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
>
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
>
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
>
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
>
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
>
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
>
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
>
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

I am late to this discussion.  My apologies.

Unfortunately I am also very grouchy.

Notifiers before kexec on panic are fundamentally broken.  So please
just remove crash_kexec_post notifiers and be done with it.  Part of the
deep issue is that firmware always has a common and broken
implementation for anything that is not mission critical to
motherboards.

Notifiers in any sense on these paths are just bollocks.  Any kind of
notifier list is fundamentally fragile in the face of memory corruption
and very very difficult to review.

So I am going to refresh my ancient NACK on this.

I can certainly appreciate that there are pieces of the reboot paths
that can be improved.  I don't think making anything more feature full
or flexible is any kind of real improvement.

Eric



>
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
>
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-24 Thread Baoquan He
On 05/24/22 at 10:01am, Petr Mladek wrote:
> On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote:
> > On 19/05/2022 20:45, Baoquan He wrote:
> > > [...]
> > >> I really appreciate the summary skill you have, to convert complex
> > >> problems in very clear and concise ideas. Thanks for that, very useful!
> > >> I agree with what was summarized above.
> > > 
> > > I want to say the similar words to Petr's reviewing comment when I went
> > > through the patches and traced each reviewing sub-thread to try to
> > > catch up. Petr has reivewed this series so carefully and given many
> > > comments I want to ack immediately.
> > > 
> > > I agree with most of the suggestions from Petr to this patch, except of
> > > one tiny concern, please see below inline comment.
> > 
> > Hi Baoquan, thanks! I'm glad you're also reviewing that =)
> > 
> > 
> > > [...]
> > > 
> > > I like the proposed skeleton of panic() and code style suggested by
> > > Petr very much. About panic_prefer_crash_dump which might need be added,
> > > I hope it has a default value true. This makes crash_dump execute at
> > > first by default just as before, unless people specify
> > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
> > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
> > > this is inconsistent with the old behaviour.
> > 
> > I'd like to understand better why the crash_kexec() must always be the
> > first thing in your use case. If we keep that behavior, we'll see all
> > sorts of workarounds - see the last patches of this series, Hyper-V and
> > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force
> > execution of their relevant notifiers (like the vmbus disconnect,
> > specially in arm64 that has no custom machine_crash_shutdown, or the
> > fadump case in ppc). This led to more risk in kdump.
> > 
> > The thing is: with the notifiers' split, we tried to keep only the most
> > relevant/necessary stuff in this first list, things that ultimately
> > should improve kdump reliability or if not, at least not break it. My
> > feeling is that, with this series, we should change the idea/concept
> > that kdump must run first nevertheless, not matter what. We're here
> > trying to accommodate the antagonistic goals of hypervisors that need
> > some clean-up (even for kdump to work) VS. kdump users, that wish a
> > "pristine" system reboot ASAP after the crash.
> 
> Good question. I wonder if Baoquan knows about problems caused by the
> particular notifiers that will end up in the hypervisor list. Note
> that there will be some shuffles and the list will be slightly
> different in V2.

Yes, I knew some of them. Please check my response to Guilherme.

We have bug to track the issue on Hyper-V in which failure happened
during panic notifiers running, haven't come to kdump. Seems both of
us sent mail replying to Guilherme at the same time. 

> 
> Anyway, I see four possible solutions:
> 
>   1. The most conservative approach is to keep the current behavior
>  and call kdump first by default.
> 
>   2. A medium conservative approach to change the default default
>  behavior and call hypervisor and eventually the info notifiers
>  before kdump. There still would be the possibility to call kdump
>  first by the command line parameter.
> 
>   3. Remove the possibility to call kdump first completely. It would
>  assume that all the notifiers in the info list are super safe
>  or that they make kdump actually more safe.
> 
>   4. Create one more notifier list for operations that always should
>  be called before crash_dump.

I would vote for 1 or 4 without any hesitation, and prefer 4. I ever
suggest the variant of solution 4 in v1 reviewing. That's taking those
notifiers out of list and enforcing to execute them before kdump. E.g
the one on HyperV to terminate VMbus connection. Maybe solution 4 is
better to provide a determinate way for people to add necessary code
at the earliest part.

> 
> Regarding the extra notifier list (4th solution). It is not clear to
> me whether it would be always called even before hypervisor list or
> when kdump is not enabled. We must not over-engineer it.

One thing I would like to notice is, no matter how perfect we split the
lists this time, we can't gurantee people will add notifiers reasonablly
in the future. And people from different sub-component may not do
sufficient investigation and add them to fulfil their local purpose.

The current panic notifers list is the best example. Hyper-V actually
wants to run some necessary code before kdump, but not all of them, they
just add it, ignoring the original purpose of
crash_kexec_post_notifiers. I guess they do like this just because it's
easy to do, no need to bother changing code in generic place.

Solution 4 can make this no doubt, that's why I like it better.

> 
> 2nd proposal looks like a good compromise. But maybe we could do
> this change few releases later. The 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-24 Thread Baoquan He
On 05/20/22 at 08:23am, Guilherme G. Piccoli wrote:
> On 19/05/2022 20:45, Baoquan He wrote:
> > [...]
> >> I really appreciate the summary skill you have, to convert complex
> >> problems in very clear and concise ideas. Thanks for that, very useful!
> >> I agree with what was summarized above.
> > 
> > I want to say the similar words to Petr's reviewing comment when I went
> > through the patches and traced each reviewing sub-thread to try to
> > catch up. Petr has reivewed this series so carefully and given many
> > comments I want to ack immediately.
> > 
> > I agree with most of the suggestions from Petr to this patch, except of
> > one tiny concern, please see below inline comment.
> 
> Hi Baoquan, thanks! I'm glad you're also reviewing that =)
> 
> 
> > [...]
> > 
> > I like the proposed skeleton of panic() and code style suggested by
> > Petr very much. About panic_prefer_crash_dump which might need be added,
> > I hope it has a default value true. This makes crash_dump execute at
> > first by default just as before, unless people specify
> > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
> > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
> > this is inconsistent with the old behaviour.
> 
> I'd like to understand better why the crash_kexec() must always be the
> first thing in your use case. If we keep that behavior, we'll see all
> sorts of workarounds - see the last patches of this series, Hyper-V and
> PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force
> execution of their relevant notifiers (like the vmbus disconnect,
> specially in arm64 that has no custom machine_crash_shutdown, or the
> fadump case in ppc). This led to more risk in kdump.

Firstly, kdump is not always the first thing. In any use case, if kdump
kernel is not loaded, it's not the first thing at all. Not to mention
if crash_kexec_post_notifiers is specified.

if kdump kernel is loaded, kdump has been executing firslty, since it
was added into kenrel/panic(); Until 2014, Masa added crash_kexec_post_notifiers
kernel parameter to make panic notifiers be able to execute before kdump
if specified.

commit dc009d92435f99498cbc579ce76bf28e837e2c14
Author: Eric W. Biederman 
Date:   Sat Jun 25 14:57:52 2005 -0700

[PATCH] kexec: add kexec syscalls
 
commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45
Author: Masami Hiramatsu 
Date:   Fri Jun 6 14:37:07 2014 -0700

kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump 
after panic_notifers

Changing this will cause regression. During these years, nobody ever doubt
kdump should execute firstly if crashkernel is reserved and kdump kernel is
loaded. That's not saying we can't change
this, but need a convincing justification.

Secondly, even with the notifiers' split, we can't guarantee people will
absolutely add notifiers into right list in the future. Letting kdump
execute behind lists by default will put kdump into risk.

For example, you replied to Hatamata saying you have been working with
kdump in the last 3, 4 years, and you have have been working on these
panic notifiers refactoring issue in the recent months. However, in your
refactoring patches of introducing hypervisor/info/pre-reboot, I noticed
you acked the suggestion from Petr that several notifiers need be moved to
correct position. So even you can't make sure these, how can other people
be able to recognize which list should be 100% appropriate when they try
to register one notifier for their sub-component?

At last, I am wondering why fadump matters. I don't know in which case
people wants to load kdump kernel, but expect to trigger crash fadump.
Power people need consider this carefully and makes some change. Fadump
just borrows the crashkernel reservation mechanism. If fadump would rather
take risk to run all panic notifiers, whether fadump really needs them
or not, then execute crash_fadump(), that's powerpc's business.

As for Hyper-V, if it enforces to terminate VMbus connection, no matter
it's kdump or not, why not taking it out of panic notifiers list and
execute it before kdump unconditionally. Below is abstracted from
Michael's words.

https://lore.kernel.org/all/mwhpr21mb15933573f5c81c5250bf6a1cd7...@mwhpr21mb1593.namprd21.prod.outlook.com/T/#u
===
I looked at the code again, and should revise my previous comments
somewhat.   The Hyper-V resets that I described indeed must be done
prior to kexec'ing the kdump kernel.   Most such resets are actually
done via __crash_kexec() -> machine_crash_shutdown(), not via the
panic notifier. However, the Hyper-V panic notifier must terminate the
VMbus connection, because that must be done even if kdump is not
being invoked.  See commit 74347a99e73.
===

> 
> The thing is: with the notifiers' split, we tried to keep only the most
> relevant/necessary stuff in this first list, things that ultimately
> should 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-24 Thread Petr Mladek
On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote:
> On 19/05/2022 20:45, Baoquan He wrote:
> > [...]
> >> I really appreciate the summary skill you have, to convert complex
> >> problems in very clear and concise ideas. Thanks for that, very useful!
> >> I agree with what was summarized above.
> > 
> > I want to say the similar words to Petr's reviewing comment when I went
> > through the patches and traced each reviewing sub-thread to try to
> > catch up. Petr has reivewed this series so carefully and given many
> > comments I want to ack immediately.
> > 
> > I agree with most of the suggestions from Petr to this patch, except of
> > one tiny concern, please see below inline comment.
> 
> Hi Baoquan, thanks! I'm glad you're also reviewing that =)
> 
> 
> > [...]
> > 
> > I like the proposed skeleton of panic() and code style suggested by
> > Petr very much. About panic_prefer_crash_dump which might need be added,
> > I hope it has a default value true. This makes crash_dump execute at
> > first by default just as before, unless people specify
> > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
> > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
> > this is inconsistent with the old behaviour.
> 
> I'd like to understand better why the crash_kexec() must always be the
> first thing in your use case. If we keep that behavior, we'll see all
> sorts of workarounds - see the last patches of this series, Hyper-V and
> PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force
> execution of their relevant notifiers (like the vmbus disconnect,
> specially in arm64 that has no custom machine_crash_shutdown, or the
> fadump case in ppc). This led to more risk in kdump.
> 
> The thing is: with the notifiers' split, we tried to keep only the most
> relevant/necessary stuff in this first list, things that ultimately
> should improve kdump reliability or if not, at least not break it. My
> feeling is that, with this series, we should change the idea/concept
> that kdump must run first nevertheless, not matter what. We're here
> trying to accommodate the antagonistic goals of hypervisors that need
> some clean-up (even for kdump to work) VS. kdump users, that wish a
> "pristine" system reboot ASAP after the crash.

Good question. I wonder if Baoquan knows about problems caused by the
particular notifiers that will end up in the hypervisor list. Note
that there will be some shuffles and the list will be slightly
different in V2.

Anyway, I see four possible solutions:

  1. The most conservative approach is to keep the current behavior
 and call kdump first by default.

  2. A medium conservative approach to change the default default
 behavior and call hypervisor and eventually the info notifiers
 before kdump. There still would be the possibility to call kdump
 first by the command line parameter.

  3. Remove the possibility to call kdump first completely. It would
 assume that all the notifiers in the info list are super safe
 or that they make kdump actually more safe.

  4. Create one more notifier list for operations that always should
 be called before crash_dump.

Regarding the extra notifier list (4th solution). It is not clear to
me whether it would be always called even before hypervisor list or
when kdump is not enabled. We must not over-engineer it.

2nd proposal looks like a good compromise. But maybe we could do
this change few releases later. The notifiers split is a big
change on its own.

Best Regards,
Petr



Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-20 Thread Guilherme G. Piccoli
On 19/05/2022 20:45, Baoquan He wrote:
> [...]
>> I really appreciate the summary skill you have, to convert complex
>> problems in very clear and concise ideas. Thanks for that, very useful!
>> I agree with what was summarized above.
> 
> I want to say the similar words to Petr's reviewing comment when I went
> through the patches and traced each reviewing sub-thread to try to
> catch up. Petr has reivewed this series so carefully and given many
> comments I want to ack immediately.
> 
> I agree with most of the suggestions from Petr to this patch, except of
> one tiny concern, please see below inline comment.

Hi Baoquan, thanks! I'm glad you're also reviewing that =)


> [...]
> 
> I like the proposed skeleton of panic() and code style suggested by
> Petr very much. About panic_prefer_crash_dump which might need be added,
> I hope it has a default value true. This makes crash_dump execute at
> first by default just as before, unless people specify
> panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
> panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
> this is inconsistent with the old behaviour.

I'd like to understand better why the crash_kexec() must always be the
first thing in your use case. If we keep that behavior, we'll see all
sorts of workarounds - see the last patches of this series, Hyper-V and
PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force
execution of their relevant notifiers (like the vmbus disconnect,
specially in arm64 that has no custom machine_crash_shutdown, or the
fadump case in ppc). This led to more risk in kdump.

The thing is: with the notifiers' split, we tried to keep only the most
relevant/necessary stuff in this first list, things that ultimately
should improve kdump reliability or if not, at least not break it. My
feeling is that, with this series, we should change the idea/concept
that kdump must run first nevertheless, not matter what. We're here
trying to accommodate the antagonistic goals of hypervisors that need
some clean-up (even for kdump to work) VS. kdump users, that wish a
"pristine" system reboot ASAP after the crash.

Cheers,


Guilherme



Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-19 Thread Baoquan He
On 05/15/22 at 07:47pm, Guilherme G. Piccoli wrote:
> On 12/05/2022 11:03, Petr Mladek wrote:
.. 
> > OK, the question is how to make it better. Let's start with
> > a clear picture of the problem:
> > 
> > 1. panic() has basically two funtions:
> > 
> >   + show/store debug information (optional ways and amount)
> >   + do something with the system (reboot, stay hanged)
> > 
> > 
> > 2. There are 4 ways how to show/store the information:
> > 
> >   + tell hypervisor to store what it is interested about
> >   + crash_dump
> >   + kmsg_dump()
> >   + consoles
> > 
> >   , where crash_dump and consoles are special:
> > 
> >  + crash_dump does not return. Instead it ends up with reboot.
> > 
> >  + Consoles work transparently. They just need an extra flush
> >before reboot or staying hanged.
> > 
> > 
> > 3. The various notifiers do things like:
> > 
> >  + tell hypervisor about the crash
> >  + print more information (also stop watchdogs)
> >  + prepare system for reboot (touch some interfaces)
> >  + prepare system for staying hanged (blinking)
> > 
> >Note that it pretty nicely matches the 4 notifier lists.
> > 
> 
> I really appreciate the summary skill you have, to convert complex
> problems in very clear and concise ideas. Thanks for that, very useful!
> I agree with what was summarized above.

I want to say the similar words to Petr's reviewing comment when I went
through the patches and traced each reviewing sub-thread to try to
catch up. Petr has reivewed this series so carefully and given many
comments I want to ack immediately.

I agree with most of the suggestions from Petr to this patch, except of
one tiny concern, please see below inline comment.

> 
> 
> > Now, we need to decide about the ordering. The main area is how
> > to store the debug information. Consoles are transparent so
> > the quesition is about:
> > 
> >  + hypervisor
> >  + crash_dump
> >  + kmsg_dump
> > 
> > Some people need none and some people want all. There is a
> > risk that system might hung at any stage. This why people want to
> > make the order configurable.
> > 
> > But crash_dump() does not return when it succeeds. And kmsg_dump()
> > users havn't complained about hypervisor problems yet. So, that
> > two variants might be enough:
> > 
> > + crash_dump (hypervisor, kmsg_dump as fallback)
> > + hypervisor, kmsg_dump, crash_dump
> > 
> > One option "panic_prefer_crash_dump" should be enough.
> > And the code might look like:
> > 
> > void panic()
> > {
> > [...]
> > dump_stack();
> > kgdb_panic(buf);
> > 
> > < ---  here starts the reworked code --- >
> > 
> > /* crash dump is enough when enabled and preferred. */
> > if (panic_prefer_crash_dump)
> > __crash_kexec(NULL);

I like the proposed skeleton of panic() and code style suggested by
Petr very much. About panic_prefer_crash_dump which might need be added,
I hope it has a default value true. This makes crash_dump execute at
first by default just as before, unless people specify
panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
this is inconsistent with the old behaviour.

> > 
> > /* Stop other CPUs and focus on handling the panic state. */
> > if (has_kexec_crash_image)
> > crash_smp_send_stop();
> > else
> > smp_send_stop()
> > 
> 
> Here we have a very important point. Why do we need 2 variants of SMP
> CPU stopping functions? I disagree with that - my understanding of this
> after some study in architectures is that the crash_() variant is
> "stronger", should work in all cases and if not, we should fix that -
> that'd be a bug.
> 
> Such variant either maps to smp_send_stop() (in various architectures,
> including XEN/x86) or overrides the basic function with more proper
> handling for panic() case...I don't see why we still need such
> distinction, if you / others have some insight about that, I'd like to
> hear =)
> 
> 
> > /* Notify hypervisor about the system panic. */
> > atomic_notifier_call_chain(_hypervisor_list, 0, NULL);
> > 
> > /*
> >  * No need to risk extra info when there is no kmsg dumper
> >  * registered.
> >  */
> > if (!has_kmsg_dumper())
> > __crash_kexec(NULL);
> > 
> > /* Add extra info from different subsystems. */
> > atomic_notifier_call_chain(_info_list, 0, NULL);
> > 
> > kmsg_dump(KMSG_DUMP_PANIC);
> > __crash_kexec(NULL);
> > 
> > /* Flush console */
> > unblank_screen();
> > console_unblank();
> > debug_locks_off();
> > console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> > 
> > if (panic_timeout > 0) {
> > delay()
> > }
> > 
> > /*
> >  * Prepare system for eventual reboot and allow custom
> >  * reboot handling.
> >  */
> > atomic_notifier_call_chain(_reboot_list, 0, NULL);
> 
> 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-16 Thread Guilherme G. Piccoli
On 16/05/2022 07:21, Petr Mladek wrote:
> [...]
> Ah, it should have been:
> 
>  + notifiers vs. kmsg_dump
>  + notifiers vs. crash_dump
>  + crash_dump vs. kmsg_dump
> 
> I am sorry for the confusion. Even "crash_dump" is slightly
> misleading because there is no function with this name.
> But it seems to be easier to understand than __crash_kexec().

Cool, thanks! Now it's totally clear for me =)
I feel crash dump is the proper term, but I personally prefer kdump to
avoid mess-up with user space "core dump" concept heheh
Also, KDUMP is an entry on MAINTAINERS file.


> [...]
>> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
>> disagree here - I'm much more fond to this kind of code than a bunch of
>> if/else blocks that almost give headaches. Encoding such "level" logic
>> in the if/else scheme is very convoluted, generates a very big code. And
>> the functions aren't so black magic - they map a level in bits, and the
>> functions _once() are called...once! Although we switch the position in
>> the code, so there are 2 calls, one of them is called and the other not.
> 
> I see. Well, I would consider this as a warning that the approach is
> too complex. If the code, using if/then/else, would cause headaches
> then also understanding of the behavior would cause headaches for
> both users and programmers.
> 
> 
>> But that's totally fine to change - especially if we're moving away from
>> the "level" logic. I see below you propose a much simpler approach - if
>> we follow that, definitely we won't need the "black magic" approach heheh
> 
> I do not say that my proposal is fully correct. But we really need
> this kind of simpler approach.

It's cool, I agree that your idea is much simpler and makes sense - mine
seems to be an over-engineering effort. Let's see the opinions of the
interested parties, I'm curious to see if everybody agrees here, that'd
would be ideal (and kind of "wishful thinking" I guess heheh - panic
path is polemic).


> [...] 
>> Here we have a very important point. Why do we need 2 variants of SMP
>> CPU stopping functions? I disagree with that - my understanding of this
>> after some study in architectures is that the crash_() variant is
>> "stronger", should work in all cases and if not, we should fix that -
>> that'd be a bug.
>>
>> Such variant either maps to smp_send_stop() (in various architectures,
>> including XEN/x86) or overrides the basic function with more proper
>> handling for panic() case...I don't see why we still need such
>> distinction, if you / others have some insight about that, I'd like to
>> hear =)
> 
> The two variants were introduced by the commit 0ee59413c967c35a6dd
> ("x86/panic: replace smp_send_stop() with kdump friendly version in
> panic path")
> 
> It points to https://lkml.org/lkml/2015/6/24/44 that talks about
> still running watchdogs.
> 
> It is possible that the problem could be fixed another way. It is
> even possible that it has already been fixed by the notifiers
> that disable the watchdogs.
> 
> Anyway, any change of the smp_send_stop() behavior should be done
> in a separate patch. It will help with bisection of possible
> regression. Also it would require a good explanation in
> the commit message. I would personally do it in a separate
> patch(set).

Thanks for the archeology and interesting findings. I agree that is
better to split in smaller patches. I'm planning a split in 3 patches
for V2: clean-up (comment, console flushing idea, useless header), the
refactor itself and finally, this SMP change.


> [...] 
>> You had the order of panic_reboot_list VS. consoles flushing inverted.
>> It might make sense, although I didn't do that in V1...
> 
> IMHO, it makes sense:
> 
>   1. panic_reboot_list contains notifiers that do the reboot
>  immediately, for example, xen_panic_event, alpha_panic_event.
>  The consoles have to be flushed earlier.
> 
>   2. console_flush_on_panic() ignores the result of console_trylock()
>  and always calls console_unlock(). As a result the lock should
>  be unlocked at the end. And any further printk() should be able
>  to printk the messages to the console immediately. It means
>  that any messages printed by the reboot notifiers should appear
>  on the console as well.
> [...] 
>> OK, I agree with you! It's indeed simpler and if others agree, I can
>> happily change the logic to what you proposed. Although...currently the
>> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list
>> callbacks _before kdump_.
>>
>> We need to mention this change in the commit messages, but I really
>> would like to hear the opinions of heavy users of notifiers (as
>> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave
>> Young / Hayatama). If we all agree on such approach, will change that
>> for V2 =)
> 
> Sure, we need to make sure that we call everything that is needed.
> And it should be documented.
> 
> I believe that this is the 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-16 Thread Petr Mladek
On Sun 2022-05-15 19:47:39, Guilherme G. Piccoli wrote:
> On 12/05/2022 11:03, Petr Mladek wrote:
> > This talks only about kdump. The reality is much more complicated.
> > The level affect the order of:
> > 
> > + notifiers vs. kdump
> > + notifiers vs. crash_dump
> > + crash_dump vs. kdump
> 
> First of all, I'd like to ask you please to clarify to me *exactly* what
> are the differences between "crash_dump" and "kdump". I'm sorry if
> that's a silly question, I need to be 100% sure I understand the
> concepts the same way you do.

Ah, it should have been:

 + notifiers vs. kmsg_dump
 + notifiers vs. crash_dump
 + crash_dump vs. kmsg_dump

I am sorry for the confusion. Even "crash_dump" is slightly
misleading because there is no function with this name.
But it seems to be easier to understand than __crash_kexec().


> > There might theoretically many variants of the ordering of kdump,
> > crash_dump, and the 4 notifier list. Some variants do not make
> > much sense. You choose 5 variants and tried to select them by
> > a level number.
> > 
> > The question is if we really could easily describe the meaning this
> > way. It is not only about a "level" of notifiers before kdump. It is
> > also about the ordering of crash_dump vs. kdump. IMHO, "level"
> > semantic does not fit there.
> > 
> > Maybe more parameters might be easier to understand the effect.
> > Anyway, we first need to agree on the chosen variants.
> > I am going to discuss it more in the code, see below.
> > 
> > 
> > [...] 
> > Here is the code using the above functions. It helps to discuss
> > the design and logic.
> > 
> > I have to say that the logic is very unclear. Almost all
> > functions are called twice:
> > 
> > The really used code path is defined by order_panic_notifiers_and_kdump()
> > that encodes "level" into "bits". The bits are then flipped in
> > panic_notifier_*_once() calls that either do something or not.
> > kmsg_dump() is called according to the bit flip.
> > 
> > Also I guess that it is good proof that "level" abstraction does
> > not fit here. Normal levels would not need this kind of magic.
> 
> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
> disagree here - I'm much more fond to this kind of code than a bunch of
> if/else blocks that almost give headaches. Encoding such "level" logic
> in the if/else scheme is very convoluted, generates a very big code. And
> the functions aren't so black magic - they map a level in bits, and the
> functions _once() are called...once! Although we switch the position in
> the code, so there are 2 calls, one of them is called and the other not.

I see. Well, I would consider this as a warning that the approach is
too complex. If the code, using if/then/else, would cause headaches
then also understanding of the behavior would cause headaches for
both users and programmers.


> But that's totally fine to change - especially if we're moving away from
> the "level" logic. I see below you propose a much simpler approach - if
> we follow that, definitely we won't need the "black magic" approach heheh

I do not say that my proposal is fully correct. But we really need
this kind of simpler approach.


> > OK, the question is how to make it better.

> > One option "panic_prefer_crash_dump" should be enough.
> > And the code might look like:
> > 
> > void panic()
> > {
> > [...]
> > dump_stack();
> > kgdb_panic(buf);
> > 
> > < ---  here starts the reworked code --- >
> > 
> > /* crash dump is enough when enabled and preferred. */
> > if (panic_prefer_crash_dump)
> > __crash_kexec(NULL);
> > 
> > /* Stop other CPUs and focus on handling the panic state. */
> > if (has_kexec_crash_image)
> > crash_smp_send_stop();
> > else
> > smp_send_stop()
> > 
> 
> Here we have a very important point. Why do we need 2 variants of SMP
> CPU stopping functions? I disagree with that - my understanding of this
> after some study in architectures is that the crash_() variant is
> "stronger", should work in all cases and if not, we should fix that -
> that'd be a bug.
> 
> Such variant either maps to smp_send_stop() (in various architectures,
> including XEN/x86) or overrides the basic function with more proper
> handling for panic() case...I don't see why we still need such
> distinction, if you / others have some insight about that, I'd like to
> hear =)

The two variants were introduced by the commit 0ee59413c967c35a6dd
("x86/panic: replace smp_send_stop() with kdump friendly version in
panic path")

It points to https://lkml.org/lkml/2015/6/24/44 that talks about
still running watchdogs.

It is possible that the problem could be fixed another way. It is
even possible that it has already been fixed by the notifiers
that disable the watchdogs.

Anyway, any change of the smp_send_stop() behavior should be done
in a separate patch. It will help with bisection of possible
regression. Also it would 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-15 Thread Guilherme G. Piccoli
On 12/05/2022 11:03, Petr Mladek wrote:
> Hello,
> 
> first, I am sorry for stepping into the discussion so late.
> I was busy with some other stuff and this patchset is far
> from trivial.
> 
> Second, thanks a lot for putting so much effort into it.
> Most of the changes look pretty good, especially all
> the fixes of particular notifiers and split into
> four lists.
> 
> Though this patch will need some more love. See below
> for more details.

Thanks a lot for your review Petr, it is much appreciated! No need for
apologies, there is no urgency here =)


> [...] 
> This talks only about kdump. The reality is much more complicated.
> The level affect the order of:
> 
> + notifiers vs. kdump
> + notifiers vs. crash_dump
> + crash_dump vs. kdump

First of all, I'd like to ask you please to clarify to me *exactly* what
are the differences between "crash_dump" and "kdump". I'm sorry if
that's a silly question, I need to be 100% sure I understand the
concepts the same way you do.


> There might theoretically many variants of the ordering of kdump,
> crash_dump, and the 4 notifier list. Some variants do not make
> much sense. You choose 5 variants and tried to select them by
> a level number.
> 
> The question is if we really could easily describe the meaning this
> way. It is not only about a "level" of notifiers before kdump. It is
> also about the ordering of crash_dump vs. kdump. IMHO, "level"
> semantic does not fit there.
> 
> Maybe more parameters might be easier to understand the effect.
> Anyway, we first need to agree on the chosen variants.
> I am going to discuss it more in the code, see below.
> 
> 
> [...] 
> Here is the code using the above functions. It helps to discuss
> the design and logic.
> 
> 
>   order_panic_notifiers_and_kdump();
> 
>   /* If no level, we should kdump ASAP. */
>   if (!panic_notifiers_level)
>   __crash_kexec(NULL);
> 
>   crash_smp_send_stop();
>   panic_notifier_hypervisor_once(buf);
> 
>   if (panic_notifier_info_once(buf))
>   kmsg_dump(KMSG_DUMP_PANIC);
> 
>   panic_notifier_pre_reboot_once(buf);
> 
>   __crash_kexec(NULL);
> 
>   panic_notifier_hypervisor_once(buf);
> 
>   if (panic_notifier_info_once(buf))
>   kmsg_dump(KMSG_DUMP_PANIC);
> 
>   panic_notifier_pre_reboot_once(buf);
> 
> 
> I have to say that the logic is very unclear. Almost all
> functions are called twice:
> 
>+ __crash_kexec()
>+ kmsg_dump()
>+ panic_notifier_hypervisor_once()
>+ panic_notifier_pre_reboot_once()
>+ panic_notifier_info_once()
> 
> It is pretty hard to find what functions are always called in the same
> order and where the order can be inverted.
> 
> The really used code path is defined by order_panic_notifiers_and_kdump()
> that encodes "level" into "bits". The bits are then flipped in
> panic_notifier_*_once() calls that either do something or not.
> kmsg_dump() is called according to the bit flip.
> 
> It is an interesting approach. I guess that you wanted to avoid too
> many if/then/else levels in panic(). But honestly, it looks like
> a black magic to me.
> 
> IMHO, it is always easier to follow if/then/else logic than using
> a translation table that requires additional bit flips when
> a value is used more times.
> 
> Also I guess that it is good proof that "level" abstraction does
> not fit here. Normal levels would not need this kind of magic.

Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
disagree here - I'm much more fond to this kind of code than a bunch of
if/else blocks that almost give headaches. Encoding such "level" logic
in the if/else scheme is very convoluted, generates a very big code. And
the functions aren't so black magic - they map a level in bits, and the
functions _once() are called...once! Although we switch the position in
the code, so there are 2 calls, one of them is called and the other not.

But that's totally fine to change - especially if we're moving away from
the "level" logic. I see below you propose a much simpler approach - if
we follow that, definitely we won't need the "black magic" approach heheh


> 
> OK, the question is how to make it better. Let's start with
> a clear picture of the problem:
> 
> 1. panic() has basically two funtions:
> 
>   + show/store debug information (optional ways and amount)
>   + do something with the system (reboot, stay hanged)
> 
> 
> 2. There are 4 ways how to show/store the information:
> 
>   + tell hypervisor to store what it is interested about
>   + crash_dump
>   + kmsg_dump()
>   + consoles
> 
>   , where crash_dump and consoles are special:
> 
>  + crash_dump does not return. Instead it ends up with reboot.
> 
>  + Consoles work transparently. They just need an extra flush
>before reboot or staying hanged.
> 
> 
> 3. The various notifiers do things like:
> 
>  + tell hypervisor about the crash
>  + print more 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-12 Thread Petr Mladek
Hello,

first, I am sorry for stepping into the discussion so late.
I was busy with some other stuff and this patchset is far
from trivial.

Second, thanks a lot for putting so much effort into it.
Most of the changes look pretty good, especially all
the fixes of particular notifiers and split into
four lists.

Though this patch will need some more love. See below
for more details.


On Wed 2022-04-27 19:49:18, Guilherme G. Piccoli wrote:
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3784,6 +3791,33 @@
>   timeout < 0: reboot immediately
>   Format: 
>  
> + panic_notifiers_level=
> + [KNL] Set the panic notifiers execution order.
> + Format: 
> + We currently have 4 lists of panic notifiers; based
> + on the functionality and risk (for panic success) the
> + callbacks are added in a given list. The lists are:
> + - hypervisor/FW notification list (low risk);
> + - informational list (low/medium risk);
> + - pre_reboot list (higher risk);
> + - post_reboot list (only run late in panic and after
> + kdump, not configurable for now).
> + This parameter defines the ordering of the first 3
> + lists with regards to kdump; the levels determine
> + which set of notifiers execute before kdump. The
> + accepted levels are:

This talks only about kdump. The reality is much more complicated.
The level affect the order of:

+ notifiers vs. kdump
+ notifiers vs. crash_dump
+ crash_dump vs. kdump

There might theoretically many variants of the ordering of kdump,
crash_dump, and the 4 notifier list. Some variants do not make
much sense. You choose 5 variants and tried to select them by
a level number.

The question is if we really could easily describe the meaning this
way. It is not only about a "level" of notifiers before kdump. It is
also about the ordering of crash_dump vs. kdump. IMHO, "level"
semantic does not fit there.

Maybe more parameters might be easier to understand the effect.
Anyway, we first need to agree on the chosen variants.
I am going to discuss it more in the code, see below.



> + 0: kdump is the first thing to run, NO list is
> + executed before kdump.
> + 1: only the hypervisor list is executed before kdump.
> + 2 (default level): the hypervisor list and (*if*
> + there's any kmsg_dumper defined) the informational
> + list are executed before kdump.
> + 3: both the hypervisor and the informational lists
> + (always) execute before kdump.
> + 4: the 3 lists (hypervisor, info and pre_reboot)
> + execute before kdump - this behavior is analog to the
> + deprecated parameter "crash_kexec_post_notifiers".
> +
>   panic_print=Bitmask for printing system info when panic happens.
>   User can chose combination of the following bits:
>   bit 0: print all tasks info
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush)
>   ftrace_dump(DUMP_ALL);
>  }
>  
> +/*
> + * Helper that accumulates all console flushing routines executed on panic.
> + */
> +static void console_flushing(void)
> +{
> +#ifdef CONFIG_VT
> + unblank_screen();
> +#endif
> + console_unblank();
> +
> + /*
> +  * In this point, we may have disabled other CPUs, hence stopping the
> +  * CPU holding the lock while still having some valuable data in the
> +  * console buffer.
> +  *
> +  * Try to acquire the lock then release it regardless of the result.
> +  * The release will also print the buffers out. Locks debug should
> +  * be disabled to avoid reporting bad unlock balance when panic()
> +  * is not being called from OOPS.
> +  */
> + debug_locks_off();
> + console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +
> + panic_print_sys_info(true);
> +}
> +
> +#define PN_HYPERVISOR_BIT0
> +#define PN_INFO_BIT  1
> +#define 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-09 Thread Guilherme G. Piccoli
Hey Hatayma, thanks for your great analysis and no need for apologies!

I'll comment/respond properly inline below, just noticing here that I've
CCed Mark and Marc (from the ARM64 perspective), Michael (Hyper-V
perspective) and Hari (PowerPC perspective), besides the usual suspects
as Petr, Baoquan, etc.

On 09/05/2022 12:16, d.hatay...@fujitsu.com wrote:
> Sorry for the delayed response. Unfortunately, I had 10 days holidays
> until yesterday...
> [...] 
>> +   We currently have 4 lists of panic notifiers; based
>> +   on the functionality and risk (for panic success) the
>> +   callbacks are added in a given list. The lists are:
>> +   - hypervisor/FW notification list (low risk);
>> +   - informational list (low/medium risk);
>> +   - pre_reboot list (higher risk);
>> +   - post_reboot list (only run late in panic and after
>> +   kdump, not configurable for now).
>> +   This parameter defines the ordering of the first 3
>> +   lists with regards to kdump; the levels determine
>> +   which set of notifiers execute before kdump. The
>> +   accepted levels are:
>> +   0: kdump is the first thing to run, NO list is
>> +   executed before kdump.
>> +   1: only the hypervisor list is executed before kdump.
>> +   2 (default level): the hypervisor list and (*if*
> 
> Hmmm, why are you trying to change default setting?
> 
> Based on the current design of kdump, it's natural to put what the
> handlers for these level 1 and level 2 handlers do in
> machine_crash_shutdown(), as these are necessary by default, right?
> 
> Or have you already tried that and figured out it's difficult in some
> reason and reached the current design? If so, why is that difficult?
> Could you point to if there is already such discussion online?
> 
> kdump is designed to perform as little things as possible before
> transferring the execution to the 2nd kernel in order to increase
> reliability. Just detour to panic() increases risks of kdump failure
> in the sense of increasing the executed codes in the abnormal
> situation, which is very the note in the explanation of
> crash_kexec_post_notifiers.
> 
> Also, the current implementation of crash_kexec_post_notifiers uses
> the panic notifier, but this is not from the technical
> reason. Ideally, it should have been implemented in the context of
> crash_kexec() independently of panic().
> 
> That is, it looks to me that, in addition to changing design of panic
> notifier, you are trying to integrate shutdown code of the crash kexec
> and the panic paths. If so, this is a big design change for kdump.
> I'm concerned about increase of reliability. I'd like you to discuss
> them carefully.

>From my understanding (specially based on both these threads [0] and
[1]), 3 facts are clear and quite complex in nature:

(a) Currently, the panic notifier list is a "no man's land" - it's a
mess, all sort of callbacks are added there, some of them are extremely
risk for breaking kdump, others are quite safe (like setting a
variable). Petr's details in thread [0] are really clear and express in
great way how confusing and conflicting the panic notifiers goals are.

(b) In order to "address" problems in the antagonistic goals of
notifiers (see point (a) above and thread [0]), we have this
quirk/workaround called "crash_kexec_post_notifiers". This is useful,
but (almost as for attesting how this is working as band-aid over
complex and fundamental issues) see the following commits:

a11589563e96 ("x86/Hyper-V: Report crash register data or kmsg before
running crash kernel")

06e629c25daa ("powerpc/fadump: Fix inaccurate CPU state info in vmcore
generated with panic")

They hardcode such workaround, because they *need* some notifiers'
callbacks. But notice they *don't need all of them*, only some important
ones (that usually are good considering the risk, it's a good
cost/benefit). Since we currently have an all-or-nothing behavior for
the panic notifiers, both PowerPC and Hyper-V end-up bringing all of
them to run before kdump due to the lack of flexibility, increasing a
lot the risk of failure for kdump.

(c) To add on top of all such complexity, we don't have a custom
machine_crash_shutdown() handler for some architectures like ARM64, and
the feeling is that's not right to add a bunch of callbacks / complexity
in such architecture code, specially since we have the notifiers
infrastructure in the kernel. I've recently started a discussion about
that with ARM64 community, please take a look in [1].

With that said, we can use (a) + (b) + (c) to justify our argument here:
the panic notifiers should be refactored! We need to try to encompass
the antagonistic goals of kdump (wants to be the 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-09 Thread d.hatay...@fujitsu.com
Sorry for the delayed response. Unfortunately, I had 10 days holidays
until yesterday...

>  .../admin-guide/kernel-parameters.txt |  42 ++-
>  include/linux/panic_notifier.h|   1 +
>  kernel/kexec_core.c   |   8 +-
>  kernel/panic.c| 292 +-
>  .../selftests/pstore/pstore_crash_test|   5 +-
>  5 files changed, 252 insertions(+), 96 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
...snip...
> @@ -3784,6 +3791,33 @@
> timeout < 0: reboot immediately
> Format: 
> 
> +   panic_notifiers_level=
> +   [KNL] Set the panic notifiers execution order.
> +   Format: 
> +   We currently have 4 lists of panic notifiers; based
> +   on the functionality and risk (for panic success) the
> +   callbacks are added in a given list. The lists are:
> +   - hypervisor/FW notification list (low risk);
> +   - informational list (low/medium risk);
> +   - pre_reboot list (higher risk);
> +   - post_reboot list (only run late in panic and after
> +   kdump, not configurable for now).
> +   This parameter defines the ordering of the first 3
> +   lists with regards to kdump; the levels determine
> +   which set of notifiers execute before kdump. The
> +   accepted levels are:
> +   0: kdump is the first thing to run, NO list is
> +   executed before kdump.
> +   1: only the hypervisor list is executed before kdump.
> +   2 (default level): the hypervisor list and (*if*

Hmmm, why are you trying to change default setting?

Based on the current design of kdump, it's natural to put what the
handlers for these level 1 and level 2 handlers do in
machine_crash_shutdown(), as these are necessary by default, right?

Or have you already tried that and figured out it's difficult in some
reason and reached the current design? If so, why is that difficult?
Could you point to if there is already such discussion online?

kdump is designed to perform as little things as possible before
transferring the execution to the 2nd kernel in order to increase
reliability. Just detour to panic() increases risks of kdump failure
in the sense of increasing the executed codes in the abnormal
situation, which is very the note in the explanation of
crash_kexec_post_notifiers.

Also, the current implementation of crash_kexec_post_notifiers uses
the panic notifier, but this is not from the technical
reason. Ideally, it should have been implemented in the context of
crash_kexec() independently of panic().

That is, it looks to me that, in addition to changing design of panic
notifier, you are trying to integrate shutdown code of the crash kexec
and the panic paths. If so, this is a big design change for kdump.
I'm concerned about increase of reliability. I'd like you to discuss
them carefully.

Thanks.
HATAYAMA, Daisuke




Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-09 Thread Guilherme G. Piccoli
On 29/04/2022 13:04, Guilherme G. Piccoli wrote:
> On 27/04/2022 21:28, Randy Dunlap wrote:
>>
>>
>> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>>> +   crash_kexec_post_notifiers
>>> +   This was DEPRECATED - users should always prefer the
>>
>>  This is DEPRECATED - users should always prefer the
>>
>>> +   parameter "panic_notifiers_level" - check its entry
>>> +   in this documentation for details on how it works.
>>> +   Setting this parameter is exactly the same as setting
>>> +   "panic_notifiers_level=4".
>>
> 
> Thanks Randy, for your suggestion - but I confess I couldn't understand
> it properly. It's related to spaces/tabs, right? What you suggest me to
> change in this formatting? Just by looking the email I can't parse.
> 
> Cheers,
> 
> 
> Guilherme

Complete lack of attention from me, apologies!
The suggestions was s/was/is - already fixed for V2, thanks Randy.



Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-03 Thread Guilherme G. Piccoli
On 03/05/2022 14:31, Michael Kelley (LINUX) wrote:
> [...]
> 
> To me, it's a weak correlation between having a kmsg dumper, and
> wanting or not wanting the info level output to come before kdump.
> Hyper-V is one of only a few places that register a kmsg dumper, so most
> Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
> the info level output after kdump.  It seems like anyone who cared strongly
> about the info level output would set the panic_notifier_level to 1 or to 3
> so that the result is more deterministic.  But that's just my opinion, and
> it's probably an opinion that is not as well informed on the topic as some
> others in the discussion. So keeping things as in your patch set is not a
> show-stopper for me.
> 
> However, I would request a clarification in the documentation.   The
> panic_notifier_level affects not only the hypervisor, informational,
> and pre_reboot lists, but it also affects panic_print_sys_info() and
> kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
> kmsg_dump() will not be run before kdump.  At level 3, they will
> always be run before kdump.  Your documentation above mentions
> "informational lists" (plural), which I take to vaguely include
> kmsg_dump() and panic_print_sys_info(), but being explicit about
> the effect would be better.
> 
> Michael

Thanks again Michael, to express your points and concerns - great idea
of documentation improvement here, I'll do that for V2, for sure.

The idea of "defaulting" to skip the info list on kdump (if no
kmsg_dump() is set) is again a mechanism that aims at accommodating all
users and concerns of antagonistic goals, kdump vs notifier lists.

Before this patch set, by default no notifier executed before kdump. So,
the "pendulum"  was strongly on kdump side, and clearly this was a
sub-optimal decision - proof of that is that both Hyper-V / PowerPC code
forcibly set the "crash_kexec_post_notifiers". The goal here is to have
a more lightweight list that by default runs before kdump, a secondary
list that only runs before kdump if there's usage for that (either user
sets that or kmsg_dumper set is considered a valid user), and the
remaining notifiers run by default only after kdump, all of that very
customizable through the levels idea.

Now, one thing we could do to improve consistency for the hyper-v case:
having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the
hypervisor list, within the info notifier (that would be moved to
hypervisor list, ofc).
Let's wait for more feedback on that, just throwing some ideas in order
we can have everyone happy with the end-result!

Cheers,


Guilherme



RE: [PATCH 24/30] panic: Refactor the panic path

2022-05-03 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Friday, April 29, 2022 
1:38 PM
> 
> On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> > From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> > 2022
> 3:49 PM
> >> [...]
> >> +  panic_notifiers_level=
> >> +  [KNL] Set the panic notifiers execution order.
> >> +  Format: 
> >> +  We currently have 4 lists of panic notifiers; based
> >> +  on the functionality and risk (for panic success) the
> >> +  callbacks are added in a given list. The lists are:
> >> +  - hypervisor/FW notification list (low risk);
> >> +  - informational list (low/medium risk);
> >> +  - pre_reboot list (higher risk);
> >> +  - post_reboot list (only run late in panic and after
> >> +  kdump, not configurable for now).
> >> +  This parameter defines the ordering of the first 3
> >> +  lists with regards to kdump; the levels determine
> >> +  which set of notifiers execute before kdump. The
> >> +  accepted levels are:
> >> +  0: kdump is the first thing to run, NO list is
> >> +  executed before kdump.
> >> +  1: only the hypervisor list is executed before kdump.
> >> +  2 (default level): the hypervisor list and (*if*
> >> +  there's any kmsg_dumper defined) the informational
> >> +  list are executed before kdump.
> >> +  3: both the hypervisor and the informational lists
> >> +  (always) execute before kdump.
> >
> > I'm not clear on why level 2 exists.  What is the scenario where
> > execution of the info list before kdump should be conditional on the
> > existence of a kmsg_dumper?   Maybe the scenario is described
> > somewhere in the patch set and I just missed it.
> >
> 
> Hi Michael, thanks for your review/consideration. So, this idea started
> kind of some time ago. It all started with a need of exposing more
> information on kernel log *before* kdump and *before* pstore -
> specifically, we're talking about panic_print. But this cause some
> reactions, Baoquan was very concerned with that [0]. Soon after, I've
> proposed a panic notifiers filter (orthogonal) approach, to which Petr
> suggested instead doing a major refactor [1] - it finally is alive in
> the form of this series.
> 
> The theory behind the level 2 is to allow a scenario of kdump with the
> minimum amount of notifiers - what is the point in printing more
> information if the user doesn't care, since it's going to kdump? Now, if
> there is a kmsg dumper, it means that there is likely some interest in
> collecting information, and that might as well be required before the
> potential kdump (which is my case, hence the proposal on [0]).
> 
> Instead of forcing one of the two behaviors (level 1 or level 3), we
> have a middle-term/compromise: if there's interest in collecting such
> data (in the form of a kmsg dumper), we then execute the informational
> notifiers before kdump. If not, why to increase (even slightly) the risk
> for kdump?
> 
> I'm OK in removing the level 2 if people prefer, but I don't feel it's a
> burden, quite opposite - seems a good way to accommodate the somewhat
> antagonistic ideas (jump to kdump ASAP vs collecting more info in the
> panicked kernel log).
> 
> [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/
> 
> [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 

To me, it's a weak correlation between having a kmsg dumper, and
wanting or not wanting the info level output to come before kdump.
Hyper-V is one of only a few places that register a kmsg dumper, so most
Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
the info level output after kdump.  It seems like anyone who cared strongly
about the info level output would set the panic_notifier_level to 1 or to 3
so that the result is more deterministic.  But that's just my opinion, and
it's probably an opinion that is not as well informed on the topic as some
others in the discussion. So keeping things as in your patch set is not a
show-stopper for me.

However, I would request a clarification in the documentation.   The
panic_notifier_level affects not only the hypervisor, informational,
and pre_reboot lists, but it also affects panic_print_sys_info() and
kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
kmsg_dump() will not be run before kdump.  At level 3, they will
always be run before kdump.  Your documentation above mentions
"informational lists" (plural), which I take to vaguely include
kmsg_dump() and panic_print_sys_info(), but being explicit about
the effect would be better.

Michael

> 
> >[...]
> >> +   * Based on the level configured (smaller than 4), we clear the
> >> +   * proper bits in "panic_notifiers_bits". 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> 2022 3:49 PM
>> [...]
>> +panic_notifiers_level=
>> +[KNL] Set the panic notifiers execution order.
>> +Format: 
>> +We currently have 4 lists of panic notifiers; based
>> +on the functionality and risk (for panic success) the
>> +callbacks are added in a given list. The lists are:
>> +- hypervisor/FW notification list (low risk);
>> +- informational list (low/medium risk);
>> +- pre_reboot list (higher risk);
>> +- post_reboot list (only run late in panic and after
>> +kdump, not configurable for now).
>> +This parameter defines the ordering of the first 3
>> +lists with regards to kdump; the levels determine
>> +which set of notifiers execute before kdump. The
>> +accepted levels are:
>> +0: kdump is the first thing to run, NO list is
>> +executed before kdump.
>> +1: only the hypervisor list is executed before kdump.
>> +2 (default level): the hypervisor list and (*if*
>> +there's any kmsg_dumper defined) the informational
>> +list are executed before kdump.
>> +3: both the hypervisor and the informational lists
>> +(always) execute before kdump.
> 
> I'm not clear on why level 2 exists.  What is the scenario where
> execution of the info list before kdump should be conditional on the
> existence of a kmsg_dumper?   Maybe the scenario is described
> somewhere in the patch set and I just missed it.
> 

Hi Michael, thanks for your review/consideration. So, this idea started
kind of some time ago. It all started with a need of exposing more
information on kernel log *before* kdump and *before* pstore -
specifically, we're talking about panic_print. But this cause some
reactions, Baoquan was very concerned with that [0]. Soon after, I've
proposed a panic notifiers filter (orthogonal) approach, to which Petr
suggested instead doing a major refactor [1] - it finally is alive in
the form of this series.

The theory behind the level 2 is to allow a scenario of kdump with the
minimum amount of notifiers - what is the point in printing more
information if the user doesn't care, since it's going to kdump? Now, if
there is a kmsg dumper, it means that there is likely some interest in
collecting information, and that might as well be required before the
potential kdump (which is my case, hence the proposal on [0]).

Instead of forcing one of the two behaviors (level 1 or level 3), we
have a middle-term/compromise: if there's interest in collecting such
data (in the form of a kmsg dumper), we then execute the informational
notifiers before kdump. If not, why to increase (even slightly) the risk
for kdump?

I'm OK in removing the level 2 if people prefer, but I don't feel it's a
burden, quite opposite - seems a good way to accommodate the somewhat
antagonistic ideas (jump to kdump ASAP vs collecting more info in the
panicked kernel log).

[0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/

[1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/


>[...]
>> + * Based on the level configured (smaller than 4), we clear the
>> + * proper bits in "panic_notifiers_bits". Notice that this bitfield
>> + * is initialized with all notifiers set.
>> + */
>> +switch (panic_notifiers_level) {
>> +case 3:
>> +clear_bit(PN_PRE_REBOOT_BIT, _notifiers_bits);
>> +break;
>> +case 2:
>> +clear_bit(PN_PRE_REBOOT_BIT, _notifiers_bits);
>> +
>> +if (!kmsg_has_dumpers())
>> +clear_bit(PN_INFO_BIT, _notifiers_bits);
>> +break;
>> +case 1:
>> +clear_bit(PN_PRE_REBOOT_BIT, _notifiers_bits);
>> +clear_bit(PN_INFO_BIT, _notifiers_bits);
>> +break;
>> +case 0:
>> +clear_bit(PN_PRE_REBOOT_BIT, _notifiers_bits);
>> +clear_bit(PN_INFO_BIT, _notifiers_bits);
>> +clear_bit(PN_HYPERVISOR_BIT, _notifiers_bits);
>> +break;
>> +}
> 
> I think the above switch statement could be done as follows:
> 
> if (panic_notifiers_level <= 3)
>   clear_bit(PN_PRE_REBOOT_BIT, _notifiers_bits);
> if (panic_notifiers_level <= 2)
>   if (!kmsg_has_dumpers())
>   clear_bit(PN_INFO_BIT, _notifiers_bits);
> if (panic_notifiers_level <=1)
>   clear_bit(PN_INFO_BIT, _notifiers_bits);
> if (panic_notifiers_level == 0)
>   clear_bit(PN_HYPERVISOR_BIT, _notifiers_bits);
> 
> That's about half the lines of code.  It's somewhat a matter of style,
> so treat this as just a suggestion to 

RE: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
> 
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
> 
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
> 
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
> 
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
> 
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
> 
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
> 
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
> 
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
> 
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
> 
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
> 
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
> 
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
> 
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
> 
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers 
> levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much 
> easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
> 
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the 

Re: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Guilherme G. Piccoli
On 27/04/2022 21:28, Randy Dunlap wrote:
> 
> 
> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>> +crash_kexec_post_notifiers
>> +This was DEPRECATED - users should always prefer the
> 
>   This is DEPRECATED - users should always prefer the
> 
>> +parameter "panic_notifiers_level" - check its entry
>> +in this documentation for details on how it works.
>> +Setting this parameter is exactly the same as setting
>> +"panic_notifiers_level=4".
> 

Thanks Randy, for your suggestion - but I confess I couldn't understand
it properly. It's related to spaces/tabs, right? What you suggest me to
change in this formatting? Just by looking the email I can't parse.

Cheers,


Guilherme



Re: [PATCH 24/30] panic: Refactor the panic path

2022-04-27 Thread Randy Dunlap



On 4/27/22 15:49, Guilherme G. Piccoli wrote:
> + crash_kexec_post_notifiers
> + This was DEPRECATED - users should always prefer the

This is DEPRECATED - users should always prefer the

> + parameter "panic_notifiers_level" - check its entry
> + in this documentation for details on how it works.
> + Setting this parameter is exactly the same as setting
> + "panic_notifiers_level=4".

-- 
~Randy