RE: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-05-17 Thread Luck, Tony
> What I'm planning to do in the altera_edac notifier is:
>
> if (kdump_is_set)
>   return;

Yes. That's what I think should happen.

-Tony
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-05-17 Thread Luck, Tony
> Tony / Dinh - can I just *skip* this notifier *if kdump* is set or else
> we run the code as-is? Does that make sense to you?

The "skip" option sounds like it needs some special flag associated with
an entry on the notifier chain. But there are other notifier chains ... so that
sounds messy to me.

Just all the notifiers in priority order. If any want to take different actions
based on kdump status, change the code. That seems more flexible than
an "all or nothing" approach by skipping.

-Tony
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-05-17 Thread Luck, Tony
> So, my reasoning here is: this notifier should fit the info list,
> definitely! But...it's very high risk for kdump. It deep dives into the
> regmap API (there are locks in such code) plus there is an (MM)IO write
> to the device and an ARM firmware call. So, despite the nature of this
> notifier _fits the informational list_, the _code is risky_ so we should
> avoid running it before a kdump.
>
> Now, we indeed have a chicken/egg problem: want to avoid it before
> kdump, BUT in case kdump is not set, kmsg_dump() (and console flushing,
> after your suggestion Petr) will run before it and not save collected
> information from EDAC PoV.

Would it be possible to have some global "kdump is configured + enabled" flag?

Then notifiers could make an informed choice on whether to deep dive to
get all the possible details (when there is no kdump) or just skim the high
level stuff (to maximize chance of getting a successful kdump).

-Tony
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-03-06 Thread Luck, Tony
On Mon, Mar 06, 2017 at 12:16:54PM +0100, Borislav Petkov wrote:
> On Thu, Feb 23, 2017 at 09:36:52PM +0800, Xunlei Pang wrote:
> > We met an issue for kdump: after kdump kernel boots up,
> > and there comes a broadcasted mce in first kernel, the
> > other cpus remaining in first kernel will enter the old
> > mce handler of first kernel, then timeout and panic due
> > to MCE synchronization, finally reset the kdump cpus.
> > 
> > This patch lets cpus stay quiet after nmi_shootdown_cpus(),
> > so after kdump boots, cpus remaining in 1st kernel should
> > not do anything except clearing MCG_STATUS. This is useful
> > for kdump to let vmcore dumping perform as hard as it can.
> 
> Ok, I went and rewrote the text to make it more succinct, to the point
> and correct spelling and formatting.
> 
> Tony, ACK?

Yes. Looks good now.

Acked-by: Tony Luck 

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-22 Thread Luck, Tony
On Wed, Feb 22, 2017 at 12:11:14PM +0800, Xunlei Pang wrote:
> + /*
> +  * Cases to bail out to avoid rendezvous process timeout:
> +  * 1)If this CPU is offline.
> +  * 2)If crashing_cpu was set, e.g. entering kdump,
> +  *   we need to skip cpus remaining in 1st kernel.
> +  */
> + if (cpu_is_offline(cpu) ||
> + (crashing_cpu != -1 && crashing_cpu != cpu)) {
>   u64 mcgstatus;
>  
>   mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);


I think we should document the remaining race conditions. I don't
think there is any good way to eliminate them, and they are already
pretty small windows.

I think the sequence of events looks like:

 1  Panic occurs
 2  nmi_shootdown_cpus() sets crashing_cpu
 3  send NMI to everyone else
 4  wait up to a second for other CPUs to take NMI
 5  go to kexec code
 6  start new kernel
 7  new kernel establishes #MC handler

If one of the other cpus triggers a machine check while
getting to, or in, the NMI handler ... then that cpu will
skip processing (if RIPV is set).

Between '2' and '5' if crashing_cpu gets a machine check it
will execute in the old kernel handler, and do the right thing.

There's a fuzzy area between '6' and '7' where a machine check
might not end up in the right code.

>From '7' onwards the kexec kernel will handle and machine
checks caused by kdump.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Luck, Tony
> It's from my understanding, I didn't get the explicit description from the 
> intel SDM on this point.
> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each 
> cpu have MCG_STATUS_RIPV bit set?

MCG_STATUS is a per-thread MSR and will contain the status appropriate for that 
thread when #MC is delivered.
So the RIPV bit will be set if, and only if, the thread saved a valid return 
address for this exception. The net result
is that it is almost always set for "innocent bystander" CPUs that were dragged 
into the exception handler because
of a broadcast #MC. We make the test because if it isn't set, then the 
do_machine_check() had better not return
because we have no idea where it will return to - since there is not a valid 
return IP.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote:
> Hey Tony,
> 
> a "welcome back" is in order? :-)

Yes - first day back today. Lots of catching up to do.

> And apparently crash knows about poisoned pages and handles them:
> 
> static int __init crash_save_vmcoreinfo_init(void)
> {
>   ...
> #ifdef CONFIG_MEMORY_FAILURE
> VMCOREINFO_NUMBER(PG_hwpoison);
> #endif
> 
> so if that works, the kexeced kernel should know about that list.

Oh good ... it is smarter than I thought.

> Doesn't matter, right? The new copy is as clueless as the old one about
> those MCEs.

If things are well enough initialized that we don't reset, and
get to do_machine_check(), then this code from Ashok:

/* If this CPU is offline, just bail out. */
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}

will ignore the machine check on the other cpus ... assuming
that "cpu_is_offline(smp_processor_id())" does the right thing
in the kexec case where this is an "old" cpu that isn't online
in the new kernel.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> > One possible timing sequence would be:
> > 1st kernel running on multiple cpus panicked
> > then the crash dump code starts
> > the crash dump code stops the others cpus except the crashing one
> > 2nd kernel boots up on the crash cpu with "nr_cpus=1"
> > some broadcasted mce comes on some cpu amongst the other cpus(not the 
> > crashing cpu)
> 
> Where does this broadcasted MCE come from?
> 
> The crash dump code triggered it? Or it happened before the panic()?
> 
> Are you talking about an *actual* sequence which you're experiencing on
> real hw or is this something hypothetical?

If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ... and then you have a
broadcast machine check (on older[1] Intel CPUs that don't support
local machine check).

This is hard to work around.  You really need all the CPUs to
have set CR4.MCE=1 (if any didn't, then they will force a reset
when they see the machine check). Also you need to make sure that
they jump to the copy of do_machine_check() in the new kernel, not
the old kernel.

A while ago I played with the nr_cpus=N code to have it bring
all the CPUs far enough online to get the machine check initialization
done, then any extras above "N" just go back offline again.
But I never got this to work reliably.

-Tony

[1] older == all released ones, at the moment.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic

2011-11-02 Thread Luck, Tony
 Instead of introducing the panic lock, as an alternative we could move
 smp_send_stop() to the beginning of panic(). Eric told me that the
 function is currently insufficiently reliable for that, but perhaps we
 could make it more reliable.

That's tough to do.  We are in panic because something went horribly
wrong somewhere in the kernel - so we can make few assumptions about
which subsystems are still working. In the worst case (for this example)
our panic was caused by a failure in the code that sends cross-processor
interrupts ... so calling that same code to stop the other cpus is
likely to run into the same problem - perhaps causing a nested panic.

So what looks like a good fix for some panic scenarios actually makes
others worse.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

2011-10-11 Thread Luck, Tony
 Frankly, I don't think that it is undefined - you basically should be
 able to read DRAM albeit with the corrupted data in it. However, you
 probably best disable the whole DRAM error detection first by clearing
 a couple of bits in MC4_CTL_MASK (at least on AMD that should work, I
 dunno how Intel does that).

Intel is the same - disable machine check in CR4, and you can read
corrupted memory (multi-bit ECC error) without getting a machine check
(or any indication that you just got garbage).

Pages that were marked as poisoned can then be handled with appropriate
suspicion by your crash dump analysis tools.

Of course if there are any other memory errors that haven't been seen
yet - the pages won't be marked as poison - so the crash dump tool will
have no idea that it is looking at invalid data.  This could be a problem
if whatever caused the memory problem affected more than a single location.

So if you do disable machine check in order to get a crash dump - you should
be conservative and mark the whole file as possibly garbage.

-Tony
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

2011-10-11 Thread Luck, Tony
 So, in any case we may not be able to disable machine-check exceptions
 (MCEs) only within the context of kexec'ed kernel. Let me know if I've
 missed something here.

Linux sets the CR4.MCE bit - look for set_in_cr4(X86_CR4_MCE) for places
where it does so.  You can ask it not to do that with mce=off argument.

So we can control this from the OS level.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

2011-10-05 Thread Luck, Tony
  The plan is to pass-down the list of poisoned memory pages to the second
  kernel using an elf-note so that these pages are left untouched during
  dump capture. I'm working on an implementation of the same and should
  have patches soon.

 I would say let us first figure out what happens while reading a poisoned
 page and is this a problem before working on a solution.

If the page is poisoned because of a real uncorrectable error in memory
(reported as SRAO machine check today, or by SRAR real-soon-now). Then
accessing the page from the processor while taking a memory dump will
result in a machine check.

Note that a large memory system that had been running for a long time
may have built up a small stash of these land-mine pages - and we need
to worry about them even in the case where the panic is not machine
check related (in fact especially in this case ... we are in a case
where we actually do want the dump to diagnose the cause of the panic,
and we don't want to risk losing the crash dump because we aborted when
touching a page that the OS had safely avoided for days/weeks/months).

So passing a list of poisoned pages from the old kernel to the new kernel
is a good idea - and is independent of the cause of the crash (except that
in the fatal machine check case due to memory error the list is guaranteed
to be non-empty).

Passing some crash signature data - so the new kernel/dump-tools can make
a choice whether to even try to take a full dump is also interesting (but
independent from the bad page list).

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

2011-10-03 Thread Luck, Tony
 It totally doesn't make sense to do this in the kernel when we can
 filter this from userspace just fine.

Patch 1 is the kernel part that provides the clue for user space
tools to do this filtering.  The other three parts are patches to
tools that see the hint and act on it.

Eric: Do you see a better way for the kernel that just crashed from
a machine check to communicate the reason for the crash to the
successor kernel?  The Elf-note in vmcore needs quite a bit of
code to set up - but is otherwise fairly succinct.  We don't want
the successor kernel to have to poke through too much memory from
the crashed kernel to figure this out - the more we look at, the
higher the probability that we step on the landmine that crashed
the original kernel.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [RFC] Kdump and memory error handling

2011-05-04 Thread Luck, Tony
Your first suggestion of a slim dump makes the most sense. The
purpose of a crash dump is a research resource to find out why
the system crashed - but in the case of a machine check, we already
have the reasons for the crash captured by the machine check handler.

Perhaps you could include __log_buf[] in the slim crash dump? Assuming
that the machine check is not a result of an uncorrectable error
in this memory range.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH 0/2][concept RFC] x86: BIOS-save kernel log to disk upon panic

2011-01-26 Thread Luck, Tony
 How is this more useful than a photograph of the backtrace ?

You can fit a lot more data into the 2-d barcode that will fit on
the screen.  You can also automate the recovery of the data (e.g.
for posting to kerneloops.org).

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH 0/2][concept RFC] x86: BIOS-save kernel log to disk upon panic

2011-01-26 Thread Luck, Tony
- The latest approach (proposed by Linus) is to forget the disk: jump to
  real-mode, but display the kernel log in a fancy format (with scroll
  ups and downs) instead.

A while ago (first Plumbers conference?) someone was talking about
using a 2-d barcode to display the tail of the kernel log  oops
register data - with the plan that you could capture the image with
a cell phone camera, and then get all the oops data without worrying
about transcription errors as you wrote down  re-typed all the hex.

Anyone know what happened to that plan?

-Tony



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH][EFI] Run EFI in physical mode

2010-08-13 Thread Luck, Tony
 does this affect ia64 in any way?

I remember Eric complaining that set_virtual_address_map() was a one
way trap door with no way to get back to physical mode ... and thus
this was a big problem to support kexec on ia64. And yet we still call
it, and ia64 can do kexec. So some other work around must have been
found. Can't immediately remember what it was though.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: kdump broken on Altix 350

2008-09-29 Thread Luck, Tony
Maybe I'm starting to see what happened ... and it could well
be my fault.

I wanted to allocate the per-cpu memory for cpu0 statically
in the vmlinux ... so it would be available in head.S to set
up everything before we move to any C code that might try to
access per cpu variables.  To make life easy for myself I just
made this allocation in vmlinus.lds.S immediately before the
initialized block where all the percpu variables live (which
means no extra labels ... and I could initialize this data
with a simple copy of PERCPU_PAGESIZE bytes from (the poorly
named) __phys_per_cpu_start to the unamed block before it
that will be the cpu0 copy.

But my extra allocation is in the percpu block in vmlinux.lds.S,
so it ends up in that PT_LOAD section.  Which ultimately confuses
the kexec code.

Probably the cpu0 percpu space should be placed in the data section.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: kdump broken on Altix 350

2008-09-29 Thread Luck, Tony
Does this make kexec/kdump happier?  Bare minimum testing so far
(builds and boots on tiger ... didn't try kexec yet).



[IA64] Put the space for cpu0 per-cpu area into .data section

Initial fix for making sure that we can access percpu variables
in all C code commit: 10617bbe84628eb18ab5f723d3ba35005adde143
inadvertantly allocated the memory in the percpu section of
the vmlinux ELF executable.  This confused kexec.

Signed-off-by: Tony Luck [EMAIL PROTECTED]

diff --git a/arch/ia64/include/asm/sections.h b/arch/ia64/include/asm/sections.h
index f667998..1a873b3 100644
--- a/arch/ia64/include/asm/sections.h
+++ b/arch/ia64/include/asm/sections.h
@@ -11,6 +11,9 @@
 #include asm-generic/sections.h
 
 extern char __per_cpu_start[], __per_cpu_end[], __phys_per_cpu_start[];
+#ifdef CONFIG_SMP
+extern char __cpu0_per_cpu[];
+#endif
 extern char __start___vtop_patchlist[], __end___vtop_patchlist[];
 extern char __start___rse_patchlist[], __end___rse_patchlist[];
 extern char __start___mckinley_e9_bundles[], __end___mckinley_e9_bundles[];
diff --git a/arch/ia64/kernel/head.S b/arch/ia64/kernel/head.S
index 8bdea8e..66e491d 100644
--- a/arch/ia64/kernel/head.S
+++ b/arch/ia64/kernel/head.S
@@ -367,16 +367,17 @@ start_ap:
;;
 #else
 (isAP) br.few 2f
-   mov r20=r19
-   sub r19=r19,r18
+   movl r20=__cpu0_per_cpu
;;
shr.u r18=r18,3
 1:
-   ld8 r21=[r20],8;;
-   st8[r19]=r21,8
+   ld8 r21=[r19],8;;
+   st8[r20]=r21,8
adds r18=-1,r18;;
cmp4.lt p7,p6=0,r18
 (p7)   br.cond.dptk.few 1b
+   mov r19=r20
+   ;;
 2:
 #endif
tpa r19=r19
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index de71da8..10a7d47 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -215,9 +215,6 @@ SECTIONS
   /* Per-cpu data: */
   percpu : { } :percpu
   . = ALIGN(PERCPU_PAGE_SIZE);
-#ifdef CONFIG_SMP
-  . = . + PERCPU_PAGE_SIZE;/* cpu0 per-cpu space */
-#endif
   __phys_per_cpu_start = .;
   .data.percpu PERCPU_ADDR : AT(__phys_per_cpu_start - LOAD_OFFSET)
{
@@ -233,6 +230,11 @@ SECTIONS
   data : { } :data
   .data : AT(ADDR(.data) - LOAD_OFFSET)
{
+#ifdef CONFIG_SMP
+  . = ALIGN(PERCPU_PAGE_SIZE);
+   __cpu0_per_cpu = .;
+  . = . + PERCPU_PAGE_SIZE;/* cpu0 per-cpu space */
+#endif
DATA_DATA
*(.data1)
*(.gnu.linkonce.d*)
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index e566ff4..0ee085e 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -163,7 +163,7 @@ per_cpu_init (void)
 * get_zeroed_page().
 */
if (first_time) {
-   void *cpu0_data = __phys_per_cpu_start - PERCPU_PAGE_SIZE;
+   void *cpu0_data = __cpu0_per_cpu;
 
first_time=0;
 
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 78026aa..d8c5fcd 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -144,7 +144,7 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
 
for_each_possible_early_cpu(cpu) {
if (cpu == 0) {
-   void *cpu0_data = __phys_per_cpu_start - 
PERCPU_PAGE_SIZE;
+   void *cpu0_data = __cpu0_per_cpu;
__per_cpu_offset[cpu] = (char*)cpu0_data -
__per_cpu_start;
} else if (node == node_cpuid[cpu].nid) {

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: kdump broken on Altix 350

2008-08-29 Thread Luck, Tony
 your commit

 commit 10617bbe84628eb18ab5f723d3ba35005adde143
 Author: Tony Luck [EMAIL PROTECTED]
 Date:   Tue Aug 12 10:34:20 2008 -0700

 [IA64] Ensure cpu0 can access per-cpu variables in early boot code

 broke kdump on our Altix 350. I get following early crash in kdump
 kernel

Sorry about that.  I'll try to reproduce it here.  Do you (or anyone
else reading this) know if the version of kexec that ships with RHEL5.2
works with current 2.6.27-rc kernels (perhaps not a politically correct
question to ask someone with a @suse.de address :-)

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH 0/3] vmcoreinfo support for dump filtering

2007-10-17 Thread Luck, Tony
 This?

That does the trick, yes.

 (please tell me if you want me to send this to Linus)

I've put it in my tree now ... so I'll ask Linus to pull
it from there.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec