Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-21 Thread Borislav Petkov
On Tue, Feb 21, 2017 at 08:37:21PM +0800, Xunlei Pang wrote: > -/* If this CPU is offline, just bail out. */ > -if (cpu_is_offline(smp_processor_id())) { > +/* > + * Cases to bail out to avoid rendezvous process timeout: > + * 1)If crashing_cpu was set, e.g. entering kdump, > +

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-21 Thread Xunlei Pang
On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: > On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long >> error_code) >> */ >> int lmce = 1; >> >> -/* If this CPU is offline, just bail out. */ >> -

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-21 Thread Borislav Petkov
On Tue, Feb 21, 2017 at 09:28:20AM +0800, Xunlei Pang wrote: > Not kdump kernel starts dumping, just during nmi_shootdown_cpus(), if some > MCE comes after crashing_cpu was set and we don't skip crashing_cpu, then > the crashing cpu will enter mce handler and trigger the synchronization issue. Ok,

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/21/2017 at 04:26 AM, Borislav Petkov wrote: > On Mon, Feb 20, 2017 at 09:29:24PM +0800, Xunlei Pang wrote: >> There is a small window between crash and kdump kernel boot, so >> if a SRAO comes within this window it will also cause the mce >> synchronization problem on the crashing cpu if we d

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/20/2017 at 09:29 PM, Xunlei Pang wrote: > On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: >> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long >>> error_code) >>> */ >>> int lmce = 1; >>> >>> -

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Borislav Petkov
On Mon, Feb 20, 2017 at 09:29:24PM +0800, Xunlei Pang wrote: > There is a small window between crash and kdump kernel boot, so > if a SRAO comes within this window it will also cause the mce > synchronization problem on the crashing cpu if we don't bail out the > crashing cpu. You mean, in the win

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: > On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long >> error_code) >> */ >> int lmce = 1; >> >> -/* If this CPU is offline, just bail out. */ >> -

Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Borislav Petkov
On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: > @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long > error_code) >*/ > int lmce = 1; > > - /* If this CPU is offline, just bail out. */ > - if (cpu_is_offline(smp_processor_id())) { > + /

[PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-19 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up, and there comes a broadcasted mce in first kernel, the other cpus remaining in first kernel will enter the old mce handler of first kernel, then timeout and panic due to MCE synchronization, finally reset the kdump cpus. This patch lets cpus