Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel

2017-01-23 Thread AKASHI Takahiro
On Mon, Jan 23, 2017 at 10:23:15AM +, Mark Rutland wrote:
> On Mon, Jan 23, 2017 at 06:51:46PM +0900, AKASHI Takahiro wrote:
> > Mark,
> > 
> > On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote:
> > > On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote:
> > > > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote:
> > > > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote:
> > > > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote:
> 
> > > > > I don't think we have much code useful for unmapping. We could re-use 
> > > > > create_mapping_late for this, passing a set of prot bits that means 
> > > > > the
> > > > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
> > > > 
> > > > Do you really think that we should totally invalidate mmu entries?
> > > > I guess that, given proper cache & TLB flush operations, RO attribute is
> > > > good enough for memory consistency, no?
> > > > (None accesses the region, as I said, except in the case of re-loading
> > > > crash dump kernel though.)
> > > 
> > > My worry is that the first kernel and kdump kernel may map (portions of)
> > > the region with potentially confliciting memory attributes. So it would
> > > be necessary to completely unmap the region.
> > 
> > I think that this can happen only if the second kernel boots up,
> > leaving non-crashed cpus still running for some reason.
> 
> Yes. I was considering a kdump case where a secondary was stuck in the
> first kernel.
> 
> > > You raise a good point that this would also mean we need to perform some
> > > cache maintenance, which makes that a little more painful. We'd need a
> > > sequence like:
> > > 
> > > * Unmap the region
> > > * TLB invalidation
> > > * Remap the region with non-cacheable attributes
> > > * Cache maintenance
> > > * Unmap the region
> > > * TLB invalidation
> > 
> > I don't get why we need to remap the region and do cache
> > maintenance here. Please elaborate a bit more?
> 
> I think I was wrong, and we don't need to. Sorry about that.
> 
> My thought was that to ensure that there aren't stale lines with
> differing attributes, we'd need to do a clean+invalidate while the
> caches are guaranteed to not allocate anything furthe. Hence, we'd need
> to use a non-cacheable mapping to perform the clean+invalidate.
> 
> However, I now think that so long as we unmap the range, this shouldn't
> matter. The new kernel can perform the maintenance if it wishes to use
> different attributes, similarly to what the first kernel must do per the
> boot protocol.
> 
> > My current implementation of arch_kexec_protect_crashkres() is:
> > 
> > kexec_segment_flush(kexec_crash_image);
> > create_mapping_late(crashk_res.start, ..., __pgprot(0));
> > or PAGE_KERNEL_INVALID
> > flush_tlb_all();
> > 
> > kexec_segment_flush() will eventually do dcache-flush for all the modified
> > data in crash dump kernel memory.
> 
> I now think this should be fine, per the above.

OK.
I think that now I can see a light of the goal :)

-Takahiro AKASHI

> Thanks,
> Mark.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v29 4/9] arm64: kdump: implement machine_crash_shutdown()

2017-01-23 Thread AKASHI Takahiro
Will,

On Mon, Jan 23, 2017 at 05:46:34PM +, Will Deacon wrote:
> On Thu, Jan 12, 2017 at 12:01:11PM +, Will Deacon wrote:
> > On Thu, Jan 12, 2017 at 01:21:44PM +0900, AKASHI Takahiro wrote:
> > > On Wed, Jan 11, 2017 at 10:54:05AM +, Will Deacon wrote:
> > > > On Wed, Jan 11, 2017 at 03:36:28PM +0900, AKASHI Takahiro wrote:
> > > > > On Tue, Jan 10, 2017 at 11:32:48AM +, Will Deacon wrote:
> > > > > > On Wed, Dec 28, 2016 at 01:36:01PM +0900, AKASHI Takahiro wrote:
> > > > > > > @@ -22,6 +25,7 @@
> > > > > > >  extern const unsigned char arm64_relocate_new_kernel[];
> > > > > > >  extern const unsigned long arm64_relocate_new_kernel_size;
> > > > > > >  
> > > > > > > +static bool in_crash_kexec;
> > > > > > 
> > > > > > Do you actually need this bool? Why not call kexec_crash_loaded() 
> > > > > > instead?
> > > > > 
> > > > > The two have different meanings:
> > > > > "in_crash_kexec" indicates that kdump is taking place, while
> > > > > kexec_crash_loaded() tells us only whether crash dump kernel has been
> > > > > loaded or not.
> > > > > 
> > > > > It is crucial to distinguish them especially for machine_kexec()
> > > > > which can be called on normal kexec even if kdump has been set up.
> > > > 
> > > > Ah, I see. So how about just doing:
> > > > 
> > > >   if (kimage == kexec_crash_image)
> > > > 
> > > > in machine_kexec?
> > > 
> > > Yeah, it should work.
> > > Do you want to merge the following hunk,
> > > or expect that I will re-send the whole patch series
> > > (with other changes if any)?
> > 
> > Thanks, I'll fold it in and shout if I run into any problems. My plan is
> > to queue this for 4.11.
> 
> Given the DT discussion with Mark, I assume you'll post a new version with
> this rolled in.

Yes, I will!

-Takahiro AKASHI

> Will

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote:
>> will ignore the machine check on the other cpus ... assuming
>> that "cpu_is_offline(smp_processor_id())" does the right thing
>> in the kexec case where this is an "old" cpu that isn't online
>> in the new kernel.
> Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be
> good there.
>

"nr_cpus=N" will consume more memory, using very large N is almost
impossible for kdump to boot with considering the limited crash memory
reserved.

For some large machine, nr_cpus=1 might not be enough, we have to use
nr_cpus=4 or more, it is also helpful for the vmcore parallel dumping :-)

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote:
> On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
>> Hey Tony,
>>
>> a "welcome back" is in order? :-)
>>
>> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>>> If the system had experienced some memory corruption, but
>>> recovered ... then there would be some pages sitting around
>>> that the old kernel had marked as POISON and stopped using.
>>> The kexec'd kernel doesn't know about these, so may touch that
>>> memory while taking a crash dump ...
>> Hmm, pass a list of poisoned pages to the kdump kernel so as not to
>> touch. Looks like there's already functionality for that:
>>
>> "makedumpfile can exclude the following types of pages while copying
>> VMCORE to DUMPFILE, and a user can choose which type of pages will be
>> excluded.
>>
>> - Pages filled with zero
>> - Cache pages
>> - User process data pages
>> - Free pages"
>>
>>  (there is a makedumpfile manpage somewhere)
>>
>> And apparently crash knows about poisoned pages and handles them:
>>
>> static int __init crash_save_vmcoreinfo_init(void)
>> {
>>  ...
>> #ifdef CONFIG_MEMORY_FAILURE
>> VMCOREINFO_NUMBER(PG_hwpoison);
>> #endif
>>
>> so if that works, the kexeced kernel should know about that list.
> From the log in my previous reply, MCE occurred before makedumpfile dumping,
> so I guess if the poisoned ones belong to the crash reserved memory or other
> type of events?

Another possibility may be from any system.reserved/pcie memory
which are shared between 1st and 2nd kernel.

>
> Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
> is also allowed to process "/proc/vmcore".
>
>>> and then you have a broadcast machine check (on older[1] Intel CPUs
>>> that don't support local machine check).
>> Right.
>>
>>> This is hard to work around. You really need all the CPUs to have set
>>> CR4.MCE=1 (if any didn't, then they will force a reset when they see
>>> the machine check). Also you need to make sure that they jump to the
>>> copy of do_machine_check() in the new kernel, not the old kernel.
>> Doesn't matter, right? The new copy is as clueless as the old one about
>> those MCEs.
>>
> It's the code in mce_start(), it waits for all the online cpus including the 
> cpus
> that kdump boots on to synchronize.
>
> So for new mce handler of kdump kernel, it is fine as the number of online 
> cpus
> is correct; as for old mce handler of 1st kernel, it's not true because some 
> cpus
> which are regarded online from 1st kernel's view are running the 2nd kernel 
> now,
> they can't respond to the old mce handler which will timeout the old mce 
> handler.
>
> Regards,
> Xunlei


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
> Hey Tony,
>
> a "welcome back" is in order? :-)
>
> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>> If the system had experienced some memory corruption, but
>> recovered ... then there would be some pages sitting around
>> that the old kernel had marked as POISON and stopped using.
>> The kexec'd kernel doesn't know about these, so may touch that
>> memory while taking a crash dump ...
> Hmm, pass a list of poisoned pages to the kdump kernel so as not to
> touch. Looks like there's already functionality for that:
>
> "makedumpfile can exclude the following types of pages while copying
> VMCORE to DUMPFILE, and a user can choose which type of pages will be
> excluded.
>
> - Pages filled with zero
> - Cache pages
> - User process data pages
> - Free pages"
>
>  (there is a makedumpfile manpage somewhere)
>
> And apparently crash knows about poisoned pages and handles them:
>
> static int __init crash_save_vmcoreinfo_init(void)
> {
>   ...
> #ifdef CONFIG_MEMORY_FAILURE
> VMCOREINFO_NUMBER(PG_hwpoison);
> #endif
>
> so if that works, the kexeced kernel should know about that list.

>From the log in my previous reply, MCE occurred before makedumpfile dumping,
so I guess if the poisoned ones belong to the crash reserved memory or other
type of events?

Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
is also allowed to process "/proc/vmcore".

>
>> and then you have a broadcast machine check (on older[1] Intel CPUs
>> that don't support local machine check).
> Right.
>
>> This is hard to work around. You really need all the CPUs to have set
>> CR4.MCE=1 (if any didn't, then they will force a reset when they see
>> the machine check). Also you need to make sure that they jump to the
>> copy of do_machine_check() in the new kernel, not the old kernel.
> Doesn't matter, right? The new copy is as clueless as the old one about
> those MCEs.
>

It's the code in mce_start(), it waits for all the online cpus including the 
cpus
that kdump boots on to synchronize.

So for new mce handler of kdump kernel, it is fine as the number of online cpus
is correct; as for old mce handler of 1st kernel, it's not true because some 
cpus
which are regarded online from 1st kernel's view are running the 2nd kernel now,
they can't respond to the old mce handler which will timeout the old mce 
handler.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote:
> will ignore the machine check on the other cpus ... assuming
> that "cpu_is_offline(smp_processor_id())" does the right thing
> in the kexec case where this is an "old" cpu that isn't online
> in the new kernel.

Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be
good there.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote:
> Hey Tony,
> 
> a "welcome back" is in order? :-)

Yes - first day back today. Lots of catching up to do.

> And apparently crash knows about poisoned pages and handles them:
> 
> static int __init crash_save_vmcoreinfo_init(void)
> {
>   ...
> #ifdef CONFIG_MEMORY_FAILURE
> VMCOREINFO_NUMBER(PG_hwpoison);
> #endif
> 
> so if that works, the kexeced kernel should know about that list.

Oh good ... it is smarter than I thought.

> Doesn't matter, right? The new copy is as clueless as the old one about
> those MCEs.

If things are well enough initialized that we don't reset, and
get to do_machine_check(), then this code from Ashok:

/* If this CPU is offline, just bail out. */
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}

will ignore the machine check on the other cpus ... assuming
that "cpu_is_offline(smp_processor_id())" does the right thing
in the kexec case where this is an "old" cpu that isn't online
in the new kernel.

-Tony

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
Hey Tony,

a "welcome back" is in order? :-)

On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
> If the system had experienced some memory corruption, but
> recovered ... then there would be some pages sitting around
> that the old kernel had marked as POISON and stopped using.
> The kexec'd kernel doesn't know about these, so may touch that
> memory while taking a crash dump ...

Hmm, pass a list of poisoned pages to the kdump kernel so as not to
touch. Looks like there's already functionality for that:

"makedumpfile can exclude the following types of pages while copying
VMCORE to DUMPFILE, and a user can choose which type of pages will be
excluded.

- Pages filled with zero
- Cache pages
- User process data pages
- Free pages"

 (there is a makedumpfile manpage somewhere)

And apparently crash knows about poisoned pages and handles them:

static int __init crash_save_vmcoreinfo_init(void)
{
...
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif

so if that works, the kexeced kernel should know about that list.

> and then you have a broadcast machine check (on older[1] Intel CPUs
> that don't support local machine check).

Right.

> This is hard to work around. You really need all the CPUs to have set
> CR4.MCE=1 (if any didn't, then they will force a reset when they see
> the machine check). Also you need to make sure that they jump to the
> copy of do_machine_check() in the new kernel, not the old kernel.

Doesn't matter, right? The new copy is as clueless as the old one about
those MCEs.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> > One possible timing sequence would be:
> > 1st kernel running on multiple cpus panicked
> > then the crash dump code starts
> > the crash dump code stops the others cpus except the crashing one
> > 2nd kernel boots up on the crash cpu with "nr_cpus=1"
> > some broadcasted mce comes on some cpu amongst the other cpus(not the 
> > crashing cpu)
> 
> Where does this broadcasted MCE come from?
> 
> The crash dump code triggered it? Or it happened before the panic()?
> 
> Are you talking about an *actual* sequence which you're experiencing on
> real hw or is this something hypothetical?

If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ... and then you have a
broadcast machine check (on older[1] Intel CPUs that don't support
local machine check).

This is hard to work around.  You really need all the CPUs to
have set CR4.MCE=1 (if any didn't, then they will force a reset
when they see the machine check). Also you need to make sure that
they jump to the copy of do_machine_check() in the new kernel, not
the old kernel.

A while ago I played with the nr_cpus=N code to have it bring
all the CPUs far enough online to get the machine check initialization
done, then any extras above "N" just go back offline again.
But I never got this to work reliably.

-Tony

[1] older == all released ones, at the moment.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> One possible timing sequence would be:
> 1st kernel running on multiple cpus panicked
> then the crash dump code starts
> the crash dump code stops the others cpus except the crashing one
> 2nd kernel boots up on the crash cpu with "nr_cpus=1"
> some broadcasted mce comes on some cpu amongst the other cpus(not the 
> crashing cpu)

Where does this broadcasted MCE come from?

The crash dump code triggered it? Or it happened before the panic()?

Are you talking about an *actual* sequence which you're experiencing on
real hw or is this something hypothetical?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote:
>> We met an issue for kdump: after kdump kernel boots up,
>> and there comes a broadcasted mce in first kernel, the
> How does that even happen?
>
> Lemme try to understand this correctly: the first kernel gets an
> MCE, kdump starts and boots a *whole* kernel and *then* you get the
> broadcasted MCE? I have real hard time believing that.
>
> What happened to the approach of clearing CR4.MCE before loading the
> kdump kernel, in native_machine_shutdown() or wherever does the kdump
> gets loaded...
>

One possible timing sequence would be:
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing 
cpu)
the other cpus enter old mce handler of 1st kernel, while crash cpu enters new 
mce handler of 2nd kernel
the old mce handler of 1st kernel will timeout and panic due to mce 
syncrhonization under default setting

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 04:48 PM, Dave Young wrote:
> Hi, Xunlei
>
> On 01/23/17 at 02:48pm, Xunlei Pang wrote:
>> CRASH_KERNEL_ADDR_MAX has been missing for a long time,
>> update it with more detailed explanation.
>>
>> Cc: Robert LeBlanc 
>> Cc: Baoquan He 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  arch/x86/kernel/setup.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 4cfba94..c32a167 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void)
>>  /* 0 means: find the address automatically */
>>  if (crash_base <= 0) {
>>  /*
>> - *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
>> + * Set CRASH_ADDR_LOW_MAX upper bound for crash memory
>> + * as old kexec-tools loads bzImage below that, unless
>> + * "crashkernel=size[KMG],high" is specified.
> There is already comment before the define of those macros, also
> there are 32bit case which has a different reason about 512M there as
> well.

If we see from the kexec's perspective, we have a common CRASH_ADDR_LOW_MAX
definition for both x86 32-bit and 64-bit(32-bit x86 has the same value defined 
for
CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX), so old kexec will load below
CRASH_ADDR_LOW_MAX,  so I think the description is fine :-)

Regards,
Xunlei

>
> So it looks better to just drop the one line comment without adding
> further comments here.
>>   */
>>  crash_base = memblock_find_in_range(CRASH_ALIGN,
>>  high ? CRASH_ADDR_HIGH_MAX
>> -- 
>> 1.8.3.1
>>
> Thanks
> Dave
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote:
> We met an issue for kdump: after kdump kernel boots up,
> and there comes a broadcasted mce in first kernel, the

How does that even happen?

Lemme try to understand this correctly: the first kernel gets an
MCE, kdump starts and boots a *whole* kernel and *then* you get the
broadcasted MCE? I have real hard time believing that.

What happened to the approach of clearing CR4.MCE before loading the
kdump kernel, in native_machine_shutdown() or wherever does the kdump
gets loaded...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel

2017-01-23 Thread Mark Rutland
On Mon, Jan 23, 2017 at 06:51:46PM +0900, AKASHI Takahiro wrote:
> Mark,
> 
> On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote:
> > On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote:
> > > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote:
> > > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote:
> > > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote:

> > > > I don't think we have much code useful for unmapping. We could re-use 
> > > > create_mapping_late for this, passing a set of prot bits that means the
> > > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
> > > 
> > > Do you really think that we should totally invalidate mmu entries?
> > > I guess that, given proper cache & TLB flush operations, RO attribute is
> > > good enough for memory consistency, no?
> > > (None accesses the region, as I said, except in the case of re-loading
> > > crash dump kernel though.)
> > 
> > My worry is that the first kernel and kdump kernel may map (portions of)
> > the region with potentially confliciting memory attributes. So it would
> > be necessary to completely unmap the region.
> 
> I think that this can happen only if the second kernel boots up,
> leaving non-crashed cpus still running for some reason.

Yes. I was considering a kdump case where a secondary was stuck in the
first kernel.

> > You raise a good point that this would also mean we need to perform some
> > cache maintenance, which makes that a little more painful. We'd need a
> > sequence like:
> > 
> > * Unmap the region
> > * TLB invalidation
> > * Remap the region with non-cacheable attributes
> > * Cache maintenance
> > * Unmap the region
> > * TLB invalidation
> 
> I don't get why we need to remap the region and do cache
> maintenance here. Please elaborate a bit more?

I think I was wrong, and we don't need to. Sorry about that.

My thought was that to ensure that there aren't stale lines with
differing attributes, we'd need to do a clean+invalidate while the
caches are guaranteed to not allocate anything furthe. Hence, we'd need
to use a non-cacheable mapping to perform the clean+invalidate.

However, I now think that so long as we unmap the range, this shouldn't
matter. The new kernel can perform the maintenance if it wishes to use
different attributes, similarly to what the first kernel must do per the
boot protocol.

> My current implementation of arch_kexec_protect_crashkres() is:
> 
> kexec_segment_flush(kexec_crash_image);
> create_mapping_late(crashk_res.start, ..., __pgprot(0));
> or PAGE_KERNEL_INVALID
> flush_tlb_all();
> 
> kexec_segment_flush() will eventually do dcache-flush for all the modified
> data in crash dump kernel memory.

I now think this should be fine, per the above.

Thanks,
Mark.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel

2017-01-23 Thread AKASHI Takahiro
Mark,

On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote:
> On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote:
> > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote:
> > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote:
> > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote:
> > > > > Great! I think it would be better to follow the approach of
> > > > > mark_rodata_ro(), rather than opening up set_memory_*(), but 
> > > > > otherwise,
> > > > > it looks like it should work.
> > > > 
> > > > I'm not quite sure what the approach of mark_rodata_ro() means, but
> > > > I found that using create_mapping_late() may cause two problems:
> > > > 
> > > > 1) it fails when PTE_CONT bits mismatch between an old and new mmu 
> > > > entry.
> > > >This can happen, say, if the memory range for crash dump kernel
> > > >starts in the mid of _continuous_ pages.
> > > 
> > > That should only happen if we try to remap a segment different to what
> > > we originally mapped.
> > > 
> > > I was intending that we'd explicitly map the reserved region separately
> > > in the boot path, like we do for kernel segments in map_kernel(). We
> > > would allow sections and/or CONT entires. 
> > > 
> > > Then, in __map_memblock() we'd then skip that range as we do for the
> > > linear map alias of the kernel image.
> > > 
> > > That way, we can later use create_mapping_late for that same region, and
> > > it should handle sections and/or CONT entries in the exact same way as
> > > it does for the kernel image segments in mark_rodata_ro().
> > 
> > I see.
> > Which one do you prefer, yours above or my (second) solution?
> > Either way, they do almost the same thing in terms of mapping.
> 
> While both should work, I'd prefer to match the existing map_kernel()
> logic, (i.e. my suggestion above), for consistency.

OK

> > > I don't think we have much code useful for unmapping. We could re-use 
> > > create_mapping_late for this, passing a set of prot bits that means the
> > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
> > 
> > Do you really think that we should totally invalidate mmu entries?
> > I guess that, given proper cache & TLB flush operations, RO attribute is
> > good enough for memory consistency, no?
> > (None accesses the region, as I said, except in the case of re-loading
> > crash dump kernel though.)
> 
> My worry is that the first kernel and kdump kernel may map (portions of)
> the region with potentially confliciting memory attributes. So it would
> be necessary to completely unmap the region.

I think that this can happen only if the second kernel boots up,
leaving non-crashed cpus still running for some reason.

> You raise a good point that this would also mean we need to perform some
> cache maintenance, which makes that a little more painful. We'd need a
> sequence like:
> 
> * Unmap the region
> * TLB invalidation
> * Remap the region with non-cacheable attributes
> * Cache maintenance
> * Unmap the region
> * TLB invalidation

I don't get why we need to remap the region and do cache
maintenance here. Please elaborate a bit more?
My current implementation of arch_kexec_protect_crashkres() is:

kexec_segment_flush(kexec_crash_image);
create_mapping_late(crashk_res.start, ..., __pgprot(0));
or PAGE_KERNEL_INVALID
flush_tlb_all();

kexec_segment_flush() will eventually do dcache-flush for all the modified
data in crash dump kernel memory.

> > > We'd have to perform the TLB invalidation ourselves, but that shouldn't
> > > be too painful.
> > 
> > Do we need to invalidate TLBs not only before but also after changing
> > permission attributes as make_rodata_ro() does?
> 
> I believe we'd only have to perform the TLB invalidation after the
> change of attributes.

OK

Thanks,
-Takahiro AKASHI

> Thanks,
> Mark.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()

2017-01-23 Thread Dave Young
Hi, Xunlei

On 01/23/17 at 02:48pm, Xunlei Pang wrote:
> CRASH_KERNEL_ADDR_MAX has been missing for a long time,
> update it with more detailed explanation.
> 
> Cc: Robert LeBlanc 
> Cc: Baoquan He 
> Signed-off-by: Xunlei Pang 
> ---
>  arch/x86/kernel/setup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4cfba94..c32a167 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void)
>   /* 0 means: find the address automatically */
>   if (crash_base <= 0) {
>   /*
> -  *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
> +  * Set CRASH_ADDR_LOW_MAX upper bound for crash memory
> +  * as old kexec-tools loads bzImage below that, unless
> +  * "crashkernel=size[KMG],high" is specified.

There is already comment before the define of those macros, also
there are 32bit case which has a different reason about 512M there as
well.

So it looks better to just drop the one line comment without adding
further comments here.
>*/
>   crash_base = memblock_find_in_range(CRASH_ALIGN,
>   high ? CRASH_ADDR_HIGH_MAX
> -- 
> 1.8.3.1
> 

Thanks
Dave

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: implemented XEN KEXEC STATUS to determine if an image is loaded

2017-01-23 Thread Baoquan He
On 01/23/17 at 09:34am, Daniel Kiper wrote:
> On Sat, Jan 21, 2017 at 09:43:19AM +0800, Baoquan He wrote:
> > Hi,
> >
> > I don't strongly oppose against this, but could you tell what you have
> > met makes the kexec_loaded/kexec_crash_loaded checking not convenient to
> > you?
> 
> kexec_loaded/kexec_crash_loaded exposes Linux kernel kexec/crash state.
> It does not say anything about Xen kexec/crash state. So, we need a special
> approach to get the latter. Though for compatibility we provide similar
> functionality in kexec-tools for the former.
> 
> I hope that helps.

Thanks for telling.

Understood. Then it makes sense. Better put above explanation into log.

Thanks
Baoquan

> 
> > On 01/20/17 at 11:03am, Eric DeVolder wrote:
> > > Instead of the scripts having to poke at various fields we can
> > > provide that functionality via the -S parameter.
> > >
> > > Returns 0 if the payload is loaded. Can be used in combination
> > > with -l or -p to get the state of the proper kexec image.
> > >
> > > Signed-off-by: Konrad Rzeszutek Wilk 
> > > Signed-off-by: Eric DeVolder 
> > > ---
> > > Note: The corresponding Xen changes have been committed
> > > to the Xen staging branch. Follow this thread:
> > > https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01570.html
> > >
> > > CC: Andrew Cooper 
> > > CC: kexec@lists.infradead.org
> > > CC: xen-de...@lists.xenproject.org
> > > CC: Daniel Kiper 
> > >
> > > v0: First version (internal product).
> > > v1: Posted on kexec mailing list. Changed -s to -S
> > > v2: Incorporated feedback from kexec mailing list
> > > ---
> > >  configure.ac  |  8 ++-
> > >  kexec/kexec-xen.c | 26 +++
> > >  kexec/kexec.8 |  6 ++
> > >  kexec/kexec.c | 62 
> > > ---
> > >  kexec/kexec.h |  5 -
> > >  5 files changed, 98 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/configure.ac b/configure.ac
> > > index 3044185..c6e864b 100644
> > > --- a/configure.ac
> > > +++ b/configure.ac
> > > @@ -165,8 +165,14 @@ fi
> > >  dnl find Xen control stack libraries
> > >  if test "$with_xen" = yes ; then
> > >   AC_CHECK_HEADER(xenctrl.h,
> > > - [AC_CHECK_LIB(xenctrl, xc_kexec_load, ,
> > > + [AC_CHECK_LIB(xenctrl, xc_kexec_load, [ have_xenctrl_h=yes ],
> > >   AC_MSG_NOTICE([Xen support disabled]))])
> > > +if test "$have_xenctrl_h" = yes ; then
> > > + AC_CHECK_LIB(xenctrl, xc_kexec_status,
> > > + AC_DEFINE(HAVE_KEXEC_CMD_STATUS, 1,
> > > + [The kexec_status call is available]),
> > > + AC_MSG_NOTICE([The kexec_status call is not available]))
> > > +fi
> 
> Eric, I have a feeling that you should add en extra indentation
> for lines starting from "+if test "$have_xenctrl_h"..." and
> ending at "+fi".
> 
> Daniel

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: implemented XEN KEXEC STATUS to determine if an image is loaded

2017-01-23 Thread Daniel Kiper
On Sat, Jan 21, 2017 at 09:43:19AM +0800, Baoquan He wrote:
> Hi,
>
> I don't strongly oppose against this, but could you tell what you have
> met makes the kexec_loaded/kexec_crash_loaded checking not convenient to
> you?

kexec_loaded/kexec_crash_loaded exposes Linux kernel kexec/crash state.
It does not say anything about Xen kexec/crash state. So, we need a special
approach to get the latter. Though for compatibility we provide similar
functionality in kexec-tools for the former.

I hope that helps.

> On 01/20/17 at 11:03am, Eric DeVolder wrote:
> > Instead of the scripts having to poke at various fields we can
> > provide that functionality via the -S parameter.
> >
> > Returns 0 if the payload is loaded. Can be used in combination
> > with -l or -p to get the state of the proper kexec image.
> >
> > Signed-off-by: Konrad Rzeszutek Wilk 
> > Signed-off-by: Eric DeVolder 
> > ---
> > Note: The corresponding Xen changes have been committed
> > to the Xen staging branch. Follow this thread:
> > https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01570.html
> >
> > CC: Andrew Cooper 
> > CC: kexec@lists.infradead.org
> > CC: xen-de...@lists.xenproject.org
> > CC: Daniel Kiper 
> >
> > v0: First version (internal product).
> > v1: Posted on kexec mailing list. Changed -s to -S
> > v2: Incorporated feedback from kexec mailing list
> > ---
> >  configure.ac  |  8 ++-
> >  kexec/kexec-xen.c | 26 +++
> >  kexec/kexec.8 |  6 ++
> >  kexec/kexec.c | 62 
> > ---
> >  kexec/kexec.h |  5 -
> >  5 files changed, 98 insertions(+), 9 deletions(-)
> >
> > diff --git a/configure.ac b/configure.ac
> > index 3044185..c6e864b 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -165,8 +165,14 @@ fi
> >  dnl find Xen control stack libraries
> >  if test "$with_xen" = yes ; then
> > AC_CHECK_HEADER(xenctrl.h,
> > -   [AC_CHECK_LIB(xenctrl, xc_kexec_load, ,
> > +   [AC_CHECK_LIB(xenctrl, xc_kexec_load, [ have_xenctrl_h=yes ],
> > AC_MSG_NOTICE([Xen support disabled]))])
> > +if test "$have_xenctrl_h" = yes ; then
> > +   AC_CHECK_LIB(xenctrl, xc_kexec_status,
> > +   AC_DEFINE(HAVE_KEXEC_CMD_STATUS, 1,
> > +   [The kexec_status call is available]),
> > +   AC_MSG_NOTICE([The kexec_status call is not available]))
> > +fi

Eric, I have a feeling that you should add en extra indentation
for lines starting from "+if test "$have_xenctrl_h"..." and
ending at "+fi".

Daniel

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet when panic happens, so
before crash cpu shots them down or after kdump boots,
they should not do anything except clearing MCG_STATUS
in case of broadcasted mce. This is useful for kdump
to let the vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi 
Signed-off-by: Xunlei Pang 
---
 arch/x86/kernel/cpu/mcheck/mce.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef432..0c2bf77 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1157,6 +1157,23 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 
mce_gather_info(, regs);
 
+   /*
+* Check if this MCE is signaled to only this logical processor,
+* on Intel only.
+*/
+   if (m.cpuvendor == X86_VENDOR_INTEL)
+   lmce = m.mcgstatus & MCG_STATUS_LMCES;
+
+   /*
+* Special treatment for Intel broadcasted machine check:
+* To avoid panic due to MCE synchronization in case of kdump,
+* after system panic, clear global status and bail out.
+*/
+   if (!lmce && atomic_read(_cpu) != PANIC_CPU_INVALID) {
+   wrmsrl(MSR_IA32_MCG_STATUS, 0);
+   goto out;
+   }
+
final = this_cpu_ptr(_seen);
*final = m;
 
@@ -1174,13 +1191,6 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
kill_it = 1;
 
/*
-* Check if this MCE is signaled to only this logical processor,
-* on Intel only.
-*/
-   if (m.cpuvendor == X86_VENDOR_INTEL)
-   lmce = m.mcgstatus & MCG_STATUS_LMCES;
-
-   /*
 * Go through all banks in exclusion of the other CPUs. This way we
 * don't report duplicated events on shared banks because the first one
 * to see it will clear it. If this is a Local MCE, then no need to
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec