Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel
On Mon, Jan 23, 2017 at 10:23:15AM +, Mark Rutland wrote: > On Mon, Jan 23, 2017 at 06:51:46PM +0900, AKASHI Takahiro wrote: > > Mark, > > > > On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote: > > > On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote: > > > > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote: > > > > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote: > > > > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote: > > > > > > I don't think we have much code useful for unmapping. We could re-use > > > > > create_mapping_late for this, passing a set of prot bits that means > > > > > the > > > > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID). > > > > > > > > Do you really think that we should totally invalidate mmu entries? > > > > I guess that, given proper cache & TLB flush operations, RO attribute is > > > > good enough for memory consistency, no? > > > > (None accesses the region, as I said, except in the case of re-loading > > > > crash dump kernel though.) > > > > > > My worry is that the first kernel and kdump kernel may map (portions of) > > > the region with potentially confliciting memory attributes. So it would > > > be necessary to completely unmap the region. > > > > I think that this can happen only if the second kernel boots up, > > leaving non-crashed cpus still running for some reason. > > Yes. I was considering a kdump case where a secondary was stuck in the > first kernel. > > > > You raise a good point that this would also mean we need to perform some > > > cache maintenance, which makes that a little more painful. We'd need a > > > sequence like: > > > > > > * Unmap the region > > > * TLB invalidation > > > * Remap the region with non-cacheable attributes > > > * Cache maintenance > > > * Unmap the region > > > * TLB invalidation > > > > I don't get why we need to remap the region and do cache > > maintenance here. Please elaborate a bit more? > > I think I was wrong, and we don't need to. Sorry about that. > > My thought was that to ensure that there aren't stale lines with > differing attributes, we'd need to do a clean+invalidate while the > caches are guaranteed to not allocate anything furthe. Hence, we'd need > to use a non-cacheable mapping to perform the clean+invalidate. > > However, I now think that so long as we unmap the range, this shouldn't > matter. The new kernel can perform the maintenance if it wishes to use > different attributes, similarly to what the first kernel must do per the > boot protocol. > > > My current implementation of arch_kexec_protect_crashkres() is: > > > > kexec_segment_flush(kexec_crash_image); > > create_mapping_late(crashk_res.start, ..., __pgprot(0)); > > or PAGE_KERNEL_INVALID > > flush_tlb_all(); > > > > kexec_segment_flush() will eventually do dcache-flush for all the modified > > data in crash dump kernel memory. > > I now think this should be fine, per the above. OK. I think that now I can see a light of the goal :) -Takahiro AKASHI > Thanks, > Mark. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v29 4/9] arm64: kdump: implement machine_crash_shutdown()
Will, On Mon, Jan 23, 2017 at 05:46:34PM +, Will Deacon wrote: > On Thu, Jan 12, 2017 at 12:01:11PM +, Will Deacon wrote: > > On Thu, Jan 12, 2017 at 01:21:44PM +0900, AKASHI Takahiro wrote: > > > On Wed, Jan 11, 2017 at 10:54:05AM +, Will Deacon wrote: > > > > On Wed, Jan 11, 2017 at 03:36:28PM +0900, AKASHI Takahiro wrote: > > > > > On Tue, Jan 10, 2017 at 11:32:48AM +, Will Deacon wrote: > > > > > > On Wed, Dec 28, 2016 at 01:36:01PM +0900, AKASHI Takahiro wrote: > > > > > > > @@ -22,6 +25,7 @@ > > > > > > > extern const unsigned char arm64_relocate_new_kernel[]; > > > > > > > extern const unsigned long arm64_relocate_new_kernel_size; > > > > > > > > > > > > > > +static bool in_crash_kexec; > > > > > > > > > > > > Do you actually need this bool? Why not call kexec_crash_loaded() > > > > > > instead? > > > > > > > > > > The two have different meanings: > > > > > "in_crash_kexec" indicates that kdump is taking place, while > > > > > kexec_crash_loaded() tells us only whether crash dump kernel has been > > > > > loaded or not. > > > > > > > > > > It is crucial to distinguish them especially for machine_kexec() > > > > > which can be called on normal kexec even if kdump has been set up. > > > > > > > > Ah, I see. So how about just doing: > > > > > > > > if (kimage == kexec_crash_image) > > > > > > > > in machine_kexec? > > > > > > Yeah, it should work. > > > Do you want to merge the following hunk, > > > or expect that I will re-send the whole patch series > > > (with other changes if any)? > > > > Thanks, I'll fold it in and shout if I run into any problems. My plan is > > to queue this for 4.11. > > Given the DT discussion with Mark, I assume you'll post a new version with > this rolled in. Yes, I will! -Takahiro AKASHI > Will ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: >> will ignore the machine check on the other cpus ... assuming >> that "cpu_is_offline(smp_processor_id())" does the right thing >> in the kexec case where this is an "old" cpu that isn't online >> in the new kernel. > Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be > good there. > "nr_cpus=N" will consume more memory, using very large N is almost impossible for kdump to boot with considering the limited crash memory reserved. For some large machine, nr_cpus=1 might not be enough, we have to use nr_cpus=4 or more, it is also helpful for the vmcore parallel dumping :-) Regards, Xunlei ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote: > On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: >> Hey Tony, >> >> a "welcome back" is in order? :-) >> >> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >>> If the system had experienced some memory corruption, but >>> recovered ... then there would be some pages sitting around >>> that the old kernel had marked as POISON and stopped using. >>> The kexec'd kernel doesn't know about these, so may touch that >>> memory while taking a crash dump ... >> Hmm, pass a list of poisoned pages to the kdump kernel so as not to >> touch. Looks like there's already functionality for that: >> >> "makedumpfile can exclude the following types of pages while copying >> VMCORE to DUMPFILE, and a user can choose which type of pages will be >> excluded. >> >> - Pages filled with zero >> - Cache pages >> - User process data pages >> - Free pages" >> >> (there is a makedumpfile manpage somewhere) >> >> And apparently crash knows about poisoned pages and handles them: >> >> static int __init crash_save_vmcoreinfo_init(void) >> { >> ... >> #ifdef CONFIG_MEMORY_FAILURE >> VMCOREINFO_NUMBER(PG_hwpoison); >> #endif >> >> so if that works, the kexeced kernel should know about that list. > From the log in my previous reply, MCE occurred before makedumpfile dumping, > so I guess if the poisoned ones belong to the crash reserved memory or other > type of events? Another possibility may be from any system.reserved/pcie memory which are shared between 1st and 2nd kernel. > > Besides, some kdump kernel may not use makedumpfile, for example a simple "cp" > is also allowed to process "/proc/vmcore". > >>> and then you have a broadcast machine check (on older[1] Intel CPUs >>> that don't support local machine check). >> Right. >> >>> This is hard to work around. You really need all the CPUs to have set >>> CR4.MCE=1 (if any didn't, then they will force a reset when they see >>> the machine check). Also you need to make sure that they jump to the >>> copy of do_machine_check() in the new kernel, not the old kernel. >> Doesn't matter, right? The new copy is as clueless as the old one about >> those MCEs. >> > It's the code in mce_start(), it waits for all the online cpus including the > cpus > that kdump boots on to synchronize. > > So for new mce handler of kdump kernel, it is fine as the number of online > cpus > is correct; as for old mce handler of 1st kernel, it's not true because some > cpus > which are regarded online from 1st kernel's view are running the 2nd kernel > now, > they can't respond to the old mce handler which will timeout the old mce > handler. > > Regards, > Xunlei ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) > > On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >> If the system had experienced some memory corruption, but >> recovered ... then there would be some pages sitting around >> that the old kernel had marked as POISON and stopped using. >> The kexec'd kernel doesn't know about these, so may touch that >> memory while taking a crash dump ... > Hmm, pass a list of poisoned pages to the kdump kernel so as not to > touch. Looks like there's already functionality for that: > > "makedumpfile can exclude the following types of pages while copying > VMCORE to DUMPFILE, and a user can choose which type of pages will be > excluded. > > - Pages filled with zero > - Cache pages > - User process data pages > - Free pages" > > (there is a makedumpfile manpage somewhere) > > And apparently crash knows about poisoned pages and handles them: > > static int __init crash_save_vmcoreinfo_init(void) > { > ... > #ifdef CONFIG_MEMORY_FAILURE > VMCOREINFO_NUMBER(PG_hwpoison); > #endif > > so if that works, the kexeced kernel should know about that list. >From the log in my previous reply, MCE occurred before makedumpfile dumping, so I guess if the poisoned ones belong to the crash reserved memory or other type of events? Besides, some kdump kernel may not use makedumpfile, for example a simple "cp" is also allowed to process "/proc/vmcore". > >> and then you have a broadcast machine check (on older[1] Intel CPUs >> that don't support local machine check). > Right. > >> This is hard to work around. You really need all the CPUs to have set >> CR4.MCE=1 (if any didn't, then they will force a reset when they see >> the machine check). Also you need to make sure that they jump to the >> copy of do_machine_check() in the new kernel, not the old kernel. > Doesn't matter, right? The new copy is as clueless as the old one about > those MCEs. > It's the code in mce_start(), it waits for all the online cpus including the cpus that kdump boots on to synchronize. So for new mce handler of kdump kernel, it is fine as the number of online cpus is correct; as for old mce handler of 1st kernel, it's not true because some cpus which are regarded online from 1st kernel's view are running the 2nd kernel now, they can't respond to the old mce handler which will timeout the old mce handler. Regards, Xunlei ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: > will ignore the machine check on the other cpus ... assuming > that "cpu_is_offline(smp_processor_id())" does the right thing > in the kexec case where this is an "old" cpu that isn't online > in the new kernel. Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be good there. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) Yes - first day back today. Lots of catching up to do. > And apparently crash knows about poisoned pages and handles them: > > static int __init crash_save_vmcoreinfo_init(void) > { > ... > #ifdef CONFIG_MEMORY_FAILURE > VMCOREINFO_NUMBER(PG_hwpoison); > #endif > > so if that works, the kexeced kernel should know about that list. Oh good ... it is smarter than I thought. > Doesn't matter, right? The new copy is as clueless as the old one about > those MCEs. If things are well enough initialized that we don't reset, and get to do_machine_check(), then this code from Ashok: /* If this CPU is offline, just bail out. */ if (cpu_is_offline(smp_processor_id())) { u64 mcgstatus; mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); if (mcgstatus & MCG_STATUS_RIPV) { mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); return; } } will ignore the machine check on the other cpus ... assuming that "cpu_is_offline(smp_processor_id())" does the right thing in the kexec case where this is an "old" cpu that isn't online in the new kernel. -Tony ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
Hey Tony, a "welcome back" is in order? :-) On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: > If the system had experienced some memory corruption, but > recovered ... then there would be some pages sitting around > that the old kernel had marked as POISON and stopped using. > The kexec'd kernel doesn't know about these, so may touch that > memory while taking a crash dump ... Hmm, pass a list of poisoned pages to the kdump kernel so as not to touch. Looks like there's already functionality for that: "makedumpfile can exclude the following types of pages while copying VMCORE to DUMPFILE, and a user can choose which type of pages will be excluded. - Pages filled with zero - Cache pages - User process data pages - Free pages" (there is a makedumpfile manpage somewhere) And apparently crash knows about poisoned pages and handles them: static int __init crash_save_vmcoreinfo_init(void) { ... #ifdef CONFIG_MEMORY_FAILURE VMCOREINFO_NUMBER(PG_hwpoison); #endif so if that works, the kexeced kernel should know about that list. > and then you have a broadcast machine check (on older[1] Intel CPUs > that don't support local machine check). Right. > This is hard to work around. You really need all the CPUs to have set > CR4.MCE=1 (if any didn't, then they will force a reset when they see > the machine check). Also you need to make sure that they jump to the > copy of do_machine_check() in the new kernel, not the old kernel. Doesn't matter, right? The new copy is as clueless as the old one about those MCEs. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > > One possible timing sequence would be: > > 1st kernel running on multiple cpus panicked > > then the crash dump code starts > > the crash dump code stops the others cpus except the crashing one > > 2nd kernel boots up on the crash cpu with "nr_cpus=1" > > some broadcasted mce comes on some cpu amongst the other cpus(not the > > crashing cpu) > > Where does this broadcasted MCE come from? > > The crash dump code triggered it? Or it happened before the panic()? > > Are you talking about an *actual* sequence which you're experiencing on > real hw or is this something hypothetical? If the system had experienced some memory corruption, but recovered ... then there would be some pages sitting around that the old kernel had marked as POISON and stopped using. The kexec'd kernel doesn't know about these, so may touch that memory while taking a crash dump ... and then you have a broadcast machine check (on older[1] Intel CPUs that don't support local machine check). This is hard to work around. You really need all the CPUs to have set CR4.MCE=1 (if any didn't, then they will force a reset when they see the machine check). Also you need to make sure that they jump to the copy of do_machine_check() in the new kernel, not the old kernel. A while ago I played with the nr_cpus=N code to have it bring all the CPUs far enough online to get the machine check initialization done, then any extras above "N" just go back offline again. But I never got this to work reliably. -Tony [1] older == all released ones, at the moment. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > One possible timing sequence would be: > 1st kernel running on multiple cpus panicked > then the crash dump code starts > the crash dump code stops the others cpus except the crashing one > 2nd kernel boots up on the crash cpu with "nr_cpus=1" > some broadcasted mce comes on some cpu amongst the other cpus(not the > crashing cpu) Where does this broadcasted MCE come from? The crash dump code triggered it? Or it happened before the panic()? Are you talking about an *actual* sequence which you're experiencing on real hw or is this something hypothetical? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: >> We met an issue for kdump: after kdump kernel boots up, >> and there comes a broadcasted mce in first kernel, the > How does that even happen? > > Lemme try to understand this correctly: the first kernel gets an > MCE, kdump starts and boots a *whole* kernel and *then* you get the > broadcasted MCE? I have real hard time believing that. > > What happened to the approach of clearing CR4.MCE before loading the > kdump kernel, in native_machine_shutdown() or wherever does the kdump > gets loaded... > One possible timing sequence would be: 1st kernel running on multiple cpus panicked then the crash dump code starts the crash dump code stops the others cpus except the crashing one 2nd kernel boots up on the crash cpu with "nr_cpus=1" some broadcasted mce comes on some cpu amongst the other cpus(not the crashing cpu) the other cpus enter old mce handler of 1st kernel, while crash cpu enters new mce handler of 2nd kernel the old mce handler of 1st kernel will timeout and panic due to mce syncrhonization under default setting Regards, Xunlei ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()
On 01/23/2017 at 04:48 PM, Dave Young wrote: > Hi, Xunlei > > On 01/23/17 at 02:48pm, Xunlei Pang wrote: >> CRASH_KERNEL_ADDR_MAX has been missing for a long time, >> update it with more detailed explanation. >> >> Cc: Robert LeBlanc>> Cc: Baoquan He >> Signed-off-by: Xunlei Pang >> --- >> arch/x86/kernel/setup.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >> index 4cfba94..c32a167 100644 >> --- a/arch/x86/kernel/setup.c >> +++ b/arch/x86/kernel/setup.c >> @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void) >> /* 0 means: find the address automatically */ >> if (crash_base <= 0) { >> /* >> - * kexec want bzImage is below CRASH_KERNEL_ADDR_MAX >> + * Set CRASH_ADDR_LOW_MAX upper bound for crash memory >> + * as old kexec-tools loads bzImage below that, unless >> + * "crashkernel=size[KMG],high" is specified. > There is already comment before the define of those macros, also > there are 32bit case which has a different reason about 512M there as > well. If we see from the kexec's perspective, we have a common CRASH_ADDR_LOW_MAX definition for both x86 32-bit and 64-bit(32-bit x86 has the same value defined for CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX), so old kexec will load below CRASH_ADDR_LOW_MAX, so I think the description is fine :-) Regards, Xunlei > > So it looks better to just drop the one line comment without adding > further comments here. >> */ >> crash_base = memblock_find_in_range(CRASH_ALIGN, >> high ? CRASH_ADDR_HIGH_MAX >> -- >> 1.8.3.1 >> > Thanks > Dave > > ___ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: > We met an issue for kdump: after kdump kernel boots up, > and there comes a broadcasted mce in first kernel, the How does that even happen? Lemme try to understand this correctly: the first kernel gets an MCE, kdump starts and boots a *whole* kernel and *then* you get the broadcasted MCE? I have real hard time believing that. What happened to the approach of clearing CR4.MCE before loading the kdump kernel, in native_machine_shutdown() or wherever does the kdump gets loaded... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel
On Mon, Jan 23, 2017 at 06:51:46PM +0900, AKASHI Takahiro wrote: > Mark, > > On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote: > > On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote: > > > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote: > > > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote: > > > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote: > > > > I don't think we have much code useful for unmapping. We could re-use > > > > create_mapping_late for this, passing a set of prot bits that means the > > > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID). > > > > > > Do you really think that we should totally invalidate mmu entries? > > > I guess that, given proper cache & TLB flush operations, RO attribute is > > > good enough for memory consistency, no? > > > (None accesses the region, as I said, except in the case of re-loading > > > crash dump kernel though.) > > > > My worry is that the first kernel and kdump kernel may map (portions of) > > the region with potentially confliciting memory attributes. So it would > > be necessary to completely unmap the region. > > I think that this can happen only if the second kernel boots up, > leaving non-crashed cpus still running for some reason. Yes. I was considering a kdump case where a secondary was stuck in the first kernel. > > You raise a good point that this would also mean we need to perform some > > cache maintenance, which makes that a little more painful. We'd need a > > sequence like: > > > > * Unmap the region > > * TLB invalidation > > * Remap the region with non-cacheable attributes > > * Cache maintenance > > * Unmap the region > > * TLB invalidation > > I don't get why we need to remap the region and do cache > maintenance here. Please elaborate a bit more? I think I was wrong, and we don't need to. Sorry about that. My thought was that to ensure that there aren't stale lines with differing attributes, we'd need to do a clean+invalidate while the caches are guaranteed to not allocate anything furthe. Hence, we'd need to use a non-cacheable mapping to perform the clean+invalidate. However, I now think that so long as we unmap the range, this shouldn't matter. The new kernel can perform the maintenance if it wishes to use different attributes, similarly to what the first kernel must do per the boot protocol. > My current implementation of arch_kexec_protect_crashkres() is: > > kexec_segment_flush(kexec_crash_image); > create_mapping_late(crashk_res.start, ..., __pgprot(0)); > or PAGE_KERNEL_INVALID > flush_tlb_all(); > > kexec_segment_flush() will eventually do dcache-flush for all the modified > data in crash dump kernel memory. I now think this should be fine, per the above. Thanks, Mark. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v29 3/9] arm64: kdump: reserve memory for crash dump kernel
Mark, On Thu, Jan 19, 2017 at 11:28:50AM +, Mark Rutland wrote: > On Thu, Jan 19, 2017 at 06:49:42PM +0900, AKASHI Takahiro wrote: > > On Tue, Jan 17, 2017 at 11:54:42AM +, Mark Rutland wrote: > > > On Tue, Jan 17, 2017 at 05:20:44PM +0900, AKASHI Takahiro wrote: > > > > On Fri, Jan 13, 2017 at 11:39:15AM +, Mark Rutland wrote: > > > > > Great! I think it would be better to follow the approach of > > > > > mark_rodata_ro(), rather than opening up set_memory_*(), but > > > > > otherwise, > > > > > it looks like it should work. > > > > > > > > I'm not quite sure what the approach of mark_rodata_ro() means, but > > > > I found that using create_mapping_late() may cause two problems: > > > > > > > > 1) it fails when PTE_CONT bits mismatch between an old and new mmu > > > > entry. > > > >This can happen, say, if the memory range for crash dump kernel > > > >starts in the mid of _continuous_ pages. > > > > > > That should only happen if we try to remap a segment different to what > > > we originally mapped. > > > > > > I was intending that we'd explicitly map the reserved region separately > > > in the boot path, like we do for kernel segments in map_kernel(). We > > > would allow sections and/or CONT entires. > > > > > > Then, in __map_memblock() we'd then skip that range as we do for the > > > linear map alias of the kernel image. > > > > > > That way, we can later use create_mapping_late for that same region, and > > > it should handle sections and/or CONT entries in the exact same way as > > > it does for the kernel image segments in mark_rodata_ro(). > > > > I see. > > Which one do you prefer, yours above or my (second) solution? > > Either way, they do almost the same thing in terms of mapping. > > While both should work, I'd prefer to match the existing map_kernel() > logic, (i.e. my suggestion above), for consistency. OK > > > I don't think we have much code useful for unmapping. We could re-use > > > create_mapping_late for this, passing a set of prot bits that means the > > > entries are invalid (e.g. have a PAGE_KERNEL_INVALID). > > > > Do you really think that we should totally invalidate mmu entries? > > I guess that, given proper cache & TLB flush operations, RO attribute is > > good enough for memory consistency, no? > > (None accesses the region, as I said, except in the case of re-loading > > crash dump kernel though.) > > My worry is that the first kernel and kdump kernel may map (portions of) > the region with potentially confliciting memory attributes. So it would > be necessary to completely unmap the region. I think that this can happen only if the second kernel boots up, leaving non-crashed cpus still running for some reason. > You raise a good point that this would also mean we need to perform some > cache maintenance, which makes that a little more painful. We'd need a > sequence like: > > * Unmap the region > * TLB invalidation > * Remap the region with non-cacheable attributes > * Cache maintenance > * Unmap the region > * TLB invalidation I don't get why we need to remap the region and do cache maintenance here. Please elaborate a bit more? My current implementation of arch_kexec_protect_crashkres() is: kexec_segment_flush(kexec_crash_image); create_mapping_late(crashk_res.start, ..., __pgprot(0)); or PAGE_KERNEL_INVALID flush_tlb_all(); kexec_segment_flush() will eventually do dcache-flush for all the modified data in crash dump kernel memory. > > > We'd have to perform the TLB invalidation ourselves, but that shouldn't > > > be too painful. > > > > Do we need to invalidate TLBs not only before but also after changing > > permission attributes as make_rodata_ro() does? > > I believe we'd only have to perform the TLB invalidation after the > change of attributes. OK Thanks, -Takahiro AKASHI > Thanks, > Mark. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()
Hi, Xunlei On 01/23/17 at 02:48pm, Xunlei Pang wrote: > CRASH_KERNEL_ADDR_MAX has been missing for a long time, > update it with more detailed explanation. > > Cc: Robert LeBlanc> Cc: Baoquan He > Signed-off-by: Xunlei Pang > --- > arch/x86/kernel/setup.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 4cfba94..c32a167 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void) > /* 0 means: find the address automatically */ > if (crash_base <= 0) { > /* > - * kexec want bzImage is below CRASH_KERNEL_ADDR_MAX > + * Set CRASH_ADDR_LOW_MAX upper bound for crash memory > + * as old kexec-tools loads bzImage below that, unless > + * "crashkernel=size[KMG],high" is specified. There is already comment before the define of those macros, also there are 32bit case which has a different reason about 512M there as well. So it looks better to just drop the one line comment without adding further comments here. >*/ > crash_base = memblock_find_in_range(CRASH_ALIGN, > high ? CRASH_ADDR_HIGH_MAX > -- > 1.8.3.1 > Thanks Dave ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] kexec: implemented XEN KEXEC STATUS to determine if an image is loaded
On 01/23/17 at 09:34am, Daniel Kiper wrote: > On Sat, Jan 21, 2017 at 09:43:19AM +0800, Baoquan He wrote: > > Hi, > > > > I don't strongly oppose against this, but could you tell what you have > > met makes the kexec_loaded/kexec_crash_loaded checking not convenient to > > you? > > kexec_loaded/kexec_crash_loaded exposes Linux kernel kexec/crash state. > It does not say anything about Xen kexec/crash state. So, we need a special > approach to get the latter. Though for compatibility we provide similar > functionality in kexec-tools for the former. > > I hope that helps. Thanks for telling. Understood. Then it makes sense. Better put above explanation into log. Thanks Baoquan > > > On 01/20/17 at 11:03am, Eric DeVolder wrote: > > > Instead of the scripts having to poke at various fields we can > > > provide that functionality via the -S parameter. > > > > > > Returns 0 if the payload is loaded. Can be used in combination > > > with -l or -p to get the state of the proper kexec image. > > > > > > Signed-off-by: Konrad Rzeszutek Wilk> > > Signed-off-by: Eric DeVolder > > > --- > > > Note: The corresponding Xen changes have been committed > > > to the Xen staging branch. Follow this thread: > > > https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01570.html > > > > > > CC: Andrew Cooper > > > CC: kexec@lists.infradead.org > > > CC: xen-de...@lists.xenproject.org > > > CC: Daniel Kiper > > > > > > v0: First version (internal product). > > > v1: Posted on kexec mailing list. Changed -s to -S > > > v2: Incorporated feedback from kexec mailing list > > > --- > > > configure.ac | 8 ++- > > > kexec/kexec-xen.c | 26 +++ > > > kexec/kexec.8 | 6 ++ > > > kexec/kexec.c | 62 > > > --- > > > kexec/kexec.h | 5 - > > > 5 files changed, 98 insertions(+), 9 deletions(-) > > > > > > diff --git a/configure.ac b/configure.ac > > > index 3044185..c6e864b 100644 > > > --- a/configure.ac > > > +++ b/configure.ac > > > @@ -165,8 +165,14 @@ fi > > > dnl find Xen control stack libraries > > > if test "$with_xen" = yes ; then > > > AC_CHECK_HEADER(xenctrl.h, > > > - [AC_CHECK_LIB(xenctrl, xc_kexec_load, , > > > + [AC_CHECK_LIB(xenctrl, xc_kexec_load, [ have_xenctrl_h=yes ], > > > AC_MSG_NOTICE([Xen support disabled]))]) > > > +if test "$have_xenctrl_h" = yes ; then > > > + AC_CHECK_LIB(xenctrl, xc_kexec_status, > > > + AC_DEFINE(HAVE_KEXEC_CMD_STATUS, 1, > > > + [The kexec_status call is available]), > > > + AC_MSG_NOTICE([The kexec_status call is not available])) > > > +fi > > Eric, I have a feeling that you should add en extra indentation > for lines starting from "+if test "$have_xenctrl_h"..." and > ending at "+fi". > > Daniel ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] kexec: implemented XEN KEXEC STATUS to determine if an image is loaded
On Sat, Jan 21, 2017 at 09:43:19AM +0800, Baoquan He wrote: > Hi, > > I don't strongly oppose against this, but could you tell what you have > met makes the kexec_loaded/kexec_crash_loaded checking not convenient to > you? kexec_loaded/kexec_crash_loaded exposes Linux kernel kexec/crash state. It does not say anything about Xen kexec/crash state. So, we need a special approach to get the latter. Though for compatibility we provide similar functionality in kexec-tools for the former. I hope that helps. > On 01/20/17 at 11:03am, Eric DeVolder wrote: > > Instead of the scripts having to poke at various fields we can > > provide that functionality via the -S parameter. > > > > Returns 0 if the payload is loaded. Can be used in combination > > with -l or -p to get the state of the proper kexec image. > > > > Signed-off-by: Konrad Rzeszutek Wilk> > Signed-off-by: Eric DeVolder > > --- > > Note: The corresponding Xen changes have been committed > > to the Xen staging branch. Follow this thread: > > https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01570.html > > > > CC: Andrew Cooper > > CC: kexec@lists.infradead.org > > CC: xen-de...@lists.xenproject.org > > CC: Daniel Kiper > > > > v0: First version (internal product). > > v1: Posted on kexec mailing list. Changed -s to -S > > v2: Incorporated feedback from kexec mailing list > > --- > > configure.ac | 8 ++- > > kexec/kexec-xen.c | 26 +++ > > kexec/kexec.8 | 6 ++ > > kexec/kexec.c | 62 > > --- > > kexec/kexec.h | 5 - > > 5 files changed, 98 insertions(+), 9 deletions(-) > > > > diff --git a/configure.ac b/configure.ac > > index 3044185..c6e864b 100644 > > --- a/configure.ac > > +++ b/configure.ac > > @@ -165,8 +165,14 @@ fi > > dnl find Xen control stack libraries > > if test "$with_xen" = yes ; then > > AC_CHECK_HEADER(xenctrl.h, > > - [AC_CHECK_LIB(xenctrl, xc_kexec_load, , > > + [AC_CHECK_LIB(xenctrl, xc_kexec_load, [ have_xenctrl_h=yes ], > > AC_MSG_NOTICE([Xen support disabled]))]) > > +if test "$have_xenctrl_h" = yes ; then > > + AC_CHECK_LIB(xenctrl, xc_kexec_status, > > + AC_DEFINE(HAVE_KEXEC_CMD_STATUS, 1, > > + [The kexec_status call is available]), > > + AC_MSG_NOTICE([The kexec_status call is not available])) > > +fi Eric, I have a feeling that you should add en extra indentation for lines starting from "+if test "$have_xenctrl_h"..." and ending at "+fi". Daniel ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
We met an issue for kdump: after kdump kernel boots up, and there comes a broadcasted mce in first kernel, the other cpus remaining in first kernel will enter the old mce handler of first kernel, then timeout and panic due to MCE synchronization, finally reset the kdump cpus. This patch lets cpus stay quiet when panic happens, so before crash cpu shots them down or after kdump boots, they should not do anything except clearing MCG_STATUS in case of broadcasted mce. This is useful for kdump to let the vmcore dumping perform as hard as it can. Previous efforts: https://patchwork.kernel.org/patch/6167631/ https://lists.gt.net/linux/kernel/2146557 Cc: Naoya HoriguchiSigned-off-by: Xunlei Pang --- arch/x86/kernel/cpu/mcheck/mce.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 00ef432..0c2bf77 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1157,6 +1157,23 @@ void do_machine_check(struct pt_regs *regs, long error_code) mce_gather_info(, regs); + /* +* Check if this MCE is signaled to only this logical processor, +* on Intel only. +*/ + if (m.cpuvendor == X86_VENDOR_INTEL) + lmce = m.mcgstatus & MCG_STATUS_LMCES; + + /* +* Special treatment for Intel broadcasted machine check: +* To avoid panic due to MCE synchronization in case of kdump, +* after system panic, clear global status and bail out. +*/ + if (!lmce && atomic_read(_cpu) != PANIC_CPU_INVALID) { + wrmsrl(MSR_IA32_MCG_STATUS, 0); + goto out; + } + final = this_cpu_ptr(_seen); *final = m; @@ -1174,13 +1191,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) kill_it = 1; /* -* Check if this MCE is signaled to only this logical processor, -* on Intel only. -*/ - if (m.cpuvendor == X86_VENDOR_INTEL) - lmce = m.mcgstatus & MCG_STATUS_LMCES; - - /* * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one * to see it will clear it. If this is a Local MCE, then no need to -- 1.8.3.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec