Re: kexec regression since 4.9 caused by efi

2017-03-22 Thread Dave Young
On 03/22/17 at 04:10pm, Ard Biesheuvel wrote:
> On 21 March 2017 at 07:48, Dave Young  wrote:
> > On 03/20/17 at 10:14am, Dave Young wrote:
> >> On 03/17/17 at 01:32pm, Matt Fleming wrote:
> >> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> >> > >
> >> > > Matt, I think it should be fine although I think the md type checking 
> >> > > in
> >> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
> >> >
> >> > Could you make that a separate patch if you think of improvements
> >> > there?
> >>
> >> Duplicate the lookup function is indeed a little ugly, will do it when I
> >> have a better idea, we can leave it as is since it works.
> >
> > Matt, rethinking about this, how about doint something below, not
> > tested, just seeking for idea and opinons, in this way no need duplicate
> > a function, but there is an assumption that no overlapped mem ranges in
> > efi memmap.
> >
> > Also the case Omar reported is the esrt memory range type is
> > RUNTIME_DATA, that is a little different with the mem attribute of
> > RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
> > attribute, like bgrt in kexec reboot. Should we distinguish the two
> > cases and give out some warnings or debug info?
> >
> >
> > ---
> >  arch/x86/platform/efi/quirks.c |5 +
> >  drivers/firmware/efi/efi.c |6 --
> >  drivers/firmware/efi/esrt.c|7 +++
> >  3 files changed, 12 insertions(+), 6 deletions(-)
> >
> > --- linux-x86.orig/drivers/firmware/efi/efi.c
> > +++ linux-x86/drivers/firmware/efi/efi.c
> > @@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
> > u64 size;
> > u64 end;
> >
> > -   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > -   md->type != EFI_BOOT_SERVICES_DATA &&
> > -   md->type != EFI_RUNTIME_SERVICES_DATA) {
> > -   continue;
> > -   }
> > -
> > size = md->num_pages << EFI_PAGE_SHIFT;
> > end = md->phys_addr + size;
> > if (phys_addr >= md->phys_addr && phys_addr < end) {
> > --- linux-x86.orig/drivers/firmware/efi/esrt.c
> > +++ linux-x86/drivers/firmware/efi/esrt.c
> > @@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
> > return;
> > }
> >
> > +   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > + md->type != EFI_BOOT_SERVICES_DATA &&
> > + md->type != EFI_RUNTIME_SERVICES_DATA) {
> > +   pr_err("ESRT header memory map type is invalid\n");
> > +   return;
> > +   }
> > +
> 
> This looks wrong to me. While the meanings get convoluted in practice,
> the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
> a virtual mapping for the region. It is perfectly legal for a
> EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
> attribute, if the region is never accessed by the runtime services
> themselves, and this is not entirely unlikely for tables that the
> firmware exposes to the OS

Thanks for the comment, if so "!(md->attribute & EFI_MEMORY_RUNTIME) &&"
should be dropped.

BTW, md->type should be md.type, bgrt reserving works fine with this
change but I have no esrt machine to test it. I would like to wait for
Matt's opinions about this first before an update. 

Also cc Peter about the esrt piece.
> 
> > max = efi_mem_desc_end();
> > if (max < efi.esrt) {
> > pr_err("EFI memory descriptor is invalid. (esrt: %p max: 
> > %p)\n",
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > +   if (md->attribute & EFI_MEMORY_RUNTIME ||
> > + md->type != EFI_BOOT_SERVICES_DATA) {
> > +   return;
> > +   }
> > +
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
> >
> >>
> >> >
> >> > > How about move the if chunk early like below because it seems no need
> >> > > to sanity check the addr + size any more if the md is still RUNTIME?
> >> >
> >> > My original version did as you suggest, but I changed it because we
> >> > *really* want to know if someone tries to reserve a range that spans
> >> > regions. That would be totally unexpected and a warning about a
> >> > potential bug/issue.
> >>
> >> Matt, I'm fine if you prefer to capture the range checking errors.
> >> Would you like me to post it or just you send it out?
> >>
> >> Thanks
> >> Dave
> >
> > Thanks
> > Dave
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
kexec mailing list

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Michael Holzheu
Am Wed, 22 Mar 2017 12:30:04 +0800
schrieb Dave Young :

> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> > Dave Young  writes:
> > 

[snip]

> > > I think makedumpfile is using it, but I also vote to remove the
> > > CRASHTIME. It is better not to do this while crashing and a makedumpfile
> > > userspace patch is needed to drop the use of it.
> > >
> > >> 
> > >> As we are looking at reliability concerns removing CRASHTIME should make
> > >> everything in vmcoreinfo a boot time constant.  Which should simplify
> > >> everything considerably.
> > >
> > > It is a nice improvement..
> > 
> > We also need to take a close look at what s390 is doing with vmcoreinfo.
> > As apparently it is reading it in a different kind of crashdump process.
> 
> Yes, need careful review from s390 and maybe ppc64 especially about
> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> fadump. Added more cc.

On s390 we have at least an issue with patch 1/3. For stand-alone dump
and also because we create the ELF header for kdump in the new
kernel we save the pointer to the vmcoreinfo note in the old kernel on a
defined memory address in our absolute zero lowcore.

This is done in arch/s390/kernel/setup.c:

static void __init setup_vmcoreinfo(void)
{
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
time we have a problem here.

To solve this - I think - we could move the initialization to
arch/s390/kernel/machine_kexec.c:

void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_SYMBOL(lowcore_ptr);
VMCOREINFO_SYMBOL(high_memory);
VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Probably related to this is my observation that patch 3/3 leads to
an empty VMCOREINFO note for kdump on s390. The note is there ...

# readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
  VMCOREINFO   0x068e   Unknown note type: (0x)

But it contains only zeros.

Unfortunately I have not yet understood the reason for this.

Michael


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 00/14] add kdump support

2017-03-22 Thread Goel, Sameer
I tested this patch set on a QDT2400 device with 4k 4-level setup. This 
patchset worked fine with maxcpus=1 on the first kernel.

I don't think this is an issue with the patchset, I am still triaging this 
issue further.

On 3/15/2017 3:56 AM, AKASHI Takahiro wrote:
> This patch series adds kdump support on arm64.
> 
> To load a crash-dump kernel to the systems, a series of patches to
> kexec-tools[1] are also needed. Please use the latest one, v6 [2].
> For your convinience, you can pick them up from:
>https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
>https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
> 
> To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
>   - crash utility (v7.1.8 or later) [3]
> 
> I tested this patchset on fast model and hikey.
> 
> The previous versions were also:
> Tested-by: Pratyush Anand  (v32, mustang and seattle)
> Tested-by: James Morse  (v27/v32?, Juno)
> Tested-by: Sameer Goel (v32, QDT2400)
> 
> Changes for v33 (Mar 15, 2017)
>   o rebased to v4.11-rc2+
>   o arch_kexec_(un)protect_crashkres() now protects loaded data segments
> only along with moving copying of control_code_page back to 
> machine_kexec()
> (patch #6)
>   o reduce the size of hibernation image when kdump and hibernation are
> comfigured at the same time (patch #7)
>   o clearify that "linux,usable-memory-range" and "linux,elfcorehdr"
> have values of the size of root node's "#address-cells" and "#size-cells"
> (patch #13)
>   o add "efi/libstub/arm*: Set default address and size cells values for
> an empty dtb" from Sameer Goel (patch #14)
> (I didn't test the case though.)
> 
> Changes for v32 (Feb 7, 2017)
>   o isolate crash dump kernel memory as well as kernel text/data by using
> MEMBLOCK_MAP attribute to and then specifically map them in map_mem()
> (patch #1,6)
>   o delete remove_pgd_mapping() and instead modify create_pgd_mapping() to
> allowing for unmapping a kernel mapping (patch #5)
>   o correct a commit message as well as a comment in the source (patch#10)
>   o other trivial changes after Mark's comments (patch#3,4)
> 
> Changes for v31 (Feb 1, 2017)
>   o add/use remove_pgd_mapping() instead of modifying (__)create_pgd_mapping()
> to protect crash dump kernel memory (patch #4,5)
>   o fix an issue at the isolation of crash dump kernel memory in
> map_mem()/__map_memblock(), adding map_crashkernel() (patch#5)
>   o preserve the contents of crash dump kernel memory around hibernation
> (patch#6)
> 
> Changes for v30 (Jan 24, 2017)
>   o rebased to Linux-v4.10-rc5
>   o remove "linux,crashkernel-base/size" from exported device tree
>   o protect memory region for crash-dump kernel (adding patch#4,5)
>   o remove "in_crash_kexec" variable
>   o and other trivial changes
> 
> Changes for v29 (Dec 28, 2016)
>   o rebased to Linux-v4.10-rc1
>   o change asm constraints in crash_setup_regs() per Catalin
> 
> Changes for v28 (Nov 22, 2016)
>   o rebased to Linux-v4.9-rc6
>   o revamp patch #1 and merge memblock_cap_memory_range() with
> memblock_mem_limit_remove_map()
> 
> Changes for v27 (Nov 1, 2016)
>   o rebased to Linux-v4.9-rc3
>   o revert v26 change, i.e. revive "linux,usable-memory-range" property
> (patch #2/#3, updating patch #9)
>   o minor fixes per review comments (patch #3/#4/#6/#8)
>   o re-order patches and improve commit messages for readability
> 
> Changes for v26 (Sep 7, 2016):
>   o Use /reserved-memory instead of "linux,usable-memory-range" property
> (dropping v25's patch#2 and #3, updating ex-patch#9.)
> 
> Changes for v25 (Aug 29, 2016):
>   o Rebase to Linux-4.8-rc4
>   o Use memremap() instead of ioremap_cache() [patch#5]
> 
> Changes for v24 (Aug 9, 2016):
>   o Rebase to Linux-4.8-rc1
>   o Update descriptions about newly added DT proerties
> 
> Changes for v23 (July 26, 2016):
> 
>   o Move memblock_reserve() to a single place in reserve_crashkernel()
>   o Use  cpu_park_loop() in ipi_cpu_crash_stop()
>   o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
>   o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
> (for ACPI) from usable memory at crash kernel
> 
> Changes for v22 (July 12, 2016):
> 
>   o Export "crashkernel-base" and "crashkernel-size" via device-tree,
> and add some descriptions about them in chosen.txt
>   o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
> with powerpc's "usable-memory"
>   o Make cosmetic changes regarding "ifdef" usage
>   o Correct some wordings in kdump.txt
> 
> Changes for v21 (July 6, 2016):
> 
>   o Remove kexec patches.
>   o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
>   o Clarify the description about kvm in kdump.txt.
> 
> See the link [4] for older changes.
> 
> 
> [1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> [2] 

Re: kexec regression since 4.9 caused by efi

2017-03-22 Thread Ard Biesheuvel
On 21 March 2017 at 07:48, Dave Young  wrote:
> On 03/20/17 at 10:14am, Dave Young wrote:
>> On 03/17/17 at 01:32pm, Matt Fleming wrote:
>> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
>> > >
>> > > Matt, I think it should be fine although I think the md type checking in
>> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
>> >
>> > Could you make that a separate patch if you think of improvements
>> > there?
>>
>> Duplicate the lookup function is indeed a little ugly, will do it when I
>> have a better idea, we can leave it as is since it works.
>
> Matt, rethinking about this, how about doint something below, not
> tested, just seeking for idea and opinons, in this way no need duplicate
> a function, but there is an assumption that no overlapped mem ranges in
> efi memmap.
>
> Also the case Omar reported is the esrt memory range type is
> RUNTIME_DATA, that is a little different with the mem attribute of
> RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
> attribute, like bgrt in kexec reboot. Should we distinguish the two
> cases and give out some warnings or debug info?
>
>
> ---
>  arch/x86/platform/efi/quirks.c |5 +
>  drivers/firmware/efi/efi.c |6 --
>  drivers/firmware/efi/esrt.c|7 +++
>  3 files changed, 12 insertions(+), 6 deletions(-)
>
> --- linux-x86.orig/drivers/firmware/efi/efi.c
> +++ linux-x86/drivers/firmware/efi/efi.c
> @@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
> u64 size;
> u64 end;
>
> -   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> -   md->type != EFI_BOOT_SERVICES_DATA &&
> -   md->type != EFI_RUNTIME_SERVICES_DATA) {
> -   continue;
> -   }
> -
> size = md->num_pages << EFI_PAGE_SHIFT;
> end = md->phys_addr + size;
> if (phys_addr >= md->phys_addr && phys_addr < end) {
> --- linux-x86.orig/drivers/firmware/efi/esrt.c
> +++ linux-x86/drivers/firmware/efi/esrt.c
> @@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
> return;
> }
>
> +   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> + md->type != EFI_BOOT_SERVICES_DATA &&
> + md->type != EFI_RUNTIME_SERVICES_DATA) {
> +   pr_err("ESRT header memory map type is invalid\n");
> +   return;
> +   }
> +

This looks wrong to me. While the meanings get convoluted in practice,
the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
a virtual mapping for the region. It is perfectly legal for a
EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
attribute, if the region is never accessed by the runtime services
themselves, and this is not entirely unlikely for tables that the
firmware exposes to the OS

> max = efi_mem_desc_end();
> if (max < efi.esrt) {
> pr_err("EFI memory descriptor is invalid. (esrt: %p max: 
> %p)\n",
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
> return;
> }
>
> +   if (md->attribute & EFI_MEMORY_RUNTIME ||
> + md->type != EFI_BOOT_SERVICES_DATA) {
> +   return;
> +   }
> +
> size += addr % EFI_PAGE_SIZE;
> size = round_up(size, EFI_PAGE_SIZE);
> addr = round_down(addr, EFI_PAGE_SIZE);
>
>>
>> >
>> > > How about move the if chunk early like below because it seems no need
>> > > to sanity check the addr + size any more if the md is still RUNTIME?
>> >
>> > My original version did as you suggest, but I changed it because we
>> > *really* want to know if someone tries to reserve a range that spans
>> > regions. That would be totally unexpected and a warning about a
>> > potential bug/issue.
>>
>> Matt, I'm fine if you prefer to capture the range checking errors.
>> Would you like me to post it or just you send it out?
>>
>> Thanks
>> Dave
>
> Thanks
> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC][PATCH 3/4] kernel, power: disable printk_kthread in unsafe places

2017-03-22 Thread Petr Mladek
(added Eric, kexec mailing likst, and linux-pm mailing list
into CC because we touch their code)

On Mon 2017-03-06 21:45:53, Sergey Senozhatsky wrote:
> It's not always possible/safe to wake_up() printk kernel
> thread. For example, late suspend/early resume may printk()
> while timekeeping is not initialized yet, so calling into the
> scheduler may result in recursive warnings. Thus we need
> printk() to operate in old mode there and attempt to immediately
> flush pending kernel message to the console. This patch adds
> console_printing_thread_off/on sections.

This patch helps to avoid the warning. But IMHO, even more
important effect is that it helps to actually see the
massages when things go wrong, the kthread is never
called later, and we even do not end up in panic()
that would flush them to the console.

I am afraid that there are more locations where
the old mode need to be forced. It might take
a while until they are found and fixed.

I would really like to have an option that would
force the old mode all the time. It might help
a lot when you debug a system freeze and the messages
are not printed because of the deferred console handling.

Otherwise, the location of console_printing_thread_off()/on()
calls looks reasonable to me. But it would be great to get
some confirmation from kexec and pm people.

Best Regards,
Petr


> Signed-off-by: Sergey Senozhatsky 
> ---
>  kernel/kexec_core.c  | 4 
>  kernel/power/hibernate.c | 8 
>  kernel/power/suspend.c   | 4 
>  3 files changed, 16 insertions(+)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index bfe62d5b3872..b40fba31150c 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1496,6 +1496,8 @@ int kernel_kexec(void)
>   goto Unlock;
>   }
>  
> + console_printing_thread_off();
> +
>  #ifdef CONFIG_KEXEC_JUMP
>   if (kexec_image->preserve_context) {
>   lock_system_sleep();
> @@ -1565,6 +1567,8 @@ int kernel_kexec(void)
>   }
>  #endif
>  
> + console_printing_thread_on();
> +
>   Unlock:
>   mutex_unlock(_mutex);
>   return error;
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index a8b978c35a6a..15634841ccc6 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -502,6 +502,7 @@ int hibernation_restore(int platform_mode)
>  {
>   int error;
>  
> + console_printing_thread_off();
>   pm_prepare_console();
>   suspend_console();
>   pm_restrict_gfp_mask();
> @@ -519,6 +520,7 @@ int hibernation_restore(int platform_mode)
>   pm_restore_gfp_mask();
>   resume_console();
>   pm_restore_console();
> + console_printing_thread_on();
>   return error;
>  }
>  
> @@ -542,6 +544,7 @@ int hibernation_platform_enter(void)
>   goto Close;
>  
>   entering_platform_hibernation = true;
> + console_printing_thread_off();
>   suspend_console();
>   error = dpm_suspend_start(PMSG_HIBERNATE);
>   if (error) {
> @@ -589,6 +592,7 @@ int hibernation_platform_enter(void)
>   entering_platform_hibernation = false;
>   dpm_resume_end(PMSG_RESTORE);
>   resume_console();
> + console_printing_thread_on();
>  
>   Close:
>   hibernation_ops->end();
> @@ -692,6 +696,7 @@ int hibernate(void)
>   goto Unlock;
>   }
>  
> + console_printing_thread_off();
>   pm_prepare_console();
>   error = __pm_notifier_call_chain(PM_HIBERNATION_PREPARE, -1, _calls);
>   if (error) {
> @@ -759,6 +764,7 @@ int hibernate(void)
>   Exit:
>   __pm_notifier_call_chain(PM_POST_HIBERNATION, nr_calls, NULL);
>   pm_restore_console();
> + console_printing_thread_on();
>   atomic_inc(_device_available);
>   Unlock:
>   unlock_system_sleep();
> @@ -868,6 +874,7 @@ static int software_resume(void)
>   goto Unlock;
>   }
>  
> + console_printing_thread_off();
>   pm_prepare_console();
>   error = __pm_notifier_call_chain(PM_RESTORE_PREPARE, -1, _calls);
>   if (error) {
> @@ -884,6 +891,7 @@ static int software_resume(void)
>   Finish:
>   __pm_notifier_call_chain(PM_POST_RESTORE, nr_calls, NULL);
>   pm_restore_console();
> + console_printing_thread_on();
>   atomic_inc(_device_available);
>   /* For success case, the suspend path will release the lock */
>   Unlock:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 15e6baef5c73..7c6e228b7de4 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -433,6 +433,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>   if (!sleep_state_supported(state))
>   return -ENOSYS;
>  
> + console_printing_thread_off();
>   error = platform_suspend_begin(state);
>   if (error)
>   goto Close;
> @@ -462,6 +463,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  
>   Close:
>   platform_resume_end(state);

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Hari Bathini

Hi Dave,


On Wednesday 22 March 2017 10:00 AM, Dave Young wrote:

On 03/21/17 at 10:18pm, Eric W. Biederman wrote:

Dave Young  writes:


On 03/20/17 at 10:33pm, Eric W. Biederman wrote:

Xunlei Pang  writes:


As Eric said,
"what we need to do is move the variable vmcoreinfo_note out
of the kernel's .bss section.  And modify the code to regenerate
and keep this information in something like the control page.

Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures.  I clearly was not
watching closely the data someone decided to keep this silly thing
in the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables,
one advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Can you preceed this patch with a patch that removes CRASHTIME from
vmcoreinfo?  If someone actually cares we can add a separate note that holds
a 64bit crashtime in the per cpu notes.

I think makedumpfile is using it, but I also vote to remove the
CRASHTIME. It is better not to do this while crashing and a makedumpfile
userspace patch is needed to drop the use of it.


As we are looking at reliability concerns removing CRASHTIME should make
everything in vmcoreinfo a boot time constant.  Which should simplify
everything considerably.

It is a nice improvement..

We also need to take a close look at what s390 is doing with vmcoreinfo.
As apparently it is reading it in a different kind of crashdump process.

Yes, need careful review from s390 and maybe ppc64 especially about
patch 2/3, better to have comments from IBM about s390 dump tool and ppc
fadump. Added more cc.


w.r.t powerpc/fadump, this patch-set works fine..

Thanks
Hari


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 12:30 PM, Dave Young wrote:
> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
>> Dave Young  writes:
>>
>>> On 03/20/17 at 10:33pm, Eric W. Biederman wrote:
 Xunlei Pang  writes:

> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
>
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
>
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.
 Can you preceed this patch with a patch that removes CRASHTIME from
 vmcoreinfo?  If someone actually cares we can add a separate note that 
 holds
 a 64bit crashtime in the per cpu notes.  
>>> I think makedumpfile is using it, but I also vote to remove the
>>> CRASHTIME. It is better not to do this while crashing and a makedumpfile
>>> userspace patch is needed to drop the use of it.
>>>

By moving the CRASHTIME info to the cpu note of crashed cpu may be a good
way. In kdump kernel, notes of vmcore elfhdr will be merged into one big note
section, I don't know how makedumpfile or crash handle the big note section?
If they process the note in some order, breakage will definitely happen...

There is also a fadump may be affected.

Regards,
Xunlei

 As we are looking at reliability concerns removing CRASHTIME should make
 everything in vmcoreinfo a boot time constant.  Which should simplify
 everything considerably.
>>> It is a nice improvement..
>> We also need to take a close look at what s390 is doing with vmcoreinfo.
>> As apparently it is reading it in a different kind of crashdump process.
> Yes, need careful review from s390 and maybe ppc64 especially about
> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> fadump. Added more cc.
>
> Thanks
> Dave
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 04:55 PM, Xunlei Pang wrote:
> On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
>> Xunlei Pang  writes:
>>
>>> As Eric said,
>>> "what we need to do is move the variable vmcoreinfo_note out
>>> of the kernel's .bss section.  And modify the code to regenerate
>>> and keep this information in something like the control page.
>>>
>>> Definitely something like this needs a page all to itself, and ideally
>>> far away from any other kernel data structures.  I clearly was not
>>> watching closely the data someone decided to keep this silly thing
>>> in the kernel's .bss section."
>>>
>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>> vmcoreinfo now is kept far away from other kernel data structures.
>> Can you preceed this patch with a patch that removes CRASHTIME from
>> vmcoreinfo?  If someone actually cares we can add a separate note that holds
>> a 64bit crashtime in the per cpu notes.  
> Hi Eric,
>
> Thanks for your review, I took some time and did some investigation.
>
> Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
> For example, makedumpfile gets vmcoreinfo note information by reading
> "/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
> and then get the timestamp. This operates in the first kernel even before
> kdump is loaded.

Think more, this is not a problem for "makedumpfile --mem-usage",
as the system doesn't have "CRASHTIME" before crash. But still we
may have the following concerns.

>
> Actually, even moving vmcoreinfo_note[] into the crash memory, it
> may have problems, for example, on s390 system the crash memory
> range will be unmapped, so I guess it may cause some risks.
>
> Additionally, there is no available way for us to allocate a page from the
> crash memory during kernel initialization, we only can achieve this during
> the kexec syscalls. There is not a neat way to implement a function to
> allocate pages from the crash memory during kernel initialization without
> some hack code added, because user-space tools(like kexec-tools) can
> allocate the crash segment by their own ways from the crash memory.
>
> That's why I only copy vmcoreinfo_data[] into the crash memory, and
> not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
> the crash memory copy, then in crash_save_vmcoreinfo(), we copy
> this guaranteed copy into vmcoreinfo_note[], so the correctness of
> vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.
>
> The current crash_save_vmcoreinfo() only involves memory(memcpy)
> operations even for get_seconds(no locks), the only risk I can think
> of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
> I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
> in this patch. After all if kimage structure was corrupted when crash happens,
> we can do nothing but have to accept the fate.
>
> So does it really deserve to eliminate crash_save_vmcoreinfo()?
>
> Regards,
> Xunlei
>
>> As we are looking at reliability concerns removing CRASHTIME should make
>> everything in vmcoreinfo a boot time constant.  Which should simplify
>> everything considerably.
>>
>> Which means we only need to worry abou the per-cpu notes being written
>> at the time of a crash.
>>
>>> Suggested-by: Eric Biederman 
>>> Signed-off-by: Xunlei Pang 
>>> ---
>>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>>  arch/x86/kernel/crash.c  |  2 +-
>>>  include/linux/kexec.h|  2 +-
>>>  kernel/kexec_core.c  | 29 -
>>>  kernel/ksysfs.c  |  2 +-
>>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>>> b/arch/ia64/kernel/machine_kexec.c
>>> index 599507b..c14815d 100644
>>> --- a/arch/ia64/kernel/machine_kexec.c
>>> +++ b/arch/ia64/kernel/machine_kexec.c
>>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>>  #endif
>>>  }
>>>  
>>> -phys_addr_t paddr_vmcoreinfo_note(void)
>>> -{
>>> -   return ia64_tpa((unsigned long)(char *)_note);
>>> -}
>>> -
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 3741461..4d35fbb 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>>> *ced,
>>> bufp += sizeof(Elf64_Phdr);
>>> phdr->p_type = PT_NOTE;
>>> phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>>> -   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>>> +   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>> (ehdr->e_phnum)++;
>>>  
>>>  #ifdef CONFIG_X86_64
>>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>>> index e98e546..f1c601b 100644
>>> --- a/include/linux/kexec.h
>>> +++ b/include/linux/kexec.h
>>> @@ 

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 04:55 PM, Xunlei Pang wrote:
> On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
>> Xunlei Pang  writes:
>>
>>> As Eric said,
>>> "what we need to do is move the variable vmcoreinfo_note out
>>> of the kernel's .bss section.  And modify the code to regenerate
>>> and keep this information in something like the control page.
>>>
>>> Definitely something like this needs a page all to itself, and ideally
>>> far away from any other kernel data structures.  I clearly was not
>>> watching closely the data someone decided to keep this silly thing
>>> in the kernel's .bss section."
>>>
>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>> vmcoreinfo now is kept far away from other kernel data structures.
>> Can you preceed this patch with a patch that removes CRASHTIME from
>> vmcoreinfo?  If someone actually cares we can add a separate note that holds
>> a 64bit crashtime in the per cpu notes.  
> Hi Eric,
>
> Thanks for your review, I took some time and did some investigation.
>
> Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
> For example, makedumpfile gets vmcoreinfo note information by reading
> "/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
> and then get the timestamp. This operates in the first kernel even before
> kdump is loaded.

Think more, this is not a problem for "makedumpfile --mem-usage",
as the system doesn't have "CRASHTIME" before crash. But still we
may have the following concerns.

>
> Actually, even moving vmcoreinfo_note[] into the crash memory, it
> may have problems, for example, on s390 system the crash memory
> range will be unmapped, so I guess it may cause some risks.
>
> Additionally, there is no available way for us to allocate a page from the
> crash memory during kernel initialization, we only can achieve this during
> the kexec syscalls. There is not a neat way to implement a function to
> allocate pages from the crash memory during kernel initialization without
> some hack code added, because user-space tools(like kexec-tools) can
> allocate the crash segment by their own ways from the crash memory.
>
> That's why I only copy vmcoreinfo_data[] into the crash memory, and
> not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
> the crash memory copy, then in crash_save_vmcoreinfo(), we copy
> this guaranteed copy into vmcoreinfo_note[], so the correctness of
> vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.
>
> The current crash_save_vmcoreinfo() only involves memory(memcpy)
> operations even for get_seconds(no locks), the only risk I can think
> of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
> I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
> in this patch. After all if kimage structure was corrupted when crash happens,
> we can do nothing but have to accept the fate.
>
> So does it really deserve to eliminate crash_save_vmcoreinfo()?
>
> Regards,
> Xunlei
>
>> As we are looking at reliability concerns removing CRASHTIME should make
>> everything in vmcoreinfo a boot time constant.  Which should simplify
>> everything considerably.
>>
>> Which means we only need to worry abou the per-cpu notes being written
>> at the time of a crash.
>>
>>> Suggested-by: Eric Biederman 
>>> Signed-off-by: Xunlei Pang 
>>> ---
>>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>>  arch/x86/kernel/crash.c  |  2 +-
>>>  include/linux/kexec.h|  2 +-
>>>  kernel/kexec_core.c  | 29 -
>>>  kernel/ksysfs.c  |  2 +-
>>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>>> b/arch/ia64/kernel/machine_kexec.c
>>> index 599507b..c14815d 100644
>>> --- a/arch/ia64/kernel/machine_kexec.c
>>> +++ b/arch/ia64/kernel/machine_kexec.c
>>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>>  #endif
>>>  }
>>>  
>>> -phys_addr_t paddr_vmcoreinfo_note(void)
>>> -{
>>> -   return ia64_tpa((unsigned long)(char *)_note);
>>> -}
>>> -
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 3741461..4d35fbb 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>>> *ced,
>>> bufp += sizeof(Elf64_Phdr);
>>> phdr->p_type = PT_NOTE;
>>> phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>>> -   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>>> +   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>> (ehdr->e_phnum)++;
>>>  
>>>  #ifdef CONFIG_X86_64
>>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>>> index e98e546..f1c601b 100644
>>> --- a/include/linux/kexec.h
>>> +++ b/include/linux/kexec.h
>>> @@ 

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
> Xunlei Pang  writes:
>
>> As Eric said,
>> "what we need to do is move the variable vmcoreinfo_note out
>> of the kernel's .bss section.  And modify the code to regenerate
>> and keep this information in something like the control page.
>>
>> Definitely something like this needs a page all to itself, and ideally
>> far away from any other kernel data structures.  I clearly was not
>> watching closely the data someone decided to keep this silly thing
>> in the kernel's .bss section."
>>
>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>> one advantage is that it enhances some safety of vmcoreinfo, because
>> vmcoreinfo now is kept far away from other kernel data structures.
> Can you preceed this patch with a patch that removes CRASHTIME from
> vmcoreinfo?  If someone actually cares we can add a separate note that holds
> a 64bit crashtime in the per cpu notes.  

Hi Eric,

Thanks for your review, I took some time and did some investigation.

Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
For example, makedumpfile gets vmcoreinfo note information by reading
"/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
and then get the timestamp. This operates in the first kernel even before
kdump is loaded.

Actually, even moving vmcoreinfo_note[] into the crash memory, it
may have problems, for example, on s390 system the crash memory
range will be unmapped, so I guess it may cause some risks.

Additionally, there is no available way for us to allocate a page from the
crash memory during kernel initialization, we only can achieve this during
the kexec syscalls. There is not a neat way to implement a function to
allocate pages from the crash memory during kernel initialization without
some hack code added, because user-space tools(like kexec-tools) can
allocate the crash segment by their own ways from the crash memory.

That's why I only copy vmcoreinfo_data[] into the crash memory, and
not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
the crash memory copy, then in crash_save_vmcoreinfo(), we copy
this guaranteed copy into vmcoreinfo_note[], so the correctness of
vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.

The current crash_save_vmcoreinfo() only involves memory(memcpy)
operations even for get_seconds(no locks), the only risk I can think
of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
in this patch. After all if kimage structure was corrupted when crash happens,
we can do nothing but have to accept the fate.

So does it really deserve to eliminate crash_save_vmcoreinfo()?

Regards,
Xunlei

>
> As we are looking at reliability concerns removing CRASHTIME should make
> everything in vmcoreinfo a boot time constant.  Which should simplify
> everything considerably.
>
> Which means we only need to worry abou the per-cpu notes being written
> at the time of a crash.
>
>> Suggested-by: Eric Biederman 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>  arch/x86/kernel/crash.c  |  2 +-
>>  include/linux/kexec.h|  2 +-
>>  kernel/kexec_core.c  | 29 -
>>  kernel/ksysfs.c  |  2 +-
>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>> b/arch/ia64/kernel/machine_kexec.c
>> index 599507b..c14815d 100644
>> --- a/arch/ia64/kernel/machine_kexec.c
>> +++ b/arch/ia64/kernel/machine_kexec.c
>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>  #endif
>>  }
>>  
>> -phys_addr_t paddr_vmcoreinfo_note(void)
>> -{
>> -return ia64_tpa((unsigned long)(char *)_note);
>> -}
>> -
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index 3741461..4d35fbb 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>> *ced,
>>  bufp += sizeof(Elf64_Phdr);
>>  phdr->p_type = PT_NOTE;
>>  phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>> -phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>> +phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>  (ehdr->e_phnum)++;
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index e98e546..f1c601b 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -317,7 +317,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct 
>> kimage *image,
>>  extern struct resource crashk_low_res;
>>  typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];
>>  extern note_buf_t __percpu *crash_notes;
>> -extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>> +extern u32 *vmcoreinfo_note;
>>  extern size_t 

Re: [PATCH] elf_info.c: fix memory leak in get_kcore_dump_loads()

2017-03-22 Thread Pingfan Liu




- Original Message -
> From: "Atsushi Kumagai" 
> To: "Pingfan Liu" 
> Cc: kexec@lists.infradead.org
> Sent: Wednesday, March 22, 2017 2:59:29 PM
> Subject: RE: [PATCH] elf_info.c: fix memory leak in get_kcore_dump_loads()
> 
> Hello Pingfan,
> 
> It would be helpful if you could mention that the patch is for
> makedumpfile when you post that, otherwise I could overlook that.
> 
Sorry, and I will add such info.

Regards,
Pingfan

> Anyway, I'll merge this patch into v1.6.2 after adding semicolon
> to the back of "free(pls)".
> 
> Thanks,
> Atsushi Kumagai
> 
> >Signed-off-by: Pingfan Liu 
> >---
> > elf_info.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> >diff --git a/elf_info.c b/elf_info.c
> >index d84553a..35e754b 100644
> >--- a/elf_info.c
> >+++ b/elf_info.c
> >@@ -893,12 +893,14 @@ int get_kcore_dump_loads(void)
> > || !is_phys_addr(p->virt_start))
> > continue;
> > if (j >= loads)
> >+free(pls)
> > return FALSE;
> >
> > if (j == 0) {
> > offset_pt_load_memory = p->file_offset;
> > if (offset_pt_load_memory == 0) {
> > ERRMSG("Can't get the offset of page data.\n");
> >+free(pls)
> > return FALSE;
> > }
> > }
> >--
> >2.7.4
> >
> >
> >___
> >kexec mailing list
> >kexec@lists.infradead.org
> >http://lists.infradead.org/mailman/listinfo/kexec
> 
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec