Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-22 Thread Xunlei Pang
On 06/22/2017 at 01:44 AM, Michael Holzheu wrote:
> Am Fri,  9 Jun 2017 10:17:05 +0800
> schrieb Xunlei Pang <xlp...@redhat.com>:
>
>> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
>> is now defined as follows:
>> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
>> It was changed by the CONFIG_CRASH_CORE feature.
>>
>> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
>> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
>>
>> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related 
>> code under CONFIG_CRASH_CORE")
>> Cc: Dave Young <dyo...@redhat.com>
>> Cc: Dave Anderson <ander...@redhat.com>
>> Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
>> Cc: Gustavo Luiz Duarte <gustav...@linux.vnet.ibm.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
> Hello Xunlei,
>
> As you already know on s390 we create the ELF header in the new kernel.
> Therefore we don't use the per-cpu buffers for ELF notes to store
> the register state.
>
> For RHEL7 we still store the registers in machine_kexec.c:add_elf_notes().
> Though we also use the ELF header from new kernel ...
>
> We assume your original problem with the "kmem -s" failure
> was caused by the memory overwrite due to the invalid size of the
> "crash_notes" per-cpu buffers.
>
> Therefore your patch looks good for RHEL7 but for upstream we propose the
> patch below.

Hi Michael,

Yes, we already did this way.
Thanks for the confirmation, the patch below looks good to me.

Regards,
Xunlei

> ---
> [PATCH] s390/crash: Remove unused KEXEC_NOTE_BYTES
>
> After commmit 692f66f26a4c19 ("crash: move crashkernel parsing and vmcore
> related code under CONFIG_CRASH_CORE") the KEXEC_NOTE_BYTES macro is not
> used anymore and for s390 we create the ELF header in the new kernel
> anyway. Therefore remove the macro.
>
> Reported-by: Xunlei Pang <xp...@redhat.com>
> Reviewed-by: Mikhail Zaslonko <zaslo...@linux.vnet.ibm.com>
> Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/kexec.h | 18 --
>  include/linux/crash_core.h|  5 +
>  include/linux/kexec.h |  9 -
>  3 files changed, 5 insertions(+), 27 deletions(-)
>
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index 2f924bc30e35..dccf24ee26d3 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -41,24 +41,6 @@
>  /* The native architecture */
>  #define KEXEC_ARCH KEXEC_ARCH_S390
>  
> -/*
> - * Size for s390x ELF notes per CPU
> - *
> - * Seven notes plus zero note at the end: prstatus, fpregset, timer,
> - * tod_cmp, tod_reg, control regs, and prefix
> - */
> -#define KEXEC_NOTE_BYTES \
> - (ALIGN(sizeof(struct elf_note), 4) * 8 + \
> -  ALIGN(sizeof("CORE"), 4) * 7 + \
> -  ALIGN(sizeof(struct elf_prstatus), 4) + \
> -  ALIGN(sizeof(elf_fpregset_t), 4) + \
> -  ALIGN(sizeof(u64), 4) + \
> -  ALIGN(sizeof(u64), 4) + \
> -  ALIGN(sizeof(u32), 4) + \
> -  ALIGN(sizeof(u64) * 16, 4) + \
> -  ALIGN(sizeof(u32), 4) \
> - )
> -
>  /* Provide a dummy definition to avoid build failures. */
>  static inline void crash_setup_regs(struct pt_regs *newregs,
>   struct pt_regs *oldregs) { }
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 541a197ba4a2..4090a42578a8 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -10,6 +10,11 @@
>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
>  
> +/*
> + * The per-cpu notes area is a list of notes terminated by a "NULL"
> + * note header.  For kdump, the code in vmcore.c runs in the context
> + * of the second kernel to combine them into one note.
> + */
>  #define CRASH_CORE_NOTE_BYTES   ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> \
>CRASH_CORE_NOTE_NAME_BYTES +   \
>CRASH_CORE_NOTE_DESC_BYTES)
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index c9481ebcbc0c..65888418fb69 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -63,15 +63,6 @@
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
>  
>  /*
> - * The per-cpu notes area is a list of notes terminated by a "NULL"
> - * note header.  For kdump, the code in vmcore.c runs in the context
> - * of the second kernel to combine them into one note.
> - */
> -#ifndef KEXEC_NOTE_BYTES
> -#define KEXEC_NOTE_BYTES CRASH_CORE_NOTE_BYTES
> -#endif
> -
> -/*
>   * This structure is used to hold the arguments that are used when loading
>   * kernel binaries.
>   */


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-11 Thread Xunlei Pang
On 06/09/2017 at 03:45 PM, Dave Young wrote:
> On 06/09/17 at 10:29am, Dave Young wrote:
>> On 06/09/17 at 10:17am, Xunlei Pang wrote:
>>> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
>>> is now defined as follows:
>>> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
>>> It was changed by the CONFIG_CRASH_CORE feature.
>>>
>>> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
>>> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
>>>
>>> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related 
>>> code under CONFIG_CRASH_CORE")
>>> Cc: Dave Young <dyo...@redhat.com>
>>> Cc: Dave Anderson <ander...@redhat.com>
>>> Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
>>> Cc: Gustavo Luiz Duarte <gustav...@linux.vnet.ibm.com>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>>  arch/s390/include/asm/kexec.h |  2 +-
>>>  include/linux/crash_core.h|  7 +++
>>>  include/linux/kexec.h | 11 +--
>>>  3 files changed, 9 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
>>> index 2f924bc..352deb8 100644
>>> --- a/arch/s390/include/asm/kexec.h
>>> +++ b/arch/s390/include/asm/kexec.h
>>> @@ -47,7 +47,7 @@
>>>   * Seven notes plus zero note at the end: prstatus, fpregset, timer,
>>>   * tod_cmp, tod_reg, control regs, and prefix
>>>   */
>>> -#define KEXEC_NOTE_BYTES \
>>> +#define CRASH_CORE_NOTE_BYTES \
>>> (ALIGN(sizeof(struct elf_note), 4) * 8 + \
>>>  ALIGN(sizeof("CORE"), 4) * 7 + \
>>>  ALIGN(sizeof(struct elf_prstatus), 4) + \
> I found that in mainline since below commit, above define should be
> useless, but if distribution with older kernel does need your fix, so in
> mainline the right fix should be dropping the s390 part about these
> macros usage.

Indeed, then I think we can remove this special definition of S390 to avoid 
confusion.

Regards,
Xunlei

>
> Anyway this need a comment from Michael.
>
> commit 8a07dd02d7615d91d65d6235f7232e3f9b5d347f
> Author: Martin Schwidefsky <schwidef...@de.ibm.com>
> Date:   Wed Oct 14 15:53:06 2015 +0200
>
> s390/kdump: remove code to create ELF notes in the crashed system
> 
> The s390 architecture can store the CPU registers of the crashed
> system
> after the kdump kernel has been started and this is the preferred
> way.
> Remove the remaining code fragments that deal with storing CPU
> registers
> while the crashed system is still active.
> 
> Acked-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
> Signed-off-by: Martin Schwidefsky <schwidef...@de.ibm.com>
>
>
>>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>>> index e9de6b4..dbc6e5c 100644
>>> --- a/include/linux/crash_core.h
>>> +++ b/include/linux/crash_core.h
>>> @@ -10,9 +10,16 @@
>>>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>>>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
>>>  
>>> +/*
>>> + * The per-cpu notes area is a list of notes terminated by a "NULL"
>>> + * note header.  For kdump, the code in vmcore.c runs in the context
>>> + * of the second kernel to combine them into one note.
>>> + */
>>> +#ifndef CRASH_CORE_NOTE_BYTES
>>>  #define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
>>> \
>>>  CRASH_CORE_NOTE_NAME_BYTES +   \
>>>  CRASH_CORE_NOTE_DESC_BYTES)
>>> +#endif
>>>  
>>>  #define VMCOREINFO_BYTES  PAGE_SIZE
>>>  #define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
>>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>>> index 3ea8275..133df03 100644
>>> --- a/include/linux/kexec.h
>>> +++ b/include/linux/kexec.h
>>> @@ -14,7 +14,6 @@
>>>  
>>>  #if !defined(__ASSEMBLY__)
>>>  
>>> -#include 
>>>  #include 
>>>  
>>>  #include 
>>> @@ -25,6 +24,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  /* Verify architecture specific macros are defined */
>>>  
>>> @@ -63,15 +63,6 @@
>>>  #define KEXEC_CORE_NOTE_NAME   CRASH_CORE_NOTE_NAME
>>>  
>>>  /*
>>> - * The per-cpu notes area is a list of notes terminated by a "NULL"
>>> - * note header.  For kdump, the code in vmcore.c runs in the context
>>> - * of the second kernel to combine them into one note.
>>> - */
>>> -#ifndef KEXEC_NOTE_BYTES
>>> -#define KEXEC_NOTE_BYTES   CRASH_CORE_NOTE_BYTES
>>> -#endif
>> It is still not clear how does s390 use the crash_notes except this macro.
>> But from code point of view we do need to update this as well after the
>> crash_core splitting.
>>
>> Acked-by: Dave Young <dyo...@redhat.com>
> Hold on the ack because of the new findings, wait for Michael's
> feedback.
>
> Thanks
> Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-08 Thread Xunlei Pang
S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
is now defined as follows:
typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
It was changed by the CONFIG_CRASH_CORE feature.

This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.

Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
under CONFIG_CRASH_CORE")
Cc: Dave Young <dyo...@redhat.com>
Cc: Dave Anderson <ander...@redhat.com>
Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
Cc: Gustavo Luiz Duarte <gustav...@linux.vnet.ibm.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/s390/include/asm/kexec.h |  2 +-
 include/linux/crash_core.h|  7 +++
 include/linux/kexec.h | 11 +--
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
index 2f924bc..352deb8 100644
--- a/arch/s390/include/asm/kexec.h
+++ b/arch/s390/include/asm/kexec.h
@@ -47,7 +47,7 @@
  * Seven notes plus zero note at the end: prstatus, fpregset, timer,
  * tod_cmp, tod_reg, control regs, and prefix
  */
-#define KEXEC_NOTE_BYTES \
+#define CRASH_CORE_NOTE_BYTES \
(ALIGN(sizeof(struct elf_note), 4) * 8 + \
 ALIGN(sizeof("CORE"), 4) * 7 + \
 ALIGN(sizeof(struct elf_prstatus), 4) + \
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index e9de6b4..dbc6e5c 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -10,9 +10,16 @@
 #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
 #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
 
+/*
+ * The per-cpu notes area is a list of notes terminated by a "NULL"
+ * note header.  For kdump, the code in vmcore.c runs in the context
+ * of the second kernel to combine them into one note.
+ */
+#ifndef CRASH_CORE_NOTE_BYTES
 #define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
 CRASH_CORE_NOTE_NAME_BYTES +   \
 CRASH_CORE_NOTE_DESC_BYTES)
+#endif
 
 #define VMCOREINFO_BYTES  PAGE_SIZE
 #define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 3ea8275..133df03 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -14,7 +14,6 @@
 
 #if !defined(__ASSEMBLY__)
 
-#include 
 #include 
 
 #include 
@@ -25,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Verify architecture specific macros are defined */
 
@@ -63,15 +63,6 @@
 #define KEXEC_CORE_NOTE_NAME   CRASH_CORE_NOTE_NAME
 
 /*
- * The per-cpu notes area is a list of notes terminated by a "NULL"
- * note header.  For kdump, the code in vmcore.c runs in the context
- * of the second kernel to combine them into one note.
- */
-#ifndef KEXEC_NOTE_BYTES
-#define KEXEC_NOTE_BYTES   CRASH_CORE_NOTE_BYTES
-#endif
-
-/*
  * This structure is used to hold the arguments that are used when loading
  * kernel binaries.
  */
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 28/32] x86/mm, kexec: Allow kexec to be used with SME

2017-05-31 Thread Xunlei Pang
On 05/31/2017 at 01:46 AM, Tom Lendacky wrote:
> On 5/25/2017 11:17 PM, Xunlei Pang wrote:
>> On 04/19/2017 at 05:21 AM, Tom Lendacky wrote:
>>> Provide support so that kexec can be used to boot a kernel when SME is
>>> enabled.
>>>
>>> Support is needed to allocate pages for kexec without encryption.  This
>>> is needed in order to be able to reboot in the kernel in the same manner
>>> as originally booted.
>>
>> Hi Tom,
>>
>> Looks like kdump will break, I didn't see the similar handling for kdump 
>> cases, see kernel:
>>  kimage_alloc_crash_control_pages(), kimage_load_crash_segment(), etc. >
>> We need to support kdump with SME, kdump 
>> kernel/initramfs/purgatory/elfcorehdr/etc
>> are all loaded into the reserved memory(see crashkernel=X) by userspace 
>> kexec-tools.
>> I think a straightforward way would be to mark the whole reserved memory 
>> range without
>> encryption before loading all the kexec segments for kdump, I guess we can 
>> handle this
>> easily in arch_kexec_unprotect_crashkres().
>
> Yes, that would work.
>
>>
>> Moreover, now that "elfcorehdr=X" is left as decrypted, it needs to be 
>> remapped to the
>> encrypted data.
>
> This is an area that I'm not familiar with, so I don't completely
> understand the flow in regards to where/when/how the ELF headers are
> copied and what needs to be done.
>
> Can you elaborate a bit on this?

"elfcorehdr" is generated by userspace 
kexec-tools(git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git), 
it's
actually ELF CORE header data(elf header, PT_LOAD/PT_NOTE program header), see 
kexec/crashdump-elf.c::FUNC().

For kdump case, it will be put in some reserved crash memory allocated by 
kexec-tools, and passed the corresponding
start address of the allocated reserved crash memory to kdump kernel via 
"elfcorehdr=", please see kernel functions
setup_elfcorehdr() and vmcore_init() for how it is parsed by kdump kernel.

Regards,
Xunlei

>>
>>>
>>> Additionally, when shutting down all of the CPUs we need to be sure to
>>> flush the caches and then halt. This is needed when booting from a state
>>> where SME was not active into a state where SME is active (or vice-versa).
>>> Without these steps, it is possible for cache lines to exist for the same
>>> physical location but tagged both with and without the encryption bit. This
>>> can cause random memory corruption when caches are flushed depending on
>>> which cacheline is written last.
>>>
>>> Signed-off-by: Tom Lendacky <thomas.lenda...@amd.com>
>>> ---
>>>   arch/x86/include/asm/init.h  |1 +
>>>   arch/x86/include/asm/irqflags.h  |5 +
>>>   arch/x86/include/asm/kexec.h |8 
>>>   arch/x86/include/asm/pgtable_types.h |1 +
>>>   arch/x86/kernel/machine_kexec_64.c   |   35 
>>> +-
>>>   arch/x86/kernel/process.c|   26 +++--
>>>   arch/x86/mm/ident_map.c  |   11 +++
>>>   include/linux/kexec.h|   14 ++
>>>   kernel/kexec_core.c  |7 +++
>>>   9 files changed, 101 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
>>> index 737da62..b2ec511 100644
>>> --- a/arch/x86/include/asm/init.h
>>> +++ b/arch/x86/include/asm/init.h
>>> @@ -6,6 +6,7 @@ struct x86_mapping_info {
>>>   void *context; /* context for alloc_pgt_page */
>>>   unsigned long pmd_flag; /* page flag for PMD entry */
>>>   unsigned long offset; /* ident mapping offset */
>>> +unsigned long kernpg_flag; /* kernel pagetable flag override */
>>>   };
>>> int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t 
>>> *pgd_page,
>>> diff --git a/arch/x86/include/asm/irqflags.h 
>>> b/arch/x86/include/asm/irqflags.h
>>> index ac7692d..38b5920 100644
>>> --- a/arch/x86/include/asm/irqflags.h
>>> +++ b/arch/x86/include/asm/irqflags.h
>>> @@ -58,6 +58,11 @@ static inline __cpuidle void native_halt(void)
>>>   asm volatile("hlt": : :"memory");
>>>   }
>>>   +static inline __cpuidle void native_wbinvd_halt(void)
>>> +{
>>> +asm volatile("wbinvd; hlt" : : : "memory");
>>> +}
>>> +
>>>   #endif
>>> #

Re: [PATCH v5 28/32] x86/mm, kexec: Allow kexec to be used with SME

2017-05-25 Thread Xunlei Pang
On 04/19/2017 at 05:21 AM, Tom Lendacky wrote:
> Provide support so that kexec can be used to boot a kernel when SME is
> enabled.
>
> Support is needed to allocate pages for kexec without encryption.  This
> is needed in order to be able to reboot in the kernel in the same manner
> as originally booted.

Hi Tom,

Looks like kdump will break, I didn't see the similar handling for kdump cases, 
see kernel:
kimage_alloc_crash_control_pages(), kimage_load_crash_segment(), etc.

We need to support kdump with SME, kdump 
kernel/initramfs/purgatory/elfcorehdr/etc
are all loaded into the reserved memory(see crashkernel=X) by userspace 
kexec-tools.
I think a straightforward way would be to mark the whole reserved memory range 
without
encryption before loading all the kexec segments for kdump, I guess we can 
handle this
easily in arch_kexec_unprotect_crashkres().

Moreover, now that "elfcorehdr=X" is left as decrypted, it needs to be remapped 
to the
encrypted data.

Regards,
Xunlei

>
> Additionally, when shutting down all of the CPUs we need to be sure to
> flush the caches and then halt. This is needed when booting from a state
> where SME was not active into a state where SME is active (or vice-versa).
> Without these steps, it is possible for cache lines to exist for the same
> physical location but tagged both with and without the encryption bit. This
> can cause random memory corruption when caches are flushed depending on
> which cacheline is written last.
>
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/include/asm/init.h  |1 +
>  arch/x86/include/asm/irqflags.h  |5 +
>  arch/x86/include/asm/kexec.h |8 
>  arch/x86/include/asm/pgtable_types.h |1 +
>  arch/x86/kernel/machine_kexec_64.c   |   35 
> +-
>  arch/x86/kernel/process.c|   26 +++--
>  arch/x86/mm/ident_map.c  |   11 +++
>  include/linux/kexec.h|   14 ++
>  kernel/kexec_core.c  |7 +++
>  9 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> index 737da62..b2ec511 100644
> --- a/arch/x86/include/asm/init.h
> +++ b/arch/x86/include/asm/init.h
> @@ -6,6 +6,7 @@ struct x86_mapping_info {
>   void *context;   /* context for alloc_pgt_page */
>   unsigned long pmd_flag;  /* page flag for PMD entry */
>   unsigned long offset;/* ident mapping offset */
> + unsigned long kernpg_flag;   /* kernel pagetable flag override */
>  };
>  
>  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index ac7692d..38b5920 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -58,6 +58,11 @@ static inline __cpuidle void native_halt(void)
>   asm volatile("hlt": : :"memory");
>  }
>  
> +static inline __cpuidle void native_wbinvd_halt(void)
> +{
> + asm volatile("wbinvd; hlt" : : : "memory");
> +}
> +
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 70ef205..e8183ac 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -207,6 +207,14 @@ struct kexec_entry64_regs {
>   uint64_t r15;
>   uint64_t rip;
>  };
> +
> +extern int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages,
> +gfp_t gfp);
> +#define arch_kexec_post_alloc_pages arch_kexec_post_alloc_pages
> +
> +extern void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages);
> +#define arch_kexec_pre_free_pages arch_kexec_pre_free_pages
> +
>  #endif
>  
>  typedef void crash_vmclear_fn(void);
> diff --git a/arch/x86/include/asm/pgtable_types.h 
> b/arch/x86/include/asm/pgtable_types.h
> index ce8cb1c..0f326f4 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -213,6 +213,7 @@ enum page_cache_mode {
>  #define PAGE_KERNEL  __pgprot(__PAGE_KERNEL | _PAGE_ENC)
>  #define PAGE_KERNEL_RO   __pgprot(__PAGE_KERNEL_RO | _PAGE_ENC)
>  #define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC | _PAGE_ENC)
> +#define PAGE_KERNEL_EXEC_NOENC   __pgprot(__PAGE_KERNEL_EXEC)
>  #define PAGE_KERNEL_RX   __pgprot(__PAGE_KERNEL_RX | _PAGE_ENC)
>  #define PAGE_KERNEL_NOCACHE  __pgprot(__PAGE_KERNEL_NOCACHE | _PAGE_ENC)
>  #define PAGE_KERNEL_LARGE__pgprot(__PAGE_KERNEL_LARGE | _PAGE_ENC)
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index 085c3b3..11c0ca9 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -86,7 +86,7 @@ static int init_transition_pgtable(struct kimage *image, 
> pgd_t *pgd)
>  

Re: [Makedumpfile PATCH V2] elf_info: fix file_size if segment is excluded

2017-05-09 Thread Xunlei Pang
On 05/09/2017 at 09:53 PM, Pratyush Anand wrote:
> I received following on a specific x86_64 hp virtual machine while
> executing `makedumpfile --mem-usage /proc/kcore`.
>
> vtop4_x86_64: Can't get a valid pte.
> readmem: Can't convert a virtual address(88115860) to physical 
> address.
> readmem: type_addr: 0, addr:88115860, size:128
> get_nodes_online: Can't get the node online map.
>
> With some debug print in vtop4_x86_64() I noticed that pte value is read
> as 0, while crash reads the value correctly:
>
> from makedumpfile:
> vaddr=88115860
> page_dir=59eaff8
> pml4=59ed067
> pgd_paddr=59edff0
> pgd_pte=59ee063
> pmd_paddr=59ee200
> pmd_pte=3642f063
> pte_paddr=3642f8a8
> pte=0
>
> from crash
> crash> vtop 88115860
> VIRTUAL   PHYSICAL
> 88115860  5b15860
>
> PML4 DIRECTORY: 87fea000
> PAGE DIRECTORY: 59ed067
>PUD: 59edff0 => 59ee063
>PMD: 59ee200 => 3642f063
>PTE: 3642f8a8 => 5b15163
>   PAGE: 5b15000
>
> With some more debug prints in elf_info.c
>
> Before calling exclude_segment()
>
> LOAD (2)
>   phys_start : 10
>   phys_end   : dfffd000
>   virt_start : 8a5a4010
>   virt_end   : 8a5b1fffd000
>   file_offset: a5a40102000
>   file_size  : dfefd000
>
> exclude_segment() is called for Crash Kernel whose range is
> 2b00-350f.
>
> We see following after exclude_segment()
>
> LOAD (2)
>   phys_start : 10
>   phys_end   : 2aff
>   virt_start : 8a5a4010
>   virt_end   : 8a5a6aff
>   file_offset: a5a40102000
>   file_size  : dfefd000
> LOAD (3)
>   phys_start : 3510
>   phys_end   : dfffd000
>   virt_start : 8a5a7510
>   virt_end   : 8a5b1fffd000
>   file_offset: a5a75102000
>   file_size  : 0
>
> Since file_size is calculated wrong therefore readpage_elf() does not
> behave correctly.
>
> This patch fixes above wrong behavior.
>
> Signed-off-by: Pratyush Anand 
> ---
> v1->v2 : subtracted (end - start) from file_size as well
>
>  elf_info.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/elf_info.c b/elf_info.c
> index 8e2437622141..5494c4dcbebe 100644
> --- a/elf_info.c
> +++ b/elf_info.c
> @@ -826,9 +826,13 @@ static int exclude_segment(struct pt_load_segment 
> **pt_loads,
>   temp_seg.virt_end = vend;
>   temp_seg.file_offset = 
> (*pt_loads)[i].file_offset
>   + temp_seg.virt_start - 
> (*pt_loads)[i].virt_start;
> + temp_seg.file_size = temp_seg.phys_end
> + - temp_seg.phys_start;
>  
>   (*pt_loads)[i].virt_end = kvstart - 1;
>   (*pt_loads)[i].phys_end =  start - 1;
> + (*pt_loads)[i].file_size -= (temp_seg.file_size
> + + end - start);

Hi Pratyush,

Don't we need to move the "(*pt_loads)[i].file_size" minus "(end - start)" down
to the tail of "if (kvstart <  vend && kvend > vstart)" condition for all cases?

Regards,
Xunlei

>  
>   tidx = i+1;
>   } else if (kvstart != vstart) {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 2/2] x86_64/kexec: Use PUD level 1GB page for identity mapping if available

2017-05-08 Thread Xunlei Pang
On 05/08/2017 at 02:29 PM, Ingo Molnar wrote:
> * Xunlei Pang <xp...@redhat.com> wrote:
>
>> On 05/05/2017 at 05:20 PM, Ingo Molnar wrote:
>>> * Xunlei Pang <xp...@redhat.com> wrote:
>>>
>>>> On 05/05/2017 at 02:52 PM, Ingo Molnar wrote:
>>>>> * Xunlei Pang <xlp...@redhat.com> wrote:
>>>>>
>>>>>> @@ -122,6 +122,10 @@ static int init_pgtable(struct kimage *image, 
>>>>>> unsigned long start_pgtable)
>>>>>>  
>>>>>>  level4p = (pgd_t *)__va(start_pgtable);
>>>>>>  clear_page(level4p);
>>>>>> +
>>>>>> +if (direct_gbpages)
>>>>>> +info.direct_gbpages = true;
>>>>> No, this should be keyed off the CPU feature (X86_FEATURE_GBPAGES) 
>>>>> automatically, 
>>>>> not set blindly! AFAICS this patch will crash kexec on any CPU that does 
>>>>> not 
>>>>> support gbpages.
>>>> It should be fine, probe_page_size_mask() already takes care of this:
>>>> if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
>>>> printk(KERN_INFO "Using GB pages for direct mapping\n");
>>>> page_size_mask |= 1 << PG_LEVEL_1G;
>>>> } else {
>>>> direct_gbpages = 0;
>>>> }
>>>>
>>>> So if X86_FEATURE_GBPAGES is not supported, direct_gbpages will be set to 
>>>> 0.
>>> So why is the introduction of the info.direct_gbpages flag necessary? 
>>> AFAICS it 
>>> just duplicates the kernel's direct_gbpages flag. One outcome is that 
>>> hibernation 
>>> won't use gbpages, which is silly.
>> boot/compressed/pagetable.c also uses kernel_ident_mapping_init() for kaslr, 
>> at 
>> the moment we don't have "direct_gbpages" definition or X86_FEATURE_GBPAGES 
>> feature detection.
>>
>> I thought that we can change the other call sites when found really needed.
> Ok, you are right - I'll use the original patches as submitted, with the 
> updated 
> changelogs.

Thanks!

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 2/2] x86_64/kexec: Use PUD level 1GB page for identity mapping if available

2017-05-05 Thread Xunlei Pang
On 05/05/2017 at 05:20 PM, Ingo Molnar wrote:
> * Xunlei Pang <xp...@redhat.com> wrote:
>
>> On 05/05/2017 at 02:52 PM, Ingo Molnar wrote:
>>> * Xunlei Pang <xlp...@redhat.com> wrote:
>>>
>>>> @@ -122,6 +122,10 @@ static int init_pgtable(struct kimage *image, 
>>>> unsigned long start_pgtable)
>>>>  
>>>>level4p = (pgd_t *)__va(start_pgtable);
>>>>clear_page(level4p);
>>>> +
>>>> +  if (direct_gbpages)
>>>> +  info.direct_gbpages = true;
>>> No, this should be keyed off the CPU feature (X86_FEATURE_GBPAGES) 
>>> automatically, 
>>> not set blindly! AFAICS this patch will crash kexec on any CPU that does 
>>> not 
>>> support gbpages.
>> It should be fine, probe_page_size_mask() already takes care of this:
>> if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
>> printk(KERN_INFO "Using GB pages for direct mapping\n");
>> page_size_mask |= 1 << PG_LEVEL_1G;
>> } else {
>> direct_gbpages = 0;
>> }
>>
>> So if X86_FEATURE_GBPAGES is not supported, direct_gbpages will be set to 0.
> So why is the introduction of the info.direct_gbpages flag necessary? AFAICS 
> it 
> just duplicates the kernel's direct_gbpages flag. One outcome is that 
> hibernation 
> won't use gbpages, which is silly.

boot/compressed/pagetable.c  also uses kernel_ident_mapping_init() for kaslr, 
at the moment
we don't have "direct_gbpages" definition or X86_FEATURE_GBPAGES feature 
detection.

I thought that we can change the other call sites when found really needed.

Regards,
Xunlei


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 2/2] x86_64/kexec: Use PUD level 1GB page for identity mapping if available

2017-05-05 Thread Xunlei Pang
On 05/05/2017 at 02:52 PM, Ingo Molnar wrote:
> * Xunlei Pang <xlp...@redhat.com> wrote:
>
>> @@ -122,6 +122,10 @@ static int init_pgtable(struct kimage *image, unsigned 
>> long start_pgtable)
>>  
>>  level4p = (pgd_t *)__va(start_pgtable);
>>  clear_page(level4p);
>> +
>> +if (direct_gbpages)
>> +info.direct_gbpages = true;
> No, this should be keyed off the CPU feature (X86_FEATURE_GBPAGES) 
> automatically, 
> not set blindly! AFAICS this patch will crash kexec on any CPU that does not 
> support gbpages.

It should be fine, probe_page_size_mask() already takes care of this:
if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
printk(KERN_INFO "Using GB pages for direct mapping\n");
page_size_mask |= 1 << PG_LEVEL_1G;
} else {
direct_gbpages = 0;
}

So if X86_FEATURE_GBPAGES is not supported, direct_gbpages will be set to 0.

>
> I only noticed this problem after having fixed/enhanced all the changelogs - 
> so 
> please pick up the new changelog up from the log below.

Thanks for the rewrite, it looks better.

Regards,
Xunlei

>
> Thanks,
>
>   Ingo
>
>
> >
>
> Author: Xunlei Pang <xlp...@redhat.com>
>
> x86/mm: Add support for gbpages to kernel_ident_mapping_init()
>
> Kernel identity mappings on x86-64 kernels are created in two
> ways: by the early x86 boot code, or by kernel_ident_mapping_init().
>
> Native kernels (which is the dominant usecase) use the former,
> but the kexec and the hibernation code uses kernel_ident_mapping_init().
>
> There's a subtle difference between these two ways of how identity
> mappings are created, the current kernel_ident_mapping_init() code
> creates identity mappings always using 2MB page(PMD level) - while
> the native kernel boot path also utilizes gbpages where available.
>
> This difference is suboptimal both for performance and for memory
> usage: kernel_ident_mapping_init() needs to allocate pages for the
> page tables when creating the new identity mappings.
>
> This patch adds 1GB page(PUD level) support to kernel_ident_mapping_init()
> to address these concerns.
>
> The primary advantage would be better TLB coverage/performance,
> because we'd utilize 1GB TLBs instead of 2MB ones.
>
> It is also useful for machines with large number of memory to
> save paging structure allocations(around 4MB/TB using 2MB page)
> when setting identity mappings for all the memory, after using
> 1GB page it will consume only 8KB/TB.
>
> ( Note that this change alone does not activate gbpages in kexec,
>   we are doing that in a separate patch. )
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 1/2] x86/mm/ident_map: Add PUD level 1GB page support

2017-05-03 Thread Xunlei Pang
The current kernel_ident_mapping_init() creates the identity
mapping always using 2MB page(PMD level), this patch adds the
1GB page(PUD level) support.

The primary advantage would be better TLB coverage/performance,
because we'd utilize 1GB TLBs instead of 2MB ones.

It is also useful for machines with large number of memory to
save paging structure allocations(around 4MB/TB using 2MB page)
when setting identity mappings for all the memory, after using
1GB page it will consume only 8KB/TB.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/boot/compressed/pagetable.c |  2 +-
 arch/x86/include/asm/init.h  |  3 ++-
 arch/x86/kernel/machine_kexec_64.c   |  2 +-
 arch/x86/mm/ident_map.c  | 14 +-
 arch/x86/power/hibernate_64.c|  2 +-
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/boot/compressed/pagetable.c 
b/arch/x86/boot/compressed/pagetable.c
index 56589d0..1d78f17 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -70,7 +70,7 @@ static void *alloc_pgt_page(void *context)
  * Due to relocation, pointers must be assigned at run time not build time.
  */
 static struct x86_mapping_info mapping_info = {
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag   = __PAGE_KERNEL_LARGE_EXEC,
 };
 
 /* Locates and clears a region for a new top level page table. */
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 737da62..474eb8c 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -4,8 +4,9 @@
 struct x86_mapping_info {
void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
void *context;   /* context for alloc_pgt_page */
-   unsigned long pmd_flag;  /* page flag for PMD entry */
+   unsigned long page_flag; /* page flag for PMD or PUD entry */
unsigned long offset;/* ident mapping offset */
+   bool direct_gbpages; /* PUD level 1GB page support */
 };
 
 int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 085c3b3..1d4f2b0 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -113,7 +113,7 @@ static int init_pgtable(struct kimage *image, unsigned long 
start_pgtable)
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
.context= image,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
};
unsigned long mstart, mend;
pgd_t *level4p;
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 04210a2..adab159 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -13,7 +13,7 @@ static void ident_pmd_init(struct x86_mapping_info *info, 
pmd_t *pmd_page,
if (pmd_present(*pmd))
continue;
 
-   set_pmd(pmd, __pmd((addr - info->offset) | info->pmd_flag));
+   set_pmd(pmd, __pmd((addr - info->offset) | info->page_flag));
}
 }
 
@@ -30,6 +30,18 @@ static int ident_pud_init(struct x86_mapping_info *info, 
pud_t *pud_page,
if (next > end)
next = end;
 
+   if (info->direct_gbpages) {
+   pud_t pudval;
+
+   if (pud_present(*pud))
+   continue;
+
+   addr &= PUD_MASK;
+   pudval = __pud((addr - info->offset) | info->page_flag);
+   set_pud(pud, pudval);
+   continue;
+   }
+
if (pud_present(*pud)) {
pmd = pmd_offset(pud, 0);
ident_pmd_init(info, pmd, addr, next);
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index 6a61194..a6e21fe 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -104,7 +104,7 @@ static int set_up_temporary_mappings(void)
 {
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
.offset = __PAGE_OFFSET,
};
unsigned long mstart, mend;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [Makedumpfile PATCH 1/2] makedumpfile: add runtime kaslr offset if it exists

2017-04-27 Thread Xunlei Pang
On 04/27/2017 at 02:15 PM, Pratyush Anand wrote:
> If we have to erase a symbol from vmcore whose address is not present in
> vmcoreinfo, then we need to pass vmlinux as well to get the symbol
> address.
> When kaslr is enabled, virtual address of all the kernel symbols are
> randomized with an offset. vmlinux  always has a static address, but all
> the arch specific calculation are based on run time kernel address. So
> we need to find a way to translate symbol address from vmlinux to kernel
> run time address.
>
> without this patch:
> # makedumpfile --split  -d 5 -x vmlinux --config scrub.conf vmcore 
> dumpfile_{1,2,3}
>
> readpage_kdump_compressed: pfn(f97ea) is excluded from vmcore.
> readmem: type_addr: 1, addr:f97eaff8, size:8
> vtop4_x86_64: Can't get pml4 (page_dir:f97eaff8).
> readmem: Can't convert a virtual address(819f1284) to physical 
> address.
> readmem: type_addr: 0, addr:819f1284, size:390
> check_release: Can't get the address of system_utsname.
>
> After this patch check_release() is ok, and also we are able to erase
> symbol from vmcore.
>
> Signed-off-by: Pratyush Anand 
> ---
>  arch/x86_64.c  | 23 +++
>  erase_info.c   |  1 +
>  makedumpfile.c | 44 
>  makedumpfile.h | 15 +++
>  4 files changed, 83 insertions(+)
>
> diff --git a/arch/x86_64.c b/arch/x86_64.c
> index e978a36f8878..ab5aae8f1b26 100644
> --- a/arch/x86_64.c
> +++ b/arch/x86_64.c
> @@ -33,6 +33,29 @@ get_xen_p2m_mfn(void)
>   return NOT_FOUND_LONG_VALUE;
>  }
>  
> +unsigned long
> +get_kaslr_offset_x86_64(unsigned long vaddr)
> +{
> + unsigned long sym_vmcoreinfo, sym_vmlinux;
> +
> + if (!info->kaslr_offset) {
> + sym_vmlinux = get_symbol_addr("_stext");
> + if (sym_vmlinux == NOT_FOUND_SYMBOL)
> + return 0;
> + sym_vmcoreinfo = read_vmcoreinfo_symbol(STR_SYMBOL("_stext"));
> + info->kaslr_offset = sym_vmcoreinfo - sym_vmlinux;
> + }
> + if (vaddr >= __START_KERNEL_map &&
> + vaddr < __START_KERNEL_map + info->kaslr_offset)
> + return info->kaslr_offset;
> + else
> + /*
> +  * TODO: we need to check if it is vmalloc/vmmemmap/module
> +  * address, we will have different offset
> +  */
> + return 0;
> +}
> +
>  static int
>  get_page_offset_x86_64(void)
>  {
> diff --git a/erase_info.c b/erase_info.c
> index f2ba9149e93e..60abfa1a1adf 100644
> --- a/erase_info.c
> +++ b/erase_info.c
> @@ -1088,6 +1088,7 @@ resolve_config_entry(struct config_entry *ce, unsigned 
> long long base_vaddr,
>   ce->line, ce->name);
>   return FALSE;
>   }
> + ce->sym_addr += get_kaslr_offset(ce->sym_addr);
>   ce->type_name = get_symbol_type_name(ce->name,
>   DWARF_INFO_GET_SYMBOL_TYPE,
>   >size, >type_flag);
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 301772a8820c..7e78641917d7 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -3782,6 +3782,46 @@ free_for_parallel()
>  }
>  
>  int
> +find_kaslr_offsets()
> +{
> + off_t offset;
> + unsigned long size;
> + int ret = FALSE;
> +
> + get_vmcoreinfo(, );
> +
> + if (!(info->name_vmcoreinfo = strdup(FILENAME_VMCOREINFO))) {
> + MSG("Can't duplicate strings(%s).\n", FILENAME_VMCOREINFO);
> + return FALSE;
> + }
> + if (!copy_vmcoreinfo(offset, size))
> + goto out;
> +
> + if (!open_vmcoreinfo("r"))
> + goto out;
> +
> + unlink(info->name_vmcoreinfo);
> +
> + /*
> +  * This arch specific function should update info->kaslr_offset. If
> +  * kaslr is not enabled then offset will be set to 0. arch specific
> +  * function might need to read from vmcoreinfo, therefore we have
> +  * called this function between open_vmcoreinfo() and
> +  * close_vmcoreinfo()
> +  */
> + get_kaslr_offset(SYMBOL(_stext));

Looks like acquiring "KERNELOFFSET" in read_vmcoreinfo() should be enough here.

We can get kaslr offset directly from the vmcoreinfo because the compressed 
dumpfile
contains vmcoreinfo as well in case of flag_refiltering, also x86_64 kernel has 
exported
"vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());"

Regards,
Xunlei

> +
> + close_vmcoreinfo();
> +
> + ret = TRUE;
> +out:
> + free(info->name_vmcoreinfo);
> + info->name_vmcoreinfo = NULL;
> +
> + return ret;
> +}
> +
> +int
>  initial(void)
>  {
>   off_t offset;
> @@ -3833,6 +3873,9 @@ initial(void)
>   set_dwarf_debuginfo("vmlinux", NULL,
>   info->name_vmlinux, info->fd_vmlinux);
>  
> + if (has_vmcoreinfo() && 

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-27 Thread Xunlei Pang
On 04/27/2017 at 01:44 PM, Dave Young wrote:
> Hi Xunlei,
>
> On 04/27/17 at 01:25pm, Xunlei Pang wrote:
>> On 04/27/2017 at 11:06 AM, Dave Young wrote:
>>> [snip]
>>>>>>>  
>>>>>>>  static int __init crash_save_vmcoreinfo_init(void)
>>>>>>>  {
>>>>>>> +   /* One page should be enough for VMCOREINFO_BYTES under all 
>>>>>>> archs */
>>>>>> Can we add a comment in the VMCOREINFO_BYTES header file about the one
>>>>>> page assumption?
>>>>>>
>>>>>> Or just define the VMCOREINFO_BYTES as PAGE_SIZE instead of 4096
>>>>> Yes, I considered this before, but VMCOREINFO_BYTES is also used by 
>>>>> VMCOREINFO_NOTE_SIZE
>>>>> definition which is exported to sysfs, also some platform has larger page 
>>>>> size(64KB), so
>>>>> I didn't touch this 4096 value.
>>>>>
>>>>> I think I should use kmalloc() to allocate both of them, then move this 
>>>>> comment to Patch3 
>>>>> kimage_crash_copy_vmcoreinfo().
>>>> But on the other hand, using a separate page for them seems safer compared 
>>>> with
>>>> using frequently-used slab, what's your opinion?
>>> I feel current page based way is better.
>>>
>>> For 64k page the vmcore note size will increase it seems fine. Do you
>>> have concern in mind?
>> Since tools are supposed to acquire vmcoreinfo note size from sysfs, it 
>> should be safe to do so,
>> except that there is some waste in memory for larger PAGE_SIZE.
> Either way is fine to me, I think it is up to your implementation, if
> choose page alloc then modify the macro with PAGE_SIZE looks better.

OK, I will use PAGE_SIZE then, thanks for your comments.

>
> Thanks
> Dave
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Xunlei Pang
On 04/27/2017 at 11:06 AM, Dave Young wrote:
> [snip]
>  
>  static int __init crash_save_vmcoreinfo_init(void)
>  {
> + /* One page should be enough for VMCOREINFO_BYTES under all archs */
 Can we add a comment in the VMCOREINFO_BYTES header file about the one
 page assumption?

 Or just define the VMCOREINFO_BYTES as PAGE_SIZE instead of 4096
>>> Yes, I considered this before, but VMCOREINFO_BYTES is also used by 
>>> VMCOREINFO_NOTE_SIZE
>>> definition which is exported to sysfs, also some platform has larger page 
>>> size(64KB), so
>>> I didn't touch this 4096 value.
>>>
>>> I think I should use kmalloc() to allocate both of them, then move this 
>>> comment to Patch3 
>>> kimage_crash_copy_vmcoreinfo().
>> But on the other hand, using a separate page for them seems safer compared 
>> with
>> using frequently-used slab, what's your opinion?
> I feel current page based way is better.
>
> For 64k page the vmcore note size will increase it seems fine. Do you
> have concern in mind?

Since tools are supposed to acquire vmcoreinfo note size from sysfs, it should 
be safe to do so,
except that there is some waste in memory for larger PAGE_SIZE.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Xunlei Pang
On 04/26/2017 at 05:51 PM, Xunlei Pang wrote:
> On 04/26/2017 at 03:19 PM, Dave Young wrote:
>> Add ia64i list,  and s390 list although Michael has tested it
>>
>> On 04/20/17 at 07:39pm, Xunlei Pang wrote:
>>> As Eric said,
>>> "what we need to do is move the variable vmcoreinfo_note out
>>> of the kernel's .bss section.  And modify the code to regenerate
>>> and keep this information in something like the control page.
>>>
>>> Definitely something like this needs a page all to itself, and ideally
>>> far away from any other kernel data structures.  I clearly was not
>>> watching closely the data someone decided to keep this silly thing
>>> in the kernel's .bss section."
>>>
>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>> vmcoreinfo now is kept far away from other kernel data structures.
>>>
>>> Suggested-by: Eric Biederman <ebied...@xmission.com>
>>> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
>>> Cc: Juergen Gross <jgr...@suse.com>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>> v3->v4:
>>> -Rebased on the latest linux-next
>>> -Handle S390 vmcoreinfo_note properly
>>> -Handle the newly-added xen/mmu_pv.c
>>>
>>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>>  arch/s390/kernel/machine_kexec.c |  1 +
>>>  arch/s390/kernel/setup.c |  6 --
>>>  arch/x86/kernel/crash.c  |  2 +-
>>>  arch/x86/xen/mmu_pv.c|  4 ++--
>>>  include/linux/crash_core.h   |  2 +-
>>>  kernel/crash_core.c  | 27 +++
>>>  kernel/ksysfs.c  |  2 +-
>>>  8 files changed, 29 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>>> b/arch/ia64/kernel/machine_kexec.c
>>> index 599507b..c14815d 100644
>>> --- a/arch/ia64/kernel/machine_kexec.c
>>> +++ b/arch/ia64/kernel/machine_kexec.c
>>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>>  #endif
>>>  }
>>>  
>>> -phys_addr_t paddr_vmcoreinfo_note(void)
>>> -{
>>> -   return ia64_tpa((unsigned long)(char *)_note);
>>> -}
>>> -
>>> diff --git a/arch/s390/kernel/machine_kexec.c 
>>> b/arch/s390/kernel/machine_kexec.c
>>> index 49a6bd4..3d0b14a 100644
>>> --- a/arch/s390/kernel/machine_kexec.c
>>> +++ b/arch/s390/kernel/machine_kexec.c
>>> @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
>>> VMCOREINFO_SYMBOL(lowcore_ptr);
>>> VMCOREINFO_SYMBOL(high_memory);
>>> VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
>>> +   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>>>  }
>>>  
>>>  void machine_shutdown(void)
>>> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
>>> index 3ae756c..3d1d808 100644
>>> --- a/arch/s390/kernel/setup.c
>>> +++ b/arch/s390/kernel/setup.c
>>> @@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
>>> pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
>>>  }
>>>  
>>> -static void __init setup_vmcoreinfo(void)
>>> -{
>>> -   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>>> -}
>>> -
>>>  #ifdef CONFIG_CRASH_DUMP
>>>  
>>>  /*
>>> @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
>>>  #endif
>>>  
>>> setup_resources();
>>> -   setup_vmcoreinfo();
>>> setup_lowcore();
>>> smp_fill_possible_mask();
>>> cpu_detect_mhz_feature();
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 22217ec..44404e2 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>>> *ced,
>>> bufp += sizeof(Elf64_Phdr);
>>> phdr->p_type = PT_NOTE;
>>> phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>>> -   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>>> +   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>> (ehdr->e_phnum)++;
>>>  
>>>  #ifdef CONFIG_X86_64
>>> di

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Xunlei Pang
On 04/26/2017 at 03:19 PM, Dave Young wrote:
> Add ia64i list,  and s390 list although Michael has tested it
>
> On 04/20/17 at 07:39pm, Xunlei Pang wrote:
>> As Eric said,
>> "what we need to do is move the variable vmcoreinfo_note out
>> of the kernel's .bss section.  And modify the code to regenerate
>> and keep this information in something like the control page.
>>
>> Definitely something like this needs a page all to itself, and ideally
>> far away from any other kernel data structures.  I clearly was not
>> watching closely the data someone decided to keep this silly thing
>> in the kernel's .bss section."
>>
>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>> one advantage is that it enhances some safety of vmcoreinfo, because
>> vmcoreinfo now is kept far away from other kernel data structures.
>>
>> Suggested-by: Eric Biederman <ebied...@xmission.com>
>> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
>> Cc: Juergen Gross <jgr...@suse.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>> v3->v4:
>> -Rebased on the latest linux-next
>> -Handle S390 vmcoreinfo_note properly
>> -Handle the newly-added xen/mmu_pv.c
>>
>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>  arch/s390/kernel/machine_kexec.c |  1 +
>>  arch/s390/kernel/setup.c |  6 --
>>  arch/x86/kernel/crash.c  |  2 +-
>>  arch/x86/xen/mmu_pv.c|  4 ++--
>>  include/linux/crash_core.h   |  2 +-
>>  kernel/crash_core.c  | 27 +++
>>  kernel/ksysfs.c  |  2 +-
>>  8 files changed, 29 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>> b/arch/ia64/kernel/machine_kexec.c
>> index 599507b..c14815d 100644
>> --- a/arch/ia64/kernel/machine_kexec.c
>> +++ b/arch/ia64/kernel/machine_kexec.c
>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>  #endif
>>  }
>>  
>> -phys_addr_t paddr_vmcoreinfo_note(void)
>> -{
>> -return ia64_tpa((unsigned long)(char *)_note);
>> -}
>> -
>> diff --git a/arch/s390/kernel/machine_kexec.c 
>> b/arch/s390/kernel/machine_kexec.c
>> index 49a6bd4..3d0b14a 100644
>> --- a/arch/s390/kernel/machine_kexec.c
>> +++ b/arch/s390/kernel/machine_kexec.c
>> @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
>>  VMCOREINFO_SYMBOL(lowcore_ptr);
>>  VMCOREINFO_SYMBOL(high_memory);
>>  VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
>> +mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>>  }
>>  
>>  void machine_shutdown(void)
>> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
>> index 3ae756c..3d1d808 100644
>> --- a/arch/s390/kernel/setup.c
>> +++ b/arch/s390/kernel/setup.c
>> @@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
>>  pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
>>  }
>>  
>> -static void __init setup_vmcoreinfo(void)
>> -{
>> -mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>> -}
>> -
>>  #ifdef CONFIG_CRASH_DUMP
>>  
>>  /*
>> @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
>>  #endif
>>  
>>  setup_resources();
>> -setup_vmcoreinfo();
>>  setup_lowcore();
>>  smp_fill_possible_mask();
>>  cpu_detect_mhz_feature();
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index 22217ec..44404e2 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>> *ced,
>>  bufp += sizeof(Elf64_Phdr);
>>  phdr->p_type = PT_NOTE;
>>  phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>> -phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>> +phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>  (ehdr->e_phnum)++;
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 9d9ae66..35543fa 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
>> unsigned int order)
>>  phys_addr_t paddr_vmcoreinfo_note(void)
>>  {
>>  if (xen_pv_domain())
>> -return virt_to_machine(_note).maddr;
>>

Re: [PATCH v4 3/3] kdump: Protect vmcoreinfo data under the crash memory

2017-04-26 Thread Xunlei Pang
On 04/26/2017 at 03:09 PM, Dave Young wrote:
> On 04/20/17 at 07:39pm, Xunlei Pang wrote:
>> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
>> it has the risk of being modified by some wrong code during system
>> is running.
>>
>> As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
>> when using "crash", "makedumpfile", etc utility to parse this vmcore,
>> we probably will get "Segmentation fault" or other unexpected errors.
>>
>> E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
>> system; 3) trigger kdump, then we obviously will fail to recognize the
>> crash context correctly due to the corrupted vmcoreinfo.
>>
>> Now except for vmcoreinfo, all the crash data is well protected(including
>> the cpu note which is fully updated in the crash path, thus its correctness
>> is guaranteed). Given that vmcoreinfo data is a large chunk prepared for
>> kdump, we better protect it as well.
>>
>> To solve this, we relocate and copy vmcoreinfo_data to the crash memory
>> when kdump is loading via kexec syscalls. Because the whole crash memory
>> will be protected by existing arch_kexec_protect_crashkres() mechanism,
>> we naturally protect vmcoreinfo_data from write(even read) access under
>> kernel direct mapping after kdump is loaded.
>>
>> Since kdump is usually loaded at the very early stage after boot, we can
>> trust the correctness of the vmcoreinfo data copied.
>>
>> On the other hand, we still need to operate the vmcoreinfo safe copy when
>> crash happens to generate vmcoreinfo_note again, we rely on vmap() to map
>> out a new kernel virtual address and update to use this new one instead in
>> the following crash_save_vmcoreinfo().
>>
>> BTW, we do not touch vmcoreinfo_note, because it will be fully updated
>> using the protected vmcoreinfo_data after crash which is surely correct
>> just like the cpu crash note.
>>
>> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>> v3->v4:
>> -Rebased on the latest linux-next
>> -Copy vmcoreinfo after machine_kexec_prepare()
>>
>>  include/linux/crash_core.h |  2 +-
>>  include/linux/kexec.h  |  2 ++
>>  kernel/crash_core.c| 17 -
>>  kernel/kexec.c |  8 
>>  kernel/kexec_core.c| 39 +++
>>  kernel/kexec_file.c|  8 
>>  6 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>> index 7d6bc7b..5469adb 100644
>> --- a/include/linux/crash_core.h
>> +++ b/include/linux/crash_core.h
>> @@ -23,6 +23,7 @@
>>  
>>  typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
>>  
>> +void crash_update_vmcoreinfo_safecopy(void *ptr);
>>  void crash_save_vmcoreinfo(void);
>>  void arch_crash_save_vmcoreinfo(void);
>>  __printf(1, 2)
>> @@ -54,7 +55,6 @@
>>  vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
>>  
>>  extern u32 *vmcoreinfo_note;
>> -extern size_t vmcoreinfo_size;
>>  
>>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>>void *data, size_t data_len);
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index c9481eb..3ea8275 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -181,6 +181,7 @@ struct kimage {
>>  unsigned long start;
>>  struct page *control_code_page;
>>  struct page *swap_page;
>> +void *vmcoreinfo_data_copy; /* locates in the crash memory */
>>  
>>  unsigned long nr_segments;
>>  struct kexec_segment segment[KEXEC_SEGMENT_MAX];
>> @@ -250,6 +251,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct 
>> kimage *image,
>>  int kexec_should_crash(struct task_struct *);
>>  int kexec_crash_loaded(void);
>>  void crash_save_cpu(struct pt_regs *regs, int cpu);
>> +extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>>  
>>  extern struct kimage *kexec_image;
>>  extern struct kimage *kexec_crash_image;
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 43cdb00..a29e9ad 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -15,9 +15,12 @@
>>  
>>  /* vmcoreinfo stuff */
>>  static unsigned char *vmcoreinfo_data;
>> -size_t vmcoreinfo_size;
&g

[PATCH v2 1/2] x86/mm/ident_map: Add PUD level 1GB page support

2017-04-26 Thread Xunlei Pang
The current kernel_ident_mapping_init() creates the identity
mapping using 2MB page(PMD level), this patch adds the 1GB
page(PUD level) support.

This is useful on large machines to save some reserved memory
(as paging structures) in the kdump case when kexec setups up
identity mappings before booting into the new kernel.

We will utilize this new support in the following patch.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1->v2:
- Rename info.use_pud_page to info.direct_gbpages
- Align PUD_MASK before set_pud()

 arch/x86/boot/compressed/pagetable.c |  2 +-
 arch/x86/include/asm/init.h  |  3 ++-
 arch/x86/kernel/machine_kexec_64.c   |  2 +-
 arch/x86/mm/ident_map.c  | 14 +-
 arch/x86/power/hibernate_64.c|  2 +-
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/boot/compressed/pagetable.c 
b/arch/x86/boot/compressed/pagetable.c
index 56589d0..1d78f17 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -70,7 +70,7 @@ static void *alloc_pgt_page(void *context)
  * Due to relocation, pointers must be assigned at run time not build time.
  */
 static struct x86_mapping_info mapping_info = {
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag   = __PAGE_KERNEL_LARGE_EXEC,
 };
 
 /* Locates and clears a region for a new top level page table. */
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 737da62..d6ead7b 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -4,8 +4,9 @@
 struct x86_mapping_info {
void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
void *context;   /* context for alloc_pgt_page */
-   unsigned long pmd_flag;  /* page flag for PMD entry */
+   unsigned long page_flag; /* page flag for PMD or PUD entry */
unsigned long offset;/* ident mapping offset */
+   bool direct_gbpages;/* PUD level 1GB page support */
 };
 
 int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 085c3b3..1d4f2b0 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -113,7 +113,7 @@ static int init_pgtable(struct kimage *image, unsigned long 
start_pgtable)
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
.context= image,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
};
unsigned long mstart, mend;
pgd_t *level4p;
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 04210a2..adab159 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -13,7 +13,7 @@ static void ident_pmd_init(struct x86_mapping_info *info, 
pmd_t *pmd_page,
if (pmd_present(*pmd))
continue;
 
-   set_pmd(pmd, __pmd((addr - info->offset) | info->pmd_flag));
+   set_pmd(pmd, __pmd((addr - info->offset) | info->page_flag));
}
 }
 
@@ -30,6 +30,18 @@ static int ident_pud_init(struct x86_mapping_info *info, 
pud_t *pud_page,
if (next > end)
next = end;
 
+   if (info->direct_gbpages) {
+   pud_t pudval;
+
+   if (pud_present(*pud))
+   continue;
+
+   addr &= PUD_MASK;
+   pudval = __pud((addr - info->offset) | info->page_flag);
+   set_pud(pud, pudval);
+   continue;
+   }
+
if (pud_present(*pud)) {
pmd = pmd_offset(pud, 0);
ident_pmd_init(info, pmd, addr, next);
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index 6a61194..a6e21fe 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -104,7 +104,7 @@ static int set_up_temporary_mappings(void)
 {
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
.offset = __PAGE_OFFSET,
};
unsigned long mstart, mend;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 2/2] x86_64/kexec: Use PUD level 1GB page for identity mapping if available

2017-04-26 Thread Xunlei Pang
Kexec setups all identity mappings before booting into the new
kernel, and this will cause extra memory consumption for paging
structures which is quite considerable on modern machines with
huge memory.

E.g. On one 32TB machine, in kdump case, it could waste around
128MB (around 4MB/TB) from the reserved memory after kexec set
all the identity mappings using the current 2MB page, plus the
loaded kdump kernel, initramfs, etc, it caused kexec syscall
-NOMEM failure. As a result, we had to enlarge reserved memory
via "crashkernel=X".

This causes some trouble for distributions that use policies
to evaluate the proper "crashkernel=X" value for users.

Given that on machines with large number of memory, 1GB feature
is very likely available, and that kernel_ident_mapping_init()
supports PUD level 1GB page, to solve this problem, we use 1GB
size page to create the identity mapping pgtable for kdump if
1GB feature is available.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 1d4f2b0..613df62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -122,6 +122,11 @@ static int init_pgtable(struct kimage *image, unsigned 
long start_pgtable)
 
level4p = (pgd_t *)__va(start_pgtable);
clear_page(level4p);
+
+   /* Use PUD level page if available, to save crash memory for kdump */
+   if (direct_gbpages)
+   info.direct_gbpages = true;
+
for (i = 0; i < nr_pfn_mapped; i++) {
mstart = pfn_mapped[i].start << PAGE_SHIFT;
mend   = pfn_mapped[i].end << PAGE_SHIFT;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] x86/mm/ident_map: Add PUD level 1GB page support

2017-04-25 Thread Xunlei Pang
On 04/26/2017 at 03:49 AM, Yinghai Lu wrote:
> On Tue, Apr 25, 2017 at 2:13 AM, Xunlei Pang <xlp...@redhat.com> wrote:
>> The current kernel_ident_mapping_init() creates the identity
>> mapping using 2MB page(PMD level), this patch adds the 1GB
>> page(PUD level) support.
>>
>> This is useful on large machines to save some reserved memory
>> (as paging structures) in the kdump case when kexec setups up
>> identity mappings before booting into the new kernel.
>>
>> We will utilize this new support in the following patch.
>>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  arch/x86/boot/compressed/pagetable.c |  2 +-
>>  arch/x86/include/asm/init.h  |  3 ++-
>>  arch/x86/kernel/machine_kexec_64.c   |  2 +-
>>  arch/x86/mm/ident_map.c  | 13 -
>>  arch/x86/power/hibernate_64.c|  2 +-
>>  5 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/pagetable.c 
>> b/arch/x86/boot/compressed/pagetable.c
>> index 56589d0..1d78f17 100644
>> --- a/arch/x86/boot/compressed/pagetable.c
>> +++ b/arch/x86/boot/compressed/pagetable.c
>> @@ -70,7 +70,7 @@ static void *alloc_pgt_page(void *context)
>>   * Due to relocation, pointers must be assigned at run time not build time.
>>   */
>>  static struct x86_mapping_info mapping_info = {
>> -   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
>> +   .page_flag   = __PAGE_KERNEL_LARGE_EXEC,
>>  };
>>
>>  /* Locates and clears a region for a new top level page table. */
>> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
>> index 737da62..46eab1a 100644
>> --- a/arch/x86/include/asm/init.h
>> +++ b/arch/x86/include/asm/init.h
>> @@ -4,8 +4,9 @@
>>  struct x86_mapping_info {
>> void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
>> void *context;   /* context for alloc_pgt_page */
>> -   unsigned long pmd_flag;  /* page flag for PMD entry */
>> +   unsigned long page_flag; /* page flag for PMD or PUD entry */
>> unsigned long offset;/* ident mapping offset */
>> +   bool use_pud_page;  /* PUD level 1GB page support */
> how about use direct_gbpages instead?
> use_pud_page is confusing.

ok

>
>>  };
>>
>>  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t 
>> *pgd_page,
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index 085c3b3..1d4f2b0 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -113,7 +113,7 @@ static int init_pgtable(struct kimage *image, unsigned 
>> long start_pgtable)
>> struct x86_mapping_info info = {
>> .alloc_pgt_page = alloc_pgt_page,
>> .context= image,
>> -   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
>> +   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
>> };
>> unsigned long mstart, mend;
>> pgd_t *level4p;
>> diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
>> index 04210a2..0ad0280 100644
>> --- a/arch/x86/mm/ident_map.c
>> +++ b/arch/x86/mm/ident_map.c
>> @@ -13,7 +13,7 @@ static void ident_pmd_init(struct x86_mapping_info *info, 
>> pmd_t *pmd_page,
>> if (pmd_present(*pmd))
>> continue;
>>
>> -   set_pmd(pmd, __pmd((addr - info->offset) | info->pmd_flag));
>> +   set_pmd(pmd, __pmd((addr - info->offset) | info->page_flag));
>> }
>>  }
>>
>> @@ -30,6 +30,17 @@ static int ident_pud_init(struct x86_mapping_info *info, 
>> pud_t *pud_page,
>> if (next > end)
>> next = end;
>>
>> +   if (info->use_pud_page) {
>> +   pud_t pudval;
>> +
>> +   if (pud_present(*pud))
>> +   continue;
>> +
>> +   pudval = __pud((addr - info->offset) | 
>> info->page_flag);
>> +   set_pud(pud, pudval);
> should mask addr with PUD_MASK.
>addr &= PUD_MASK;
>set_pud(pud, __pmd(addr - info->offset) | info->page_flag);

Yes, will update, thanks for the catch.

Regards,
Xunlei

>
>
>> +   continue;
>> +   }
>> +
>> if (p

[PATCH 2/2] x86_64/kexec: Use PUD level 1GB page for identity mapping if available

2017-04-25 Thread Xunlei Pang
Kexec setups all identity mappings before booting into the new
kernel, and this will cause extra memory consumption for paging
structures which is quite considerable on modern machines with
huge memory.

E.g. On one 32TB machine, in kdump case, it could waste around
128MB (around 4MB/TB) from the reserved memory after kexec set
all the identity mappings using the current 2MB page, plus the
loaded kdump kernel, initramfs, etc, it caused kexec syscall
-NOMEM failure. As a result, we had to enlarge reserved memory
via "crashkernel=X".

This causes some trouble for distributions that use policies
to evaluate the proper "crashkernel=X" value for users.

Given that on machines with large number of memory, 1GB feature
is very likely available, and that kernel_ident_mapping_init()
supports PUD level 1GB page, to solve this problem, we use 1GB
size page to create the identity mapping pgtable for kdump if
1GB feature is available.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 1d4f2b0..41f1ae7 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -122,6 +122,11 @@ static int init_pgtable(struct kimage *image, unsigned 
long start_pgtable)
 
level4p = (pgd_t *)__va(start_pgtable);
clear_page(level4p);
+
+   /* Use PUD level page if available, to save crash memory for kdump */
+   if (direct_gbpages)
+   info.use_pud_page = true;
+
for (i = 0; i < nr_pfn_mapped; i++) {
mstart = pfn_mapped[i].start << PAGE_SHIFT;
mend   = pfn_mapped[i].end << PAGE_SHIFT;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/2] x86/mm/ident_map: Add PUD level 1GB page support

2017-04-25 Thread Xunlei Pang
The current kernel_ident_mapping_init() creates the identity
mapping using 2MB page(PMD level), this patch adds the 1GB
page(PUD level) support.

This is useful on large machines to save some reserved memory
(as paging structures) in the kdump case when kexec setups up
identity mappings before booting into the new kernel.

We will utilize this new support in the following patch.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/boot/compressed/pagetable.c |  2 +-
 arch/x86/include/asm/init.h  |  3 ++-
 arch/x86/kernel/machine_kexec_64.c   |  2 +-
 arch/x86/mm/ident_map.c  | 13 -
 arch/x86/power/hibernate_64.c|  2 +-
 5 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/boot/compressed/pagetable.c 
b/arch/x86/boot/compressed/pagetable.c
index 56589d0..1d78f17 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -70,7 +70,7 @@ static void *alloc_pgt_page(void *context)
  * Due to relocation, pointers must be assigned at run time not build time.
  */
 static struct x86_mapping_info mapping_info = {
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag   = __PAGE_KERNEL_LARGE_EXEC,
 };
 
 /* Locates and clears a region for a new top level page table. */
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 737da62..46eab1a 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -4,8 +4,9 @@
 struct x86_mapping_info {
void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
void *context;   /* context for alloc_pgt_page */
-   unsigned long pmd_flag;  /* page flag for PMD entry */
+   unsigned long page_flag; /* page flag for PMD or PUD entry */
unsigned long offset;/* ident mapping offset */
+   bool use_pud_page;  /* PUD level 1GB page support */
 };
 
 int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 085c3b3..1d4f2b0 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -113,7 +113,7 @@ static int init_pgtable(struct kimage *image, unsigned long 
start_pgtable)
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
.context= image,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
};
unsigned long mstart, mend;
pgd_t *level4p;
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 04210a2..0ad0280 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -13,7 +13,7 @@ static void ident_pmd_init(struct x86_mapping_info *info, 
pmd_t *pmd_page,
if (pmd_present(*pmd))
continue;
 
-   set_pmd(pmd, __pmd((addr - info->offset) | info->pmd_flag));
+   set_pmd(pmd, __pmd((addr - info->offset) | info->page_flag));
}
 }
 
@@ -30,6 +30,17 @@ static int ident_pud_init(struct x86_mapping_info *info, 
pud_t *pud_page,
if (next > end)
next = end;
 
+   if (info->use_pud_page) {
+   pud_t pudval;
+
+   if (pud_present(*pud))
+   continue;
+
+   pudval = __pud((addr - info->offset) | info->page_flag);
+   set_pud(pud, pudval);
+   continue;
+   }
+
if (pud_present(*pud)) {
pmd = pmd_offset(pud, 0);
ident_pmd_init(info, pmd, addr, next);
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index 6a61194..a6e21fe 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -104,7 +104,7 @@ static int set_up_temporary_mappings(void)
 {
struct x86_mapping_info info = {
.alloc_pgt_page = alloc_pgt_page,
-   .pmd_flag   = __PAGE_KERNEL_LARGE_EXEC,
+   .page_flag  = __PAGE_KERNEL_LARGE_EXEC,
.offset = __PAGE_OFFSET,
};
unsigned long mstart, mend;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 3/3] kdump: Protect vmcoreinfo data under the crash memory

2017-04-20 Thread Xunlei Pang
Currently vmcoreinfo data is updated at boot time subsys_initcall(),
it has the risk of being modified by some wrong code during system
is running.

As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore,
we probably will get "Segmentation fault" or other unexpected errors.

E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.

Now except for vmcoreinfo, all the crash data is well protected(including
the cpu note which is fully updated in the crash path, thus its correctness
is guaranteed). Given that vmcoreinfo data is a large chunk prepared for
kdump, we better protect it as well.

To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash memory
will be protected by existing arch_kexec_protect_crashkres() mechanism,
we naturally protect vmcoreinfo_data from write(even read) access under
kernel direct mapping after kdump is loaded.

Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.

On the other hand, we still need to operate the vmcoreinfo safe copy when
crash happens to generate vmcoreinfo_note again, we rely on vmap() to map
out a new kernel virtual address and update to use this new one instead in
the following crash_save_vmcoreinfo().

BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.

Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v3->v4:
-Rebased on the latest linux-next
-Copy vmcoreinfo after machine_kexec_prepare()

 include/linux/crash_core.h |  2 +-
 include/linux/kexec.h  |  2 ++
 kernel/crash_core.c| 17 -
 kernel/kexec.c |  8 
 kernel/kexec_core.c| 39 +++
 kernel/kexec_file.c|  8 
 6 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 7d6bc7b..5469adb 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -23,6 +23,7 @@
 
 typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
 
+void crash_update_vmcoreinfo_safecopy(void *ptr);
 void crash_save_vmcoreinfo(void);
 void arch_crash_save_vmcoreinfo(void);
 __printf(1, 2)
@@ -54,7 +55,6 @@
vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
 
 extern u32 *vmcoreinfo_note;
-extern size_t vmcoreinfo_size;
 
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len);
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index c9481eb..3ea8275 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -181,6 +181,7 @@ struct kimage {
unsigned long start;
struct page *control_code_page;
struct page *swap_page;
+   void *vmcoreinfo_data_copy; /* locates in the crash memory */
 
unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -250,6 +251,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 int kexec_should_crash(struct task_struct *);
 int kexec_crash_loaded(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
+extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 43cdb00..a29e9ad 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -15,9 +15,12 @@
 
 /* vmcoreinfo stuff */
 static unsigned char *vmcoreinfo_data;
-size_t vmcoreinfo_size;
+static size_t vmcoreinfo_size;
 u32 *vmcoreinfo_note;
 
+/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
+static unsigned char *vmcoreinfo_data_safecopy;
+
 /*
  * parsing the "crashkernel" commandline
  *
@@ -323,11 +326,23 @@ static void update_vmcoreinfo_note(void)
final_note(buf);
 }
 
+void crash_update_vmcoreinfo_safecopy(void *ptr)
+{
+   if (ptr)
+   memcpy(ptr, vmcoreinfo_data, vmcoreinfo_size);
+
+   vmcoreinfo_data_safecopy = ptr;
+}
+
 void crash_save_vmcoreinfo(void)
 {
if (!vmcoreinfo_note)
return;
 
+   /* Use the safe copy to generate vmcoreinfo note if have */
+   if (vmcoreinfo_data_safecopy)
+   vmcoreinfo_data = vmcoreinfo_data_safecopy;
+
vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
update_vmcoreinfo_note();
 }
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 980936a..e62ec4d 100644
--- a/kernel/kexec.c

[PATCH v4 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-04-20 Thread Xunlei Pang
vmcoreinfo_max_size stands for the vmcoreinfo_data, the
correct one we should use is vmcoreinfo_note whose total
size is VMCOREINFO_NOTE_SIZE.

Like explained in commit 77019967f06b ("kdump: fix exported
size of vmcoreinfo note"), it should not affect the actual
function, but we better fix it, also this change should be
safe and backward compatible.

After this, we can get rid of variable vmcoreinfo_max_size,
let's use the corresponding macros directly, fewer variables
means more safety for vmcoreinfo operation.

Cc: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v3->v4:
-Rebased on the latest linux-next

 arch/powerpc/kernel/fadump.c | 3 +--
 include/linux/crash_core.h   | 1 -
 kernel/crash_core.c  | 3 +--
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 466569e..7bd6cd0 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp)
 
phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
phdr->p_offset  = phdr->p_paddr;
-   phdr->p_memsz   = vmcoreinfo_max_size;
-   phdr->p_filesz  = vmcoreinfo_max_size;
+   phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
 
/* Increment number of program headers. */
(elf->e_phnum)++;
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index ba283a2..7d6bc7b 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -55,7 +55,6 @@
 
 extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
-extern size_t vmcoreinfo_max_size;
 
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 0321f04..43cdb00 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -16,7 +16,6 @@
 /* vmcoreinfo stuff */
 static unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
-size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
 u32 *vmcoreinfo_note;
 
 /*
@@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
r = vscnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
 
-   r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
+   r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
 
memcpy(_data[vmcoreinfo_size], buf, r);
 
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-20 Thread Xunlei Pang
As Eric said,
"what we need to do is move the variable vmcoreinfo_note out
of the kernel's .bss section.  And modify the code to regenerate
and keep this information in something like the control page.

Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures.  I clearly was not
watching closely the data someone decided to keep this silly thing
in the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables,
one advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Suggested-by: Eric Biederman <ebied...@xmission.com>
Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
Cc: Juergen Gross <jgr...@suse.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v3->v4:
-Rebased on the latest linux-next
-Handle S390 vmcoreinfo_note properly
-Handle the newly-added xen/mmu_pv.c

 arch/ia64/kernel/machine_kexec.c |  5 -
 arch/s390/kernel/machine_kexec.c |  1 +
 arch/s390/kernel/setup.c |  6 --
 arch/x86/kernel/crash.c  |  2 +-
 arch/x86/xen/mmu_pv.c|  4 ++--
 include/linux/crash_core.h   |  2 +-
 kernel/crash_core.c  | 27 +++
 kernel/ksysfs.c  |  2 +-
 8 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/kernel/machine_kexec.c b/arch/ia64/kernel/machine_kexec.c
index 599507b..c14815d 100644
--- a/arch/ia64/kernel/machine_kexec.c
+++ b/arch/ia64/kernel/machine_kexec.c
@@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
 }
 
-phys_addr_t paddr_vmcoreinfo_note(void)
-{
-   return ia64_tpa((unsigned long)(char *)_note);
-}
-
diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c
index 49a6bd4..3d0b14a 100644
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
VMCOREINFO_SYMBOL(lowcore_ptr);
VMCOREINFO_SYMBOL(high_memory);
VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
+   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
 }
 
 void machine_shutdown(void)
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 3ae756c..3d1d808 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
 }
 
-static void __init setup_vmcoreinfo(void)
-{
-   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
-}
-
 #ifdef CONFIG_CRASH_DUMP
 
 /*
@@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
 #endif
 
setup_resources();
-   setup_vmcoreinfo();
setup_lowcore();
smp_fill_possible_mask();
cpu_detect_mhz_feature();
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 22217ec..44404e2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data *ced,
bufp += sizeof(Elf64_Phdr);
phdr->p_type = PT_NOTE;
phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
-   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
(ehdr->e_phnum)++;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 9d9ae66..35543fa 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
unsigned int order)
 phys_addr_t paddr_vmcoreinfo_note(void)
 {
if (xen_pv_domain())
-   return virt_to_machine(_note).maddr;
+   return virt_to_machine(vmcoreinfo_note).maddr;
else
-   return __pa_symbol(_note);
+   return __pa(vmcoreinfo_note);
 }
 #endif /* CONFIG_KEXEC_CORE */
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index eb71a70..ba283a2 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -53,7 +53,7 @@
 #define VMCOREINFO_PHYS_BASE(value) \
vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
 
-extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
 extern size_t vmcoreinfo_max_size;
 
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index fcbd568..0321f04 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -14,10 +14,10 @@
 #include 
 
 /* vmcoreinfo stuff */
-static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
-u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+static unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
-size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
+size_t vmcoreinfo

Re: No man page for kdump (yet?)

2017-04-12 Thread Xunlei Pang
On 04/11/2017 at 05:26 AM, Philip Prindeville wrote:
>> On Apr 10, 2017, at 5:57 AM, Bhupesh Sharma  wrote:
>>
>> Hi,
>>
>> It seems the latest upstream kexec-tools does support a complete man
>> page for kdump yet:
>>
>> DESCRIPTION
>>   kdump does not have a man page yet.
>>
>> I would propose having a propose having a man page for kdump (as the
>> feature is quite useful and has matured over time).
>>
>> I can work to cook up a patch for the same.
>>
>> Please let me know if this makes sense or if there are any objections
>> to the same.
>>
>> Regards,
>> Bhupesh
>
> Personally I’d welcome one.

This tool(/usr/sbin/kdump) looks like "cp /proc/vmcore" under kdump,
does anyone know its existent/potential users?

Regards,
Xunlei

>
> -Philip
>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH] x86_64/mm/boot: Fix kernel_ident_mapping_init() failure for kexec

2017-03-30 Thread Xunlei Pang
On 03/30/2017 at 07:21 PM, Xunlei Pang wrote:
> On 03/24/2017 at 08:04 PM, Kirill A. Shutemov wrote:
>> On Mon, Mar 20, 2017 at 02:11:31PM +0800, Xunlei Pang wrote:
>>> I found that the kdump is broken on linux-4.11.0-rc2+
>> That's actually tip tree or linux-next. The problematic change is not in
>> Linus' tree.
>>
>>> , probably
>>> due to the 5level-paging feature that "#define p4d_present(p4d) 1",
>>> as a result in ident_p4d_init(), it will go into ident_pud_init()
>>> directly without allocating the new pud.
>>>
>>> Looks like this patch can make it work again.
>> Okay, that's bisectability issue. Uncovered by splitting my patchset into
>> parts.
>>
>> Could you check if applying "Part 2" of 5-level paging changes[1] would
>> help you?
> I confirmed that it works after applying your following patches:
>   x86: Convert the rest of the code to support p4d_t

To be exact, this one("x86: Convert the rest of the code to support p4d_t") 
fixes the issue.

>   x86/xen: Change __xen_pgd_walk() and xen_cleanmfnmap() to support p4d
>   x86/kasan: Prepare clear_pgds() to switch to 
>   x86/mm/pat: Add 5-level paging support
>   x86/efi: Add 5-level paging support
>   x86/kexec: Add 5-level paging support
>
> Regards,
> Xunlei
>
>> Making the code work with both  and
>>  would make it even uglier. Not sure if it
>> makes sense to address it on its own if second part fixes the situation.
>>
>> [1] 
>> http://lkml.kernel.org/r/20170317185515.8636-1-kirill.shute...@linux.intel.com
>>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH] x86_64/mm/boot: Fix kernel_ident_mapping_init() failure for kexec

2017-03-30 Thread Xunlei Pang
On 03/24/2017 at 08:04 PM, Kirill A. Shutemov wrote:
> On Mon, Mar 20, 2017 at 02:11:31PM +0800, Xunlei Pang wrote:
>> I found that the kdump is broken on linux-4.11.0-rc2+
> That's actually tip tree or linux-next. The problematic change is not in
> Linus' tree.
>
>> , probably
>> due to the 5level-paging feature that "#define p4d_present(p4d) 1",
>> as a result in ident_p4d_init(), it will go into ident_pud_init()
>> directly without allocating the new pud.
>>
>> Looks like this patch can make it work again.
> Okay, that's bisectability issue. Uncovered by splitting my patchset into
> parts.
>
> Could you check if applying "Part 2" of 5-level paging changes[1] would
> help you?

I confirmed that it works after applying your following patches:
  x86: Convert the rest of the code to support p4d_t
  x86/xen: Change __xen_pgd_walk() and xen_cleanmfnmap() to support p4d
  x86/kasan: Prepare clear_pgds() to switch to 
  x86/mm/pat: Add 5-level paging support
  x86/efi: Add 5-level paging support
  x86/kexec: Add 5-level paging support

Regards,
Xunlei

>
> Making the code work with both  and
>  would make it even uglier. Not sure if it
> makes sense to address it on its own if second part fixes the situation.
>
> [1] 
> http://lkml.kernel.org/r/20170317185515.8636-1-kirill.shute...@linux.intel.com
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-23 Thread Xunlei Pang
On 03/23/2017 at 04:48 AM, Michael Holzheu wrote:
> Am Wed, 22 Mar 2017 12:30:04 +0800
> schrieb Dave Young :
>
>> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
>>> Dave Young  writes:
>>>
> [snip]
>
 I think makedumpfile is using it, but I also vote to remove the
 CRASHTIME. It is better not to do this while crashing and a makedumpfile
 userspace patch is needed to drop the use of it.

> As we are looking at reliability concerns removing CRASHTIME should make
> everything in vmcoreinfo a boot time constant.  Which should simplify
> everything considerably.
 It is a nice improvement..
>>> We also need to take a close look at what s390 is doing with vmcoreinfo.
>>> As apparently it is reading it in a different kind of crashdump process.
>> Yes, need careful review from s390 and maybe ppc64 especially about
>> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
>> fadump. Added more cc.
> On s390 we have at least an issue with patch 1/3. For stand-alone dump
> and also because we create the ELF header for kdump in the new
> kernel we save the pointer to the vmcoreinfo note in the old kernel on a
> defined memory address in our absolute zero lowcore.
>
> This is done in arch/s390/kernel/setup.c:
>
> static void __init setup_vmcoreinfo(void)
> {
> mem_assign_absolute(S390_lowcore.vmcore_info, 
> paddr_vmcoreinfo_note());
> }
>
> Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
> time we have a problem here.
>
> To solve this - I think - we could move the initialization to
> arch/s390/kernel/machine_kexec.c:
>
> void arch_crash_save_vmcoreinfo(void)
> {
> VMCOREINFO_SYMBOL(lowcore_ptr);
> VMCOREINFO_SYMBOL(high_memory);
> VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> mem_assign_absolute(S390_lowcore.vmcore_info, 
> paddr_vmcoreinfo_note());
> }
>
> Probably related to this is my observation that patch 3/3 leads to
> an empty VMCOREINFO note for kdump on s390. The note is there ...
>
> # readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
>   VMCOREINFO   0x068e   Unknown note type: (0x)
>
> But it contains only zeros.

Yes, this is a good catch, I will do more tests.

Thanks,
Xunlei

>
> Unfortunately I have not yet understood the reason for this.
>
> Michael
>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 12:30 PM, Dave Young wrote:
> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
>> Dave Young <dyo...@redhat.com> writes:
>>
>>> On 03/20/17 at 10:33pm, Eric W. Biederman wrote:
>>>> Xunlei Pang <xlp...@redhat.com> writes:
>>>>
>>>>> As Eric said,
>>>>> "what we need to do is move the variable vmcoreinfo_note out
>>>>> of the kernel's .bss section.  And modify the code to regenerate
>>>>> and keep this information in something like the control page.
>>>>>
>>>>> Definitely something like this needs a page all to itself, and ideally
>>>>> far away from any other kernel data structures.  I clearly was not
>>>>> watching closely the data someone decided to keep this silly thing
>>>>> in the kernel's .bss section."
>>>>>
>>>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>>>> vmcoreinfo now is kept far away from other kernel data structures.
>>>> Can you preceed this patch with a patch that removes CRASHTIME from
>>>> vmcoreinfo?  If someone actually cares we can add a separate note that 
>>>> holds
>>>> a 64bit crashtime in the per cpu notes.  
>>> I think makedumpfile is using it, but I also vote to remove the
>>> CRASHTIME. It is better not to do this while crashing and a makedumpfile
>>> userspace patch is needed to drop the use of it.
>>>

By moving the CRASHTIME info to the cpu note of crashed cpu may be a good
way. In kdump kernel, notes of vmcore elfhdr will be merged into one big note
section, I don't know how makedumpfile or crash handle the big note section?
If they process the note in some order, breakage will definitely happen...

There is also a fadump may be affected.

Regards,
Xunlei

>>>> As we are looking at reliability concerns removing CRASHTIME should make
>>>> everything in vmcoreinfo a boot time constant.  Which should simplify
>>>> everything considerably.
>>> It is a nice improvement..
>> We also need to take a close look at what s390 is doing with vmcoreinfo.
>> As apparently it is reading it in a different kind of crashdump process.
> Yes, need careful review from s390 and maybe ppc64 especially about
> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> fadump. Added more cc.
>
> Thanks
> Dave
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 04:55 PM, Xunlei Pang wrote:
> On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
>> Xunlei Pang <xlp...@redhat.com> writes:
>>
>>> As Eric said,
>>> "what we need to do is move the variable vmcoreinfo_note out
>>> of the kernel's .bss section.  And modify the code to regenerate
>>> and keep this information in something like the control page.
>>>
>>> Definitely something like this needs a page all to itself, and ideally
>>> far away from any other kernel data structures.  I clearly was not
>>> watching closely the data someone decided to keep this silly thing
>>> in the kernel's .bss section."
>>>
>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>> vmcoreinfo now is kept far away from other kernel data structures.
>> Can you preceed this patch with a patch that removes CRASHTIME from
>> vmcoreinfo?  If someone actually cares we can add a separate note that holds
>> a 64bit crashtime in the per cpu notes.  
> Hi Eric,
>
> Thanks for your review, I took some time and did some investigation.
>
> Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
> For example, makedumpfile gets vmcoreinfo note information by reading
> "/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
> and then get the timestamp. This operates in the first kernel even before
> kdump is loaded.

Think more, this is not a problem for "makedumpfile --mem-usage",
as the system doesn't have "CRASHTIME" before crash. But still we
may have the following concerns.

>
> Actually, even moving vmcoreinfo_note[] into the crash memory, it
> may have problems, for example, on s390 system the crash memory
> range will be unmapped, so I guess it may cause some risks.
>
> Additionally, there is no available way for us to allocate a page from the
> crash memory during kernel initialization, we only can achieve this during
> the kexec syscalls. There is not a neat way to implement a function to
> allocate pages from the crash memory during kernel initialization without
> some hack code added, because user-space tools(like kexec-tools) can
> allocate the crash segment by their own ways from the crash memory.
>
> That's why I only copy vmcoreinfo_data[] into the crash memory, and
> not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
> the crash memory copy, then in crash_save_vmcoreinfo(), we copy
> this guaranteed copy into vmcoreinfo_note[], so the correctness of
> vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.
>
> The current crash_save_vmcoreinfo() only involves memory(memcpy)
> operations even for get_seconds(no locks), the only risk I can think
> of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
> I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
> in this patch. After all if kimage structure was corrupted when crash happens,
> we can do nothing but have to accept the fate.
>
> So does it really deserve to eliminate crash_save_vmcoreinfo()?
>
> Regards,
> Xunlei
>
>> As we are looking at reliability concerns removing CRASHTIME should make
>> everything in vmcoreinfo a boot time constant.  Which should simplify
>> everything considerably.
>>
>> Which means we only need to worry abou the per-cpu notes being written
>> at the time of a crash.
>>
>>> Suggested-by: Eric Biederman <ebied...@xmission.com>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>>  arch/x86/kernel/crash.c  |  2 +-
>>>  include/linux/kexec.h|  2 +-
>>>  kernel/kexec_core.c  | 29 -
>>>  kernel/ksysfs.c  |  2 +-
>>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>>> b/arch/ia64/kernel/machine_kexec.c
>>> index 599507b..c14815d 100644
>>> --- a/arch/ia64/kernel/machine_kexec.c
>>> +++ b/arch/ia64/kernel/machine_kexec.c
>>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>>  #endif
>>>  }
>>>  
>>> -phys_addr_t paddr_vmcoreinfo_note(void)
>>> -{
>>> -   return ia64_tpa((unsigned long)(char *)_note);
>>> -}
>>> -
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 3741461..4d35fbb 100644
>>> -

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/22/2017 at 04:55 PM, Xunlei Pang wrote:
> On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
>> Xunlei Pang <xlp...@redhat.com> writes:
>>
>>> As Eric said,
>>> "what we need to do is move the variable vmcoreinfo_note out
>>> of the kernel's .bss section.  And modify the code to regenerate
>>> and keep this information in something like the control page.
>>>
>>> Definitely something like this needs a page all to itself, and ideally
>>> far away from any other kernel data structures.  I clearly was not
>>> watching closely the data someone decided to keep this silly thing
>>> in the kernel's .bss section."
>>>
>>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>>> one advantage is that it enhances some safety of vmcoreinfo, because
>>> vmcoreinfo now is kept far away from other kernel data structures.
>> Can you preceed this patch with a patch that removes CRASHTIME from
>> vmcoreinfo?  If someone actually cares we can add a separate note that holds
>> a 64bit crashtime in the per cpu notes.  
> Hi Eric,
>
> Thanks for your review, I took some time and did some investigation.
>
> Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
> For example, makedumpfile gets vmcoreinfo note information by reading
> "/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
> and then get the timestamp. This operates in the first kernel even before
> kdump is loaded.

Think more, this is not a problem for "makedumpfile --mem-usage",
as the system doesn't have "CRASHTIME" before crash. But still we
may have the following concerns.

>
> Actually, even moving vmcoreinfo_note[] into the crash memory, it
> may have problems, for example, on s390 system the crash memory
> range will be unmapped, so I guess it may cause some risks.
>
> Additionally, there is no available way for us to allocate a page from the
> crash memory during kernel initialization, we only can achieve this during
> the kexec syscalls. There is not a neat way to implement a function to
> allocate pages from the crash memory during kernel initialization without
> some hack code added, because user-space tools(like kexec-tools) can
> allocate the crash segment by their own ways from the crash memory.
>
> That's why I only copy vmcoreinfo_data[] into the crash memory, and
> not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
> the crash memory copy, then in crash_save_vmcoreinfo(), we copy
> this guaranteed copy into vmcoreinfo_note[], so the correctness of
> vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.
>
> The current crash_save_vmcoreinfo() only involves memory(memcpy)
> operations even for get_seconds(no locks), the only risk I can think
> of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
> I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
> in this patch. After all if kimage structure was corrupted when crash happens,
> we can do nothing but have to accept the fate.
>
> So does it really deserve to eliminate crash_save_vmcoreinfo()?
>
> Regards,
> Xunlei
>
>> As we are looking at reliability concerns removing CRASHTIME should make
>> everything in vmcoreinfo a boot time constant.  Which should simplify
>> everything considerably.
>>
>> Which means we only need to worry abou the per-cpu notes being written
>> at the time of a crash.
>>
>>> Suggested-by: Eric Biederman <ebied...@xmission.com>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>>  arch/x86/kernel/crash.c  |  2 +-
>>>  include/linux/kexec.h|  2 +-
>>>  kernel/kexec_core.c  | 29 -
>>>  kernel/ksysfs.c  |  2 +-
>>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>>> b/arch/ia64/kernel/machine_kexec.c
>>> index 599507b..c14815d 100644
>>> --- a/arch/ia64/kernel/machine_kexec.c
>>> +++ b/arch/ia64/kernel/machine_kexec.c
>>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>>  #endif
>>>  }
>>>  
>>> -phys_addr_t paddr_vmcoreinfo_note(void)
>>> -{
>>> -   return ia64_tpa((unsigned long)(char *)_note);
>>> -}
>>> -
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 3741461..4d35fbb 100644
>>> -

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Xunlei Pang
On 03/21/2017 at 11:33 AM, Eric W. Biederman wrote:
> Xunlei Pang <xlp...@redhat.com> writes:
>
>> As Eric said,
>> "what we need to do is move the variable vmcoreinfo_note out
>> of the kernel's .bss section.  And modify the code to regenerate
>> and keep this information in something like the control page.
>>
>> Definitely something like this needs a page all to itself, and ideally
>> far away from any other kernel data structures.  I clearly was not
>> watching closely the data someone decided to keep this silly thing
>> in the kernel's .bss section."
>>
>> This patch allocates extra pages for these vmcoreinfo_XXX variables,
>> one advantage is that it enhances some safety of vmcoreinfo, because
>> vmcoreinfo now is kept far away from other kernel data structures.
> Can you preceed this patch with a patch that removes CRASHTIME from
> vmcoreinfo?  If someone actually cares we can add a separate note that holds
> a 64bit crashtime in the per cpu notes.  

Hi Eric,

Thanks for your review, I took some time and did some investigation.

Removing "CRASHTIME=X" from vmcoreinfo_note will break user-space tools.
For example, makedumpfile gets vmcoreinfo note information by reading
"/sys/kernel/vmcoreinfo"  its PA, then get its "VA = PA | PAGE_OFFSET",
and then get the timestamp. This operates in the first kernel even before
kdump is loaded.

Actually, even moving vmcoreinfo_note[] into the crash memory, it
may have problems, for example, on s390 system the crash memory
range will be unmapped, so I guess it may cause some risks.

Additionally, there is no available way for us to allocate a page from the
crash memory during kernel initialization, we only can achieve this during
the kexec syscalls. There is not a neat way to implement a function to
allocate pages from the crash memory during kernel initialization without
some hack code added, because user-space tools(like kexec-tools) can
allocate the crash segment by their own ways from the crash memory.

That's why I only copy vmcoreinfo_data[] into the crash memory, and
not touch vmcoreinfo_note, so vmcoreinfo_data is well protected in
the crash memory copy, then in crash_save_vmcoreinfo(), we copy
this guaranteed copy into vmcoreinfo_note[], so the correctness of
vmcoreinfo_note[] is guaranteed. This is what [PATCH v3 3/3] does.

The current crash_save_vmcoreinfo() only involves memory(memcpy)
operations even for get_seconds(no locks), the only risk I can think
of now is that vmcoreinfo_note pointer may be corrupted. If it is a concern,
I guess we can put it into struct kimage" just like vmcoreinfo_XXX_copy
in this patch. After all if kimage structure was corrupted when crash happens,
we can do nothing but have to accept the fate.

So does it really deserve to eliminate crash_save_vmcoreinfo()?

Regards,
Xunlei

>
> As we are looking at reliability concerns removing CRASHTIME should make
> everything in vmcoreinfo a boot time constant.  Which should simplify
> everything considerably.
>
> Which means we only need to worry abou the per-cpu notes being written
> at the time of a crash.
>
>> Suggested-by: Eric Biederman <ebied...@xmission.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  arch/ia64/kernel/machine_kexec.c |  5 -
>>  arch/x86/kernel/crash.c  |  2 +-
>>  include/linux/kexec.h|  2 +-
>>  kernel/kexec_core.c  | 29 -
>>  kernel/ksysfs.c  |  2 +-
>>  5 files changed, 27 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/ia64/kernel/machine_kexec.c 
>> b/arch/ia64/kernel/machine_kexec.c
>> index 599507b..c14815d 100644
>> --- a/arch/ia64/kernel/machine_kexec.c
>> +++ b/arch/ia64/kernel/machine_kexec.c
>> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>>  #endif
>>  }
>>  
>> -phys_addr_t paddr_vmcoreinfo_note(void)
>> -{
>> -return ia64_tpa((unsigned long)(char *)_note);
>> -}
>> -
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index 3741461..4d35fbb 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
>> *ced,
>>  bufp += sizeof(Elf64_Phdr);
>>  phdr->p_type = PT_NOTE;
>>  phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
>> -phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
>> +phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>>  (ehdr->e_phnum)++;
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/include/linux/kexec.h b/include/linux/k

Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-20 Thread Xunlei Pang
On 03/20/2017 at 09:04 PM, Petr Tesarik wrote:
> On Mon, 20 Mar 2017 10:17:42 +0800
> Xunlei Pang <xp...@redhat.com> wrote:
>
>> On 03/19/2017 at 02:23 AM, Petr Tesarik wrote:
>>> On Thu, 16 Mar 2017 21:40:58 +0800
>>> Xunlei Pang <xp...@redhat.com> wrote:
>>>
>>>> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
>>>>> On 03/16/17 at 08:36pm, Xunlei Pang wrote:
>>>>>> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
>>>>>>> Hi Xunlei,
>>>>>>>
>>>>>>> Did you really see this ever happened? Because the vmcore size estimate
>>>>>>> feature, namely --mem-usage option of makedumpfile, depends on the
>>>>>>> vmcoreinfo in 1st kernel, your change will break it.
>>>>>> Hi Baoquan,
>>>>>>
>>>>>> I can reproduce it using a kernel module which modifies the vmcoreinfo,
>>>>>> so it's a problem can actually happen.
>>>>>>
>>>>>>> If not, it could be not good to change that.
>>>>>> That's a good point, then I guess we can keep the 
>>>>>> crash_save_vmcoreinfo_init(),
>>>>>> and store again all the vmcoreinfo after crash. What do you think?
>>>>> Well, then it will make makedumpfile segfault happen too when execute
>>>>> below command in 1st kernel if it existed:
>>>>>   makedumpfile --mem-usage /proc/kcore
>>>> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
>>>> --mem-usage", it might happen,
>>>> after all the system is going something wrong. And that's why we deploy 
>>>> kdump service at the very
>>>> beginning when the system has a low possibility of going wrong.
>>>>
>>>> But we have to guarantee kdump vmcore can be generated correctly as 
>>>> possible as it can.
>>>>
>>>>> So we still need to face that problem and need fix it. vmcoreinfo_note
>>>>> is in kernel data area, how does module intrude into this area? And can
>>>>> we fix the module code?
>>>>>
>>>> Bugs always exist in products, we can't know what will happen and fix all 
>>>> the errors,
>>>> that's why we need kdump.
>>>>
>>>> I think the following update should guarantee the correct vmcoreinfo for 
>>>> kdump.
>>> I'm still not convinced. I would probably have more trust in a clean
>>> kernel (after boot) than a kernel that has already crashed (presumably
>>> because of a serious bug). How can be reliability improved by running
>>> more code in unsafe environment?
>> Correct, I realized that, so used crc32 to protect the original data,
>> but since Eric left a more reasonable idea, I will try that later.
>>
>>> If some code overwrites reserved areas (such as vmcoreinfo), then it's
>>> seriously buggy. And in my opinion, it is more difficult to identify
>>> such bugs if they are masked by re-initializing vmcoreinfo after crash.
>>> In fact, if makedumpfile in the kexec'ed kernel complains that it
>>> didn't find valid VMCOREINFO content, that's already a hint.
>>>
>>> As a side note, if you're debugging a vmcoreinfo corruption, it's
>>> possible to use a standalone VMCOREINFO file with makedumpfile, so you
>>> can pre-generate it and save it in the kdump initrd.
>>>
>>> In short, I don't see a compelling case for this change.
>> E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
>> system; 3) trigger kdump, then we obviously will fail to recognize the
>> crash context correctly due to the corrupted vmcoreinfo.  Everyone
>> will get confused if met such unfortunate customer-side issue.
>>
>> Although it's corner case, if it's easy to fix, then I think we better do it.
>>
>> Now except for vmcoreinfo, all the crash data is well protected (including
>> cpu note which is fully updated in the crash path, thus its correctness is
>> guaranteed).
> Hm, I think we shouldn't combine the two things.
>
> Protecting VMCOREINFO with SHA (just as the other information passed to
> the secondary kernel) sounds right to me. Re-creating the info while
> the kernel is already crashing does not sound particularly good.
>
> Yes, your patch may help in some scenarios, but in general it also
> increases the amount of code that must reliably work in a crashed
> environment. I can still recall why the LKCD approach (save the dump
> directly from the crashed kernel) was abandoned...

Agree on this point, there is nearly no extra code added to the crash path in 
v3,
maybe you can have a quick look.

>
> Apart, there's a lot of other information that might be corrupted (e.g.
> the purgatory code, elfcorehdr, secondary kernel, or the initrd).

Those are located at the crash memory, they can be protected by either
SHA or the arch_kexec_protect_crashkres() mechanism(if implemented).

>
> Why is this VMCOREINFO so special?

It is also a chunk passed to 2nd kernel like the above-mentioned information,
we better treat it like them as well.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH] x86_64/mm/boot: Fix kernel_ident_mapping_init() failure for kexec

2017-03-20 Thread Xunlei Pang
I found that the kdump is broken on linux-4.11.0-rc2+, probably
due to the 5level-paging feature that "#define p4d_present(p4d) 1",
as a result in ident_p4d_init(), it will go into ident_pud_init()
directly without allocating the new pud.

Looks like this patch can make it work again.

Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/mm/ident_map.c| 22 ++
 include/asm-generic/5level-fixup.h |  6 +++---
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 04210a2..d14bb52 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -97,22 +97,20 @@ int kernel_ident_mapping_init(struct x86_mapping_info 
*info, pgd_t *pgd_page,
continue;
}
 
-   p4d = (p4d_t *)info->alloc_pgt_page(info->context);
-   if (!p4d)
-   return -ENOMEM;
+   if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+   p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+   if (!p4d)
+   return -ENOMEM;
+   } else {
+   p4d = pgd;
+   }
+
result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
-   if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+
+   if (IS_ENABLED(CONFIG_X86_5LEVEL))
set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
-   } else {
-   /*
-* With p4d folded, pgd is equal to p4d.
-* The pgd entry has to point to the pud page table in 
this case.
-*/
-   pud_t *pud = pud_offset(p4d, 0);
-   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
-   }
}
 
return 0;
diff --git a/include/asm-generic/5level-fixup.h 
b/include/asm-generic/5level-fixup.h
index b5ca82d..3131ada 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -17,9 +17,9 @@
 
 #define p4d_alloc(mm, pgd, address)(pgd)
 #define p4d_offset(pgd, start) (pgd)
-#define p4d_none(p4d)  0
-#define p4d_bad(p4d)   0
-#define p4d_present(p4d)   1
+#define p4d_none(p4d)  pgd_none(p4d)
+#define p4d_bad(p4d)   pgd_bad(p4d)
+#define p4d_present(p4d)   pgd_present(p4d)
 #define p4d_ERROR(p4d) do { } while (0)
 #define p4d_clear(p4d) pgd_clear(p4d)
 #define p4d_val(p4d)   pgd_val(p4d)
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 3/3] kdump: Relocate vmcoreinfo to the crash memory range

2017-03-19 Thread Xunlei Pang
Currently vmcoreinfo data is updated at boot time subsys_initcall(),
it has the risk of being modified by some wrong code during system
is running.

As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore,
we probably will get "Segmentation fault" or other unexpected errors.

E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.

Now except for vmcoreinfo, all the crash data is well protected(including
the cpu note which is fully updated in the crash path, thus its correctness
is guaranteed). Given that vmcoreinfo data is a large chunk, we better
protect it as well.

To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash memory
will be protected by existing arch_kexec_protect_crashkres() mechanism,
we naturally protect vmcoreinfo_data from write(even read) access under
kernel direct mapping after kdump is loaded.

Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.

On the other hand, we still need to operate the vmcoreinfo safe copy when
crash happens to generate vmcoreinfo_note again, we rely on vmap() to map
out a new kernel virtual address and update to use this new one instead in
the following crash_save_vmcoreinfo().

BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 include/linux/kexec.h |  3 +++
 kernel/kexec.c|  3 +++
 kernel/kexec_core.c   | 52 +++
 kernel/kexec_file.c   |  3 +++
 4 files changed, 61 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 6918fda..fae2fc6 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -187,6 +187,8 @@ struct kimage {
unsigned long start;
struct page *control_code_page;
struct page *swap_page;
+   void *vmcoreinfo_data_copy; /* locates in the crash memory */
+   size_t vmcoreinfo_size_copy;
 
unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -243,6 +245,7 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
 extern int kernel_kexec(void);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 extern int kexec_load_purgatory(struct kimage *image, unsigned long min,
unsigned long max, int top_down,
unsigned long *load_addr);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 980936a..e0c4dea 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -93,6 +93,9 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned 
long entry,
pr_err("Could not allocate swap buffer\n");
goto out_free_control_pages;
}
+   } else {
+   if (kimage_crash_copy_vmcoreinfo(image) < 0)
+   goto out_free_image;
}
 
*rimage = image;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index e503b48..7fad9f6 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -486,6 +486,45 @@ struct page *kimage_alloc_control_pages(struct kimage 
*image,
return pages;
 }
 
+int kimage_crash_copy_vmcoreinfo(struct kimage *image)
+{
+   struct page *vmcoreinfo_page;
+   void *safecopy;
+
+   WARN_ON(image->type != KEXEC_TYPE_CRASH);
+
+   if (!vmcoreinfo_size) {
+   pr_err("empty vmcoreinfo data\n");
+   return -ENOMEM;
+   }
+
+   /*
+* For kdump, allocate one vmcoreinfo safe copy from the
+* crash memory. as we have arch_kexec_protect_crashkres()
+* after kexec syscall, we naturally protect it from write
+* (even read) access under kernel direct mapping. But on
+* the other hand, we still need to operate it when crash
+* happens to generate vmcoreinfo note, hereby we rely on
+* vmap for this purpose.
+*/
+   vmcoreinfo_page = kimage_alloc_control_pages(image, 0);
+   if (!vmcoreinfo_page) {
+   pr_err("could not allocate vmcoreinfo buffer\n");
+   return -ENOMEM;
+   }
+   safecopy = vmap(_page, 1, VM_MAP, PAGE_KERNEL);
+   if (!safecopy) {
+   pr_err("cound not vmap vmcoreinfo buffer\n");
+   return -ENOMEM;
+   }
+
+   memcpy(safecopy, vmcoreinfo_dat

[PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-19 Thread Xunlei Pang
As Eric said,
"what we need to do is move the variable vmcoreinfo_note out
of the kernel's .bss section.  And modify the code to regenerate
and keep this information in something like the control page.

Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures.  I clearly was not
watching closely the data someone decided to keep this silly thing
in the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables,
one advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Suggested-by: Eric Biederman <ebied...@xmission.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/ia64/kernel/machine_kexec.c |  5 -
 arch/x86/kernel/crash.c  |  2 +-
 include/linux/kexec.h|  2 +-
 kernel/kexec_core.c  | 29 -
 kernel/ksysfs.c  |  2 +-
 5 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/ia64/kernel/machine_kexec.c b/arch/ia64/kernel/machine_kexec.c
index 599507b..c14815d 100644
--- a/arch/ia64/kernel/machine_kexec.c
+++ b/arch/ia64/kernel/machine_kexec.c
@@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
 }
 
-phys_addr_t paddr_vmcoreinfo_note(void)
-{
-   return ia64_tpa((unsigned long)(char *)_note);
-}
-
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 3741461..4d35fbb 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data *ced,
bufp += sizeof(Elf64_Phdr);
phdr->p_type = PT_NOTE;
phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
-   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
(ehdr->e_phnum)++;
 
 #ifdef CONFIG_X86_64
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index e98e546..f1c601b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -317,7 +317,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 extern struct resource crashk_low_res;
 typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];
 extern note_buf_t __percpu *crash_notes;
-extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
 extern size_t vmcoreinfo_max_size;
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index bfe62d5..e3a4bda 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -52,10 +52,10 @@
 note_buf_t __percpu *crash_notes;
 
 /* vmcoreinfo stuff */
-static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
-u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+static unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
-size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
+size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
+u32 *vmcoreinfo_note;
 
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
@@ -1369,6 +1369,9 @@ static void update_vmcoreinfo_note(void)
 
 void crash_save_vmcoreinfo(void)
 {
+   if (!vmcoreinfo_note)
+   return;
+
vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
update_vmcoreinfo_note();
 }
@@ -1397,13 +1400,29 @@ void vmcoreinfo_append_str(const char *fmt, ...)
 void __weak arch_crash_save_vmcoreinfo(void)
 {}
 
-phys_addr_t __weak paddr_vmcoreinfo_note(void)
+phys_addr_t paddr_vmcoreinfo_note(void)
 {
-   return __pa_symbol((unsigned long)(char *)_note);
+   return __pa(vmcoreinfo_note);
 }
 
 static int __init crash_save_vmcoreinfo_init(void)
 {
+   /* One page should be enough for VMCOREINFO_BYTES under all archs */
+   vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
+   if (!vmcoreinfo_data) {
+   pr_warn("Memory allocation for vmcoreinfo_data failed\n");
+   return -ENOMEM;
+   }
+
+   vmcoreinfo_note = alloc_pages_exact(VMCOREINFO_NOTE_SIZE,
+   GFP_KERNEL | __GFP_ZERO);
+   if (!vmcoreinfo_note) {
+   free_page((unsigned long)vmcoreinfo_data);
+   vmcoreinfo_data = NULL;
+   pr_warn("Memory allocation for vmcoreinfo_note failed\n");
+   return -ENOMEM;
+   }
+
VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
VMCOREINFO_PAGESIZE(PAGE_SIZE);
 
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index ee1bc1b..9de6fcc 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -130,7 +130,7 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj,
 {
phys_addr_t vmcore_base = paddr_vmcoreinfo_note();
return sprintf(buf, "%pa %x\n", _base,
-  (unsigned int)sizeof(vmcoreinfo_note));
+   

[PATCH v3 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-03-19 Thread Xunlei Pang
vmcoreinfo_max_size stands for the vmcoreinfo_data, the
correct one we should use is vmcoreinfo_note whose total
size is VMCOREINFO_NOTE_SIZE.

Like explained in commit 77019967f06b ("kdump: fix exported
size of vmcoreinfo note"), it does not affect the actual 
function, we better fix it, also this change should be safe 
and backward compatible.

After this, we can get rid of variable vmcoreinfo_max_size,
let's use the macro VMCOREINFO_BYTES instead, fewer variables
means more safety for vmcoreinfo operation.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/powerpc/kernel/fadump.c | 3 +--
 include/linux/kexec.h| 1 -
 kernel/kexec_core.c  | 3 +--
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ff0dd4..b8e15cf 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -906,8 +906,7 @@ static int fadump_create_elfcore_headers(char *bufp)
 
phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
phdr->p_offset  = phdr->p_paddr;
-   phdr->p_memsz   = vmcoreinfo_max_size;
-   phdr->p_filesz  = vmcoreinfo_max_size;
+   phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
 
/* Increment number of program headers. */
(elf->e_phnum)++;
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f1c601b..6918fda 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -319,7 +319,6 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 extern note_buf_t __percpu *crash_notes;
 extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
-extern size_t vmcoreinfo_max_size;
 
 /* flag to track if kexec reboot is in progress */
 extern bool kexec_in_progress;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index e3a4bda..e503b48 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -54,7 +54,6 @@
 /* vmcoreinfo stuff */
 static unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
-size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
 u32 *vmcoreinfo_note;
 
 /* Flag to indicate we are going to kexec a new kernel */
@@ -1386,7 +1385,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
r = vscnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
 
-   r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
+   r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
 
memcpy(_data[vmcoreinfo_size], buf, r);
 
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: Introduce vmcoreinfo signature verification

2017-03-19 Thread Xunlei Pang
On 03/20/2017 at 11:55 AM, Baoquan He wrote:
> On 03/20/17 at 10:39am, Xunlei Pang wrote:
>> On 03/20/2017 at 10:13 AM, Baoquan He wrote:
>>> On 03/17/17 at 12:22pm, Eric W. Biederman wrote:
>>>> Xunlei Pang <xlp...@redhat.com> writes:
>>>>
>>>>> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
>>>>> it has the risk of being modified by some wrong code during system
>>>>> is running.
>>>>>
>>>>> As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
>>>>> when using "crash" or "makedumpfile"(etc) utility to parse this vmcore,
>>>>> we probably will get "Segmentation fault" or other unexpected/confusing
>>>>> errors.
>>>> If this is a real concern and the previous discussion sounds like it is
>>>> part of what we need to do is move the variable vmcoreinfo_note out
>>>> of the kernel's .bss section.  And modify the code to regenerate
>>>> and keep this information in something like the control page.
>>> I guess this is not from a real issue, just from Xunlei's worry. But
>>> Xunlei didn't give a direct answer to this, and Petr's question. Not
>> It's easy to reproduce: write a kernel module to modify part content of
>> vmcoreinfo_data (we surely have many ways to acquire its VA). If it does
>> exist in theory, we will met it sooner or later in real world due to billions
>> of applications.
>>
>> Also there are bugs like this one
>> https://bugzilla.redhat.com/show_bug.cgi?id=1287097
>> Not sure if it is makedumpfile issue or this one, maybe we can't know 
>> forever.
> Well, kdump is not all-purpose. If you write code in module to stomp
> page init_level4_pgt is pointing at, you won't get a vmcore.

vmcoreinfo is a large data chunk prepared for kdump not a normal-sized 
variable, we better protect it.

> And you are saying vmcoreinfo_data, it's a intermediate page, should be
> vmcoreinfo_note. If the wrong code you mentioned didn't change
> vmcoreinfo_note, but other kernel data which need be saved into
> vmcoreinfo_note, crash_save_vmcoreinfo_init is doing better than you
> re-saved one.

I am not going to touch vmcoreinfo_note, just trying to relocate 
vmcoreinfo_data into the crash memory,
then use it to update vmcoreinfo_note which is fully overwritten when crash 
happens just like the cpu
crash note.

Anyway I will send v3 soon, let further discuss it there, thanks!

Regards,
Xunlei

>
>> Regards,
>> Xunlei
>>
>>> very sure if this will impact other implementation. fadump will be
>>> impacted by this or other dump? Maybe yet or maybe not.
>>>
>>> I don't object this strongly, but please at least add code comment to
>>> explain why vmcoreinfo need be saved twice because it does look weird.
>>>
>>>> Definitely something like this needs a page all to itself, and ideally
>>>> far away from any other kernel data structures.  I clearly was not
>>>> watching closely the data someone decided to keep this silly thing
>>>> in the kernel's .bss section.
>>>>
>>>>> As vmcoreinfo is the most fundamental information for vmcore, we better
>>>>> double check its correctness. Here we generate a signature(using crc32)
>>>>> after it is saved, then verify it in crash_save_vmcoreinfo() to see if
>>>>> the signature was broken, if so we have to re-save the vmcoreinfo data
>>>>> to get the correct vmcoreinfo for kdump as possible as we can.
>>>> Sigh.  We already have a sha256 that is supposed to cover this sort of
>>>> thing.  The bug rather is that apparently it isn't covering this data.
>>>> That sounds like what we should be fixing.
>>>>
>>>> Please let's not invent new mechanisms we have to maintain.  Let's
>>>> reorganize this so this static data is protected like all other static
>>>> data in the kexec-on-panic path.  We have good mechanims and good
>>>> strategies for avoiding and detecting corruption we just need to use
>>>> them.
>>>>
>>>> Eric
>>>>
>>>>
>>>>
>>>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>>>> ---
>>>>> v1->v2:
>>>>> - Keep crash_save_vmcoreinfo_init() because "makedumpfile --mem-usage"
>>>>>   uses the information.
>>>>> - Add crc32 verification for vmcoreinfo, re-save when failure.
>>>>>
>>>>>  arch/Kconfig 

Re: [PATCH v2] kexec: Introduce vmcoreinfo signature verification

2017-03-19 Thread Xunlei Pang
On 03/20/2017 at 10:13 AM, Baoquan He wrote:
> On 03/17/17 at 12:22pm, Eric W. Biederman wrote:
>> Xunlei Pang <xlp...@redhat.com> writes:
>>
>>> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
>>> it has the risk of being modified by some wrong code during system
>>> is running.
>>>
>>> As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
>>> when using "crash" or "makedumpfile"(etc) utility to parse this vmcore,
>>> we probably will get "Segmentation fault" or other unexpected/confusing
>>> errors.
>> If this is a real concern and the previous discussion sounds like it is
>> part of what we need to do is move the variable vmcoreinfo_note out
>> of the kernel's .bss section.  And modify the code to regenerate
>> and keep this information in something like the control page.
> I guess this is not from a real issue, just from Xunlei's worry. But
> Xunlei didn't give a direct answer to this, and Petr's question. Not

It's easy to reproduce: write a kernel module to modify part content of
vmcoreinfo_data (we surely have many ways to acquire its VA). If it does
exist in theory, we will met it sooner or later in real world due to billions
of applications.

Also there are bugs like this one
https://bugzilla.redhat.com/show_bug.cgi?id=1287097
Not sure if it is makedumpfile issue or this one, maybe we can't know forever.

Regards,
Xunlei

> very sure if this will impact other implementation. fadump will be
> impacted by this or other dump? Maybe yet or maybe not.
>
> I don't object this strongly, but please at least add code comment to
> explain why vmcoreinfo need be saved twice because it does look weird.
>
>> Definitely something like this needs a page all to itself, and ideally
>> far away from any other kernel data structures.  I clearly was not
>> watching closely the data someone decided to keep this silly thing
>> in the kernel's .bss section.
>>
>>> As vmcoreinfo is the most fundamental information for vmcore, we better
>>> double check its correctness. Here we generate a signature(using crc32)
>>> after it is saved, then verify it in crash_save_vmcoreinfo() to see if
>>> the signature was broken, if so we have to re-save the vmcoreinfo data
>>> to get the correct vmcoreinfo for kdump as possible as we can.
>> Sigh.  We already have a sha256 that is supposed to cover this sort of
>> thing.  The bug rather is that apparently it isn't covering this data.
>> That sounds like what we should be fixing.
>>
>> Please let's not invent new mechanisms we have to maintain.  Let's
>> reorganize this so this static data is protected like all other static
>> data in the kexec-on-panic path.  We have good mechanims and good
>> strategies for avoiding and detecting corruption we just need to use
>> them.
>>
>> Eric
>>
>>
>>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>> v1->v2:
>>> - Keep crash_save_vmcoreinfo_init() because "makedumpfile --mem-usage"
>>>   uses the information.
>>> - Add crc32 verification for vmcoreinfo, re-save when failure.
>>>
>>>  arch/Kconfig|  1 +
>>>  kernel/kexec_core.c | 43 +++
>>>  2 files changed, 36 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/Kconfig b/arch/Kconfig
>>> index c4d6833..66eb296 100644
>>> --- a/arch/Kconfig
>>> +++ b/arch/Kconfig
>>> @@ -4,6 +4,7 @@
>>>  
>>>  config KEXEC_CORE
>>> bool
>>> +   select CRC32
>>>  
>>>  config HAVE_IMA_KEXEC
>>> bool
>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>> index bfe62d5..012acbe 100644
>>> --- a/kernel/kexec_core.c
>>> +++ b/kernel/kexec_core.c
>>> @@ -38,6 +38,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -53,9 +54,10 @@
>>>  
>>>  /* vmcoreinfo stuff */
>>>  static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
>>> -u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>>> +static u32 vmcoreinfo_sig;
>>>  size_t vmcoreinfo_size;
>>>  size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
>>> +u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>>>  
>>>  /* Flag to indicate we are going to kexec a new kernel */
>>>  bool kexec_in_progress = false;
>>> @@ -13

Re: [PATCH] x86_64, kexec: Avoid unnecessary identity mappings for kdump

2017-03-19 Thread Xunlei Pang
On 03/18/2017 at 01:38 AM, Eric W. Biederman wrote:
> Xunlei Pang <xlp...@redhat.com> writes:
>
>> kexec setups identity mappings for all the memory mapped in 1st kernel,
>> this is not necessary for the kdump case. Actually it can cause extra
>> memory consumption for paging structures, which is quite considerable
>> on modern machines with huge memory.
>>
>> E.g. On our 24TB machine, it will waste around 96MB (around 4MB/TB)
>> from the reserved memory range if setting all the identity mappings.
>>
>> It also causes some trouble for distributions that use an intelligent
>> policy to evaluate the proper "crashkernel=X" for users.
>>
>> To solve it, in case of kdump, we only setup identity mappings for the
>> crash memory and the ISA memory(may be needed by purgatory/kdump
>> boot).
> How about instead we detect the presence of 1GiB pages and use them
> if they are available.  We already use 2MiB pages.  If we can do that
> we will only need about 192K for page tables in the case you have
> described and this all becomes a non-issue.
>
> I strongly suspect that the presence of 24TiB of memory in an x86 system
> strongly correlates to the presence of 1GiB pages.
>
> In principle we certainly can use a less extensive mapping but that
> should not be something that differs between the two kexec cases.

Ok, will try gbpages for the identity mapping.

Regards,
Xunlei

> I can see forcing the low 1MiB range in.  But calling it ISA range is
> very wrong and misleading.  The reasons that range are special during
> boot-up have nothing to do with ISA.  But have everything to do with
> where legacy page tables are mapped, and where we need identity pages to
> start other cpus.  I think the only user that actually cares is
> purgatory where it plays swapping games with the low 1MiB because we
> can't preload what we need to down there or it would mess up the running
> kernel.  So saying anything about the old ISA bus is wrong and
> misleading.  At the very very least we need accurate comments.
>
> Eric
>
>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  arch/x86/kernel/machine_kexec_64.c | 34 ++
>>  1 file changed, 30 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index 857cdbd..db77a76 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -112,14 +112,40 @@ static int init_pgtable(struct kimage *image, unsigned 
>> long start_pgtable)
>>  
>>  level4p = (pgd_t *)__va(start_pgtable);
>>  clear_page(level4p);
>> -for (i = 0; i < nr_pfn_mapped; i++) {
>> -mstart = pfn_mapped[i].start << PAGE_SHIFT;
>> -mend   = pfn_mapped[i].end << PAGE_SHIFT;
>>  
>> +if (image->type == KEXEC_TYPE_CRASH) {
>> +/* Always map the ISA range */
>>  result = kernel_ident_mapping_init(,
>> - level4p, mstart, mend);
>> +level4p, 0, ISA_END_ADDRESS);
>>  if (result)
>>  return result;
>> +
>> +/* crashk_low_res may not be initialized when reaching here */
>> +if (crashk_low_res.end) {
>> +mstart = crashk_low_res.start;
>> +mend = crashk_low_res.end + 1;
>> +result = kernel_ident_mapping_init(,
>> +level4p, mstart, mend);
>> +if (result)
>> +return result;
>> +}
>> +
>> +mstart = crashk_res.start;
>> +mend = crashk_res.end + 1;
>> +result = kernel_ident_mapping_init(,
>> +level4p, mstart, mend);
>> +if (result)
>> +return result;
>> +} else {
>> +for (i = 0; i < nr_pfn_mapped; i++) {
>> +mstart = pfn_mapped[i].start << PAGE_SHIFT;
>> +mend   = pfn_mapped[i].end << PAGE_SHIFT;
>> +
>> +result = kernel_ident_mapping_init(,
>> + level4p, mstart, mend);
>> +if (result)
>> +return result;
>> +}
>>  }
>>  
>>  /*


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-19 Thread Xunlei Pang
On 03/19/2017 at 02:23 AM, Petr Tesarik wrote:
> On Thu, 16 Mar 2017 21:40:58 +0800
> Xunlei Pang <xp...@redhat.com> wrote:
>
>> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
>>> On 03/16/17 at 08:36pm, Xunlei Pang wrote:
>>>> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
>>>>> Hi Xunlei,
>>>>>
>>>>> Did you really see this ever happened? Because the vmcore size estimate
>>>>> feature, namely --mem-usage option of makedumpfile, depends on the
>>>>> vmcoreinfo in 1st kernel, your change will break it.
>>>> Hi Baoquan,
>>>>
>>>> I can reproduce it using a kernel module which modifies the vmcoreinfo,
>>>> so it's a problem can actually happen.
>>>>
>>>>> If not, it could be not good to change that.
>>>> That's a good point, then I guess we can keep the 
>>>> crash_save_vmcoreinfo_init(),
>>>> and store again all the vmcoreinfo after crash. What do you think?
>>> Well, then it will make makedumpfile segfault happen too when execute
>>> below command in 1st kernel if it existed:
>>> makedumpfile --mem-usage /proc/kcore
>> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
>> --mem-usage", it might happen,
>> after all the system is going something wrong. And that's why we deploy 
>> kdump service at the very
>> beginning when the system has a low possibility of going wrong.
>>
>> But we have to guarantee kdump vmcore can be generated correctly as possible 
>> as it can.
>>
>>> So we still need to face that problem and need fix it. vmcoreinfo_note
>>> is in kernel data area, how does module intrude into this area? And can
>>> we fix the module code?
>>>
>> Bugs always exist in products, we can't know what will happen and fix all 
>> the errors,
>> that's why we need kdump.
>>
>> I think the following update should guarantee the correct vmcoreinfo for 
>> kdump.
> I'm still not convinced. I would probably have more trust in a clean
> kernel (after boot) than a kernel that has already crashed (presumably
> because of a serious bug). How can be reliability improved by running
> more code in unsafe environment?

Correct, I realized that, so used crc32 to protect the original data,
but since Eric left a more reasonable idea, I will try that later.

>
> If some code overwrites reserved areas (such as vmcoreinfo), then it's
> seriously buggy. And in my opinion, it is more difficult to identify
> such bugs if they are masked by re-initializing vmcoreinfo after crash.
> In fact, if makedumpfile in the kexec'ed kernel complains that it
> didn't find valid VMCOREINFO content, that's already a hint.
>
> As a side note, if you're debugging a vmcoreinfo corruption, it's
> possible to use a standalone VMCOREINFO file with makedumpfile, so you
> can pre-generate it and save it in the kdump initrd.
>
> In short, I don't see a compelling case for this change.

E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.  Everyone
will get confused if met such unfortunate customer-side issue.

Although it's corner case, if it's easy to fix, then I think we better do it.

Now except for vmcoreinfo, all the crash data is well protected (including
cpu note which is fully updated in the crash path, thus its correctness is
guaranteed).

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: Introduce vmcoreinfo signature verification

2017-03-19 Thread Xunlei Pang
On 03/18/2017 at 01:22 AM, Eric W. Biederman wrote:
> Xunlei Pang <xlp...@redhat.com> writes:
>
>> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
>> it has the risk of being modified by some wrong code during system
>> is running.
>>
>> As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
>> when using "crash" or "makedumpfile"(etc) utility to parse this vmcore,
>> we probably will get "Segmentation fault" or other unexpected/confusing
>> errors.
> If this is a real concern and the previous discussion sounds like it is
> part of what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
>
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section.
>
>> As vmcoreinfo is the most fundamental information for vmcore, we better
>> double check its correctness. Here we generate a signature(using crc32)
>> after it is saved, then verify it in crash_save_vmcoreinfo() to see if
>> the signature was broken, if so we have to re-save the vmcoreinfo data
>> to get the correct vmcoreinfo for kdump as possible as we can.
> Sigh.  We already have a sha256 that is supposed to cover this sort of
> thing.  The bug rather is that apparently it isn't covering this data.
> That sounds like what we should be fixing.
>
> Please let's not invent new mechanisms we have to maintain.  Let's
> reorganize this so this static data is protected like all other static
> data in the kexec-on-panic path.  We have good mechanims and good
> strategies for avoiding and detecting corruption we just need to use
> them.
>
> Eric
>

Yes, this idea looks way better, I will follow your suggestions, thanks!

Regards,
Xunlei

>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>> v1->v2:
>> - Keep crash_save_vmcoreinfo_init() because "makedumpfile --mem-usage"
>>   uses the information.
>> - Add crc32 verification for vmcoreinfo, re-save when failure.
>>
>>  arch/Kconfig|  1 +
>>  kernel/kexec_core.c | 43 +++
>>  2 files changed, 36 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index c4d6833..66eb296 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -4,6 +4,7 @@
>>  
>>  config KEXEC_CORE
>>  bool
>> +select CRC32
>>  
>>  config HAVE_IMA_KEXEC
>>  bool
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index bfe62d5..012acbe 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -38,6 +38,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -53,9 +54,10 @@
>>  
>>  /* vmcoreinfo stuff */
>>  static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
>> -u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>> +static u32 vmcoreinfo_sig;
>>  size_t vmcoreinfo_size;
>>  size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
>> +u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>>  
>>  /* Flag to indicate we are going to kexec a new kernel */
>>  bool kexec_in_progress = false;
>> @@ -1367,12 +1369,6 @@ static void update_vmcoreinfo_note(void)
>>  final_note(buf);
>>  }
>>  
>> -void crash_save_vmcoreinfo(void)
>> -{
>> -vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
>> -update_vmcoreinfo_note();
>> -}
>> -
>>  void vmcoreinfo_append_str(const char *fmt, ...)
>>  {
>>  va_list args;
>> @@ -1402,7 +1398,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>>  return __pa_symbol((unsigned long)(char *)_note);
>>  }
>>  
>> -static int __init crash_save_vmcoreinfo_init(void)
>> +static void do_crash_save_vmcoreinfo_init(void)
>>  {
>>  VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
>>  VMCOREINFO_PAGESIZE(PAGE_SIZE);
>> @@ -1474,6 +1470,37 @@ static int __init crash_save_vmcoreinfo_init(void)
>>  #endif
>>  
>>  arch_crash_save_vmcoreinfo();
>> +}
>> +
>> +static u32 crash_calc_vmcoreinfo_sig(void)
>> +{
>> +return crc32(~0, vmcoreinfo_data, vmcoreinfo_size);
>> +}
>> +
>> +static bool crash_ver

[PATCH] x86_64, kexec: Avoid unnecessary identity mappings for kdump

2017-03-17 Thread Xunlei Pang
kexec setups identity mappings for all the memory mapped in 1st kernel,
this is not necessary for the kdump case. Actually it can cause extra
memory consumption for paging structures, which is quite considerable
on modern machines with huge memory.

E.g. On our 24TB machine, it will waste around 96MB (around 4MB/TB)
from the reserved memory range if setting all the identity mappings.

It also causes some trouble for distributions that use an intelligent
policy to evaluate the proper "crashkernel=X" for users.

To solve it, in case of kdump, we only setup identity mappings for the
crash memory and the ISA memory(may be needed by purgatory/kdump boot).

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c | 34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 857cdbd..db77a76 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -112,14 +112,40 @@ static int init_pgtable(struct kimage *image, unsigned 
long start_pgtable)
 
level4p = (pgd_t *)__va(start_pgtable);
clear_page(level4p);
-   for (i = 0; i < nr_pfn_mapped; i++) {
-   mstart = pfn_mapped[i].start << PAGE_SHIFT;
-   mend   = pfn_mapped[i].end << PAGE_SHIFT;
 
+   if (image->type == KEXEC_TYPE_CRASH) {
+   /* Always map the ISA range */
result = kernel_ident_mapping_init(,
-level4p, mstart, mend);
+   level4p, 0, ISA_END_ADDRESS);
if (result)
return result;
+
+   /* crashk_low_res may not be initialized when reaching here */
+   if (crashk_low_res.end) {
+   mstart = crashk_low_res.start;
+   mend = crashk_low_res.end + 1;
+   result = kernel_ident_mapping_init(,
+   level4p, mstart, mend);
+   if (result)
+   return result;
+   }
+
+   mstart = crashk_res.start;
+   mend = crashk_res.end + 1;
+   result = kernel_ident_mapping_init(,
+   level4p, mstart, mend);
+   if (result)
+   return result;
+   } else {
+   for (i = 0; i < nr_pfn_mapped; i++) {
+   mstart = pfn_mapped[i].start << PAGE_SHIFT;
+   mend   = pfn_mapped[i].end << PAGE_SHIFT;
+
+   result = kernel_ident_mapping_init(,
+level4p, mstart, mend);
+   if (result)
+   return result;
+   }
}
 
/*
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] kexec: Introduce vmcoreinfo signature verification

2017-03-16 Thread Xunlei Pang
Currently vmcoreinfo data is updated at boot time subsys_initcall(),
it has the risk of being modified by some wrong code during system
is running.

As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash" or "makedumpfile"(etc) utility to parse this vmcore,
we probably will get "Segmentation fault" or other unexpected/confusing
errors.

As vmcoreinfo is the most fundamental information for vmcore, we better
double check its correctness. Here we generate a signature(using crc32)
after it is saved, then verify it in crash_save_vmcoreinfo() to see if
the signature was broken, if so we have to re-save the vmcoreinfo data
to get the correct vmcoreinfo for kdump as possible as we can.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1->v2:
- Keep crash_save_vmcoreinfo_init() because "makedumpfile --mem-usage"
  uses the information.
- Add crc32 verification for vmcoreinfo, re-save when failure.

 arch/Kconfig|  1 +
 kernel/kexec_core.c | 43 +++
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c4d6833..66eb296 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -4,6 +4,7 @@
 
 config KEXEC_CORE
bool
+   select CRC32
 
 config HAVE_IMA_KEXEC
bool
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index bfe62d5..012acbe 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -53,9 +54,10 @@
 
 /* vmcoreinfo stuff */
 static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
-u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+static u32 vmcoreinfo_sig;
 size_t vmcoreinfo_size;
 size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
+u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
 
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
@@ -1367,12 +1369,6 @@ static void update_vmcoreinfo_note(void)
final_note(buf);
 }
 
-void crash_save_vmcoreinfo(void)
-{
-   vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
-   update_vmcoreinfo_note();
-}
-
 void vmcoreinfo_append_str(const char *fmt, ...)
 {
va_list args;
@@ -1402,7 +1398,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
return __pa_symbol((unsigned long)(char *)_note);
 }
 
-static int __init crash_save_vmcoreinfo_init(void)
+static void do_crash_save_vmcoreinfo_init(void)
 {
VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
VMCOREINFO_PAGESIZE(PAGE_SIZE);
@@ -1474,6 +1470,37 @@ static int __init crash_save_vmcoreinfo_init(void)
 #endif
 
arch_crash_save_vmcoreinfo();
+}
+
+static u32 crash_calc_vmcoreinfo_sig(void)
+{
+   return crc32(~0, vmcoreinfo_data, vmcoreinfo_size);
+}
+
+static bool crash_verify_vmcoreinfo(void)
+{
+   if (crash_calc_vmcoreinfo_sig() == vmcoreinfo_sig)
+   return true;
+
+   return false;
+}
+
+void crash_save_vmcoreinfo(void)
+{
+   /* Re-save if verification fails */
+   if (!crash_verify_vmcoreinfo()) {
+   vmcoreinfo_size = 0;
+   do_crash_save_vmcoreinfo_init();
+   }
+
+   vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
+   update_vmcoreinfo_note();
+}
+
+static int __init crash_save_vmcoreinfo_init(void)
+{
+   do_crash_save_vmcoreinfo_init();
+   vmcoreinfo_sig = crash_calc_vmcoreinfo_sig();
update_vmcoreinfo_note();
 
return 0;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-16 Thread Xunlei Pang
On 03/16/2017 at 08:27 PM, Baoquan He wrote:
> Hi Xunlei,
>
> Did you really see this ever happened? Because the vmcore size estimate
> feature, namely --mem-usage option of makedumpfile, depends on the
> vmcoreinfo in 1st kernel, your change will break it.

Hi Baoquan,

I can reproduce it using a kernel module which modifies the vmcoreinfo,
so it's a problem can actually happen.

> If not, it could be not good to change that.

That's a good point, then I guess we can keep the crash_save_vmcoreinfo_init(),
and store again all the vmcoreinfo after crash. What do you think?

Regards,
Xunlei

>
> Baoquan
>
> On 03/16/17 at 08:16pm, Xunlei Pang wrote:
>> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
>> it has the risk of being modified by some wrong code during system
>> is running.
>>
>> As a result, vmcore dumped will contain the wrong vmcoreinfo. Later on,
>> when using "crash" utility to parse this vmcore, we probably will get
>> "Segmentation fault".
>>
>> Based on the fact that the value of each vmcoreinfo stays invariable
>> once kernel boots up, we safely move all the vmcoreinfo operations into
>> crash_save_vmcoreinfo() which is called after crash happened. In this
>> way, vmcoreinfo data correctness is always guaranteed.
>>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  kernel/kexec_core.c | 14 +++---
>>  1 file changed, 3 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index bfe62d5..1bfdd96 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -1367,12 +1367,6 @@ static void update_vmcoreinfo_note(void)
>>  final_note(buf);
>>  }
>>  
>> -void crash_save_vmcoreinfo(void)
>> -{
>> -vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
>> -update_vmcoreinfo_note();
>> -}
>> -
>>  void vmcoreinfo_append_str(const char *fmt, ...)
>>  {
>>  va_list args;
>> @@ -1402,7 +1396,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>>  return __pa_symbol((unsigned long)(char *)_note);
>>  }
>>  
>> -static int __init crash_save_vmcoreinfo_init(void)
>> +void crash_save_vmcoreinfo(void)
>>  {
>>  VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
>>  VMCOREINFO_PAGESIZE(PAGE_SIZE);
>> @@ -1474,13 +1468,11 @@ static int __init crash_save_vmcoreinfo_init(void)
>>  #endif
>>  
>>  arch_crash_save_vmcoreinfo();
>> -update_vmcoreinfo_note();
>> +vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
>>  
>> -return 0;
>> +update_vmcoreinfo_note();
>>  }
>>  
>> -subsys_initcall(crash_save_vmcoreinfo_init);
>> -
>>  /*
>>   * Move into place and start executing a preloaded standalone
>>   * executable.  If nothing was preloaded return an error.
>> -- 
>> 1.8.3.1
>>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-16 Thread Xunlei Pang
Currently vmcoreinfo data is updated at boot time subsys_initcall(),
it has the risk of being modified by some wrong code during system
is running.

As a result, vmcore dumped will contain the wrong vmcoreinfo. Later on,
when using "crash" utility to parse this vmcore, we probably will get
"Segmentation fault".

Based on the fact that the value of each vmcoreinfo stays invariable
once kernel boots up, we safely move all the vmcoreinfo operations into
crash_save_vmcoreinfo() which is called after crash happened. In this
way, vmcoreinfo data correctness is always guaranteed.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 kernel/kexec_core.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index bfe62d5..1bfdd96 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1367,12 +1367,6 @@ static void update_vmcoreinfo_note(void)
final_note(buf);
 }
 
-void crash_save_vmcoreinfo(void)
-{
-   vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
-   update_vmcoreinfo_note();
-}
-
 void vmcoreinfo_append_str(const char *fmt, ...)
 {
va_list args;
@@ -1402,7 +1396,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
return __pa_symbol((unsigned long)(char *)_note);
 }
 
-static int __init crash_save_vmcoreinfo_init(void)
+void crash_save_vmcoreinfo(void)
 {
VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
VMCOREINFO_PAGESIZE(PAGE_SIZE);
@@ -1474,13 +1468,11 @@ static int __init crash_save_vmcoreinfo_init(void)
 #endif
 
arch_crash_save_vmcoreinfo();
-   update_vmcoreinfo_note();
+   vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
 
-   return 0;
+   update_vmcoreinfo_note();
 }
 
-subsys_initcall(crash_save_vmcoreinfo_init);
-
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-23 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet after nmi_shootdown_cpus(),
so after kdump boots, cpus remaining in 1st kernel should
not do anything except clearing MCG_STATUS. This is useful
for kdump to let vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
Suggested-by: Borislav Petkov <b...@alien8.de>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1->v2:
- Using crashing_cpu according to Borislav's suggestion.

v2->v3:
- Used crashing_cpu in mce.c explicitly, not skip crashing_cpu.
- Added some comments.

v3->v4:
- Added more code comments according to Tony's feedback.

 arch/x86/include/asm/reboot.h|  1 +
 arch/x86/kernel/cpu/mcheck/mce.c | 17 +++--
 arch/x86/kernel/reboot.c |  5 +++--
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..fc62ba8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -15,6 +15,7 @@ struct machine_ops {
 };
 
 extern struct machine_ops machine_ops;
+extern int crashing_cpu;
 
 void native_machine_crash_shutdown(struct pt_regs *regs);
 void native_machine_shutdown(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..b65505f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mce-internal.h"
 
@@ -1127,9 +1128,21 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 * on Intel.
 */
int lmce = 1;
+   int cpu = smp_processor_id();
 
-   /* If this CPU is offline, just bail out. */
-   if (cpu_is_offline(smp_processor_id())) {
+   /*
+* Cases to bail out to avoid rendezvous process timeout:
+* 1)If this CPU is offline.
+* 2)If crashing_cpu was set, e.g. entering kdump,
+*   we need to skip cpus remaining in 1st kernel.
+*   Note: there is a small window between kexecing
+*   and kdump kernel establishing new mce handler,
+*   if some MCE comes within the window, there is
+*   no valid mce handler due to pgtable changing,
+*   let's just face the fate.
+*/
+   if (cpu_is_offline(cpu) ||
+   (crashing_cpu != -1 && crashing_cpu != cpu)) {
u64 mcgstatus;
 
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..92ecf4b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -749,10 +749,11 @@ void machine_crash_shutdown(struct pt_regs *regs)
 #endif
 
 
+/* This keeps a track of which one is crashing cpu. */
+int crashing_cpu = -1;
+
 #if defined(CONFIG_SMP)
 
-/* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
 static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-22 Thread Xunlei Pang
On 02/23/2017 at 02:50 AM, Luck, Tony wrote:
> On Wed, Feb 22, 2017 at 12:11:14PM +0800, Xunlei Pang wrote:
>> +/*
>> + * Cases to bail out to avoid rendezvous process timeout:
>> + * 1)If this CPU is offline.
>> + * 2)If crashing_cpu was set, e.g. entering kdump,
>> + *   we need to skip cpus remaining in 1st kernel.
>> + */
>> +if (cpu_is_offline(cpu) ||
>> +(crashing_cpu != -1 && crashing_cpu != cpu)) {
>>  u64 mcgstatus;
>>  
>>  mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
>
> I think we should document the remaining race conditions. I don't
> think there is any good way to eliminate them, and they are already
> pretty small windows.
>
> I think the sequence of events looks like:
>
>  1Panic occurs
>  2nmi_shootdown_cpus() sets crashing_cpu
>  3send NMI to everyone else
>  4wait up to a second for other CPUs to take NMI
>  5go to kexec code
>  6start new kernel
>  7new kernel establishes #MC handler
>
> If one of the other cpus triggers a machine check while
> getting to, or in, the NMI handler ... then that cpu will
> skip processing (if RIPV is set).
>
> Between '2' and '5' if crashing_cpu gets a machine check it
> will execute in the old kernel handler, and do the right thing.
>
> There's a fuzzy area between '6' and '7' where a machine check
> might not end up in the right code.
>
> From '7' onwards the kexec kernel will handle and machine
> checks caused by kdump.
>

Agree, will update the comment.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Xunlei Pang
On 02/22/2017 at 02:20 AM, Luck, Tony wrote:
>> It's from my understanding, I didn't get the explicit description from the 
>> intel SDM on this point.
>> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each 
>> cpu have MCG_STATUS_RIPV bit set?
> MCG_STATUS is a per-thread MSR and will contain the status appropriate for 
> that thread when #MC is delivered.
> So the RIPV bit will be set if, and only if, the thread saved a valid return 
> address for this exception. The net result
> is that it is almost always set for "innocent bystander" CPUs that were 
> dragged into the exception handler because
> of a broadcast #MC. We make the test because if it isn't set, then the 
> do_machine_check() had better not return
> because we have no idea where it will return to - since there is not a valid 
> return IP.
>

Got it, thanks for the details.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-21 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet after nmi_shootdown_cpus(),
so after kdump boots, cpus remaining in 1st kernel should 
not do anything except clearing MCG_STATUS. This is useful
for kdump to let vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
Suggested-by: Borislav Petkov <b...@alien8.de>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1->v2:
Using crashing_cpu according to Borislav's suggestion.

v2->v3:
- Used crashing_cpu in mce.c explicitly, not skip crashing_cpu.
- Added some comments.

 arch/x86/include/asm/reboot.h|  1 +
 arch/x86/kernel/cpu/mcheck/mce.c | 12 ++--
 arch/x86/kernel/reboot.c |  5 +++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..fc62ba8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -15,6 +15,7 @@ struct machine_ops {
 };
 
 extern struct machine_ops machine_ops;
+extern int crashing_cpu;
 
 void native_machine_crash_shutdown(struct pt_regs *regs);
 void native_machine_shutdown(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..1493222 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mce-internal.h"
 
@@ -1127,9 +1128,16 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 * on Intel.
 */
int lmce = 1;
+   int cpu = smp_processor_id();
 
-   /* If this CPU is offline, just bail out. */
-   if (cpu_is_offline(smp_processor_id())) {
+   /*
+* Cases to bail out to avoid rendezvous process timeout:
+* 1)If this CPU is offline.
+* 2)If crashing_cpu was set, e.g. entering kdump,
+*   we need to skip cpus remaining in 1st kernel.
+*/
+   if (cpu_is_offline(cpu) ||
+   (crashing_cpu != -1 && crashing_cpu != cpu)) {
u64 mcgstatus;
 
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..92ecf4b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -749,10 +749,11 @@ void machine_crash_shutdown(struct pt_regs *regs)
 #endif
 
 
+/* This keeps a track of which one is crashing cpu. */
+int crashing_cpu = -1;
+
 #if defined(CONFIG_SMP)
 
-/* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
 static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-21 Thread Xunlei Pang
On 02/20/2017 at 07:09 PM, Borislav Petkov wrote:
> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote:
>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long 
>> error_code)
>>   */
>>  int lmce = 1;
>>  
>> -/* If this CPU is offline, just bail out. */
>> -if (cpu_is_offline(smp_processor_id())) {
>> +/* If nmi shootdown happened or this CPU is offline, just bail out. */
>> +if (cpus_shotdown() ||
> I don't like "cpus_shotdown" - it doesn't hint at all that this is
> special-handling crash/kdump.
>
> And more importantly, I want it to be obvious that we do let the
> crashing CPU into the MCE handler.

Hi Boris,

I made some improvements, what do you think the following one?
If you think it is fine, I can send out v3. Thanks for your time!

---
 arch/x86/include/asm/reboot.h|  1 +
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +--
 arch/x86/kernel/reboot.c |  5 +++--
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..fc62ba8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -15,6 +15,7 @@ struct machine_ops {
 };
 
 extern struct machine_ops machine_ops;
+extern int crashing_cpu;
 
 void native_machine_crash_shutdown(struct pt_regs *regs);
 void native_machine_shutdown(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..7f53145 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mce-internal.h"
 
@@ -1128,8 +1129,14 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
  */
 int lmce = 1;
 
-/* If this CPU is offline, just bail out. */
-if (cpu_is_offline(smp_processor_id())) {
+/*
+ * Cases to bail out to avoid rendezvous process timeout:
+ * 1)If crashing_cpu was set, e.g. entering kdump,
+ *   we need to skip cpus remaining in 1st kernel.
+ * 2)If this CPU is offline.
+ */
+if (crashing_cpu != -1 ||
+cpu_is_offline(smp_processor_id())) {
 u64 mcgstatus;
 
 mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..92ecf4b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -749,10 +749,11 @@ void machine_crash_shutdown(struct pt_regs *regs)
 #endif
 
 
+/* This keeps a track of which one is crashing cpu. */
+int crashing_cpu = -1;
+
 #if defined(CONFIG_SMP)
 
-/* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
 static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
-- 
1.8.3.1

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/21/2017 at 04:26 AM, Borislav Petkov wrote:
> On Mon, Feb 20, 2017 at 09:29:24PM +0800, Xunlei Pang wrote:
>> There is a small window between crash and kdump kernel boot, so
>> if a SRAO comes within this window it will also cause the mce
>> synchronization problem on the crashing cpu if we don't bail out the
>> crashing cpu.
> You mean, in the window between, kdump kernel starts writing out memory
> and the second, kexec-ed kernel?

Not kdump kernel starts dumping, just during nmi_shootdown_cpus(), if some
MCE comes after crashing_cpu was set and we don't skip crashing_cpu, then
the crashing cpu will enter mce handler and trigger the synchronization issue.

>
> If so, please add that information to the place in do_machine_check()
> where we check crashing_cpu so that we know why we're doing this
> temporary ignore of #MC.

Ok, will add, thanks for the feedback.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/20/2017 at 09:29 PM, Xunlei Pang wrote:
> On 02/20/2017 at 07:09 PM, Borislav Petkov wrote:
>> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote:
>>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long 
>>> error_code)
>>>  */
>>> int lmce = 1;
>>>  
>>> -   /* If this CPU is offline, just bail out. */
>>> -   if (cpu_is_offline(smp_processor_id())) {
>>> +   /* If nmi shootdown happened or this CPU is offline, just bail out. */
>>> +   if (cpus_shotdown() ||
>> I don't like "cpus_shotdown" - it doesn't hint at all that this is
>> special-handling crash/kdump.
>>
>> And more importantly, I want it to be obvious that we do let the
>> crashing CPU into the MCE handler.
> Ok, I will export crashing_cpu and use it directly in mce handler.

Forget to mention, one reason I introduced cpus_shotdown() is that 
"crashing_cpu"
is defined with CONFIG_SMP=y, so we have to export it unconditionally if we 
don't want
to add the conditional code(i.e. with #ifdef CONFIG_SMP quoted) in mce.c.

Regards,
Xunlei

>
>> Why?
>>
>> If we didn't, you will not handle *any* MCE, even a fatal one, during
>> dumping memory so if that dump is corrupted from the MCE, you won't
>> know. And I don't want to be the one staring at the corrupted dump and
>> wondering why I'm seeing what I'm seeing.
>>
>> IOW, if we get a fatal MCE during dumping then we should go and die.
>> This is much better than silently corrupting the dump and not even
>> saying anything about it.
>>
> My thought is that it doesn't matter after kdump boots as new mce handler
> will be installed. If we get a fatal MCE during kdumping, the new handler will
> handle the cpus running kdump kernel correctly.
>
> There is a small window between crash and kdump kernel boot, so if a SRAO 
> comes
> within this window it will also cause the mce synchronization problem on the 
> crashing
> cpu if we don't bail out the crashing cpu.
>
> Regards,
> Xunlei


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-20 Thread Xunlei Pang
On 02/20/2017 at 07:09 PM, Borislav Petkov wrote:
> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote:
>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long 
>> error_code)
>>   */
>>  int lmce = 1;
>>  
>> -/* If this CPU is offline, just bail out. */
>> -if (cpu_is_offline(smp_processor_id())) {
>> +/* If nmi shootdown happened or this CPU is offline, just bail out. */
>> +if (cpus_shotdown() ||
> I don't like "cpus_shotdown" - it doesn't hint at all that this is
> special-handling crash/kdump.
>
> And more importantly, I want it to be obvious that we do let the
> crashing CPU into the MCE handler.

Ok, I will export crashing_cpu and use it directly in mce handler.

>
> Why?
>
> If we didn't, you will not handle *any* MCE, even a fatal one, during
> dumping memory so if that dump is corrupted from the MCE, you won't
> know. And I don't want to be the one staring at the corrupted dump and
> wondering why I'm seeing what I'm seeing.
>
> IOW, if we get a fatal MCE during dumping then we should go and die.
> This is much better than silently corrupting the dump and not even
> saying anything about it.
>

My thought is that it doesn't matter after kdump boots as new mce handler
will be installed. If we get a fatal MCE during kdumping, the new handler will
handle the cpus running kdump kernel correctly.

There is a small window between crash and kdump kernel boot, so if a SRAO comes
within this window it will also cause the mce synchronization problem on the 
crashing
cpu if we don't bail out the crashing cpu.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

2017-02-19 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet after nmi_shootdown_cpus(),
so before crash cpu shots them down or after kdump boots,
they should not do anything except clearing MCG_STATUS
in case of broadcasted mce. This is useful for kdump
to let the vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
Suggested-by: Borislav Petkov <b...@alien8.de>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1->v2:
Using crashing_cpu according to Borislav's suggestion.

 arch/x86/include/asm/reboot.h|  1 +
 arch/x86/kernel/cpu/mcheck/mce.c |  6 --
 arch/x86/kernel/reboot.c | 16 +++-
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..ec8657b6 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -26,5 +26,6 @@ struct machine_ops {
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
 void run_crash_ipi_callback(struct pt_regs *regs);
+bool cpus_shotdown(void);
 
 #endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..3b56710 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mce-internal.h"
 
@@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 */
int lmce = 1;
 
-   /* If this CPU is offline, just bail out. */
-   if (cpu_is_offline(smp_processor_id())) {
+   /* If nmi shootdown happened or this CPU is offline, just bail out. */
+   if (cpus_shotdown() ||
+   cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;
 
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..b301c8d 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -752,7 +752,7 @@ void machine_crash_shutdown(struct pt_regs *regs)
 #if defined(CONFIG_SMP)
 
 /* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
+static int crashing_cpu = -1;
 static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
@@ -852,6 +852,14 @@ void nmi_panic_self_stop(struct pt_regs *regs)
}
 }
 
+bool cpus_shotdown(void)
+{
+   if (crashing_cpu != -1)
+   return true;
+
+   return false;
+}
+
 #else /* !CONFIG_SMP */
 void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 {
@@ -861,4 +869,10 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 void run_crash_ipi_callback(struct pt_regs *regs)
 {
 }
+
+bool cpus_shotdown(void)
+{
+   return false;
+}
+
 #endif
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-17 Thread Xunlei Pang
On 02/17/2017 at 05:07 PM, Borislav Petkov wrote:
> On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote:
>> It changes the value of cpu_online_mask/etc which will cause confusion to 
>> vmcore analysis.
> Then export the crashing_cpu variable, initialize it to something
> invalid in the first kernel, -1 for example, and test it in the #MC
> handlier like this:
>
>   int cpu;
>
>   ...
>
>   cpu = smp_processor_id();
>
>   if (cpu_is_offline(cpu) ||
>   ((crashing_cpu != -1) && (crashing_cpu != cpu)) {
> u64 mcgstatus;
>
> mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
> if (mcgstatus & MCG_STATUS_RIPV) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>   return;
>   }
>   }

Yes, it is doable, I will do some tests later.

>> Moreover, for the code(see comment inlined)
>>
>> if (cpu_is_offline(smp_processor_id())) {
>> u64 mcgstatus;
>>
>> mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
>> if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be 
>> not true, the mce triggered on kdump cpu 
>>  // 
>> doesn't need to have this bit set for the other cpus remain in 1st kernel. 
> Is this on kvm or on a real hardware? Because for kvm I don't care. And
> don't say "theoretically".
>

It's from my understanding, I didn't get the explicit description from the 
intel SDM on this point.
If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each 
cpu have MCG_STATUS_RIPV bit set?

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 08:22 PM, Borislav Petkov wrote:
> On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote:
>> then mce will be broadcast to the other cpus which are still running
>> in the first kernel(i.e. looping in crash_nmi_callback).
> Simple: the crash code should really mark CPUs as not being online:
>
> void do_machine_check(struct pt_regs *regs, long error_code)
>
>   ...
>
> /* If this CPU is offline, just bail out. */
> if (cpu_is_offline(smp_processor_id())) {
> u64 mcgstatus;
>
> mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
> if (mcgstatus & MCG_STATUS_RIPV) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> return;
> }
> }
>
> because looping in crash_nmi_callback() does not really denote them as
> CPUs being online.
>
> And just so that you don't disturb the machine too much during crashing,
> you could simply clear them from the online masks, i.e., perhaps call
> remove_cpu_from_maps() with the proper locking around it instead of
> doing a full cpu_down().

It changes the value of cpu_online_mask/etc which will cause confusion to 
vmcore analysis.
Moreover, for the code(see comment inlined)

if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be not 
true, the mce triggered on kdump cpu 
 // doesn't 
need to have this bit set for the other cpus remain in 1st kernel. 
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}


Regards,
Xunlei

>
> The machine will be killed anyway after kdump is done writing out
> memory.
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 06:18 PM, Borislav Petkov wrote:
> On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote:
>> I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 
>> 0x0"),
>> it works well in 1st kernel, but it doesn't work for 1st kernel after kdump 
>> boots(seems
>> the cpus remain in 1st kernel don't respond to the simulated broadcasting 
>> mce).
>>
>> But in theory, we know cpus belong to kdump kernel can't respond to the
>> old mce handler, so a single SRAO injection in 1st kernel should be similar.
>> For example, I used "... -smp 2 -cpu Haswell" to launch a simulation with 
>> broadcast
>> mce supported, and inject SRAO to cpu0 only through qemu monitor
>> "mce 0 0 0xb100 0x5 0x0 0x0", cpu0 will timeout/panic and reboot
>> the machine as follows(running on linux-4.9):
>>   Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast 
>> exception handler
> Sounds to me like you're trying hard to prove some point of yours which
> doesn't make much sense to me. And when you say "in theory", that makes
> it even less believable. So I remember asking you for exact steps. That
> above doesn't read like steps but like some babbling and I've actually
> tried to make sense of it for a couple of minutes but failed.
>
> So lemme spell it out for ya. I'd like for you to give me this:
>
> 1. Build kernel with this config
> 2. Boot it in kvm with this settings
> 3. Do this in the guest
> 4. Do that in the guest
> 5. ...
> 6. ...
>
>
> And all should be exact commands so that I can do them here on my machine.
>

Sorry, missed your point.

The steps should be as follows:
1. Prepare a multi-core intel machine with broadcasted mce support.
Enable kdump(crashkernel=256M) and configure kdump kernel to boot with 
"nr_cpus=1".
2. Activate kdump, and crash the first kernel on some cpu, say cpu1
(taskset -c 1 echo 0 > /proc/sysrq-trigger), then kdump will boot on cpu1.
3. After kdump boots up(let it enter shell), trigger a SRAO on cpu1
   (QEMU monitor cmd: mce -b 1 0 0xb100 0x5 0x0 0x0),
then mce will be broadcast to the other cpus which are still running
in the first kernel(i.e. looping in crash_nmi_callback).
If you own some hardware to inject mce, it would be great, as QEMU does not 
work correctly for me.
4. Then something like below is expected to happen:

[1.468556] tsc: Refined TSC clocksource calibration: 2933.437 MHz
 Starting Kdump Vmcore Save Service...
kdump: saving to /sysroot//var/crash/127.0.0.1-2015-09-01-05:07:03/
kdump: saving vmcore-dmesg.txt
[   39.10] mce: [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 2: 
bd00017a
[   39.10] mce: [Hardware Error]: TSC 0 ADDR 6160 MISC 8c 
[   39.10] mce: [Hardware Error]: PROCESSOR 0:106a3 TIME 1441083980 SOCKET 
0 APIC 0 microcode 1
[   39.10] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[   39.10] Kernel panic - not syncing: Timeout: Not all CPUs entered 
broadcast exception handler
[   39.10] Shutting down cpus with NMI
[1.758463] Uhhuh. NMI received for unknown reason 20 on CPU 0.
[1.758463] Do you have a strange power saving mode enabled?
[1.758463] Dazed and confused, but trying to continue
[   39.10] Rebooting in 30 seconds..

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-15 Thread Xunlei Pang
On 01/26/2017 at 02:44 PM, Borislav Petkov wrote:
> On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote:
>> The hardware machine check is hard to reproduce, but the mce code of
>> RHEL7 is quite the same as that of tip/master, anyway we are able to
>> inject software mce to reproduce it.
> Please give me your exact steps so that I can try to reproduce it here
> too.
>

Hi Borislav,

I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 0x0"),
it works well in 1st kernel, but it doesn't work for 1st kernel after kdump 
boots(seems
the cpus remain in 1st kernel don't respond to the simulated broadcasting mce).

But in theory, we know cpus belong to kdump kernel can't respond to the
old mce handler, so a single SRAO injection in 1st kernel should be similar.
For example, I used "... -smp 2 -cpu Haswell" to launch a simulation with 
broadcast
mce supported, and inject SRAO to cpu0 only through qemu monitor
"mce 0 0 0xb100 0x5 0x0 0x0", cpu0 will timeout/panic and reboot
the machine as follows(running on linux-4.9):
  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception 
handler
  Kernel Offset: disabled
  Rebooting in 30 seconds..

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-25 Thread Xunlei Pang
On 01/24/2017 at 08:22 PM, Borislav Petkov wrote:
> On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote:
>> It occurred on real hardware when testing crash dump.
>>
>> 1) SysRq-c was injected for the test in 1st kernel
>> [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
>>[ 0.00] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.el7.x86_64 
>> root=UUID=976a15c8-8cbe-44ad-bb91-23f9b18e8789
> Yeah, no, I'm not debugging the RH Frankenstein kernel.
>
> Please retrigger this with latest tip/master first.
>

The hardware machine check is hard to reproduce, but the mce code of RHEL7 is 
quite
the same as that of tip/master, anyway we are able to inject software mce to 
reproduce it.

It is also clear from the theoretical analysis of the code.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-24 Thread Xunlei Pang
On 01/23/2017 at 10:50 PM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
>> One possible timing sequence would be:
>> 1st kernel running on multiple cpus panicked
>> then the crash dump code starts
>> the crash dump code stops the others cpus except the crashing one
>> 2nd kernel boots up on the crash cpu with "nr_cpus=1"
>> some broadcasted mce comes on some cpu amongst the other cpus(not the 
>> crashing cpu)
> Where does this broadcasted MCE come from?
>
> The crash dump code triggered it? Or it happened before the panic()?
>
> Are you talking about an *actual* sequence which you're experiencing on
> real hw or is this something hypothetical?
>

It occurred on real hardware when testing crash dump.

1) SysRq-c was injected for the test in 1st kernel
[ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
   [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.el7.x86_64 
root=UUID=976a15c8-8cbe-44ad-bb91-23f9b18e8789 ro console=ttyS1,115200 
nmi_watchdog=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off 
numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug 
disable_cpu_apicid=0 elfcorehdr=869772K 3) An MCE came to the 1st kernel, 
timeout panic occurred, and rebooted the machine
[6.095706] Dazed and confused, but trying to continue  // message of 
the 1st kernel
[   81.655507] Kernel panic - not syncing: Timeout synchronizing machine 
check over CPUs
[   82.729324] Shutting down cpus with NMI
[   82.774539] drm_kms_helper: panic occurred, switching back to text 
console
[   82.782257] Rebooting in 10 seconds..

Please see the attached for the full log. Regards, Xunlei

[   49.897279] SysRq : Trigger a crash 
[   49.901218] BUG: unable to handle kernel NULL pointer dereference at 
  (null) 
[   49.909988] IP: [] sysrq_handle_crash+0x16/0x20 
[   49.916805] PGD 868add067 PUD 867139067 PMD 0  
[   49.921805] Oops: 0002 [#1] SMP  
[   49.925432] Modules linked in: ipmi_devintf intel_powerclamp coretemp 
intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd 
iTCO_wdt sb_edac iTCO_vendor_support ntb mei_me pcspkr edac_core ioatdma 
lpc_ich i2c_i801 ipmi_si mei mfd_core shpchp dca ipmi_msghandler acpi_pad 
acpi_power_meter xfs sd_mod sr_mod crc_t10dif cdrom crct10dif_common 
usb_storage mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit 
drm_kms_helper ata_generic ttm bnx2x pata_acpi mdio drm ata_piix ptp libata 
i2c_core pps_core libcrc32c 
[   49.984994] CPU: 9 PID: 9463 Comm: do-test.sh Not tainted 
3.10.0-229.el7.x86_64 #1 
[   49.993456] Hardware name: NEC Express5800/B120d-h [N8400-126Y]/G7LDV, BIOS 
4.6.2013 10/24/2012 
[   50.003164] task: 88043370 ti: 8808653b8000 task.ti: 
8808653b8000 
[   50.011514] RIP: 0010:[]  [] 
sysrq_handle_crash+0x16/0x20 
[   50.021045] RSP: 0018:8808653bbe80  EFLAGS: 00010046 
[   50.026976] RAX: 000f RBX: 819c18a0 RCX: 
 
[   50.034939] RDX:  RSI: 88087fc2d488 RDI: 
0063 
[   50.042908] RBP: 8808653bbe80 R08: 0092 R09: 
0608 
[   50.050870] R10: 0607 R11: 0003 R12: 
0063 
[   50.058837] R13: 0246 R14: 0007 R15: 
 
[   50.066799] FS:  7f0faaf54740() GS:88087fc2() 
knlGS: 
[   50.075828] CS:  0010 DS:  ES:  CR0: 80050033 
[   50.082244] CR2:  CR3: 000866d07000 CR4: 
000407e0 
[   50.090212] DR0:  DR1:  DR2: 
 
[   50.098173] DR3:  DR6: 0ff0 DR7: 
0400 
[   50.106133] Stack: 
[   50.108388]  8808653bbeb8 81397c32 0002 
7f0faaf58000 
[   50.116671]  8808653bbf48 0002  
8808653bbed0 
[   50.124963]  8139810f 8804674a6540 8808653bbef0 
8122de0d 
[   50.133257] Call Trace: 
[   50.135993]  [] __handle_sysrq+0xa2/0x170 
[   50.142219]  [] write_sysrq_trigger+0x2f/0x40 
[   50.148841]  [] pro c_reg_write+0x3] Code: eb 9b 45 01 f4 
45 39 65 34 75 e5 4c 89 ef e8 e2 f7 ff ff eb db 66 66 66 66 90 55 c7 05 50 d7 
59 00 01 00 00 00 48 89 e5 0f ae f8  04 25 00 00 00 00 01 5d c3 66 66 66 66 
90 55 31 c0 c7 05 ce  
[   50.194758] RIP  [] sysrq_handle_crash+0x16/0x20 
[   50.201669]  RSP  
[   50.205558] CR2:  
[0.00] Initializing cgroup subsys cpuset 
[0.00] Initializing cgroup subsys cpu 
[0.00] Initializing cgroup subsys cpuacct 
[0.00] Linux version 3.10.0-229.el7.x86_64 
(mockbu...@x86-035.build.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red 
Hat 4.8.3-7) (GCC) ) #1 SMP Thu Jan 29 18:37:38 EST 

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote:
>> will ignore the machine check on the other cpus ... assuming
>> that "cpu_is_offline(smp_processor_id())" does the right thing
>> in the kexec case where this is an "old" cpu that isn't online
>> in the new kernel.
> Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be
> good there.
>

"nr_cpus=N" will consume more memory, using very large N is almost
impossible for kdump to boot with considering the limited crash memory
reserved.

For some large machine, nr_cpus=1 might not be enough, we have to use
nr_cpus=4 or more, it is also helpful for the vmcore parallel dumping :-)

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote:
> On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
>> Hey Tony,
>>
>> a "welcome back" is in order? :-)
>>
>> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>>> If the system had experienced some memory corruption, but
>>> recovered ... then there would be some pages sitting around
>>> that the old kernel had marked as POISON and stopped using.
>>> The kexec'd kernel doesn't know about these, so may touch that
>>> memory while taking a crash dump ...
>> Hmm, pass a list of poisoned pages to the kdump kernel so as not to
>> touch. Looks like there's already functionality for that:
>>
>> "makedumpfile can exclude the following types of pages while copying
>> VMCORE to DUMPFILE, and a user can choose which type of pages will be
>> excluded.
>>
>> - Pages filled with zero
>> - Cache pages
>> - User process data pages
>> - Free pages"
>>
>>  (there is a makedumpfile manpage somewhere)
>>
>> And apparently crash knows about poisoned pages and handles them:
>>
>> static int __init crash_save_vmcoreinfo_init(void)
>> {
>>  ...
>> #ifdef CONFIG_MEMORY_FAILURE
>> VMCOREINFO_NUMBER(PG_hwpoison);
>> #endif
>>
>> so if that works, the kexeced kernel should know about that list.
> From the log in my previous reply, MCE occurred before makedumpfile dumping,
> so I guess if the poisoned ones belong to the crash reserved memory or other
> type of events?

Another possibility may be from any system.reserved/pcie memory
which are shared between 1st and 2nd kernel.

>
> Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
> is also allowed to process "/proc/vmcore".
>
>>> and then you have a broadcast machine check (on older[1] Intel CPUs
>>> that don't support local machine check).
>> Right.
>>
>>> This is hard to work around. You really need all the CPUs to have set
>>> CR4.MCE=1 (if any didn't, then they will force a reset when they see
>>> the machine check). Also you need to make sure that they jump to the
>>> copy of do_machine_check() in the new kernel, not the old kernel.
>> Doesn't matter, right? The new copy is as clueless as the old one about
>> those MCEs.
>>
> It's the code in mce_start(), it waits for all the online cpus including the 
> cpus
> that kdump boots on to synchronize.
>
> So for new mce handler of kdump kernel, it is fine as the number of online 
> cpus
> is correct; as for old mce handler of 1st kernel, it's not true because some 
> cpus
> which are regarded online from 1st kernel's view are running the 2nd kernel 
> now,
> they can't respond to the old mce handler which will timeout the old mce 
> handler.
>
> Regards,
> Xunlei


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
> Hey Tony,
>
> a "welcome back" is in order? :-)
>
> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>> If the system had experienced some memory corruption, but
>> recovered ... then there would be some pages sitting around
>> that the old kernel had marked as POISON and stopped using.
>> The kexec'd kernel doesn't know about these, so may touch that
>> memory while taking a crash dump ...
> Hmm, pass a list of poisoned pages to the kdump kernel so as not to
> touch. Looks like there's already functionality for that:
>
> "makedumpfile can exclude the following types of pages while copying
> VMCORE to DUMPFILE, and a user can choose which type of pages will be
> excluded.
>
> - Pages filled with zero
> - Cache pages
> - User process data pages
> - Free pages"
>
>  (there is a makedumpfile manpage somewhere)
>
> And apparently crash knows about poisoned pages and handles them:
>
> static int __init crash_save_vmcoreinfo_init(void)
> {
>   ...
> #ifdef CONFIG_MEMORY_FAILURE
> VMCOREINFO_NUMBER(PG_hwpoison);
> #endif
>
> so if that works, the kexeced kernel should know about that list.

>From the log in my previous reply, MCE occurred before makedumpfile dumping,
so I guess if the poisoned ones belong to the crash reserved memory or other
type of events?

Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
is also allowed to process "/proc/vmcore".

>
>> and then you have a broadcast machine check (on older[1] Intel CPUs
>> that don't support local machine check).
> Right.
>
>> This is hard to work around. You really need all the CPUs to have set
>> CR4.MCE=1 (if any didn't, then they will force a reset when they see
>> the machine check). Also you need to make sure that they jump to the
>> copy of do_machine_check() in the new kernel, not the old kernel.
> Doesn't matter, right? The new copy is as clueless as the old one about
> those MCEs.
>

It's the code in mce_start(), it waits for all the online cpus including the 
cpus
that kdump boots on to synchronize.

So for new mce handler of kdump kernel, it is fine as the number of online cpus
is correct; as for old mce handler of 1st kernel, it's not true because some 
cpus
which are regarded online from 1st kernel's view are running the 2nd kernel now,
they can't respond to the old mce handler which will timeout the old mce 
handler.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote:
>> We met an issue for kdump: after kdump kernel boots up,
>> and there comes a broadcasted mce in first kernel, the
> How does that even happen?
>
> Lemme try to understand this correctly: the first kernel gets an
> MCE, kdump starts and boots a *whole* kernel and *then* you get the
> broadcasted MCE? I have real hard time believing that.
>
> What happened to the approach of clearing CR4.MCE before loading the
> kdump kernel, in native_machine_shutdown() or wherever does the kdump
> gets loaded...
>

One possible timing sequence would be:
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing 
cpu)
the other cpus enter old mce handler of 1st kernel, while crash cpu enters new 
mce handler of 2nd kernel
the old mce handler of 1st kernel will timeout and panic due to mce 
syncrhonization under default setting

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 04:48 PM, Dave Young wrote:
> Hi, Xunlei
>
> On 01/23/17 at 02:48pm, Xunlei Pang wrote:
>> CRASH_KERNEL_ADDR_MAX has been missing for a long time,
>> update it with more detailed explanation.
>>
>> Cc: Robert LeBlanc <rob...@leblancnet.us>
>> Cc: Baoquan He <b...@redhat.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  arch/x86/kernel/setup.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 4cfba94..c32a167 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void)
>>  /* 0 means: find the address automatically */
>>  if (crash_base <= 0) {
>>  /*
>> - *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
>> + * Set CRASH_ADDR_LOW_MAX upper bound for crash memory
>> + * as old kexec-tools loads bzImage below that, unless
>> + * "crashkernel=size[KMG],high" is specified.
> There is already comment before the define of those macros, also
> there are 32bit case which has a different reason about 512M there as
> well.

If we see from the kexec's perspective, we have a common CRASH_ADDR_LOW_MAX
definition for both x86 32-bit and 64-bit(32-bit x86 has the same value defined 
for
CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX), so old kexec will load below
CRASH_ADDR_LOW_MAX,  so I think the description is fine :-)

Regards,
Xunlei

>
> So it looks better to just drop the one line comment without adding
> further comments here.
>>   */
>>  crash_base = memblock_find_in_range(CRASH_ALIGN,
>>  high ? CRASH_ADDR_HIGH_MAX
>> -- 
>> 1.8.3.1
>>
> Thanks
> Dave
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet when panic happens, so
before crash cpu shots them down or after kdump boots,
they should not do anything except clearing MCG_STATUS
in case of broadcasted mce. This is useful for kdump
to let the vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef432..0c2bf77 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1157,6 +1157,23 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 
mce_gather_info(, regs);
 
+   /*
+* Check if this MCE is signaled to only this logical processor,
+* on Intel only.
+*/
+   if (m.cpuvendor == X86_VENDOR_INTEL)
+   lmce = m.mcgstatus & MCG_STATUS_LMCES;
+
+   /*
+* Special treatment for Intel broadcasted machine check:
+* To avoid panic due to MCE synchronization in case of kdump,
+* after system panic, clear global status and bail out.
+*/
+   if (!lmce && atomic_read(_cpu) != PANIC_CPU_INVALID) {
+   wrmsrl(MSR_IA32_MCG_STATUS, 0);
+   goto out;
+   }
+
final = this_cpu_ptr(_seen);
*final = m;
 
@@ -1174,13 +1191,6 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
kill_it = 1;
 
/*
-* Check if this MCE is signaled to only this logical processor,
-* on Intel only.
-*/
-   if (m.cpuvendor == X86_VENDOR_INTEL)
-   lmce = m.mcgstatus & MCG_STATUS_LMCES;
-
-   /*
 * Go through all banks in exclusion of the other CPUs. This way we
 * don't report duplicated events on shared banks because the first one
 * to see it will clear it. If this is a Local MCE, then no need to
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()

2017-01-22 Thread Xunlei Pang
CRASH_KERNEL_ADDR_MAX has been missing for a long time,
update it with more detailed explanation.

Cc: Robert LeBlanc <rob...@leblancnet.us>
Cc: Baoquan He <b...@redhat.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 arch/x86/kernel/setup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4cfba94..c32a167 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void)
/* 0 means: find the address automatically */
if (crash_base <= 0) {
/*
-*  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
+* Set CRASH_ADDR_LOW_MAX upper bound for crash memory
+* as old kexec-tools loads bzImage below that, unless
+* "crashkernel=size[KMG],high" is specified.
 */
crash_base = memblock_find_in_range(CRASH_ALIGN,
high ? CRASH_ADDR_HIGH_MAX
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Add +~800M crashkernel explaination

2017-01-11 Thread Xunlei Pang
On 01/12/2017 at 03:35 AM, Robert LeBlanc wrote:
> On Wed, Dec 14, 2016 at 4:17 PM, Xunlei Pang <xp...@redhat.com> wrote:
>> As I replied in another post, if you really want to detail the behaviour, 
>> should mention
>> "crashkernel=size[KMG][@offset[KMG]]" with @offset[KMG] specified 
>> explicitly, after
>> all, it's handled differently with no upper bound limitation, but doing this 
>> may put
>> the first kernel at the risk of lacking low memory(some devices require 
>> 32bit DMA),
>> must use it with care because the kernel will assume users are aware of what 
>> they
>> are doing and make a successful reservation as long as the given range is 
>> available.
> crashkernel=1024M@0x1000
>
> I can't get the offset to work. It seems that it allocates the space
> and loads the crash kernel, but I couldn't get it to actually boot
> into the crash kernel. Does it work for you? I'm using the 4.9 kernel.

Not sure what is the problem you met, but kdump kernel boots well using 4.9 on 
my x86_64 machine.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/crash: Update the stale comment in reserve_crashkernel()

2016-12-23 Thread Xunlei Pang
On 12/22/2016 at 11:22 AM, Baoquan He wrote:
> On 12/15/16 at 11:30am, Xunlei Pang wrote:
>> CRASH_KERNEL_ADDR_MAX was missing for a long time, update it
>> with more detailed explanation.
>>
>> Cc: Robert LeBlanc <rob...@leblancnet.us>
>> Cc: Baoquan He <b...@redhat.com>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  arch/x86/kernel/setup.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 9c337b0..79ee507 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -575,7 +575,10 @@ static void __init reserve_crashkernel(void)
>>  /* 0 means: find the address automatically */
>>  if (crash_base <= 0) {
>>  /*
>> - *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
>> + * Set CRASH_ADDR_LOW_MAX upper bound for crash range
>> + * as old kexec-tools loads bzImage below that, unless
>> + * "size,high" or "size@offset"(nonzero offset, see the
>> + * else leg below) is specified.
> Yes, this is a good catch. It might be better to add comment only about
> this if branch. If you want to say more about the upper bounds, better

OK, how about the following change?
/*
 * Set CRASH_ADDR_LOW_MAX upper bound for crash memory.
 * as old kexec-tools loads bzImage below that, unless
 * "crashkernel=size[KMG],high" is specified.
 */

> discuss with Robert LeBlanc to see if it can be detailed in kdump.txt.

Yes, this is independent of Robert's documentation patch.

>
> Also please CC to x86 maintainers, or akpm. They can help merge this.

OK, thanks!

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-20 Thread Xunlei Pang
On 12/19/2016 at 11:23 AM, Baoquan He wrote:
> On 12/09/16 at 03:16pm, Xunlei Pang wrote:
>> On 12/09/2016 at 01:13 PM, zhong jiang wrote:
>>> On 2016/12/8 17:41, Xunlei Pang wrote:
>>>> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>>>>> From: zhong jiang <zhongji...@huawei.com>
>>>>>
>>>>> A soft lookup will occur when I run trinity in syscall kexec_load.
>>>>> the corresponding stack information is as follows.
>>>>>
>>>>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>>>>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>>>>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O 
>>>>> L V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>>>>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>>>>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>>>>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>>>>> 81638f16
>>>>> [  237.277471]  88184c803e98 8163278f 0008 
>>>>> 88184c803ea8
>>>>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>>>>> 
>>>>> [  237.292909] Call Trace:
>>>>> [  237.295404][] dump_stack+0x19/0x1b
>>>>> [  237.301352]  [] panic+0xd8/0x214
>>>>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>>>>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>>>>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>>>>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>>>>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>>>>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>>>>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>>>>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>>>>> [  237.354967][] ? 
>>>>> kimage_alloc_control_pages+0x80/0x270
>>>>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>>>>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>>>>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>>>>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>>>>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>>>>
>>>>> the first time allocate control pages may take too much time because
>>>>> crash_res.end can be set to a higher value. we need to add cond_resched
>>>>> to avoid the issue.
>>>>>
>>>>> The patch have been tested and above issue is not appear.
>>>>>
>>>>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>>>>> ---
>>>>>  kernel/kexec_core.c | 2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>>> index 5616755..bfc9621 100644
>>>>> --- a/kernel/kexec_core.c
>>>>> +++ b/kernel/kexec_core.c
>>>>> @@ -441,6 +441,8 @@ static struct page 
>>>>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>>>>   while (hole_end <= crashk_res.end) {
>>>>>   unsigned long i;
>>>>>  
>>>>> + cond_resched();
>>>>> +
>>>> I can't see why it would take a long time to loop inside, the job it does 
>>>> is simply to find a control area
>>>> not overlapped with image->segment[], you can see the loop "for (i = 0; i 
>>>> < image->nr_segments; i++)",
>>>> @hole_end will be advanced to the end of its next nearby segment once 
>>>> overlap was detected each loop,
>>>> also there are limited (<=16) segments, so it won't take long to locate 
>>>> the right area.
>>>>
>>>> Am I missing something?
>>>>
>>>> Regards,
>>>> Xunlei
>>>   if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
>>> will exceed to 4G, the first allocate control pages will
>>>   loop  million times. if we set crashk_res.end to the higher value 
>>> manually,  you can image
>> How does "loop million times" happen? See my inlined comments prefixed with 
>> "pxl".
>>
>> kimage_alloc_crash_control_pages():
>> while (hole_end <= crashk_res.end) {
>> un

Re: [PATCHv2 2/2] [fs] proc/vmcore: check the dummy place holder for offline cpu to avoid warning

2016-12-20 Thread Xunlei Pang
On 12/21/2016 at 11:57 AM, Pratyush Anand wrote:
>
>
> On Wednesday 21 December 2016 08:56 AM, Xunlei Pang wrote:
>> On 12/20/2016 at 11:38 PM, Pratyush Anand wrote:
>>>
>>>
>>> On Monday 19 December 2016 08:10 AM, Dave Young wrote:
>>>> Hi, Pingfan
>>>>
>>>> On 12/19/16 at 10:08am, Pingfan Liu wrote:
>>>>>> kexec-tools always allocates program headers for present cpus. But
>>>>>> when crashing, offline cpus have dummy headers. We do not copy these
>>>>>> dummy notes into ELF file, also have no need of warning on them.
>>>> I still think it is not worth such a fix, if you feel a lot of warnings
>>>> in case large cpu numbers, I think you can change the pr_warn to
>>>> pr_warn_once, we do not care the null cpu notes if it has nothing bad
>>>> to the vmcore.
>>>>
>>>
>>> I agree. Warning is more like information here. May be, we can count the 
>>> number of times real_sz was 0, and then can print an info at the end in 
>>> stead of warning, like..."N number of CPUs would have been offline, PT_NOTE 
>>> entries was absent for them."
>>
>> Well, OTOH the warning may also be due to some user-space misuse, we can't 
>> distinguish that without extra information added.
>
> Yes, yes..I agree, I meant that the above info is just indicative. May be 
> "might have been" could be better word than "would have been" in the above 
> info print message.
>
>
>>
>> Another possible user-space fix would be: Firstly fix kexec-tools to add 
>> notes only for online cpus,
>> then utilize udev rules(cpu online/offline events) to automatically trigger 
>> kdump kernel reload.
>
> Hummm..this is certainly possible. But can we do much even when we get the 
> info that the PT_NOTE was compromised by user space?
>
> Therefore, I am of the view that if at all we are concerned about number of 
> warning messages in case of multiple offline cpu, then we can just print the 
> total number of NULL PT_NOTE at the end of loop.

Yes, agree.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv2 2/2] [fs] proc/vmcore: check the dummy place holder for offline cpu to avoid warning

2016-12-20 Thread Xunlei Pang
On 12/20/2016 at 11:38 PM, Pratyush Anand wrote:
>
>
> On Monday 19 December 2016 08:10 AM, Dave Young wrote:
>> Hi, Pingfan
>>
>> On 12/19/16 at 10:08am, Pingfan Liu wrote:
>>> > kexec-tools always allocates program headers for present cpus. But
>>> > when crashing, offline cpus have dummy headers. We do not copy these
>>> > dummy notes into ELF file, also have no need of warning on them.
>> I still think it is not worth such a fix, if you feel a lot of warnings
>> in case large cpu numbers, I think you can change the pr_warn to
>> pr_warn_once, we do not care the null cpu notes if it has nothing bad
>> to the vmcore.
>>
>
> I agree. Warning is more like information here. May be, we can count the 
> number of times real_sz was 0, and then can print an info at the end in stead 
> of warning, like..."N number of CPUs would have been offline, PT_NOTE entries 
> was absent for them."

Well, OTOH the warning may also be due to some user-space misuse, we can't 
distinguish that without extra information added.

Another possible user-space fix would be: Firstly fix kexec-tools to add notes 
only for online cpus,
then utilize udev rules(cpu online/offline events) to automatically trigger 
kdump kernel reload.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] [fs] proc/vmcore: check the dummy place holder for offline cpu to avoid warning

2016-12-14 Thread Xunlei Pang
On 12/14/2016 at 02:11 PM, Pingfan Liu wrote:
> kexec-tools always allocates program headers for possible cpus. But
> when crashing, offline cpus have dummy headers. We do not copy these
> dummy notes into ELF file, also have no need of warning on them.
>
> Signed-off-by: Pingfan Liu 
> ---
>  fs/proc/vmcore.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 8ab782d..bbc9dad 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -526,9 +526,10 @@ static u64 __init get_vmcore_size(size_t elfsz, size_t 
> elfnotesegsz,
>   */
>  static int __init update_note_header_size_elf64(const Elf64_Ehdr *ehdr_ptr)
>  {
> - int i, rc=0;
> + int i, j, rc = 0;
>   Elf64_Phdr *phdr_ptr;
>   Elf64_Nhdr *nhdr_ptr;
> + bool warn;
>  
>   phdr_ptr = (Elf64_Phdr *)(ehdr_ptr + 1);
>   for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> @@ -536,6 +537,7 @@ static int __init update_note_header_size_elf64(const 
> Elf64_Ehdr *ehdr_ptr)
>   u64 offset, max_sz, sz, real_sz = 0;
>   if (phdr_ptr->p_type != PT_NOTE)
>   continue;
> + warn = true;
>   max_sz = phdr_ptr->p_memsz;
>   offset = phdr_ptr->p_offset;
>   notes_section = kmalloc(max_sz, GFP_KERNEL);
> @@ -547,7 +549,7 @@ static int __init update_note_header_size_elf64(const 
> Elf64_Ehdr *ehdr_ptr)
>   return rc;
>   }
>   nhdr_ptr = notes_section;
> - while (nhdr_ptr->n_namesz != 0) {
> + for (j = 0; nhdr_ptr->n_namesz != 0; j++) {

Hi Pingfan,

I think we don't need to be this complex, how about simply check before while 
loop,
if it is the cpu dummy note(initialize it with some magic), then handle it 
differently,
e.g. set a "nowarn" flag to use afterwards and make sure it has zero p_memsz?

Also do the similar thing for update_note_header_size_elf32()?

Regards,
Xunlei

>   sz = sizeof(Elf64_Nhdr) +
>   (((u64)nhdr_ptr->n_namesz + 3) & ~3) +
>   (((u64)nhdr_ptr->n_descsz + 3) & ~3);
> @@ -559,11 +561,22 @@ static int __init update_note_header_size_elf64(const 
> Elf64_Ehdr *ehdr_ptr)
>   real_sz += sz;
>   nhdr_ptr = (Elf64_Nhdr*)((char*)nhdr_ptr + sz);
>   }
> + if (real_sz != 0)
> + warn = false;
> + if (j == 1) {
> + nhdr_ptr = notes_section;
> + if ((nhdr_ptr->n_type == NT_DUMMY)
> +   && !strncmp(KEXEC_CORE_NOTE_NAME,
> + (char *)nhdr_ptr + sizeof(Elf64_Nhdr),
> + strlen(KEXEC_CORE_NOTE_NAME))) {
> + /* do not copy this dummy note */
> + real_sz = 0;
> + }
> + }
>   kfree(notes_section);
>   phdr_ptr->p_memsz = real_sz;
> - if (real_sz == 0) {
> + if (warn)
>   pr_warn("Warning: Zero PT_NOTE entries found\n");
> - }
>   }
>  
>   return 0;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: add a dummy note for each offline cpu

2016-12-14 Thread Xunlei Pang
On 12/14/2016 at 05:13 PM, Liu ping fan wrote:
> [...]
>>> No. This patch just place a mark on these offline cpu. The next patch
>>> for capture kernel will recognize this case, and ignore this kind of
>>> pt_note by the code:
>>> real_sz = 0; // although the size of this kind of PT_NOTE is not zero,
>>> but it contains nothing useful, so just ignore it
>>> phdr_ptr->p_memsz = real_sz
>> If there is any other vmcore functional issue besides throwing "Warning: 
>> Zero PT_NOTE entries found"?
>>
> Not at present when I debugged.

Well, agree that we should fix it given that it produces many unnecessary 
warnings on some machines.

> I just think we can not suppose the behaviour of different archs, so
> just mark out the dummy pt_note. If some archs want to use these notes
> memory,
> they will just overwrite the dummy.

For cpu crash_notes, it should be arch-independent, and related to elf format.

>
> Thx,
> Pingfan
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Add +~800M crashkernel explaination

2016-12-14 Thread Xunlei Pang
On 12/15/2016 at 01:50 AM, Robert LeBlanc wrote:
> On Tue, Dec 13, 2016 at 8:08 PM, Xunlei Pang <xp...@redhat.com> wrote:
>> On 12/10/2016 at 01:20 PM, Robert LeBlanc wrote:
>>> On Fri, Dec 9, 2016 at 7:49 PM, Baoquan He <b...@redhat.com> wrote:
>>>> On 12/09/16 at 05:22pm, Robert LeBlanc wrote:
>>>>> When trying to configure crashkernel greater than about 800 MB, the
>>>>> kernel fails to allocate memory on x86 and x86_64. This is due to an
>>>>> undocumented limit that the crashkernel and other low memory items must
>>>>> be allocated below 896 MB unless the ",high" option is given. This
>>>>> updates the documentation to explain this and what I understand the
>>>>> limitations to be on the option.
>>>> This is true, but not very accurate. You found it's about 800M, it's
>>>> becasue usually the current kernel need about 40M space to run, and some
>>>> extra reservation before reserve_crashkernel invocation, another ~10M.
>>>> However it's normal case, people may build modules into or have some
>>>> special code to bloat kernel. This patch makes sense to address the
>>>> low|high issue, it might be not good so determined to say ~800M.
>>> My testing showed that I could go anywhere from about 830M to 880M,
>>> depending on distro, kernel version, and stuff that you mentioned. I
>>> just thought some rule of thumb of when to consider using high would
>>> be good. People may not think that 800 MB is 'large' when you have 512
>>> GB of RAM for instance. I thought about making 512 MB be the rule of
>>> thumb, but you can do a lot with ~300 MB.
>> Hi Robert,
>>
>> I think you are correct.
>>
>> For x86, the kernel uses memblock to locate the proper range starts from 
>> 16MB to some "end",
>> without "high" prefix, "end" is CRASH_ADDR_LOW_MAX, otherwise 
>> CRASH_ADDR_HIGH_MAX.
>>
>> You can find the definition for both 32-bit and 64-bit:
>> #ifdef CONFIG_X86_32
>> # define CRASH_ADDR_LOW_MAX (512 << 20)
>> # define CRASH_ADDR_HIGH_MAX(512 << 20)
>> #else
>> # define CRASH_ADDR_LOW_MAX (896UL << 20)
>> # define CRASH_ADDR_HIGH_MAXMAXMEM
>> #endif
>>
>> as some memory was already allocated by the kernel, which means it's highly 
>> likely to get a reservation
>> failure after specifying a crashkernel value near 800MB(for x86_64) which 
>> was what you met. But we can't
>> get the exact threshold, but it would be better if there is some explanation 
>> accordingly in the document.
> To make sure I'm understanding what you are say, you want me to go
> into a bit more detail about the limitation and specify the
> differences between x86 and x86_64, right?

Yeah, it would be better to have one, at least to mention the different upper 
bounds.

As I replied in another post, if you really want to detail the behaviour, 
should mention
"crashkernel=size[KMG][@offset[KMG]]" with @offset[KMG] specified explicitly, 
after
all, it's handled differently with no upper bound limitation, but doing this 
may put
the first kernel at the risk of lacking low memory(some devices require 32bit 
DMA),
must use it with care because the kernel will assume users are aware of what 
they
are doing and make a successful reservation as long as the given range is 
available.

>
>>> I'm happy to adjust the wording, what would you recommend? Also, I'm
>>> not 100% sure that I got the cases covered correctly. I was surprised
>>> that I could not get it to work with the "new" format with the
>>> multiple ranges, and that specifying an offset would't work either,
>>> although the offset kind of makes sense. Do you know for sure that it
>>> doesn't work with ranges?
>>>
>>> I tried,
>>>
>>> crashkernel=256M-1G:128M,high,1G-4G:256M,high,4G-:512M,high
>>>
>>> and
>>>
>>> crashkernel=256M-1G:128M,1G-4G:256M,4G-:512M,high
>>>
>>> and neither worked. It seems that a better separator would be ';'
>>> instead of ',' for ranges, then you could specify options better. Kind
>>> of hard to change now.
>> For "crashkernel=range1:size1[,range2:size2,...][@offset]"
>> I'm afraid it doesn't support "high" prefix in the current implementation, 
>> so there is no guarantee.
>> I guess we can drop a note to eliminate the confusion.
> I tried to express in the extended syntax section that ',high' is not
> available and you have to use the 'simple' format. Do you

Re: [PATCH 1/2] kexec: add a dummy note for each offline cpu

2016-12-14 Thread Xunlei Pang
On 12/14/2016 at 04:56 PM, Liu ping fan wrote:
> On Wed, Dec 14, 2016 at 4:48 PM, Xunlei Pang <xp...@redhat.com> wrote:
>> On 12/14/2016 at 02:11 PM, Pingfan Liu wrote:
>>> kexec-tools always allocates program headers for each possible cpu. This
>>> incurs zero PT_NOTE for offline cpu. We mark this case so that later,
>>> the capture kernel can distinguish it from the mistake of allocated
>>> program header.
>>> The counterpart of the capture kernel comes in next patch.
>> Hmm, we can initialize the cpu crash note buf in crash_notes_memory_init(), 
>> needless
>> to do it at the crash moment, right?
>>
> The cpus can be on-off-on.., We can not know the user's action.

I meant we can add the fake note into the cpu note buf, then the crash happens, 
the online ones
will be overwritten with the real note data, while others(!online) will still 
be the fake note.

>
>> BTW, does this cause any issue, for example the crash utility can't parse 
>> the vmcore
>> properly? or just reproduce lots of warnings after offline multiple cpus?
>>
> No. This patch just place a mark on these offline cpu. The next patch
> for capture kernel will recognize this case, and ignore this kind of
> pt_note by the code:
> real_sz = 0; // although the size of this kind of PT_NOTE is not zero,
> but it contains nothing useful, so just ignore it
> phdr_ptr->p_memsz = real_sz

If there is any other vmcore functional issue besides throwing "Warning: Zero 
PT_NOTE entries found"?

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: add a dummy note for each offline cpu

2016-12-14 Thread Xunlei Pang
On 12/14/2016 at 02:11 PM, Pingfan Liu wrote:
> kexec-tools always allocates program headers for each possible cpu. This
> incurs zero PT_NOTE for offline cpu. We mark this case so that later,
> the capture kernel can distinguish it from the mistake of allocated
> program header.
> The counterpart of the capture kernel comes in next patch.

Hmm, we can initialize the cpu crash note buf in crash_notes_memory_init(), 
needless
to do it at the crash moment, right?

BTW, does this cause any issue, for example the crash utility can't parse the 
vmcore
properly? or just reproduce lots of warnings after offline multiple cpus?

Regards,
Xunlei

>
> Signed-off-by: Pingfan Liu 
> ---
> This unnecessary warning buzz on all archs when there is offline cpu
>
>  include/uapi/linux/elf.h | 1 +
>  kernel/kexec_core.c  | 9 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index b59ee07..9744f1e 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -367,6 +367,7 @@ typedef struct elf64_shdr {
>   * using the corresponding note types via the PTRACE_GETREGSET and
>   * PTRACE_SETREGSET requests.
>   */
> +#define NT_DUMMY 0
>  #define NT_PRSTATUS  1
>  #define NT_PRFPREG   2
>  #define NT_PRPSINFO  3
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 5616755..aeac16e 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -891,9 +891,12 @@ void __crash_kexec(struct pt_regs *regs)
>   if (mutex_trylock(_mutex)) {
>   if (kexec_crash_image) {
>   struct pt_regs fixed_regs;
> + unsigned int cpu;
>  
>   crash_setup_regs(_regs, regs);
>   crash_save_vmcoreinfo();
> + for_each_cpu_not(cpu, cpu_online_mask)
> + crash_save_cpu(NULL, cpu);
>   machine_crash_shutdown(_regs);
>   machine_kexec(kexec_crash_image);
>   }
> @@ -1040,6 +1043,12 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
>   buf = (u32 *)per_cpu_ptr(crash_notes, cpu);
>   if (!buf)
>   return;
> + if (regs == NULL) {
> + buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_DUMMY,
> + NULL, 0);
> + final_note(buf);
> + return;
> + }
>   memset(, 0, sizeof(prstatus));
>   prstatus.pr_pid = current->pid;
>   elf_core_copy_kernel_regs(_reg, regs);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Add +~800M crashkernel explaination

2016-12-13 Thread Xunlei Pang
On 12/14/2016 at 11:08 AM, Xunlei Pang wrote:
> On 12/10/2016 at 01:20 PM, Robert LeBlanc wrote:
>> On Fri, Dec 9, 2016 at 7:49 PM, Baoquan He <b...@redhat.com> wrote:
>>> On 12/09/16 at 05:22pm, Robert LeBlanc wrote:
>>>> When trying to configure crashkernel greater than about 800 MB, the
>>>> kernel fails to allocate memory on x86 and x86_64. This is due to an
>>>> undocumented limit that the crashkernel and other low memory items must
>>>> be allocated below 896 MB unless the ",high" option is given. This
>>>> updates the documentation to explain this and what I understand the
>>>> limitations to be on the option.
>>> This is true, but not very accurate. You found it's about 800M, it's
>>> becasue usually the current kernel need about 40M space to run, and some
>>> extra reservation before reserve_crashkernel invocation, another ~10M.
>>> However it's normal case, people may build modules into or have some
>>> special code to bloat kernel. This patch makes sense to address the
>>> low|high issue, it might be not good so determined to say ~800M.
>> My testing showed that I could go anywhere from about 830M to 880M,
>> depending on distro, kernel version, and stuff that you mentioned. I
>> just thought some rule of thumb of when to consider using high would
>> be good. People may not think that 800 MB is 'large' when you have 512
>> GB of RAM for instance. I thought about making 512 MB be the rule of
>> thumb, but you can do a lot with ~300 MB.
> Hi Robert,
>
> I think you are correct.
>
> For x86, the kernel uses memblock to locate the proper range starts from 16MB 
> to some "end",
> without "high" prefix, "end" is CRASH_ADDR_LOW_MAX, otherwise 
> CRASH_ADDR_HIGH_MAX.
>
> You can find the definition for both 32-bit and 64-bit:
> #ifdef CONFIG_X86_32
> # define CRASH_ADDR_LOW_MAX (512 << 20)
> # define CRASH_ADDR_HIGH_MAX(512 << 20)
> #else
> # define CRASH_ADDR_LOW_MAX (896UL << 20)
> # define CRASH_ADDR_HIGH_MAXMAXMEM
> #endif
>
> as some memory was already allocated by the kernel, which means it's highly 
> likely to get a reservation
> failure after specifying a crashkernel value near 800MB(for x86_64) which was 
> what you met. But we can't
> get the exact threshold, but it would be better if there is some explanation 
> accordingly in the document.

But there is another point:
If you specify the base using crashkernel=size[KMG][@offset[KMG]], for example
"crashkernel=1024M@0x1000", there is no such limitation, and you may get
a successful reservation. I have no idea why the design is so different.

Regards,
Xunlei

>
>> I'm happy to adjust the wording, what would you recommend? Also, I'm
>> not 100% sure that I got the cases covered correctly. I was surprised
>> that I could not get it to work with the "new" format with the
>> multiple ranges, and that specifying an offset would't work either,
>> although the offset kind of makes sense. Do you know for sure that it
>> doesn't work with ranges?
>>
>> I tried,
>>
>> crashkernel=256M-1G:128M,high,1G-4G:256M,high,4G-:512M,high
>>
>> and
>>
>> crashkernel=256M-1G:128M,1G-4G:256M,4G-:512M,high
>>
>> and neither worked. It seems that a better separator would be ';'
>> instead of ',' for ranges, then you could specify options better. Kind
>> of hard to change now.
> For "crashkernel=range1:size1[,range2:size2,...][@offset]"
> I'm afraid it doesn't support "high" prefix in the current implementation, so 
> there is no guarantee.
> I guess we can drop a note to eliminate the confusion.
>
> Regards,
> Xunlei
>
>>>> Signed-off-by: Robert LeBlanc <rob...@leblancnet.us>
>>>> ---
>>>>  Documentation/kdump/kdump.txt | 22 +-
>>>>  1 file changed, 17 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>>> index b0eb27b..aa3efa8 100644
>>>> --- a/Documentation/kdump/kdump.txt
>>>> +++ b/Documentation/kdump/kdump.txt
>>>> @@ -256,7 +256,9 @@ While the "crashkernel=size[@offset]" syntax is 
>>>> sufficient for most
>>>>  configurations, sometimes it's handy to have the reserved memory dependent
>>>>  on the value of System RAM -- that's mostly for distributors that 
>>>> pre-setup
>>>>  the kernel command line to avoid a unbootable system after some memory has
>>>> -been remov

Re: [PATCH] Add +~800M crashkernel explaination

2016-12-13 Thread Xunlei Pang
On 12/10/2016 at 01:20 PM, Robert LeBlanc wrote:
> On Fri, Dec 9, 2016 at 7:49 PM, Baoquan He  wrote:
>> On 12/09/16 at 05:22pm, Robert LeBlanc wrote:
>>> When trying to configure crashkernel greater than about 800 MB, the
>>> kernel fails to allocate memory on x86 and x86_64. This is due to an
>>> undocumented limit that the crashkernel and other low memory items must
>>> be allocated below 896 MB unless the ",high" option is given. This
>>> updates the documentation to explain this and what I understand the
>>> limitations to be on the option.
>> This is true, but not very accurate. You found it's about 800M, it's
>> becasue usually the current kernel need about 40M space to run, and some
>> extra reservation before reserve_crashkernel invocation, another ~10M.
>> However it's normal case, people may build modules into or have some
>> special code to bloat kernel. This patch makes sense to address the
>> low|high issue, it might be not good so determined to say ~800M.
> My testing showed that I could go anywhere from about 830M to 880M,
> depending on distro, kernel version, and stuff that you mentioned. I
> just thought some rule of thumb of when to consider using high would
> be good. People may not think that 800 MB is 'large' when you have 512
> GB of RAM for instance. I thought about making 512 MB be the rule of
> thumb, but you can do a lot with ~300 MB.

Hi Robert,

I think you are correct.

For x86, the kernel uses memblock to locate the proper range starts from 16MB 
to some "end",
without "high" prefix, "end" is CRASH_ADDR_LOW_MAX, otherwise 
CRASH_ADDR_HIGH_MAX.

You can find the definition for both 32-bit and 64-bit:
#ifdef CONFIG_X86_32
# define CRASH_ADDR_LOW_MAX (512 << 20)
# define CRASH_ADDR_HIGH_MAX(512 << 20)
#else
# define CRASH_ADDR_LOW_MAX (896UL << 20)
# define CRASH_ADDR_HIGH_MAXMAXMEM
#endif

as some memory was already allocated by the kernel, which means it's highly 
likely to get a reservation
failure after specifying a crashkernel value near 800MB(for x86_64) which was 
what you met. But we can't
get the exact threshold, but it would be better if there is some explanation 
accordingly in the document.

>
> I'm happy to adjust the wording, what would you recommend? Also, I'm
> not 100% sure that I got the cases covered correctly. I was surprised
> that I could not get it to work with the "new" format with the
> multiple ranges, and that specifying an offset would't work either,
> although the offset kind of makes sense. Do you know for sure that it
> doesn't work with ranges?
>
> I tried,
>
> crashkernel=256M-1G:128M,high,1G-4G:256M,high,4G-:512M,high
>
> and
>
> crashkernel=256M-1G:128M,1G-4G:256M,4G-:512M,high
>
> and neither worked. It seems that a better separator would be ';'
> instead of ',' for ranges, then you could specify options better. Kind
> of hard to change now.

For "crashkernel=range1:size1[,range2:size2,...][@offset]"
I'm afraid it doesn't support "high" prefix in the current implementation, so 
there is no guarantee.
I guess we can drop a note to eliminate the confusion.

Regards,
Xunlei

>>> Signed-off-by: Robert LeBlanc 
>>> ---
>>>  Documentation/kdump/kdump.txt | 22 +-
>>>  1 file changed, 17 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>> index b0eb27b..aa3efa8 100644
>>> --- a/Documentation/kdump/kdump.txt
>>> +++ b/Documentation/kdump/kdump.txt
>>> @@ -256,7 +256,9 @@ While the "crashkernel=size[@offset]" syntax is 
>>> sufficient for most
>>>  configurations, sometimes it's handy to have the reserved memory dependent
>>>  on the value of System RAM -- that's mostly for distributors that pre-setup
>>>  the kernel command line to avoid a unbootable system after some memory has
>>> -been removed from the machine.
>>> +been removed from the machine. If you need to allocate more than ~800M
>>> +for x86 or x86_64 then you must use the simple format as the format
>>> +',high' conflicts with the separators of ranges.
>>>
>>>  The syntax is:
>>>
>>> @@ -282,11 +284,21 @@ Boot into System Kernel
>>>  1) Update the boot loader (such as grub, yaboot, or lilo) configuration
>>> files as necessary.
>>>
>>> -2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
>>> +2) Boot the system kernel with the boot parameter "crashkernel=Y[@X | 
>>> ,high]",
>>> where Y specifies how much memory to reserve for the dump-capture kernel
>>> -   and X specifies the beginning of this reserved memory. For example,
>>> -   "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
>>> -   starting at physical address 0x0100 (16MB) for the dump-capture 
>>> kernel.
>>> +   and X specifies the beginning of this reserved memory or ',high' to 
>>> load in
>>> +   high memory. For example, "crashkernel=64M@16M" tells the system
>>> +   kernel to reserve 64 MB of memory starting at physical address
>>> +   

Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread Xunlei Pang
On 12/09/2016 at 01:13 PM, zhong jiang wrote:
> On 2016/12/8 17:41, Xunlei Pang wrote:
>> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> A soft lookup will occur when I run trinity in syscall kexec_load.
>>> the corresponding stack information is as follows.
>>>
>>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
>>> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>>> 81638f16
>>> [  237.277471]  88184c803e98 8163278f 0008 
>>> 88184c803ea8
>>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>>> 
>>> [  237.292909] Call Trace:
>>> [  237.295404][] dump_stack+0x19/0x1b
>>> [  237.301352]  [] panic+0xd8/0x214
>>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>>> [  237.354967][] ? 
>>> kimage_alloc_control_pages+0x80/0x270
>>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>>
>>> the first time allocate control pages may take too much time because
>>> crash_res.end can be set to a higher value. we need to add cond_resched
>>> to avoid the issue.
>>>
>>> The patch have been tested and above issue is not appear.
>>>
>>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>>> ---
>>>  kernel/kexec_core.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>> index 5616755..bfc9621 100644
>>> --- a/kernel/kexec_core.c
>>> +++ b/kernel/kexec_core.c
>>> @@ -441,6 +441,8 @@ static struct page 
>>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>> while (hole_end <= crashk_res.end) {
>>> unsigned long i;
>>>  
>>> +   cond_resched();
>>> +
>> I can't see why it would take a long time to loop inside, the job it does is 
>> simply to find a control area
>> not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
>> image->nr_segments; i++)",
>> @hole_end will be advanced to the end of its next nearby segment once 
>> overlap was detected each loop,
>> also there are limited (<=16) segments, so it won't take long to locate the 
>> right area.
>>
>> Am I missing something?
>>
>> Regards,
>> Xunlei
>   if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
> will exceed to 4G, the first allocate control pages will
>   loop  million times. if we set crashk_res.end to the higher value manually, 
>  you can image

How does "loop million times" happen? See my inlined comments prefixed with 
"pxl".

kimage_alloc_crash_control_pages():
while (hole_end <= crashk_res.end) {
unsigned long i;

if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
break;
/* See if I overlap any of the segments */
for (i = 0; i < image->nr_segments; i++) {  // pxl: max 16 loops, all 
existent segments are not overlapped, though may not sorted.
unsigned long mstart, mend;

mstart = image->segment[i].mem;
mend   = mstart + image->segment[i].memsz - 1;
if ((hole_end >= mstart) && (hole_start <= mend)) {
/* Advance the hole to the end of the segment */
hole_start = (mend + (size - 1)) & ~(size - 1);
hole_end   = hole_start + size - 1;
break;  // pxl: If overlap was found, break for loop, @hole_end 
starts after the overlapped segment area, and will while loop again
}
}
/* If I don't overlap any segments I have found my hole! */
if (i == image->nr_segments) {
pages = pfn_to_page(hole_start >> PAGE_SHIFT);
image->control_page = hole_end;
break;   // pxl: no overlap with all the segments, get the result 
and break the while loop. END.
}   
}

So, the worst "while" loops in theory would be (image->nr_segments + 1), no?

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread Xunlei Pang
On 12/08/2016 at 10:37 AM, zhongjiang wrote:
> From: zhong jiang 
>
> A soft lookup will occur when I run trinity in syscall kexec_load.
> the corresponding stack information is as follows.
>
> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
> [  237.269752]  8187626b 18cfde31 88184c803e18 
> 81638f16
> [  237.277471]  88184c803e98 8163278f 0008 
> 88184c803ea8
> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
> 
> [  237.292909] Call Trace:
> [  237.295404][] dump_stack+0x19/0x1b
> [  237.301352]  [] panic+0xd8/0x214
> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
> [  237.329963]  [] ? call_softirq+0x1c/0x30
> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
> [  237.354967][] ? 
> kimage_alloc_control_pages+0x80/0x270
> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>
> the first time allocate control pages may take too much time because
> crash_res.end can be set to a higher value. we need to add cond_resched
> to avoid the issue.
>
> The patch have been tested and above issue is not appear.
>
> Signed-off-by: zhong jiang 
> ---
>  kernel/kexec_core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 5616755..bfc9621 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -441,6 +441,8 @@ static struct page 
> *kimage_alloc_crash_control_pages(struct kimage *image,
>   while (hole_end <= crashk_res.end) {
>   unsigned long i;
>  
> + cond_resched();
> +

I can't see why it would take a long time to loop inside, the job it does is 
simply to find a control area
not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
image->nr_segments; i++)",
@hole_end will be advanced to the end of its next nearby segment once overlap 
was detected each loop,
also there are limited (<=16) segments, so it won't take long to locate the 
right area.

Am I missing something?

Regards,
Xunlei

>   if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
>   break;
>   /* See if I overlap any of the segments */


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-12-01 Thread Xunlei Pang
On 12/01/2016 at 06:33 PM, Joerg Roedel wrote:
> On Thu, Dec 01, 2016 at 10:15:45AM +0800, Xunlei Pang wrote:
>> index 3965e73..624eac9 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -2024,6 +2024,25 @@ static int domain_context_mapping_one(struct 
>> dmar_domain *domain,
>> if (context_present(context))
>> goto out_unlock;
>>  
>> +   /*
>> +* For kdump cases, old valid entries may be cached due to the
>> +* in-flight DMA and copied pgtable, but there is no unmapping
>> +* behaviour for them, thus we need an explicit cache flush for
>> +* the newly-mapped device. For kdump, at this point, the device
>> +* is supposed to finish reset at its driver probe stage, so no
>> +* in-flight DMA will exist, and we don't need to worry anymore
>> +* hereafter.
>> +*/
>> +   if (context_copied(context)) {
>> +   u16 did_old = context_domain_id(context);
>> +
>> +   if (did_old >= 0 && did_old < cap_ndoms(iommu->cap))
>> +   iommu->flush.flush_context(iommu, did_old,
>> +  (((u16)bus) << 8) | devfn,
>> +  DMA_CCMD_MASK_NOBIT,
>> +  DMA_CCMD_DEVICE_INVL);
>> +   }
>> +
>> pgd = domain->pgd;
> Yes, this looks better. Have you tested it the same way as the old
> patch?

Yes, I have tested and it works, will send v3 later.

Regards,
Xunlei

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-11-30 Thread Xunlei Pang
On 11/30/2016 at 10:26 PM, Joerg Roedel wrote:
> On Wed, Nov 30, 2016 at 06:23:34PM +0800, Baoquan He wrote:
>> OK, talked with Xunlei. The old cache could be entry with present bit
>> set.
> -EPARSE
>
> Anyway, what I was trying to say is, that the IOMMU TLB is tagged with
> domain-ids, and that there is also a context-cache which maps device-ids
> to domain-ids.
>
> If we update the context entry then we need to flush only the context
> entry, as it will point to a new domain-id then and future IOTLB lookups
> in the IOMMU will be using the new domain-id and do not match the old
> entries.

Hi Joerg,

Thanks for the explanation, and we still need to flush context cache using old 
domain-id, right?
How about the following update?

index 3965e73..624eac9 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2024,6 +2024,25 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
if (context_present(context))
goto out_unlock;
 
+   /*
+* For kdump cases, old valid entries may be cached due to the
+* in-flight DMA and copied pgtable, but there is no unmapping
+* behaviour for them, thus we need an explicit cache flush for
+* the newly-mapped device. For kdump, at this point, the device
+* is supposed to finish reset at its driver probe stage, so no
+* in-flight DMA will exist, and we don't need to worry anymore
+* hereafter.
+*/
+   if (context_copied(context)) {
+   u16 did_old = context_domain_id(context);
+
+   if (did_old >= 0 && did_old < cap_ndoms(iommu->cap))
+   iommu->flush.flush_context(iommu, did_old,
+  (((u16)bus) << 8) | devfn,
+  DMA_CCMD_MASK_NOBIT,
+  DMA_CCMD_DEVICE_INVL);
+   }
+
pgd = domain->pgd;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-11-30 Thread Xunlei Pang
On 11/29/2016 at 10:35 PM, Joerg Roedel wrote:
> On Thu, Nov 17, 2016 at 10:47:28AM +0800, Xunlei Pang wrote:
>> As per the comment, the code here only needs to flush context caches
>> for the special domain 0 which is used to tag the
>> non-present/erroneous caches, seems we should flush the old domain id
>> of present entries for kdump according to the analysis, other than the
>> new-allocated domain id. Let me ponder more on this.
> Flushing the context entry only is fine. The old domain-id will not be
> re-used anyway, so there is no point in reading it out of the context
> table and flush it.

Do you mean to flush the context entry using the new-allocated domain id?

Yes, old domain-id will not be re-used as they were reserved when copy, but
may still be cached by in-flight DMA access.

Here is what the things seem to be from my understanding, and why I want to
flush using the old domain id:
1) In kdump mode, old tables are copied, and all the iommu caches are flushed.
2) There comes some in-flight DMA before the device's new context is mapped,
so translation caches(context, iotlb, etc) are created tagging old domain-id
in the iommu hardware.
3) At the driver probe stage, the device is reset , and no in-flight DMA will 
exist.
Here I assumed that the device reset won't flush the old caches in the iommu
hardware related to this device. I haven't found any relevant 
specification, please
correct me if I am wrong.
4) Then new context is setup, and new DMA is initiated, hit old cache that was
created in 2) as currently there's no such flush action, so DMAR fault 
happens.

I already posted v2 to flush context/iotlb using the old domain-id:
https://lkml.org/lkml/2016/11/18/514

Regards,
Xunlei

>
> Also, please add a Fixes-tag when you re-post this patch.
>
>
>   Joerg
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] iommu/vt-d: Flush old iommu caches for kdump when the device gets context mapped

2016-11-27 Thread Xunlei Pang
Ping Joerg/David, do you have any comment on it?
On 2016/11/19 at 00:23, Xunlei Pang wrote:
> We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
> under kdump, it can be steadily reproduced on several different machines,
> the dmesg log is like(running on 4.9.0-rc5+):
> HP HPSA Driver (v 3.4.16-0)
> hpsa :02:00.0: using doorbell to reset controller
> hpsa :02:00.0: board ready after hard reset.
> hpsa :02:00.0: Waiting for controller to respond to no-op
> DMAR: Setting identity map for device :02:00.0 [0xe8000 - 0xe8fff]
> DMAR: Setting identity map for device :02:00.0 [0xf4000 - 0xf4fff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf6e000 - 0xbdf6efff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf6f000 - 0xbdf7efff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf7f000 - 0xbdf82fff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf83000 - 0xbdf84fff]
> DMAR: DRHD: handling fault status reg 2
> DMAR: [DMA Read] Request device [02:00.0] fault addr f000 [fault reason 
> 06] PTE Read access is not set
> hpsa :02:00.0: controller message 03:00 timed out
> hpsa :02:00.0: no-op failed; re-trying
>
> After some debugging, we found that the fault addr is from DMA initiated at
> the driver probe stage after reset(not in-flight DMA), and the corresponding
> pte entry value is correct, the fault is likely due to the old iommu caches
> of the in-flight DMA before it.
>
> Thus we need to flush the old cache after context mapping is setup for the
> device, where the device is supposed to finish reset at its driver probe
> stage and no in-flight DMA exists hereafter.
>
> I'm not sure if the hardware is responsible for invalidating all the related
> caches allocated in iommu hardware during reset, but seems not the case for 
> hpsa,
> actually many device drivers even have problems properly resetting the 
> hardware.
> Anyway flushing (again) by software in kdump mode when the device gets context
> mapped which is a quite infrequent operation does little harm.
>
> With this patch, the problematic machine can survive the kdump tests.
>
> CC: Myron Stowe <myron.st...@redhat.com>
> CC: Joseph Szczypek <jszcz...@redhat.com>
> CC: Don Brace <don.br...@microsemi.com>
> CC: Baoquan He <b...@redhat.com>
> CC: Dave Young <dyo...@redhat.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
> ---
> v1 -> v2:
> Flush caches using old domain id.
>
>  drivers/iommu/intel-iommu.c | 22 ++
>  1 file changed, 22 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3965e73..653304d 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2024,6 +2024,28 @@ static int domain_context_mapping_one(struct 
> dmar_domain *domain,
>   if (context_present(context))
>   goto out_unlock;
>  
> + /*
> +  * For kdump cases, old valid entries may be cached due to the
> +  * in-flight DMA and copied pgtable, but there is no unmapping
> +  * behaviour for them, thus we need an explicit cache flush for
> +  * the newly-mapped device. For kdump, at this point, the device
> +  * is supposed to finish reset at its driver probe stage, so no
> +  * in-flight DMA will exist, and we don't need to worry anymore
> +  * hereafter.
> +  */
> + if (context_copied(context)) {
> + u16 did_old = context_domain_id(context);
> +
> + if (did_old >= 0 && did_old < cap_ndoms(iommu->cap)) {
> + iommu->flush.flush_context(iommu, did_old,
> +(((u16)bus) << 8) | devfn,
> +DMA_CCMD_MASK_NOBIT,
> +DMA_CCMD_DEVICE_INVL);
> + iommu->flush.flush_iotlb(iommu, did_old, 0, 0,
> +DMA_TLB_DSI_FLUSH);
> + }
> + }
> +
>   pgd = domain->pgd;
>  
>   context_clear_entry(context);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] iommu/vt-d: Flush old iommu caches for kdump when the device gets context mapped

2016-11-18 Thread Xunlei Pang
We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
under kdump, it can be steadily reproduced on several different machines,
the dmesg log is like(running on 4.9.0-rc5+):
HP HPSA Driver (v 3.4.16-0)
hpsa :02:00.0: using doorbell to reset controller
hpsa :02:00.0: board ready after hard reset.
hpsa :02:00.0: Waiting for controller to respond to no-op
DMAR: Setting identity map for device :02:00.0 [0xe8000 - 0xe8fff]
DMAR: Setting identity map for device :02:00.0 [0xf4000 - 0xf4fff]
DMAR: Setting identity map for device :02:00.0 [0xbdf6e000 - 0xbdf6efff]
DMAR: Setting identity map for device :02:00.0 [0xbdf6f000 - 0xbdf7efff]
DMAR: Setting identity map for device :02:00.0 [0xbdf7f000 - 0xbdf82fff]
DMAR: Setting identity map for device :02:00.0 [0xbdf83000 - 0xbdf84fff]
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [02:00.0] fault addr f000 [fault reason 06] 
PTE Read access is not set
hpsa :02:00.0: controller message 03:00 timed out
hpsa :02:00.0: no-op failed; re-trying

After some debugging, we found that the fault addr is from DMA initiated at
the driver probe stage after reset(not in-flight DMA), and the corresponding
pte entry value is correct, the fault is likely due to the old iommu caches
of the in-flight DMA before it.

Thus we need to flush the old cache after context mapping is setup for the
device, where the device is supposed to finish reset at its driver probe
stage and no in-flight DMA exists hereafter.

I'm not sure if the hardware is responsible for invalidating all the related
caches allocated in iommu hardware during reset, but seems not the case for 
hpsa,
actually many device drivers even have problems properly resetting the hardware.
Anyway flushing (again) by software in kdump mode when the device gets context
mapped which is a quite infrequent operation does little harm.

With this patch, the problematic machine can survive the kdump tests.

CC: Myron Stowe <myron.st...@redhat.com>
CC: Joseph Szczypek <jszcz...@redhat.com>
CC: Don Brace <don.br...@microsemi.com>
CC: Baoquan He <b...@redhat.com>
CC: Dave Young <dyo...@redhat.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
v1 -> v2:
Flush caches using old domain id.

 drivers/iommu/intel-iommu.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3965e73..653304d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2024,6 +2024,28 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
if (context_present(context))
goto out_unlock;
 
+   /*
+* For kdump cases, old valid entries may be cached due to the
+* in-flight DMA and copied pgtable, but there is no unmapping
+* behaviour for them, thus we need an explicit cache flush for
+* the newly-mapped device. For kdump, at this point, the device
+* is supposed to finish reset at its driver probe stage, so no
+* in-flight DMA will exist, and we don't need to worry anymore
+* hereafter.
+*/
+   if (context_copied(context)) {
+   u16 did_old = context_domain_id(context);
+
+   if (did_old >= 0 && did_old < cap_ndoms(iommu->cap)) {
+   iommu->flush.flush_context(iommu, did_old,
+  (((u16)bus) << 8) | devfn,
+  DMA_CCMD_MASK_NOBIT,
+  DMA_CCMD_DEVICE_INVL);
+   iommu->flush.flush_iotlb(iommu, did_old, 0, 0,
+  DMA_TLB_DSI_FLUSH);
+   }
+   }
+
pgd = domain->pgd;
 
context_clear_entry(context);
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-11-16 Thread Xunlei Pang
On 2016/11/16 at 22:58, Myron Stowe wrote:
> On Wed, Nov 16, 2016 at 2:13 AM, Xunlei Pang <xp...@redhat.com> wrote:
>> Ccing David
>> On 2016/11/16 at 17:02, Xunlei Pang wrote:
>>> We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
>>> under kdump, it can be steadily reproduced on several different machines,
>>> the dmesg log is like:
>>> HP HPSA Driver (v 3.4.16-0)
>>> hpsa :02:00.0: using doorbell to reset controller
>>> hpsa :02:00.0: board ready after hard reset.
>>> hpsa :02:00.0: Waiting for controller to respond to no-op
>>> DMAR: Setting identity map for device :02:00.0 [0xe8000 - 0xe8fff]
>>> DMAR: Setting identity map for device :02:00.0 [0xf4000 - 0xf4fff]
>>> DMAR: Setting identity map for device :02:00.0 [0xbdf6e000 - 0xbdf6efff]
>>> DMAR: Setting identity map for device :02:00.0 [0xbdf6f000 - 0xbdf7efff]
>>> DMAR: Setting identity map for device :02:00.0 [0xbdf7f000 - 0xbdf82fff]
>>> DMAR: Setting identity map for device :02:00.0 [0xbdf83000 - 0xbdf84fff]
>>> DMAR: DRHD: handling fault status reg 2
>>> DMAR: [DMA Read] Request device [02:00.0] fault addr f000 [fault reason 
>>> 06] PTE Read access is not set
>>> hpsa :02:00.0: controller message 03:00 timed out
>>> hpsa :02:00.0: no-op failed; re-trying
>>>
>>> After some debugging, we found that the corresponding pte entry value
>>> is correct, and the value of the iommu caching mode is 0, the fault is
>>> probably due to the old iotlb cache of the in-flight DMA.
>>>
>>> Thus need to flush the old iotlb after context mapping is setup for the
>>> device, where the device is supposed to finish reset at its driver probe
>>> stage and no in-flight DMA exists hereafter.
>>>
>>> With this patch, all our problematic machines can survive the kdump tests.
>>>
>>> CC: Myron Stowe <myron.st...@redhat.com>
>>> CC: Don Brace <don.br...@microsemi.com>
>>> CC: Baoquan He <b...@redhat.com>
>>> CC: Dave Young <dyo...@redhat.com>
>>> Tested-by: Joseph Szczypek <jszcz...@redhat.com>
>>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>>> ---
>>>  drivers/iommu/intel-iommu.c | 11 +--
>>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>>> index 3965e73..eb79288 100644
>>> --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -2067,9 +2067,16 @@ static int domain_context_mapping_one(struct 
>>> dmar_domain *domain,
>>>* It's a non-present to present mapping. If hardware doesn't cache
>>>* non-present entry we only need to flush the write-buffer. If the
>>>* _does_ cache non-present entries, then it does so in the special
> If this does get accepted then we should fix the above grammar also -
>   "If the _does_ cache ..." -> "If the hardware _does_ cache ..."

Yes, but this reminds me of something.
As per the comment, the code here only needs to flush context caches for the 
special domain 0 which is
used to tag the non-present/erroneous caches, seems we should flush the old 
domain id of present entries
for kdump according to the analysis, other than the new-allocated domain id. 
Let me ponder more on this.

Regards,
Xunlei

>
>>> -  * domain #0, which we have to flush:
>>> +  * domain #0, which we have to flush.
>>> +  *
>>> +  * For kdump cases, present entries may be cached due to the in-flight
>>> +  * DMA and copied old pgtable, but there is no unmapping behaviour for
>>> +  * them, so we need an explicit iotlb flush for the newly-mapped 
>>> device.
>>> +  * For kdump, at this point, the device is supposed to finish reset at
>>> +  * the driver probe stage, no in-flight DMA will exist, thus we do not
>>> +  * need to worry about that anymore hereafter.
>>>*/
>>> - if (cap_caching_mode(iommu->cap)) {
>>> + if (is_kdump_kernel() || cap_caching_mode(iommu->cap)) {
>>>   iommu->flush.flush_context(iommu, 0,
>>>  (((u16)bus) << 8) | devfn,
>>>  DMA_CCMD_MASK_NOBIT,
>> ___
>> iommu mailing list
>> io...@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-11-16 Thread Xunlei Pang
Ccing David
On 2016/11/16 at 17:02, Xunlei Pang wrote:
> We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
> under kdump, it can be steadily reproduced on several different machines,
> the dmesg log is like:
> HP HPSA Driver (v 3.4.16-0)
> hpsa :02:00.0: using doorbell to reset controller
> hpsa :02:00.0: board ready after hard reset.
> hpsa :02:00.0: Waiting for controller to respond to no-op
> DMAR: Setting identity map for device :02:00.0 [0xe8000 - 0xe8fff]
> DMAR: Setting identity map for device :02:00.0 [0xf4000 - 0xf4fff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf6e000 - 0xbdf6efff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf6f000 - 0xbdf7efff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf7f000 - 0xbdf82fff]
> DMAR: Setting identity map for device :02:00.0 [0xbdf83000 - 0xbdf84fff]
> DMAR: DRHD: handling fault status reg 2
> DMAR: [DMA Read] Request device [02:00.0] fault addr f000 [fault reason 
> 06] PTE Read access is not set
> hpsa :02:00.0: controller message 03:00 timed out
> hpsa :02:00.0: no-op failed; re-trying
>
> After some debugging, we found that the corresponding pte entry value
> is correct, and the value of the iommu caching mode is 0, the fault is
> probably due to the old iotlb cache of the in-flight DMA.
>
> Thus need to flush the old iotlb after context mapping is setup for the
> device, where the device is supposed to finish reset at its driver probe
> stage and no in-flight DMA exists hereafter.
>
> With this patch, all our problematic machines can survive the kdump tests.
>
> CC: Myron Stowe <myron.st...@redhat.com>
> CC: Don Brace <don.br...@microsemi.com>
> CC: Baoquan He <b...@redhat.com>
> CC: Dave Young <dyo...@redhat.com>
> Tested-by: Joseph Szczypek <jszcz...@redhat.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
> ---
>  drivers/iommu/intel-iommu.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3965e73..eb79288 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2067,9 +2067,16 @@ static int domain_context_mapping_one(struct 
> dmar_domain *domain,
>* It's a non-present to present mapping. If hardware doesn't cache
>* non-present entry we only need to flush the write-buffer. If the
>* _does_ cache non-present entries, then it does so in the special
> -  * domain #0, which we have to flush:
> +  * domain #0, which we have to flush.
> +  *
> +  * For kdump cases, present entries may be cached due to the in-flight
> +  * DMA and copied old pgtable, but there is no unmapping behaviour for
> +  * them, so we need an explicit iotlb flush for the newly-mapped device.
> +  * For kdump, at this point, the device is supposed to finish reset at
> +  * the driver probe stage, no in-flight DMA will exist, thus we do not
> +  * need to worry about that anymore hereafter.
>*/
> - if (cap_caching_mode(iommu->cap)) {
> + if (is_kdump_kernel() || cap_caching_mode(iommu->cap)) {
>   iommu->flush.flush_context(iommu, 0,
>  (((u16)bus) << 8) | devfn,
>  DMA_CCMD_MASK_NOBIT,


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] iommu/vt-d: Flush old iotlb for kdump when the device gets context mapped

2016-11-16 Thread Xunlei Pang
We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
under kdump, it can be steadily reproduced on several different machines,
the dmesg log is like:
HP HPSA Driver (v 3.4.16-0)
hpsa :02:00.0: using doorbell to reset controller
hpsa :02:00.0: board ready after hard reset.
hpsa :02:00.0: Waiting for controller to respond to no-op
DMAR: Setting identity map for device :02:00.0 [0xe8000 - 0xe8fff]
DMAR: Setting identity map for device :02:00.0 [0xf4000 - 0xf4fff]
DMAR: Setting identity map for device :02:00.0 [0xbdf6e000 - 0xbdf6efff]
DMAR: Setting identity map for device :02:00.0 [0xbdf6f000 - 0xbdf7efff]
DMAR: Setting identity map for device :02:00.0 [0xbdf7f000 - 0xbdf82fff]
DMAR: Setting identity map for device :02:00.0 [0xbdf83000 - 0xbdf84fff]
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [02:00.0] fault addr f000 [fault reason 06] 
PTE Read access is not set
hpsa :02:00.0: controller message 03:00 timed out
hpsa :02:00.0: no-op failed; re-trying

After some debugging, we found that the corresponding pte entry value
is correct, and the value of the iommu caching mode is 0, the fault is
probably due to the old iotlb cache of the in-flight DMA.

Thus need to flush the old iotlb after context mapping is setup for the
device, where the device is supposed to finish reset at its driver probe
stage and no in-flight DMA exists hereafter.

With this patch, all our problematic machines can survive the kdump tests.

CC: Myron Stowe <myron.st...@redhat.com>
CC: Don Brace <don.br...@microsemi.com>
CC: Baoquan He <b...@redhat.com>
CC: Dave Young <dyo...@redhat.com>
Tested-by: Joseph Szczypek <jszcz...@redhat.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3965e73..eb79288 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2067,9 +2067,16 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
 * It's a non-present to present mapping. If hardware doesn't cache
 * non-present entry we only need to flush the write-buffer. If the
 * _does_ cache non-present entries, then it does so in the special
-* domain #0, which we have to flush:
+* domain #0, which we have to flush.
+*
+* For kdump cases, present entries may be cached due to the in-flight
+* DMA and copied old pgtable, but there is no unmapping behaviour for
+* them, so we need an explicit iotlb flush for the newly-mapped device.
+* For kdump, at this point, the device is supposed to finish reset at
+* the driver probe stage, no in-flight DMA will exist, thus we do not
+* need to worry about that anymore hereafter.
 */
-   if (cap_caching_mode(iommu->cap)) {
+   if (is_kdump_kernel() || cap_caching_mode(iommu->cap)) {
iommu->flush.flush_context(iommu, 0,
   (((u16)bus) << 8) | devfn,
   DMA_CCMD_MASK_NOBIT,
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: Increase the upper limit for RAM segments

2016-11-14 Thread Xunlei Pang
On 2016/11/12 at 06:21, Sameer Goel wrote:
> On a newer UEFI based Qualcomm target the number of system ram regions
> retrieved from /proc/iomem are ~40. So increasing the current hardcoded
> values to 64 from 16.

I am a little confused, memory regions from /proc/iomem should be 
MAX_MEMORY_RANGES used
as the elfcorehdr, while KEXEC_SEGMENT_MAX stands for the kexec segments passed 
to the kexec
syscall, like kernel image, initrd image, purgatory, etc.

Do you mean KEXEC_SEGMENT_MAX or MAX_MEMORY_RANGES?

Regards,
Xunlei

>
> Signed-off-by: Sameer Goel 
> ---
>  kexec/arch/arm64/kexec-arm64.h | 2 +-
>  kexec/kexec-syscall.h  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h
> index bac62f8..bd4c20e 100644
> --- a/kexec/arch/arm64/kexec-arm64.h
> +++ b/kexec/arch/arm64/kexec-arm64.h
> @@ -11,7 +11,7 @@
>  #include "image-header.h"
>  #include "kexec.h"
>  
> -#define KEXEC_SEGMENT_MAX 16
> +#define KEXEC_SEGMENT_MAX 64
>  
>  #define BOOT_BLOCK_VERSION 17
>  #define BOOT_BLOCK_LAST_COMP_VERSION 16
> diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h
> index c0d0bea..f84c937 100644
> --- a/kexec/kexec-syscall.h
> +++ b/kexec/kexec-syscall.h
> @@ -115,7 +115,7 @@ static inline long kexec_file_load(int kernel_fd, int 
> initrd_fd,
>  #define KEXEC_ARCH_MIPS( 8 << 16)
>  #define KEXEC_ARCH_CRIS(76 << 16)
>  
> -#define KEXEC_MAX_SEGMENTS 16
> +#define KEXEC_MAX_SEGMENTS 64
>  
>  #ifdef __i386__
>  #define KEXEC_ARCH_NATIVEKEXEC_ARCH_386


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [V4 PATCH 1/2] x86/panic: Replace smp_send_stop() with kdump friendly version in panic path

2016-09-20 Thread Xunlei Pang
On 2016/08/15/ at 19:22, Hidehiro Kawai wrote:
> Hi Dave,
>
> Thank you for the review.
>
>> From: Dave Young [mailto:dyo...@redhat.com]
>> Sent: Friday, August 12, 2016 12:17 PM
>>
>> Thanks for the update.
>> On 08/10/16 at 05:09pm, Hidehiro Kawai wrote:
>>> Daniel Walker reported problems which happens when
>>> crash_kexec_post_notifiers kernel option is enabled
>>> (https://lkml.org/lkml/2015/6/24/44).
>>>
>>> In that case, smp_send_stop() is called before entering kdump routines
>>> which assume other CPUs are still online.  As the result, for x86,
>>> kdump routines fail to save other CPUs' registers  and disable
>>> virtualization extensions.
>> Seems you simplified the changelog, but I think a little more details
>> will be helpful to understand the patch. You know sometimes lkml.org
>> does not work well.
> So, I'll try another archives when I post patch set next time.

Hi Hidehiro Kawai,

What's the status of this patch set, are you going to send an updated version?

Regards,
Xunlei

>>> To fix this problem, call a new kdump friendly function,
>>> crash_smp_send_stop(), instead of the smp_send_stop() when
>>> crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a
>>> weak function, and it just call smp_send_stop().  Architecture
>>> codes should override it so that kdump can work appropriately.
>>> This patch only provides x86-specific version.
>>>
>>> For Xen's PV kernel, just keep the current behavior.
>> Could you explain a bit about above Xen PV kernel behavior?
>>
>> BTW, this version looks better,  I think I'm fine with this version
>> besides of the questions about changelog.
> As for Dom0 kernel, it doesn't use crash_kexec routines, and
> it relies on panic notifier chain.  At the end of the chain,
> xen_panic_event is called, and it issues a hypercall which
> requests Hypervisor to execute kdump.  This means whether
> crash_kexec_panic_notifiers is set or not, panic notifiers
> are called after smp_send_stop.  Even if we save registers
> in Dom0 kernel, they seem to be ignored (Hypervisor is responsible
> for that).  This is why I kept the current behavior for Xen.
>
> For PV DomU kernel, kdump is not supported.  For PV HVM
> DomU, I'm not sure what will happen on panic because I
> couldn't boot PV HVM DomU and test it.  But I think it will
> work similarly to baremetal kernels with extra cleanups
> for Hypervisor.
>
> Best regards,
>
> Hidehiro Kawai
>
>>> Changes in V4:
>>> - Keep to use smp_send_stop if crash_kexec_post_notifiers is not set
>>> - Rename panic_smp_send_stop to crash_smp_send_stop
>>> - Don't change the behavior for Xen's PV kernel
>>>
>>> Changes in V3:
>>> - Revise comments, description, and symbol names
>>>
>>> Changes in V2:
>>> - Replace smp_send_stop() call with crash_kexec version which
>>>   saves cpu states and cleans up VMX/SVM
>>> - Drop a fix for Problem 1 at this moment
>>>
>>> Reported-by: Daniel Walker <dwal...@fifo99.com>
>>> Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" 
>>> option)
>>> Signed-off-by: Hidehiro Kawai <hidehiro.kawai...@hitachi.com>
>>> Cc: Dave Young <dyo...@redhat.com>
>>> Cc: Baoquan He <b...@redhat.com>
>>> Cc: Vivek Goyal <vgo...@redhat.com>
>>> Cc: Eric Biederman <ebied...@xmission.com>
>>> Cc: Masami Hiramatsu <mhira...@kernel.org>
>>> Cc: Daniel Walker <dwal...@fifo99.com>
>>> Cc: Xunlei Pang <xp...@redhat.com>
>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>> Cc: Ingo Molnar <mi...@redhat.com>
>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>> Cc: Borislav Petkov <b...@suse.de>
>>> Cc: David Vrabel <david.vra...@citrix.com>
>>> Cc: Toshi Kani <toshi.k...@hpe.com>
>>> Cc: Andrew Morton <a...@linux-foundation.org>
>>> ---
>>>  arch/x86/include/asm/kexec.h |1 +
>>>  arch/x86/include/asm/smp.h   |1 +
>>>  arch/x86/kernel/crash.c  |   22 +---
>>>  arch/x86/kernel/smp.c|5 
>>>  kernel/panic.c   |   47 
>>> --
>>>  5 files changed, 66 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
>>> index d2434c1..282630e 100644
>>> --- a/arch/x86/include/asm/kexec.h
>>> +++

Re: [PATCH v2 1/2] kexec: Introduce "/sys/kernel/kexec_crash_low_size"

2016-08-24 Thread Xunlei Pang
On 2016/08/24 at 16:20, Dave Young wrote:
> On 08/23/16 at 06:11pm, Yinghai Lu wrote:
>> On Wed, Aug 17, 2016 at 1:20 AM, Dave Young <dyo...@redhat.com> wrote:
>>> On 08/17/16 at 09:50am, Xunlei Pang wrote:
>>>> "/sys/kernel/kexec_crash_size" only handles crashk_res, it
>>>> is fine in most cases, but sometimes we have crashk_low_res.
>>>> For example, when "crashkernel=size[KMG],high" combined with
>>>> "crashkernel=size[KMG],low" is used for 64-bit x86.
>>>>
>>>> Like crashk_res, we introduce the corresponding sysfs file
>>>> "/sys/kernel/kexec_crash_low_size" for crashk_low_res.
>>>>
>>>> So, the exact total reserved memory is the sum of the two.
>>>>
>>>> crashk_low_res can also be shrunk via this new interface,
>>>> and users should be aware of what they are doing.
>> ...
>>>> @@ -218,6 +238,7 @@ static struct attribute * kernel_attrs[] = {
>>>>  #ifdef CONFIG_KEXEC_CORE
>>>>   _loaded_attr.attr,
>>>>   _crash_loaded_attr.attr,
>>>> + _crash_low_size_attr.attr,
>>>>   _crash_size_attr.attr,
>>>>   _attr.attr,
>>>>  #endif
>> would be better if you can use attribute_group .is_visible to control 
>> showing of
>> crash_low_size only when the crash_base is above 4G.
> I have same feeling that it looks odd to show low in sysfs in case no
> crashkernel=,high being used. Even if crashkernel=,high is used only in
> x86 the resource crashk_low is in common code. What do you think to move
> it to x86?

If want to put some restriction on it, I'd prefer to move crashk_low to arch 
x86, to make
it x86-specific.

We can show the interface unconditionally. If it isn't used, its size is 0, it 
doesn't matter.

Regards,
Xunlei

>
> Thanks
> Dave
>
>> Thanks
>>
>> Yinghai
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/4] Kexec: Enable run time memory resrvation of crash kernel

2016-08-23 Thread Xunlei Pang
On 2016/08/22 at 18:59, Pratyush Anand wrote:
> On 12/08/2016:07:48:38 PM, Ronit Halder wrote:
>> Currenty linux kernel reserves memory at the boot time for crash kernel.
>> It will be very useful if we can reserve memory in run time. The user can 
>> reserve the memory whenerver needed instead of reserving at the boot time.
>>
>> It is possible to reserve memory for crash kernel at the run time using
>> CMA (Contiguous Memory Allocator). CMA is capable of allocating big chunk 
>> of memory. At the boot time we will create one (if only low memory is used)
>> or two (if we use both high memory in case of x86_64) CMA areas of size 
>> given in "crashkernel" boot time command line parameter. This memory in CMA
>> areas can be used as movable pages (used for disk caches, process pages
>> etc) if not allocated. Then the user can reserve or free memory from those
>> CMA areas using "/sys/kernel/kexec_crash_size" sysfs entry. If the user
> But the cma_alloc() is not a guaranteed allocation function, whereas memblock
> api will guarantee that crashkerenel memory is available. 
> More over, most of the system starts kdump service at boot time, so not sure 
> if
> it could be useful enough. Lets see what other says

Maybe this is useful for debug purpose, after you shrunk the memory and realized
you just made a mistake, you can use this function to expand it without reboot 
to
modify the cmdline. Otherwise, I can't think of other use cases.

But it still relys on the "crashkernel" cmdline, and I think it would be more 
useful(at least
for me) if you can throw away "crashkernel", and use the sysfs entry directly 
to reserve
or expand the memory if possible. Because sometimes when I want to debug some 
kdump
issue, I found the system I was using didn't specify the right (none or 
smaller)"crashkernel"
cmdline, so I must reboot it.

Regards,
Xunlei

>
>> usee high memory it will automatically at least 256MB low memory
>> (needed for swiotlb and DMA buffers) when the user allocates memory using
>> mentioned sysfs enrty. In case of high memory reservation the user controls
>> the size of reserved region in high memory with
>> "/sys/kernel/kexec_crash_size" entry. If the size set is zero then the 
>> memory allocated in low memory will automatically be freed.
>>
>> As the pages under CMA area (when not allocated by CMA) can only be used by
>> movable pages. The pages won't be used for DMA. So, after allocating pages
>> from CMA area for loading the crash kernel, there won't be any chance of
>> DMA on the memory.
>>
>> Thus is a prototype patch. Please share your opinions on my approach. This
>> patch is only for x86 and x86_64. Please note, this patch is only a
>> prototype just to explain my approach and get the review. This patch is on
>> kernel version v4.4.11.
>>
>> CMA depends on page migration and only uses movable pages. But, the movable
>> pages become unmovable momentarily for pinning. The CMA fails for this
>> reason. I don't have any solution for that right now. This approach will
>> work when the this problems with CMA will be fixed. The patch is enabled
>> by a kernel configuration option CONFIG_KEXEC_CMA.
>>
>> Ronit Halder (4):
>>   Creating one or two CMA area at Boot time
>>   Functions for memory reservation and release
>>   Adding a new kernel configuration to enable the feature
>>   Enable memory allocation through sysfs interface
>>
>>  arch/x86/kernel/setup.c | 44 --
>>  include/linux/kexec.h   | 11 ++-
>>  kernel/kexec_core.c | 83 
>> +
>>  kernel/ksysfs.c | 23 +-
>>  mm/Kconfig  |  6 
>>  5 files changed, 162 insertions(+), 5 deletions(-)
> ~Pratyush
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v7 2/2] Documentation: kdump: add description of enable multi-cpus support

2016-08-17 Thread Xunlei Pang
On 2016/08/18 at 09:50, Zhou Wenjian wrote:
> multi-cpu support is useful to improve the performance of kdump in
> some cases. So add the description of enable multi-cpu support in
> dump-capture kernel.
>
> Signed-off-by: Zhou Wenjian 
> Acked-by: Baoquan He 
> ---
>  Documentation/kdump/kdump.txt | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 96da2b7..c93a6e0 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -396,6 +396,13 @@ Notes on loading the dump-capture kernel:
>Note, though maxcpus always works, you should replace it by nr_cpus to
>save memory if supported by the current ARCH, such as x86.
>  
> +* You should enable multi-cpu support in dump-capture kernel if you intend
> +  to use multi-thread programs with it, such as parallel dump feature of
> +  makedumpfile. Otherwise, the multi-thread program may have a great
> +  performance degradation. To enable multi-cpu support, you should bring up
> +  a SMP dump-capture kernel and specify maxcpus\nr_cpus, 
> disable_cpu_apicid=[X]

s/a SMP/an SMP/
For "maxcpus\nr_cpus", I think to use slash instead of backslash in Linux is 
better.

Otherwise, looks good to me.

Regards,
Xunlei

> +  options while loading it.
> +
>  * For s390x there are two kdump modes: If a ELF header is specified with
>the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
>is done on all other architectures. If no elfcorehdr= kernel parameter is


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 2/2] kexec: Consider crashk_low_res in sanity_check_segment_list()

2016-08-17 Thread Xunlei Pang
On 2016/08/17 at 15:24, Dave Young wrote:
> Hi, Xunlei,
>
> On 08/17/16 at 09:50am, Xunlei Pang wrote:
>> We have crashk_res only in most cases, but sometimes we have
>> crashk_low_res.
>>
>> For example, on 64-bit x86 systems, when "crashkernel=32M,high"
>> combined with "crashkernel=128M,low" is used, so some segments
>> may have the chance to be loaded into crashk_low_res area. We
>> can't fail it as a memory violation in these cases.
>>
>> Thus, we add the case to regard the segment as valid if it is
>> within crashk_low_res.
> crashkernel low is meant for swiotlb, it can be reserved automaticlly
> in case there's only crashkernel high specified in cmdline, I'm not
> sure it is useful to use crashk_res_low for other purpose and
> likely kdump can fail in the case. 
>
> I'm not sure it is really necessary to add this check now, we may
> handle it only when there is an actual use case and bug report in
> the future.

Thanks for the review.
The reason I added this is that crashk_res is allowed to be shrunk, so the 
segment
will surely fall into crashk_low_res if crashk_res was shrunk to be a small 
range.

But yes, this should be a corner case, but seems it does no harm adding this 
check.
Anyway, if you think it's not necessary, let's simply ignore it :-)

Regards,
Xunlei

>
> Thanks
> Dave
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  kernel/kexec_core.c | 11 ---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 707d18e..9012a60 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -248,9 +248,14 @@ int sanity_check_segment_list(struct kimage *image)
>>  mstart = image->segment[i].mem;
>>  mend = mstart + image->segment[i].memsz - 1;
>>  /* Ensure we are within the crash kernel limits */
>> -if ((mstart < phys_to_boot_phys(crashk_res.start)) ||
>> -(mend > phys_to_boot_phys(crashk_res.end)))
>> -return -EADDRNOTAVAIL;
>> +if ((mstart >= phys_to_boot_phys(crashk_res.start)) &&
>> +(mend <= phys_to_boot_phys(crashk_res.end)))
>> +continue;
>> +if ((mstart >= phys_to_boot_phys(crashk_low_res.start)) 
>> &&
>> +(mend <= phys_to_boot_phys(crashk_low_res.end)))
>> +continue;
>> +
>> +return -EADDRNOTAVAIL;
>>  }
>>  }
>>  
>> -- 
>> 1.8.3.1
>>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 1/2] kexec: Introduce "/sys/kernel/kexec_crash_low_size"

2016-08-16 Thread Xunlei Pang
"/sys/kernel/kexec_crash_size" only handles crashk_res, it
is fine in most cases, but sometimes we have crashk_low_res.
For example, when "crashkernel=size[KMG],high" combined with
"crashkernel=size[KMG],low" is used for 64-bit x86.

Like crashk_res, we introduce the corresponding sysfs file
"/sys/kernel/kexec_crash_low_size" for crashk_low_res.

So, the exact total reserved memory is the sum of the two.

crashk_low_res can also be shrunk via this new interface,
and users should be aware of what they are doing.

Suggested-by: Dave Young <dyo...@redhat.com>
Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 include/linux/kexec.h |  4 ++--
 kernel/kexec_core.c   | 23 ---
 kernel/ksysfs.c   | 25 +++--
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d743777..4f271fc 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -304,8 +304,8 @@ int parse_crashkernel_high(char *cmdline, unsigned long 
long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
-int crash_shrink_memory(unsigned long new_size);
-size_t crash_get_memory_size(void);
+int crash_shrink_memory(struct resource *res, unsigned long new_size);
+size_t crash_get_memory_size(struct resource *res);
 void crash_free_reserved_phys_range(unsigned long begin, unsigned long end);
 
 int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5616755..707d18e 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -925,13 +925,13 @@ void crash_kexec(struct pt_regs *regs)
}
 }
 
-size_t crash_get_memory_size(void)
+size_t crash_get_memory_size(struct resource *res)
 {
size_t size = 0;
 
mutex_lock(_mutex);
-   if (crashk_res.end != crashk_res.start)
-   size = resource_size(_res);
+   if (res->end != res->start)
+   size = resource_size(res);
mutex_unlock(_mutex);
return size;
 }
@@ -945,7 +945,7 @@ void __weak crash_free_reserved_phys_range(unsigned long 
begin,
free_reserved_page(boot_pfn_to_page(addr >> PAGE_SHIFT));
 }
 
-int crash_shrink_memory(unsigned long new_size)
+int crash_shrink_memory(struct resource *res, unsigned long new_size)
 {
int ret = 0;
unsigned long start, end;
@@ -958,8 +958,9 @@ int crash_shrink_memory(unsigned long new_size)
ret = -ENOENT;
goto unlock;
}
-   start = crashk_res.start;
-   end = crashk_res.end;
+
+   start = res->start;
+   end = res->end;
old_size = (end == 0) ? 0 : end - start + 1;
if (new_size >= old_size) {
ret = (new_size == old_size) ? 0 : -EINVAL;
@@ -975,17 +976,17 @@ int crash_shrink_memory(unsigned long new_size)
start = roundup(start, KEXEC_CRASH_MEM_ALIGN);
end = roundup(start + new_size, KEXEC_CRASH_MEM_ALIGN);
 
-   crash_free_reserved_phys_range(end, crashk_res.end);
+   crash_free_reserved_phys_range(end, res->end);
 
-   if ((start == end) && (crashk_res.parent != NULL))
-   release_resource(_res);
+   if ((start == end) && (res->parent != NULL))
+   release_resource(res);
 
ram_res->start = end;
-   ram_res->end = crashk_res.end;
+   ram_res->end = res->end;
ram_res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
ram_res->name = "System RAM";
 
-   crashk_res.end = end - 1;
+   res->end = end - 1;
 
insert_resource(_resource, ram_res);
 
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index ee1bc1b..3336fd5 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -105,10 +105,30 @@ static ssize_t kexec_crash_loaded_show(struct kobject 
*kobj,
 }
 KERNEL_ATTR_RO(kexec_crash_loaded);
 
+static ssize_t kexec_crash_low_size_show(struct kobject *kobj,
+  struct kobj_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%zu\n", crash_get_memory_size(_low_res));
+}
+static ssize_t kexec_crash_low_size_store(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  const char *buf, size_t count)
+{
+   unsigned long cnt;
+   int ret;
+
+   if (kstrtoul(buf, 0, ))
+   return -EINVAL;
+
+   ret = crash_shrink_memory(_low_res, cnt);
+   return ret < 0 ? ret : count;
+}
+KERNEL_ATTR_RW(kexec_crash_low_size);
+
 static ssize_t kexec_crash_size_show(struct kobject *kobj,
   struct kobj_attribute *attr, char *buf)
 {
-   r

[PATCH v2 2/2] kexec: Consider crashk_low_res in sanity_check_segment_list()

2016-08-16 Thread Xunlei Pang
We have crashk_res only in most cases, but sometimes we have
crashk_low_res.

For example, on 64-bit x86 systems, when "crashkernel=32M,high"
combined with "crashkernel=128M,low" is used, so some segments
may have the chance to be loaded into crashk_low_res area. We
can't fail it as a memory violation in these cases.

Thus, we add the case to regard the segment as valid if it is
within crashk_low_res.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 kernel/kexec_core.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 707d18e..9012a60 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -248,9 +248,14 @@ int sanity_check_segment_list(struct kimage *image)
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz - 1;
/* Ensure we are within the crash kernel limits */
-   if ((mstart < phys_to_boot_phys(crashk_res.start)) ||
-   (mend > phys_to_boot_phys(crashk_res.end)))
-   return -EADDRNOTAVAIL;
+   if ((mstart >= phys_to_boot_phys(crashk_res.start)) &&
+   (mend <= phys_to_boot_phys(crashk_res.end)))
+   continue;
+   if ((mstart >= phys_to_boot_phys(crashk_low_res.start)) 
&&
+   (mend <= phys_to_boot_phys(crashk_low_res.end)))
+   continue;
+
+   return -EADDRNOTAVAIL;
}
}
 
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Account crashk_low_res to kexec_crash_size

2016-08-15 Thread Xunlei Pang
On 2016/08/15 at 15:17, Dave Young wrote:
> Hi Xunlei,
>
> On 08/13/16 at 04:26pm, Xunlei Pang wrote:
>> "/sys/kernel/kexec_crash_size" only includes crashk_res, it
>> is fine in most cases, but sometimes we have crashk_low_res.
>> For example, when "crashkernel=size[KMG],high" combined with
>> "crashkernel=size[KMG],low" is used for 64-bit x86.
>>
>> Let "/sys/kernel/kexec_crash_size" reflect all the reserved
>> memory including crashk_low_res, this is more understandable
>> from its naming.
> Maybe export another file for the kexec_crash_low_size so that
> we can clearly get how much the low area is.

I'm fine with it.

>> Although we can get all the crash memory from "/proc/iomem"
>> by filtering all "Crash kernel" keyword, it is more convenient
>> to utilize this file, and the two ways should stay consistent.
> Shrink low area does not make much sense, one may either use it or
> shrink it to 0.
>
> Actually think more about it, the crashk_low is only for x86,
> it might be even better to move it to x86 code instead of in
> common code.
>
> Opinion?

crashk_low is defined in kernel/kexec_core.c, it's an architecture independent 
definition
though it's only used by x86 currently, maybe it can be used by others in the 
future.
It's why I'm not handling it specifically for x86.

I just tested the original proc interface further, and it can be shrinked to be 
zero.
So I guess we can ease the restriction on shrinking the low area as well.

What do you think?

Regards,
Xunlei

>
> Thanks
> Dave
>> Note that write to "/sys/kernel/kexec_crash_size" is to shrink
>> the reserved memory, and we want to shrink crashk_res only.
>> So we add some additional check in crash_shrink_memory() since
>> crashk_low_res now is involved.
>>
>> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
>> ---
>>  kernel/kexec_core.c | 15 ++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 5616755..d5ae780 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -932,6 +932,8 @@ size_t crash_get_memory_size(void)
>>  mutex_lock(_mutex);
>>  if (crashk_res.end != crashk_res.start)
>>  size = resource_size(_res);
>> +if (crashk_low_res.end != crashk_low_res.start)
>> +size += resource_size(_low_res);
>>  mutex_unlock(_mutex);
>>  return size;
>>  }
>> @@ -949,7 +951,7 @@ int crash_shrink_memory(unsigned long new_size)
>>  {
>>  int ret = 0;
>>  unsigned long start, end;
>> -unsigned long old_size;
>> +unsigned long low_size, old_size;
>>  struct resource *ram_res;
>>  
>>  mutex_lock(_mutex);
>> @@ -958,6 +960,17 @@ int crash_shrink_memory(unsigned long new_size)
>>  ret = -ENOENT;
>>  goto unlock;
>>  }
>> +
>> +start = crashk_low_res.start;
>> +end = crashk_low_res.end;
>> +low_size = (end == 0) ? 0 : end - start + 1;
>> +/* Do not shrink crashk_low_res. */
>> +if (new_size <= low_size) {
>> +ret = -EINVAL;
>> +goto unlock;
>> +}
>> +
>> +new_size -= low_size;
>>  start = crashk_res.start;
>>  end = crashk_res.end;
>>  old_size = (end == 0) ? 0 : end - start + 1;
>> -- 
>> 1.8.3.1
>>
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] kexec: Account crashk_low_res to kexec_crash_size

2016-08-13 Thread Xunlei Pang
"/sys/kernel/kexec_crash_size" only includes crashk_res, it
is fine in most cases, but sometimes we have crashk_low_res.
For example, when "crashkernel=size[KMG],high" combined with
"crashkernel=size[KMG],low" is used for 64-bit x86.

Let "/sys/kernel/kexec_crash_size" reflect all the reserved
memory including crashk_low_res, this is more understandable
from its naming.

Although we can get all the crash memory from "/proc/iomem"
by filtering all "Crash kernel" keyword, it is more convenient
to utilize this file, and the two ways should stay consistent.

Note that write to "/sys/kernel/kexec_crash_size" is to shrink
the reserved memory, and we want to shrink crashk_res only.
So we add some additional check in crash_shrink_memory() since
crashk_low_res now is involved.

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
---
 kernel/kexec_core.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5616755..d5ae780 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -932,6 +932,8 @@ size_t crash_get_memory_size(void)
mutex_lock(_mutex);
if (crashk_res.end != crashk_res.start)
size = resource_size(_res);
+   if (crashk_low_res.end != crashk_low_res.start)
+   size += resource_size(_low_res);
mutex_unlock(_mutex);
return size;
 }
@@ -949,7 +951,7 @@ int crash_shrink_memory(unsigned long new_size)
 {
int ret = 0;
unsigned long start, end;
-   unsigned long old_size;
+   unsigned long low_size, old_size;
struct resource *ram_res;
 
mutex_lock(_mutex);
@@ -958,6 +960,17 @@ int crash_shrink_memory(unsigned long new_size)
ret = -ENOENT;
goto unlock;
}
+
+   start = crashk_low_res.start;
+   end = crashk_low_res.end;
+   low_size = (end == 0) ? 0 : end - start + 1;
+   /* Do not shrink crashk_low_res. */
+   if (new_size <= low_size) {
+   ret = -EINVAL;
+   goto unlock;
+   }
+
+   new_size -= low_size;
start = crashk_res.start;
end = crashk_res.end;
old_size = (end == 0) ? 0 : end - start + 1;
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


  1   2   >