Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA

2020-07-02 Thread chenzhou
Hi Bhupesh,


On 2020/7/3 3:22, Bhupesh Sharma wrote:
> Hi Will,
>
> On Thu, Jul 2, 2020 at 1:20 PM Will Deacon  wrote:
>> On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
>>> commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
>>> ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
>>>
>>> However as reported by Prabhakar, this breaks kdump kernel booting in
>>> ThunderX2 like arm64 systems. I have noticed this on another ampere
>>> arm64 machine. The OOM log in the kdump kernel looks like this:
>>>
>>>   [0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic 
>>> allocations
>>>   [0.247713] swapper/0: page allocation failure: order:1, 
>>> mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
>>>   <..snip..>
>>>   [0.274706] Call trace:
>>>   [0.277170]  dump_backtrace+0x0/0x208
>>>   [0.280863]  show_stack+0x1c/0x28
>>>   [0.284207]  dump_stack+0xc4/0x10c
>>>   [0.287638]  warn_alloc+0x104/0x170
>>>   [0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
>>>   [0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
>>>   [0.301530]  alloc_page_interleave+0x20/0x90
>>>   [0.305839]  alloc_pages_current+0xdc/0xf8
>>>   [0.309972]  atomic_pool_expand+0x60/0x210
>>>   [0.314108]  __dma_atomic_pool_init+0x50/0xa4
>>>   [0.318504]  dma_atomic_pool_init+0xac/0x158
>>>   [0.322813]  do_one_initcall+0x50/0x218
>>>   [0.326684]  kernel_init_freeable+0x22c/0x2d0
>>>   [0.331083]  kernel_init+0x18/0x110
>>>   [0.334600]  ret_from_fork+0x10/0x18
>>>
>>> This patch limits the crashkernel allocation to the first 1GB of
>>> the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
>>> issues when crashkernel is executed, as it might have been originally
>>> allocated from either a ZONE_DMA32 memory or mixture of memory chunks
>>> belonging to both ZONE_DMA and ZONE_DMA32.
>> How does this interact with this ongoing series:
>>
>> https://lore.kernel.org/r/20200628083458.40066-1-chenzho...@huawei.com
>>
>> (patch 4, in particular)
> Many thanks for having a look at this patchset. I was not aware that
> Chen had sent out a new version.
> I had noted in the v9 review of the high/low range allocation
>  that I was working
> on a generic solution (irrespective of the crashkernel, low and high
> range allocation) which resulted in this patchset.
>
> The issue is two-fold: OOPs in memcfg layer (PATCH 1/2, which has been
> Acked-by memcfg maintainer) and OOM in the kdump kernel due to
> crashkernel allocation in ZONE_DMA32 regions(s) which is addressed by
> this PATCH.
>
> I will have a closer look at the v10 patchset Chen shared, but seems
> it needs some rework as per Dave's review comments which he shared
> today.
> IMO, in the meanwhile this patchset  can be used to fix the existing
> kdump issue with upstream kernel.
Thanks for your work.
There is no progress on the issue for long time, so i sent my solution in v8 
comments
and sent v9 recently.

I think direct limiting the crashkernel in ZONE_DMA isn't a good idea:
1. For parameter "crashkernel=Y", reserving crashkernel in first 1G memory will 
increase
the probability of memory allocation failure.
Previous discuss from https://lkml.org/lkml/2019/10/21/725:
"With ZONE_DMA=y, this config will fail to reserve 512M CMA on a server"

2. For parameter "crashkernel=Y@X", limiting the crashkernel in ZONE_DMA is 
unreasonable
for someone really want to reserve crashkernel from specified start address.

I have sent v10: https://www.spinics.net/lists/arm-kernel/msg819408.html, any 
commets are welcome.

Thanks,
Chen Zhou
>
>>> Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
>>> Cc: Johannes Weiner 
>>> Cc: Michal Hocko 
>>> Cc: Vladimir Davydov 
>>> Cc: James Morse 
>>> Cc: Mark Rutland 
>>> Cc: Will Deacon 
>>> Cc: Catalin Marinas 
>>> Cc: cgro...@vger.kernel.org
>>> Cc: linux...@kvack.org
>>> Cc: linux-arm-ker...@lists.infradead.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Cc: kexec@lists.infradead.org
>>> Reported-by: Prabhakar Kushwaha 
>>> Signed-off-by: Bhupesh Sharma 
>>> ---
>>>  arch/arm64/mm/init.c | 16 ++--
>>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>> index 1e93cfc7c47a..02ae4d623802 100644
>>> --- a/arch/arm64/mm/init.c
>>> +++ b/arch/arm64/mm/init.c
>>> @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
>>>   crash_size = PAGE_ALIGN(crash_size);
>>>
>>>   if (crash_base == 0) {
>>> - /* Current arm64 boot protocol requires 2MB alignment */
>>> - crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
>>> + /* Current arm64 boot protocol requires 2MB alignment.
>>> +  * Also limit the crashkernel allocation to the first
>>> +  * 1GB of the RAM accessible (ZONE_DMA), as 

Re: [PATCH v10 5/5] kdump: update Documentation about crashkernel on arm64

2020-07-02 Thread Dave Young
On 07/03/20 at 12:46pm, Dave Young wrote:
> Hi,
> 
> Thanks for the update, but still some nitpicks :(
> 
> I'm sorry I did not catch them previously,  but maybe it is not worth to
> repost the whole series if no other changes needed.

Feel free to add my acks for the common kdump part:

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v10 5/5] kdump: update Documentation about crashkernel on arm64

2020-07-02 Thread Dave Young
Hi,

Thanks for the update, but still some nitpicks :(

I'm sorry I did not catch them previously,  but maybe it is not worth to
repost the whole series if no other changes needed.
On 07/03/20 at 11:58am, Chen Zhou wrote:
> Now we support crashkernel=X,[low] on arm64, update the Documentation.
> We could use parameters "crashkernel=X crashkernel=Y,low" to reserve
> memory above 4G.
> 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> Tested-by: Prabhakar Kushwaha 
> ---
>  Documentation/admin-guide/kdump/kdump.rst   | 14 --
>  Documentation/admin-guide/kernel-parameters.txt | 17 +++--
>  2 files changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 2da65fef2a1c..e80fc9e28a9a 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -299,7 +299,15 @@ Boot into System Kernel
> "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
> starting at physical address 0x0100 (16MB) for the dump-capture 
> kernel.
>  
> -   On x86 and x86_64, use "crashkernel=64M@16M".
> +   On x86 use "crashkernel=64M@16M".
> +
> +   On x86_64, use "crashkernel=Y" to select a region under 4G first, and
> +   fall back to reserve region above 4G.
> +   We can also use "crashkernel=X,high" to select a region above 4G, which
> +   also tries to allocate at least 256M below 4G automatically and
> +   "crashkernel=Y,low" can be used to allocate specified size low memory.
> +   Use "crashkernel=Y@X" if we really have to reserve memory from specified

s/we/you

> +   start address X.
>  
> On ppc64, use "crashkernel=128M@32M".
>  
> @@ -316,8 +324,10 @@ Boot into System Kernel
> kernel will automatically locate the crash kernel image within the
> first 512MB of RAM if X is not given.
>  
> -   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
> +   On arm64, use "crashkernel=Y[@X]". Note that the start address of
> the kernel, X if explicitly specified, must be aligned to 2MiB (0x20).
> +   If crashkernel=Z,low is specified simultaneously, reserve spcified size

s/spcified/specified

> +   low memory firstly and then reserve memory above 4G.
>  
>  Load the Dump-capture Kernel
>  
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index fb95fad81c79..58a731eed011 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -722,6 +722,9 @@
>   [KNL, x86_64] select a region under 4G first, and
>   fall back to reserve region above 4G when '@offset'
>   hasn't been specified.
> + [KNL, arm64] If crashkernel=X,low is specified, reserve
> + spcified size low memory firstly, and then reserve 
> memory

s/spcified/specified

> + above 4G.
>   See Documentation/admin-guide/kdump/kdump.rst for 
> further details.
>  
>   crashkernel=range1:size1[,range2:size2,...][@offset]
> @@ -746,13 +749,23 @@
>   requires at least 64M+32K low memory, also enough extra
>   low memory is needed to make sure DMA buffers for 32-bit
>   devices won't run out. Kernel would try to allocate at
> - at least 256M below 4G automatically.
> + least 256M below 4G automatically.
>   This one let user to specify own low range under 4G
>   for second kernel instead.
>   0: to disable low allocation.
>   It will be ignored when crashkernel=X,high is not used
>   or memory reserved is below 4G.
> -
> + [KNL, arm64] range under 4G.
> + This one let user to specify own low range under 4G

s/own low/a low

> + for crash dump kernel instead.
> + Be different from x86_64, kernel reserves specified size
> + physical memory region only when this parameter is 
> specified
> + instead of trying to reserve at least 256M below 4G
> + automatically.
> + Use this parameter along with crashkernel=X when we want
> + to reserve crashkernel above 4G. If there are devices
> + need to use ZONE_DMA in crash dump kernel, it is also
> + a good choice.
>   cryptomgr.notests
>   [KNL] Disable crypto self-tests
>  
> -- 
> 2.20.1
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v10 1/5] x86: kdump: move reserve_crashkernel_low() into crash_core.c

2020-07-02 Thread Chen Zhou
In preparation for supporting reserve_crashkernel_low in arm64 as
x86_64 does, move reserve_crashkernel_low() into kernel/crash_core.c.

BTW, move x86_64 CRASH_ALIGN to 2M suggested by Dave. CONFIG_PHYSICAL_ALIGN
can be selected from 2M to 16M, move to the same as arm64.

Note, in arm64, we reserve low memory if and only if crashkernel=X,low
is specified. Different with x86_64, don't set low memory automatically.

Reported-by: kbuild test robot 
Signed-off-by: Chen Zhou 
Tested-by: John Donnelly 
Tested-by: Prabhakar Kushwaha 
Acked-by: Dave Young 
---
 arch/x86/kernel/setup.c| 66 -
 include/linux/crash_core.h |  3 ++
 include/linux/kexec.h  |  2 -
 kernel/crash_core.c| 85 ++
 kernel/kexec_core.c| 17 
 5 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a3767e74c758..33db99ae3035 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -401,8 +401,8 @@ static void __init 
memblock_x86_reserve_range_setup_data(void)
 
 #ifdef CONFIG_KEXEC_CORE
 
-/* 16M alignment for crash kernel regions */
-#define CRASH_ALIGNSZ_16M
+/* 2M alignment for crash kernel regions */
+#define CRASH_ALIGNSZ_2M
 
 /*
  * Keep the crash kernel below this limit.
@@ -425,59 +425,6 @@ static void __init 
memblock_x86_reserve_range_setup_data(void)
 # define CRASH_ADDR_HIGH_MAX   SZ_64T
 #endif
 
-static int __init reserve_crashkernel_low(void)
-{
-#ifdef CONFIG_X86_64
-   unsigned long long base, low_base = 0, low_size = 0;
-   unsigned long total_low_mem;
-   int ret;
-
-   total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
-
-   /* crashkernel=Y,low */
-   ret = parse_crashkernel_low(boot_command_line, total_low_mem, 
_size, );
-   if (ret) {
-   /*
-* two parts from kernel/dma/swiotlb.c:
-* -swiotlb size: user-specified with swiotlb= or default.
-*
-* -swiotlb overflow buffer: now hardcoded to 32k. We round it
-* to 8M for other buffers that may need to stay low too. Also
-* make sure we allocate enough extra low memory so that we
-* don't run out of DMA buffers for 32-bit devices.
-*/
-   low_size = max(swiotlb_size_or_default() + (8UL << 20), 256UL 
<< 20);
-   } else {
-   /* passed with crashkernel=0,low ? */
-   if (!low_size)
-   return 0;
-   }
-
-   low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN);
-   if (!low_base) {
-   pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
smaller size.\n",
-  (unsigned long)(low_size >> 20));
-   return -ENOMEM;
-   }
-
-   ret = memblock_reserve(low_base, low_size);
-   if (ret) {
-   pr_err("%s: Error reserving crashkernel low memblock.\n", 
__func__);
-   return ret;
-   }
-
-   pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System 
low RAM: %ldMB)\n",
-   (unsigned long)(low_size >> 20),
-   (unsigned long)(low_base >> 20),
-   (unsigned long)(total_low_mem >> 20));
-
-   crashk_low_res.start = low_base;
-   crashk_low_res.end   = low_base + low_size - 1;
-   insert_resource(_resource, _low_res);
-#endif
-   return 0;
-}
-
 static void __init reserve_crashkernel(void)
 {
unsigned long long crash_size, crash_base, total_mem;
@@ -541,9 +488,12 @@ static void __init reserve_crashkernel(void)
return;
}
 
-   if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
-   memblock_free(crash_base, crash_size);
-   return;
+   if (crash_base >= (1ULL << 32)) {
+   if (reserve_crashkernel_low()) {
+   memblock_free(crash_base, crash_size);
+   return;
+   }
+   insert_resource(_resource, _low_res);
}
 
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 525510a9f965..4df8c0bff03e 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -63,6 +63,8 @@ phys_addr_t paddr_vmcoreinfo_note(void);
 extern unsigned char *vmcoreinfo_data;
 extern size_t vmcoreinfo_size;
 extern u32 *vmcoreinfo_note;
+extern struct resource crashk_res;
+extern struct resource crashk_low_res;
 
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len);
@@ -74,5 +76,6 @@ int parse_crashkernel_high(char *cmdline, unsigned long long 
system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
 int 

[PATCH v10 5/5] kdump: update Documentation about crashkernel on arm64

2020-07-02 Thread Chen Zhou
Now we support crashkernel=X,[low] on arm64, update the Documentation.
We could use parameters "crashkernel=X crashkernel=Y,low" to reserve
memory above 4G.

Signed-off-by: Chen Zhou 
Tested-by: John Donnelly 
Tested-by: Prabhakar Kushwaha 
---
 Documentation/admin-guide/kdump/kdump.rst   | 14 --
 Documentation/admin-guide/kernel-parameters.txt | 17 +++--
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index 2da65fef2a1c..e80fc9e28a9a 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -299,7 +299,15 @@ Boot into System Kernel
"crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
starting at physical address 0x0100 (16MB) for the dump-capture kernel.
 
-   On x86 and x86_64, use "crashkernel=64M@16M".
+   On x86 use "crashkernel=64M@16M".
+
+   On x86_64, use "crashkernel=Y" to select a region under 4G first, and
+   fall back to reserve region above 4G.
+   We can also use "crashkernel=X,high" to select a region above 4G, which
+   also tries to allocate at least 256M below 4G automatically and
+   "crashkernel=Y,low" can be used to allocate specified size low memory.
+   Use "crashkernel=Y@X" if we really have to reserve memory from specified
+   start address X.
 
On ppc64, use "crashkernel=128M@32M".
 
@@ -316,8 +324,10 @@ Boot into System Kernel
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.
 
-   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
+   On arm64, use "crashkernel=Y[@X]". Note that the start address of
the kernel, X if explicitly specified, must be aligned to 2MiB (0x20).
+   If crashkernel=Z,low is specified simultaneously, reserve spcified size
+   low memory firstly and then reserve memory above 4G.
 
 Load the Dump-capture Kernel
 
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..58a731eed011 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -722,6 +722,9 @@
[KNL, x86_64] select a region under 4G first, and
fall back to reserve region above 4G when '@offset'
hasn't been specified.
+   [KNL, arm64] If crashkernel=X,low is specified, reserve
+   spcified size low memory firstly, and then reserve 
memory
+   above 4G.
See Documentation/admin-guide/kdump/kdump.rst for 
further details.
 
crashkernel=range1:size1[,range2:size2,...][@offset]
@@ -746,13 +749,23 @@
requires at least 64M+32K low memory, also enough extra
low memory is needed to make sure DMA buffers for 32-bit
devices won't run out. Kernel would try to allocate at
-   at least 256M below 4G automatically.
+   least 256M below 4G automatically.
This one let user to specify own low range under 4G
for second kernel instead.
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
-
+   [KNL, arm64] range under 4G.
+   This one let user to specify own low range under 4G
+   for crash dump kernel instead.
+   Be different from x86_64, kernel reserves specified size
+   physical memory region only when this parameter is 
specified
+   instead of trying to reserve at least 256M below 4G
+   automatically.
+   Use this parameter along with crashkernel=X when we want
+   to reserve crashkernel above 4G. If there are devices
+   need to use ZONE_DMA in crash dump kernel, it is also
+   a good choice.
cryptomgr.notests
[KNL] Disable crypto self-tests
 
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v10 0/5] support reserving crashkernel above 4G on arm64 kdump

2020-07-02 Thread Chen Zhou
This patch series enable reserving crashkernel above 4G in arm64.

There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel below 4G, which will fail
when there is no enough low memory.
2. Currently, crashkernel=Y@X can be used to reserve crashkernel above 4G,
in this case, if swiotlb or DMA buffers are required, crash dump kernel
will boot failure because there is no low memory available for allocation.
3. commit 1a8e1cef7603 ("arm64: use both ZONE_DMA and ZONE_DMA32") broken
the arm64 kdump. If the memory reserved for crash dump kernel falled in
ZONE_DMA32, the devices in crash dump kernel need to use ZONE_DMA will alloc
fail.

To solve these issues, introduce crashkernel=X,low to reserve specified
size low memory.
Crashkernel=X tries to reserve memory for the crash dump kernel under
4G. If crashkernel=Y,low is specified simultaneously, reserve spcified
size low memory for crash kdump kernel devices firstly and then reserve
memory above 4G.

When crashkernel is reserved above 4G in memory and crashkernel=X,low
is specified simultaneously, kernel should reserve specified size low memory
for crash dump kernel devices. So there may be two crash kernel regions, one
is below 4G, the other is above 4G.
In order to distinct from the high region and make no effect to the use of
kexec-tools, rename the low region as "Crash kernel (low)", and pass the
low region by reusing DT property "linux,usable-memory-range". We made the low
memory region as the last range of "linux,usable-memory-range" to keep
compatibility with existing user-space and older kdump kernels.

Besides, we need to modify kexec-tools:
arm64: support more than one crash kernel regions(see [1])

Another update is document about DT property 'linux,usable-memory-range':
schemas: update 'linux,usable-memory-range' node schema(see [2])

The previous changes and discussions can be retrieved from:

Changes since [v9]
- Patch 1 add Acked-by from Dave.
- Update patch 5 according to Dave's comments.
- Update chosen schema.

Changes since [v8]
- Reuse DT property "linux,usable-memory-range".
Suggested by Rob, reuse DT property "linux,usable-memory-range" to pass the low
memory region.
- Fix kdump broken with ZONE_DMA reintroduced.
- Update chosen schema.

Changes since [v7]
- Move x86 CRASH_ALIGN to 2M
Suggested by Dave and do some test, move x86 CRASH_ALIGN to 2M.
- Update Documentation/devicetree/bindings/chosen.txt.
Add corresponding documentation to Documentation/devicetree/bindings/chosen.txt
suggested by Arnd.
- Add Tested-by from Jhon and pk.

Changes since [v6]
- Fix build errors reported by kbuild test robot.

Changes since [v5]
- Move reserve_crashkernel_low() into kernel/crash_core.c.
- Delete crashkernel=X,high.
- Modify crashkernel=X,low.
If crashkernel=X,low is specified simultaneously, reserve spcified size low
memory for crash kdump kernel devices firstly and then reserve memory above 4G.
In addition, rename crashk_low_res as "Crash kernel (low)" for arm64, and then
pass to crash dump kernel by DT property "linux,low-memory-range".
- Update Documentation/admin-guide/kdump/kdump.rst.

Changes since [v4]
- Reimplement memblock_cap_memory_ranges for multiple ranges by Mike.

Changes since [v3]
- Add memblock_cap_memory_ranges back for multiple ranges.
- Fix some compiling warnings.

Changes since [v2]
- Split patch "arm64: kdump: support reserving crashkernel above 4G" as
two. Put "move reserve_crashkernel_low() into kexec_core.c" in a separate
patch.

Changes since [v1]:
- Move common reserve_crashkernel_low() code into kernel/kexec_core.c.
- Remove memblock_cap_memory_ranges() i added in v1 and implement that
in fdt_enforce_memory_region().
There are at most two crash kernel regions, for two crash kernel regions
case, we cap the memory range [min(regs[*].start), max(regs[*].end)]
and then remove the memory range in the middle.

[1]: http://lists.infradead.org/pipermail/kexec/2020-June/020737.html
[2]: https://github.com/robherring/dt-schema/pull/19 
[v1]: https://lkml.org/lkml/2019/4/2/1174
[v2]: https://lkml.org/lkml/2019/4/9/86
[v3]: https://lkml.org/lkml/2019/4/9/306
[v4]: https://lkml.org/lkml/2019/4/15/273
[v5]: https://lkml.org/lkml/2019/5/6/1360
[v6]: https://lkml.org/lkml/2019/8/30/142
[v7]: https://lkml.org/lkml/2019/12/23/411
[v8]: https://lkml.org/lkml/2020/5/21/213
[v9]: https://lkml.org/lkml/2020/6/28/73

Chen Zhou (5):
  x86: kdump: move reserve_crashkernel_low() into crash_core.c
  arm64: kdump: reserve crashkenel above 4G for crash dump kernel
  arm64: kdump: add memory for devices by DT property
linux,usable-memory-range
  arm64: kdump: fix kdump broken with ZONE_DMA reintroduced
  kdump: update Documentation about crashkernel on arm64

 Documentation/admin-guide/kdump/kdump.rst | 14 ++-
 .../admin-guide/kernel-parameters.txt | 17 +++-
 arch/arm64/kernel/setup.c |  8 +-
 arch/arm64/mm/init.c  | 74 ---
 

[PATCH v10 3/5] arm64: kdump: add memory for devices by DT property linux, usable-memory-range

2020-07-02 Thread Chen Zhou
If we want to reserve crashkernel above 4G, we could use parameters
"crashkernel=X crashkernel=Y,low", in this case, specified size low
memory is reserved for crash dump kernel devices and never mapped by
the first kernel. This memory range is advertised to crash dump kernel
via DT property under /chosen,
linux,usable-memory-range = 

We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.

Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.

Signed-off-by: Chen Zhou 
Tested-by: John Donnelly 
Tested-by: Prabhakar Kushwaha 
---
 arch/arm64/mm/init.c | 43 +--
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index ce7ced85f5fb..f5b31e8f1f34 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -69,6 +69,15 @@ EXPORT_SYMBOL(vmemmap);
 phys_addr_t arm64_dma_phys_limit __ro_after_init;
 static phys_addr_t arm64_dma32_phys_limit __ro_after_init;
 
+/*
+ * The main usage of linux,usable-memory-range is for crash dump kernel.
+ * Originally, the number of usable-memory regions is one. Now crash dump
+ * kernel support at most two regions, low region and high region.
+ * To make compatibility with existing user-space and older kdump, the low
+ * region is always the last range of linux,usable-memory-range if exist.
+ */
+#define MAX_USABLE_RANGES  2
+
 #ifdef CONFIG_KEXEC_CORE
 /*
  * reserve_crashkernel() - reserves memory for crash kernel
@@ -272,9 +281,9 @@ early_param("mem", early_mem);
 static int __init early_init_dt_scan_usablemem(unsigned long node,
const char *uname, int depth, void *data)
 {
-   struct memblock_region *usablemem = data;
-   const __be32 *reg;
-   int len;
+   struct memblock_region *usable_rgns = data;
+   const __be32 *reg, *endp;
+   int len, nr = 0;
 
if (depth != 1 || strcmp(uname, "chosen") != 0)
return 0;
@@ -283,22 +292,36 @@ static int __init early_init_dt_scan_usablemem(unsigned 
long node,
if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
return 1;
 
-   usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
-   usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
+   endp = reg + (len / sizeof(__be32));
+   while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
+   usable_rgns[nr].base = dt_mem_next_cell(dt_root_addr_cells, 
);
+   usable_rgns[nr].size = dt_mem_next_cell(dt_root_size_cells, 
);
+
+   if (++nr >= MAX_USABLE_RANGES)
+   break;
+   }
 
return 1;
 }
 
 static void __init fdt_enforce_memory_region(void)
 {
-   struct memblock_region reg = {
-   .size = 0,
+   struct memblock_region usable_rgns[MAX_USABLE_RANGES] = {
+   { .size = 0 },
+   { .size = 0 }
};
 
-   of_scan_flat_dt(early_init_dt_scan_usablemem, );
+   of_scan_flat_dt(early_init_dt_scan_usablemem, _rgns);
 
-   if (reg.size)
-   memblock_cap_memory_range(reg.base, reg.size);
+   /*
+* The first range of usable-memory regions is for crash dump
+* kernel with only one region or for high region with two regions,
+* the second range is dedicated for low region if exist.
+*/
+   if (usable_rgns[0].size)
+   memblock_cap_memory_range(usable_rgns[0].base, 
usable_rgns[0].size);
+   if (usable_rgns[1].size)
+   memblock_add(usable_rgns[1].base, usable_rgns[1].size);
 }
 
 void __init arm64_memblock_init(void)
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v10 4/5] arm64: kdump: fix kdump broken with ZONE_DMA reintroduced

2020-07-02 Thread Chen Zhou
commit 1a8e1cef7603 ("arm64: use both ZONE_DMA and ZONE_DMA32")
broken the arm64 kdump. If the memory reserved for crash dump kernel
falled in ZONE_DMA32, the devices in crash dump kernel need to use
ZONE_DMA will alloc fail.

This patch addressed the above issue based on "reserving crashkernel
above 4G". Originally, we reserve low memory below 4G, and now just need
to adjust memory limit to arm64_dma_phys_limit in reserve_crashkernel_low
if ZONE_DMA is enabled. That is, if there are devices need to use ZONE_DMA
in crash dump kernel, it is a good choice to use parameters
"crashkernel=X crashkernel=Y,low".

Signed-off-by: Chen Zhou 
---
 kernel/crash_core.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a7580d291c37..e8ecbbc761a3 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -320,6 +320,7 @@ int __init reserve_crashkernel_low(void)
unsigned long long base, low_base = 0, low_size = 0;
unsigned long total_low_mem;
int ret;
+   phys_addr_t crash_max = 1ULL << 32;
 
total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
 
@@ -352,7 +353,11 @@ int __init reserve_crashkernel_low(void)
return 0;
}
 
-   low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN);
+#ifdef CONFIG_ARM64
+   if (IS_ENABLED(CONFIG_ZONE_DMA))
+   crash_max = arm64_dma_phys_limit;
+#endif
+   low_base = memblock_find_in_range(0, crash_max, low_size, CRASH_ALIGN);
if (!low_base) {
pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
smaller size.\n",
   (unsigned long)(low_size >> 20));
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v10 2/5] arm64: kdump: reserve crashkenel above 4G for crash dump kernel

2020-07-02 Thread Chen Zhou
Crashkernel=X tries to reserve memory for the crash dump kernel under
4G. If crashkernel=X,low is specified simultaneously, reserve spcified
size low memory for crash kdump kernel devices firstly and then reserve
memory above 4G.

Suggested by James, just introduced crashkernel=X,low to arm64. As memtioned
above, if crashkernel=X,low is specified simultaneously, reserve spcified
size low memory for crash kdump kernel devices firstly and then reserve
memory above 4G, which is much simpler.

Signed-off-by: Chen Zhou 
Tested-by: John Donnelly 
Tested-by: Prabhakar Kushwaha 
---
 arch/arm64/kernel/setup.c |  8 +++-
 arch/arm64/mm/init.c  | 31 +--
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 93b3844cf442..4dc51a2ac012 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -238,7 +238,13 @@ static void __init request_standard_resources(void)
kernel_data.end <= res->end)
request_resource(res, _data);
 #ifdef CONFIG_KEXEC_CORE
-   /* Userspace will find "Crash kernel" region in /proc/iomem. */
+   /*
+* Userspace will find "Crash kernel" region in /proc/iomem.
+* Note: the low region is renamed as Crash kernel (low).
+*/
+   if (crashk_low_res.end && crashk_low_res.start >= res->start &&
+   crashk_low_res.end <= res->end)
+   request_resource(res, _low_res);
if (crashk_res.end && crashk_res.start >= res->start &&
crashk_res.end <= res->end)
request_resource(res, _res);
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..ce7ced85f5fb 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -81,6 +81,7 @@ static void __init reserve_crashkernel(void)
 {
unsigned long long crash_base, crash_size;
int ret;
+   phys_addr_t crash_max = arm64_dma32_phys_limit;
 
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
_size, _base);
@@ -88,12 +89,38 @@ static void __init reserve_crashkernel(void)
if (ret || !crash_size)
return;
 
+   ret = reserve_crashkernel_low();
+   if (!ret && crashk_low_res.end) {
+   /*
+* If crashkernel=X,low specified, there may be two regions,
+* we need to make some changes as follows:
+*
+* 1. rename the low region as "Crash kernel (low)"
+* In order to distinct from the high region and make no effect
+* to the use of existing kexec-tools, rename the low region as
+* "Crash kernel (low)".
+*
+* 2. change the upper bound for crash memory
+* Set MEMBLOCK_ALLOC_ACCESSIBLE upper bound for crash memory.
+*
+* 3. mark the low region as "nomap"
+* The low region is intended to be used for crash dump kernel
+* devices, just mark the low region as "nomap" simply.
+*/
+   const char *rename = "Crash kernel (low)";
+
+   crashk_low_res.name = rename;
+   crash_max = MEMBLOCK_ALLOC_ACCESSIBLE;
+   memblock_mark_nomap(crashk_low_res.start,
+   resource_size(_low_res));
+   }
+
crash_size = PAGE_ALIGN(crash_size);
 
if (crash_base == 0) {
/* Current arm64 boot protocol requires 2MB alignment */
-   crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
-   crash_size, SZ_2M);
+   crash_base = memblock_find_in_range(0, crash_max, crash_size,
+   SZ_2M);
if (crash_base == 0) {
pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
crash_size);
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 5/5] kdump: update Documentation about crashkernel on arm64

2020-07-02 Thread chenzhou
Hi Dave,


On 2020/7/2 10:59, Dave Young wrote:
> Hi Chen,
> On 06/28/20 at 04:34pm, Chen Zhou wrote:
>> Now we support crashkernel=X,[low] on arm64, update the Documentation.
>> We could use parameters "crashkernel=X crashkernel=Y,low" to reserve
>> memory above 4G.
>>
>> Signed-off-by: Chen Zhou 
>> Tested-by: John Donnelly 
>> Tested-by: Prabhakar Kushwaha 
>> ---
>>  Documentation/admin-guide/kdump/kdump.rst   | 13 +++--
>>  Documentation/admin-guide/kernel-parameters.txt | 17 +++--
>>  2 files changed, 26 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
>> b/Documentation/admin-guide/kdump/kdump.rst
>> index 2da65fef2a1c..6ba294d425c9 100644
>> --- a/Documentation/admin-guide/kdump/kdump.rst
>> +++ b/Documentation/admin-guide/kdump/kdump.rst
>> @@ -299,7 +299,13 @@ Boot into System Kernel
>> "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
>> starting at physical address 0x0100 (16MB) for the dump-capture 
>> kernel.
>>  
>> -   On x86 and x86_64, use "crashkernel=64M@16M".
>> +   On x86 use "crashkernel=64M@16M".
>> +
>> +   On x86_64, use "crashkernel=Y[@X]" to select a region under 4G first, and
>> +   fall back to reserve region above 4G when '@offset' hasn't been 
>> specified.
> Actually crashkernel=Y without the offset works well, I do not see why
> we need the offset, it should be some legacy thing.  So it should be
> better just use the Y without offset here, and just leave a note
> somewhere people can use [@X] offset when they really have to.
>
>> +   We can also use "crashkernel=X,high" to select a region above 4G, which
>> +   also tries to allocate at least 256M below 4G automatically and
>> +   "crashkernel=Y,low" can be used to allocate specified size low memory.
>>  
>> On ppc64, use "crashkernel=128M@32M".
>>  
>> @@ -316,8 +322,11 @@ Boot into System Kernel
>> kernel will automatically locate the crash kernel image within the
>> first 512MB of RAM if X is not given.
>>  
>> -   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
>> +   On arm64, use "crashkernel=Y[@X]". Note that the start address of
>> the kernel, X if explicitly specified, must be aligned to 2MiB 
>> (0x20).
>> +   If crashkernel=Z,low is specified simultaneously, reserve spcified size
>> +   low memory for crash kdump kernel devices firstly and then reserve memory
> "devices" seems not very accurate, maybe just drop the "for crash kdump
> kernel devices" since it is clear based on the context.
>
>> +   above 4G.
>>  
>>  Load the Dump-capture Kernel
>>  
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
>> b/Documentation/admin-guide/kernel-parameters.txt
>> index fb95fad81c79..335431a351c0 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -722,6 +722,9 @@
>>  [KNL, x86_64] select a region under 4G first, and
>>  fall back to reserve region above 4G when '@offset'
>>  hasn't been specified.
>> +[KNL, arm64] If crashkernel=X,low is specified, reserve
>> +spcified size low memory for crash kdump kernel devices
> Ditto.
>
>> +firstly, and then reserve memory above 4G.
>>  See Documentation/admin-guide/kdump/kdump.rst for 
>> further details.
>>  
>>  crashkernel=range1:size1[,range2:size2,...][@offset]
>> @@ -746,13 +749,23 @@
>>  requires at least 64M+32K low memory, also enough extra
>>  low memory is needed to make sure DMA buffers for 32-bit
>>  devices won't run out. Kernel would try to allocate at
>> -at least 256M below 4G automatically.
>> +least 256M below 4G automatically.
>>  This one let user to specify own low range under 4G
>>  for second kernel instead.
>>  0: to disable low allocation.
>>  It will be ignored when crashkernel=X,high is not used
>>  or memory reserved is below 4G.
>> -
>> +[KNL, arm64] range under 4G.
>> +This one let user to specify own low range under 4G
>> +for crash dump kernel instead.
>> +Different with x86_64, kernel allocates specified size
> sounds better:
> s/Different with/Be different from
>
> s/allocates/reserves
>
>> +physical memory region only when this parameter is 
>> specified
>> +instead of trying to allocate at least 256M below 4G
> s/allocate/reserve
>
>> +automatically.
>> +This parameter is used along with crashkernel=X when we
> Could change the passive sentence to below:
> "Use this parameter along with"
Thanks 

[PATCH v2 10/12] ppc64/kexec_file: prepare elfcore header for crashing kernel

2020-07-02 Thread Hari Bathini
Prepare elf headers for the crashing kernel's core file using
crash_prepare_elf64_headers() and pass on this info to kdump
kernel by updating its command line with elfcorehdr parameter.
Also, add elfcorehdr location to reserve map to avoid it from
being stomped on while booting.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Tried merging adjacent memory ranges on hitting maximum ranges limit
  to reduce reallocations for memory ranges and also, minimize PT_LOAD
  segments for elfcore.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/include/asm/kexec.h  |6 +
 arch/powerpc/kexec/elf_64.c   |   12 ++
 arch/powerpc/kexec/file_load.c|   49 ++
 arch/powerpc/kexec/file_load_64.c |  181 +
 4 files changed, 248 insertions(+)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 037cf2b..8b0a6d6 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -112,12 +112,18 @@ struct kimage_arch {
unsigned long backup_start;
void *backup_buf;
 
+   unsigned long elfcorehdr_addr;
+   unsigned long elf_headers_sz;
+   void *elf_headers;
+
 #ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
 #endif
 };
 
+char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
+ unsigned long cmdline_len);
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr);
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 4838b42..40a028c 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -36,6 +36,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
void *fdt;
const void *slave_code;
struct elfhdr ehdr;
+   char *modified_cmdline = NULL;
struct kexec_elf_info elf_info;
struct kexec_buf kbuf = { .image = image, .buf_min = 0,
  .buf_max = ppc64_rma_size };
@@ -74,6 +75,16 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_err("Failed to load kdump kernel segments\n");
goto out;
}
+
+   /* Setup cmdline for kdump kernel case */
+   modified_cmdline = setup_kdump_cmdline(image, cmdline,
+  cmdline_len);
+   if (!modified_cmdline) {
+   pr_err("Setting up cmdline for kdump kernel failed\n");
+   ret = -EINVAL;
+   goto out;
+   }
+   cmdline = modified_cmdline;
}
 
if (initrd != NULL) {
@@ -130,6 +141,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_err("Error setting up the purgatory.\n");
 
 out:
+   kfree(modified_cmdline);
kexec_free_elf_info(_info);
 
/* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 99a2c4d..2e74992 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -17,11 +17,46 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define SLAVE_CODE_SIZE256 /* First 0x100 bytes */
 
 /**
+ * setup_kdump_cmdline - Prepend "elfcorehdr= " to command line
+ *   of kdump kernel for exporting the core.
+ * @image:   Kexec image
+ * @cmdline: Command line parameters to update.
+ * @cmdline_len: Length of the cmdline parameters.
+ *
+ * kdump segment must be setup before calling this function.
+ *
+ * Returns new cmdline buffer for kdump kernel on success, NULL otherwise.
+ */
+char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
+ unsigned long cmdline_len)
+{
+   int elfcorehdr_strlen;
+   char *cmdline_ptr;
+
+   cmdline_ptr = kzalloc(COMMAND_LINE_SIZE, GFP_KERNEL);
+   if (!cmdline_ptr)
+   return NULL;
+
+   elfcorehdr_strlen = sprintf(cmdline_ptr, "elfcorehdr=0x%lx ",
+   image->arch.elfcorehdr_addr);
+
+   if (elfcorehdr_strlen + cmdline_len > COMMAND_LINE_SIZE) {
+   pr_err("Appending elfcorehdr= exceeds cmdline size\n");
+   kfree(cmdline_ptr);
+   return NULL;
+   }
+
+   memcpy(cmdline_ptr + elfcorehdr_strlen, cmdline, cmdline_len);
+   return cmdline_ptr;
+}
+
+/**
  * setup_purgatory - initialize the purgatory's global variables
  * @image: kexec image.
  * @slave_code:Slave code for the purgatory.
@@ -215,6 +250,20 @@ int setup_new_fdt(const struct kimage 

[PATCH v2 11/12] ppc64/kexec_file: add appropriate regions for memory reserve map

2020-07-02 Thread Hari Bathini
While initrd, elfcorehdr and backup regions are already added to the
reserve map, there are a few missing regions that need to be added to
the memory reserve map. Add them here. And now that all the changes
to load panic kernel are in place, claim likewise.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/kexec/file_load_64.c |   58 ++---
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 6f895fa..d3b29e0 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -193,6 +193,34 @@ static int get_crash_memory_ranges(struct crash_mem 
**mem_ranges)
 }
 
 /**
+ * get_reserved_memory_ranges - Get reserve memory ranges. This list includes
+ *  memory regions that should be added to the
+ *  memory reserve map to ensure the region is
+ *  protected from any mischeif.
+ * @mem_ranges: Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_reserved_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_reserved_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup reserved memory ranges\n");
+   return ret;
+}
+
+/**
  * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
  *  in the memory regions between buf_min & buf_max
  *  for the buffer. If found, sets kbuf->mem.
@@ -1200,8 +1228,8 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline)
 {
-   struct crash_mem *umem = NULL;
-   int chosen_node, ret;
+   struct crash_mem *umem = NULL, *rmem = NULL;
+   int i, nr_ranges, chosen_node, ret;
 
/* Remove memory reservation for the current device tree. */
ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
@@ -1247,6 +1275,25 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
}
}
 
+   /* Update memory reserve map */
+   ret = get_reserved_memory_ranges();
+   if (ret)
+   goto out;
+
+   nr_ranges = rmem ? rmem->nr_ranges : 0;
+   for (i = 0; i < nr_ranges; i++) {
+   u64 base, size;
+
+   base = rmem->ranges[i].start;
+   size = rmem->ranges[i].end - base + 1;
+   ret = fdt_add_mem_rsv(fdt, base, size);
+   if (ret) {
+   pr_err("Error updating memory reserve map: %s\n",
+  fdt_strerror(ret));
+   goto out;
+   }
+   }
+
ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
cmdline, _node);
if (ret)
@@ -1257,6 +1304,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
pr_err("Failed to update device-tree with 
linux,booted-from-kexec\n");
 out:
kfree(umem);
+   kfree(rmem);
return ret;
 }
 
@@ -1434,10 +1482,10 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
void *buf,
 
/* Get exclude memory ranges needed for setting up kdump 
segments */
ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
-   if (ret)
+   if (ret) {
pr_err("Failed to setup exclude memory ranges for 
buffer lookup\n");
-   /* Return this until all changes for panic kernel are in */
-   return -EOPNOTSUPP;
+   return ret;
+   }
}
 
return kexec_image_probe_default(image, buf, buf_len);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 12/12] ppc64/kexec_file: fix kexec load failure with lack of memory hole

2020-07-02 Thread Hari Bathini
The kexec purgatory has to run in real mode. Only the first memory
block maybe accessible in real mode. And, unlike the case with panic
kernel, no memory is set aside for regular kexec load. Another thing
to note is, the memory for crashkernel is reserved at an offset of
128MB. So, when crashkernel memory is reserved, the memory ranges to
load kexec segments shrink further as the generic code only looks for
memblock free memory ranges and in all likelihood only a tiny bit of
memory from 0 to 128MB would be available to load kexec segments.

With kdump being used by default in general, kexec file load is likely
to fail almost always. This can be fixed by changing the memory hole
lookup logic for regular kexec to use the same method as kdump. This
would mean that most kexec segments will overlap with crashkernel
memory region. That should still be ok as the pages, whose destination
address isn't available while loading, are placed in an intermediate
location till a flush to the actual destination address happens during
kexec boot sequence.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* New patch to fix locating memory hole for kexec_file_load (kexec -s -l)
  when memory is reserved for crashkernel.


 arch/powerpc/kexec/file_load_64.c |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index d3b29e0..746c16f 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1326,13 +1326,6 @@ int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
u64 buf_min, buf_max;
int ret;
 
-   /*
-* Use the generic kexec_locate_mem_hole for regular
-* kexec_file_load syscall
-*/
-   if (kbuf->image->type != KEXEC_TYPE_CRASH)
-   return kexec_locate_mem_hole(kbuf);
-
/* Look up the exclude ranges list while locating the memory hole */
emem = &(kbuf->image->arch.exclude_ranges);
if (!(*emem) || ((*emem)->nr_ranges == 0)) {
@@ -1340,11 +1333,15 @@ int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
return 0;
}
 
+   buf_min = kbuf->buf_min;
+   buf_max = kbuf->buf_max;
/* Segments for kdump kernel should be within crashkernel region */
-   buf_min = (kbuf->buf_min < crashk_res.start ?
-  crashk_res.start : kbuf->buf_min);
-   buf_max = (kbuf->buf_max > crashk_res.end ?
-  crashk_res.end : kbuf->buf_max);
+   if (kbuf->image->type == KEXEC_TYPE_CRASH) {
+   buf_min = (buf_min < crashk_res.start ?
+  crashk_res.start : buf_min);
+   buf_max = (buf_max > crashk_res.end ?
+  crashk_res.end : buf_max);
+   }
 
if (buf_min > buf_max) {
pr_err("Invalid buffer min and/or max values\n");
@@ -1477,15 +1474,13 @@ int arch_kexec_apply_relocations_add(struct 
purgatory_info *pi,
 int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
  unsigned long buf_len)
 {
-   if (image->type == KEXEC_TYPE_CRASH) {
-   int ret;
+   int ret;
 
-   /* Get exclude memory ranges needed for setting up kdump 
segments */
-   ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
-   if (ret) {
-   pr_err("Failed to setup exclude memory ranges for 
buffer lookup\n");
-   return ret;
-   }
+   /* Get exclude memory ranges needed for setting up kexec segments */
+   ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
+   if (ret) {
+   pr_err("Failed to setup exclude memory ranges for buffer 
lookup\n");
+   return ret;
}
 
return kexec_image_probe_default(image, buf, buf_len);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 09/12] ppc64/kexec_file: setup backup region for kdump kernel

2020-07-02 Thread Hari Bathini
Though kdump kernel boots from loaded address, the first 64K bytes
of it is copied down to real 0. So, setup a backup region to copy
the first 64K bytes of crashed kernel, in purgatory, before booting
into kdump kernel. Also, update reserve map with backup region and
crashed kernel's memory to avoid kdump kernel from accidentially
using that memory.

Reported-by: kernel test robot 
[lkp: In v1, purgatory() declaration was missing]
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Check if backup region is available before branching out. This is
  to keep `kexec -l -s` flow as before as much as possible. This would
  eventually change with more testing and addition of sha256 digest
  verification support.
* Fixed missing prototype for purgatory() as reported by lkp.
  lkp report for reference:
- https://lore.kernel.org/patchwork/patch/1264423/


 arch/powerpc/include/asm/crashdump-ppc64.h |5 +
 arch/powerpc/include/asm/kexec.h   |7 ++
 arch/powerpc/include/asm/purgatory.h   |   11 +++
 arch/powerpc/kexec/elf_64.c|9 +++
 arch/powerpc/kexec/file_load_64.c  |   95 
 arch/powerpc/purgatory/Makefile|   28 
 arch/powerpc/purgatory/purgatory_64.c  |   36 +++
 arch/powerpc/purgatory/trampoline_64.S |   28 +++-
 8 files changed, 211 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/include/asm/purgatory.h
 create mode 100644 arch/powerpc/purgatory/purgatory_64.c

diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h 
b/arch/powerpc/include/asm/crashdump-ppc64.h
index 90deb46..fcc5fce 100644
--- a/arch/powerpc/include/asm/crashdump-ppc64.h
+++ b/arch/powerpc/include/asm/crashdump-ppc64.h
@@ -2,6 +2,11 @@
 #ifndef _ASM_POWERPC_CRASHDUMP_PPC64_H
 #define _ASM_POWERPC_CRASHDUMP_PPC64_H
 
+/* Backup region - first 64K bytes of System RAM. */
+#define BACKUP_SRC_START   0
+#define BACKUP_SRC_END 0x
+#define BACKUP_SRC_SIZE(BACKUP_SRC_END - BACKUP_SRC_START + 1)
+
 /* min & max addresses for kdump load segments */
 #define KDUMP_BUF_MIN  (crashk_res.start)
 #define KDUMP_BUF_MAX  ((crashk_res.end < ppc64_rma_size) ? \
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index e78cd0a..037cf2b 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -109,6 +109,9 @@ extern const struct kexec_file_ops kexec_elf64_ops;
 struct kimage_arch {
struct crash_mem *exclude_ranges;
 
+   unsigned long backup_start;
+   void *backup_buf;
+
 #ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
@@ -124,6 +127,10 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
+struct kexec_buf;
+
+int load_crashdump_segments_ppc64(struct kimage *image,
+ struct kexec_buf *kbuf);
 int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr);
diff --git a/arch/powerpc/include/asm/purgatory.h 
b/arch/powerpc/include/asm/purgatory.h
new file mode 100644
index 000..076d150
--- /dev/null
+++ b/arch/powerpc/include/asm/purgatory.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_POWERPC_PURGATORY_H
+#define _ASM_POWERPC_PURGATORY_H
+
+#ifndef __ASSEMBLY__
+#include 
+
+void purgatory(void);
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_POWERPC_PURGATORY_H */
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index c695f94..4838b42 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -67,6 +67,15 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
+   /* Setup additional segments needed for panic kernel */
+   if (image->type == KEXEC_TYPE_CRASH) {
+   ret = load_crashdump_segments_ppc64(image, );
+   if (ret) {
+   pr_err("Failed to load kdump kernel segments\n");
+   goto out;
+   }
+   }
+
if (initrd != NULL) {
kbuf.buffer = initrd;
kbuf.bufsz = kbuf.memsz = initrd_len;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index f06dcf1..f91530e 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -858,6 +859,69 @@ static int kexec_do_relocs_ppc64(unsigned long my_r2, 
const Elf_Sym *sym,
 }
 
 /**
+ * load_backup_segment - Initialize backup segment of crashing kernel.
+ * @image:   Kexec image.
+ * @kbuf:Buffer contents 

[PATCH v2 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-07-02 Thread Hari Bathini
Kdump kernel, used for capturing the kernel core image, is supposed
to use only specific memory regions to avoid corrupting the image to
be captured. The regions are crashkernel range - the memory reserved
explicitly for kdump kernel, memory used for the tce-table, the OPAL
region and RTAS region as applicable. Restrict kdump kernel memory
to use only these regions by setting up usable-memory DT property.
Also, tell the kdump kernel to run at the loaded address by setting
the magic word at 0x5c.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Fixed off-by-one error while setting up usable-memory properties.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/kexec/file_load_64.c |  401 +
 1 file changed, 399 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 932e0e5..08c71be 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,10 +17,22 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
+struct umem_info {
+   uint64_t *buf; /* data buffer for usable-memory property */
+   uint32_t idx;  /* current index */
+   uint32_t size; /* size allocated for the data buffer */
+
+   /* usable memory ranges to look up */
+   const struct crash_mem *umrngs;
+};
+
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_elf64_ops,
NULL
@@ -76,6 +88,38 @@ static int get_exclude_memory_ranges(struct crash_mem 
**mem_ranges)
 }
 
 /**
+ * get_usable_memory_ranges - Get usable memory ranges. This list includes
+ *regions like crashkernel, opal/rtas & tce-table,
+ *that kdump kernel could use.
+ * @mem_ranges:   Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   /* First memory block & crashkernel region */
+   ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
+   if (ret)
+   goto out;
+
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_opal_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup usable memory ranges\n");
+   return ret;
+}
+
+/**
  * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
  *  in the memory regions between buf_min & buf_max
  *  for the buffer. If found, sets kbuf->mem.
@@ -261,6 +305,322 @@ static int locate_mem_hole_bottom_up_ppc64(struct 
kexec_buf *kbuf,
 }
 
 /**
+ * check_realloc_usable_mem - Reallocate buffer if it can't accommodate entries
+ * @um_info:  Usable memory buffer and ranges info.
+ * @cnt:  No. of entries to accommodate.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt)
+{
+   void *tbuf;
+
+   if (um_info->size >=
+   ((um_info->idx + cnt) * sizeof(*(um_info->buf
+   return um_info->buf;
+
+   um_info->size += MEM_RANGE_CHUNK_SZ;
+   tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL);
+   if (!tbuf) {
+   um_info->size -= MEM_RANGE_CHUNK_SZ;
+   return NULL;
+   }
+
+   memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);
+   return tbuf;
+}
+
+/**
+ * add_usable_mem - Add the usable memory ranges within the given memory range
+ *  to the buffer
+ * @um_info:Usable memory buffer and ranges info.
+ * @base:   Base address of memory range to look for.
+ * @end:End address of memory range to look for.
+ * @cnt:No. of usable memory ranges added to buffer.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int add_usable_mem(struct umem_info *um_info, uint64_t base,
+ uint64_t end, int *cnt)
+{
+   uint64_t loc_base, loc_end, *buf;
+   const struct crash_mem *umrngs;
+   int i, add;
+
+   *cnt = 0;
+   umrngs = um_info->umrngs;
+   for (i = 0; i < umrngs->nr_ranges; i++) {
+   add = 0;
+   loc_base = umrngs->ranges[i].start;
+   loc_end = umrngs->ranges[i].end;
+   if (loc_base >= base && loc_end <= end)
+   add = 1;
+   else if (base < loc_end && end > loc_base) {
+   if (loc_base < base)
+   loc_base = base;
+   if (loc_end > end)
+   loc_end = end;
+

[PATCH v2 05/12] powerpc/drmem: make lmb walk a bit more flexible

2020-07-02 Thread Hari Bathini
Currently, numa & prom are the users of drmem lmb walk code. Loading
kdump with kexec_file also needs to walk the drmem LMBs to setup the
usable memory ranges for kdump kernel. But there are couple of issues
in using the code as is. One, walk_drmem_lmb() code is built into the
.init section currently, while kexec_file needs it later. Two, there
is no scope to pass data to the callback function for processing and/
or erroring out on certain conditions.

Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when
an error is encountered in the callback function.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* No changes.


 arch/powerpc/include/asm/drmem.h |9 ++--
 arch/powerpc/kernel/prom.c   |   13 +++---
 arch/powerpc/mm/drmem.c  |   87 +-
 arch/powerpc/mm/numa.c   |   13 +++---
 4 files changed, 78 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209..17ccc64 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -90,13 +90,14 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
 }
 
 u64 drmem_lmb_memory_max(void);
-void __init walk_drmem_lmbs(struct device_node *dn,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int walk_drmem_lmbs(struct device_node *dn, void *data,
+   int (*func)(struct drmem_lmb *, const __be32 **, void *));
 int drmem_update_dt(void);
 
 #ifdef CONFIG_PPC_PSERIES
-void __init walk_drmem_lmbs_early(unsigned long node,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int __init
+walk_drmem_lmbs_early(unsigned long node, void *data,
+ int (*func)(struct drmem_lmb *, const __be32 **, void *));
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f2..7df78de 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -468,8 +468,9 @@ static bool validate_mem_limit(u64 base, u64 *size)
  * This contains a list of memory blocks along with NUMA affinity
  * information.
  */
-static void __init early_init_drmem_lmb(struct drmem_lmb *lmb,
-   const __be32 **usm)
+static int  __init early_init_drmem_lmb(struct drmem_lmb *lmb,
+   const __be32 **usm,
+   void *data)
 {
u64 base, size;
int is_kexec_kdump = 0, rngs;
@@ -484,7 +485,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
if ((lmb->flags & DRCONF_MEM_RESERVED) ||
!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   return;
+   return 0;
 
if (*usm)
is_kexec_kdump = 1;
@@ -499,7 +500,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
rngs = dt_mem_next_cell(dt_root_size_cells, usm);
if (!rngs) /* there are no (base, size) duple */
-   return;
+   return 0;
}
 
do {
@@ -524,6 +525,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
if (lmb->flags & DRCONF_MEM_HOTREMOVABLE)
memblock_mark_hotplug(base, size);
} while (--rngs);
+
+   return 0;
 }
 #endif /* CONFIG_PPC_PSERIES */
 
@@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned 
long node,
 #ifdef CONFIG_PPC_PSERIES
if (depth == 1 &&
strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
-   walk_drmem_lmbs_early(node, early_init_drmem_lmb);
+   walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
return 0;
}
 #endif
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327ce..b2eeea3 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+static int n_root_addr_cells, n_root_size_cells;
+
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -189,12 +191,13 @@ int drmem_update_dt(void)
return rc;
 }
 
-static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
+static void read_drconf_v1_cell(struct drmem_lmb *lmb,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
-   lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   lmb->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
lmb->drc_index = of_read_number(p++, 1);
 
p++; /* skip reserved field */
@@ -205,29 +208,33 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
*lmb,
*prop = p;
 }
 
-static void __init 

[PATCH v2 08/12] ppc64/kexec_file: setup the stack for purgatory

2020-07-02 Thread Hari Bathini
To avoid any weird errors, the purgatory should run with its own
stack. Set one up by adding the stack buffer to .data section of
the purgatory. Also, setup opal base & entry values in r8 & r9
registers to help early OPAL debugging.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Setting up opal base & entry values in r8 & r9 for early OPAL debug.


 arch/powerpc/include/asm/kexec.h   |4 
 arch/powerpc/kexec/file_load_64.c  |   29 +
 arch/powerpc/purgatory/trampoline_64.S |   32 
 3 files changed, 65 insertions(+)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index bf47a01..e78cd0a 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -45,6 +45,10 @@
 #define KEXEC_ARCH KEXEC_ARCH_PPC
 #endif
 
+#ifdef CONFIG_KEXEC_FILE
+#define KEXEC_PURGATORY_STACK_SIZE 16384   /* 16KB stack size */
+#endif
+
 #define KEXEC_STATE_NONE 0
 #define KEXEC_STATE_IRQS_OFF 1
 #define KEXEC_STATE_REAL_MODE 2
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index adca9c0..f06dcf1 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -873,6 +873,8 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr)
 {
+   struct device_node *dn;
+   void *stack_buf;
uint64_t val;
int ret;
 
@@ -896,10 +898,37 @@ int setup_purgatory_ppc64(struct kimage *image, const 
void *slave_code,
goto out;
}
 
+   /* Setup the stack top */
+   stack_buf = kexec_purgatory_get_symbol_addr(image, "stack_buf");
+   if (!stack_buf)
+   goto out;
+
+   val = (u64)stack_buf + KEXEC_PURGATORY_STACK_SIZE;
+   ret = kexec_purgatory_get_set_symbol(image, "stack", , sizeof(val),
+false);
+   if (ret)
+   goto out;
+
/* Setup the TOC pointer */
val = get_toc_ptr(image->purgatory_info.ehdr);
ret = kexec_purgatory_get_set_symbol(image, "my_toc", , sizeof(val),
 false);
+   if (ret)
+   goto out;
+
+   /* Setup OPAL base & entry values */
+   dn = of_find_node_by_path("/ibm,opal");
+   if (dn) {
+   of_property_read_u64(dn, "opal-base-address", );
+   ret = kexec_purgatory_get_set_symbol(image, "opal_base", ,
+sizeof(val), false);
+   if (ret)
+   goto out;
+
+   of_property_read_u64(dn, "opal-entry-address", );
+   ret = kexec_purgatory_get_set_symbol(image, "opal_entry", ,
+sizeof(val), false);
+   }
 out:
if (ret)
pr_err("Failed to setup purgatory symbols");
diff --git a/arch/powerpc/purgatory/trampoline_64.S 
b/arch/powerpc/purgatory/trampoline_64.S
index 7b4a5f7..83e93b7 100644
--- a/arch/powerpc/purgatory/trampoline_64.S
+++ b/arch/powerpc/purgatory/trampoline_64.S
@@ -9,6 +9,7 @@
  * Copyright (C) 2013, Anton Blanchard, IBM Corporation
  */
 
+#include 
 #include 
 
.machine ppc64
@@ -53,6 +54,8 @@ master:
 
ld  %r2,(my_toc - 0b)(%r18) /* setup toc */
 
+   ld  %r1,(stack - 0b)(%r18)  /* setup stack */
+
/* load device-tree address */
ld  %r3, (dt_offset - 0b)(%r18)
mr  %r16,%r3/* save dt address in reg16 */
@@ -63,6 +66,11 @@ master:
li  %r4,28
STWX_BE %r17,%r3,%r4/* Store my cpu as __be32 at byte 28 */
 1:
+
+   /* Load opal base and entry values in r8 & r9 respectively */
+   ld  %r8,(opal_base - 0b)(%r18)
+   ld  %r9,(opal_entry - 0b)(%r18)
+
/* load the kernel address */
ld  %r4,(kernel - 0b)(%r18)
 
@@ -111,6 +119,24 @@ my_toc:
.8byte  0x0
.size my_toc, . - my_toc
 
+   .balign 8
+   .globl stack
+stack:
+   .8byte  0x0
+   .size stack, . - stack
+
+   .balign 8
+   .globl opal_base
+opal_base:
+   .8byte  0x0
+   .size opal_base, . - opal_base
+
+   .balign 8
+   .globl opal_entry
+opal_entry:
+   .8byte  0x0
+   .size opal_entry, . - opal_entry
+
.data
.balign 8
 .globl purgatory_sha256_digest
@@ -123,3 +149,9 @@ purgatory_sha256_digest:
 purgatory_sha_regions:
.skip   8 * 2 * 16
.size purgatory_sha_regions, . - purgatory_sha_regions
+
+   .balign 8
+.globl stack_buf
+stack_buf:
+   .skip   KEXEC_PURGATORY_STACK_SIZE
+   .size stack_buf, . - stack_buf


___
kexec mailing list
kexec@lists.infradead.org

[PATCH v2 07/12] ppc64/kexec_file: add support to relocate purgatory

2020-07-02 Thread Hari Bathini
Right now purgatory implementation is only minimal. But if purgatory
code is to be enhanced to copy memory to the backup region and verify
sha256 digest, relocations may have to be applied to the purgatory.
So, add support to relocate purgatory in kexec_file_load system call
by setting up TOC pointer and applying RELA relocations as needed.

Reported-by: kernel test robot 
[lkp: In v1, 'struct mem_sym' was declared in parameter list]
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Fixed wrong use of 'struct mem_sym' in local_entry_offset() as
  reported by lkp. lkp report for reference:
- https://lore.kernel.org/patchwork/patch/1264421/


 arch/powerpc/kexec/file_load_64.c  |  338 
 arch/powerpc/purgatory/trampoline_64.S |8 +
 2 files changed, 346 insertions(+)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 08c71be..adca9c0 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -621,6 +622,242 @@ static int update_usable_mem_fdt(void *fdt, struct 
crash_mem *usable_mem)
 }
 
 /**
+ * get_toc_section - Look for ".toc" symbol and return the corresponding 
section
+ * @ehdr:ELF header.
+ *
+ * Returns TOC section on success, NULL otherwise.
+ */
+static const Elf_Shdr *get_toc_section(const Elf_Ehdr *ehdr)
+{
+   const Elf_Shdr *sechdrs;
+   const char *secstrings;
+   int i;
+
+   if (!ehdr) {
+   pr_err("Purgatory elf load info missing?\n");
+   return NULL;
+   }
+
+   sechdrs = (void *)ehdr + ehdr->e_shoff;
+   secstrings = (void *)ehdr + sechdrs[ehdr->e_shstrndx].sh_offset;
+
+   for (i = 0; i < ehdr->e_shnum; i++) {
+   if ((sechdrs[i].sh_size != 0) &&
+   (strcmp(secstrings + sechdrs[i].sh_name, ".toc") == 0)) {
+   /* Return the ".toc" section */
+   pr_debug("TOC section number is %d\n", i);
+   return [i];
+   }
+   }
+
+   return NULL;
+}
+
+/**
+ * get_toc_ptr - r2 is the TOC pointer: it points 0x8000 into the TOC
+ * @ehdr:ELF header.
+ *
+ * Returns r2 on success, 0 otherwise.
+ */
+static unsigned long get_toc_ptr(const Elf_Ehdr *ehdr)
+{
+   const Elf_Shdr *sechdr;
+
+   sechdr = get_toc_section(ehdr);
+   if (!sechdr) {
+   pr_err("Could not get the TOC section!\n");
+   return 0;
+   }
+
+   return sechdr->sh_addr + 0x8000;
+}
+
+/* Helper functions to apply relocations */
+static int do_relative_toc(unsigned long val, uint16_t *loc,
+  unsigned long mask, int complain_signed)
+{
+   if (complain_signed && (val + 0x8000 > 0x)) {
+   pr_err("TOC16 relocation overflows (%lu)\n", val);
+   return -ENOEXEC;
+   }
+
+   if ((~mask & 0x) & val) {
+   pr_err("Bad TOC16 relocation (%lu)\n", val);
+   return -ENOEXEC;
+   }
+
+   *loc = (*loc & ~mask) | (val & mask);
+   return 0;
+}
+#ifdef PPC64_ELF_ABI_v2
+/* PowerPC64 specific values for the Elf64_Sym st_other field.  */
+#define STO_PPC64_LOCAL_BIT5
+#define STO_PPC64_LOCAL_MASK   (7 << STO_PPC64_LOCAL_BIT)
+#define PPC64_LOCAL_ENTRY_OFFSET(other)
\
+   (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) \
+>> 2) << 2)
+
+static unsigned int local_entry_offset(const Elf64_Sym *sym)
+{
+   /* If this symbol has a local entry point, use it. */
+   return PPC64_LOCAL_ENTRY_OFFSET(sym->st_other);
+}
+#else
+static unsigned int local_entry_offset(const Elf64_Sym *sym)
+{
+   return 0;
+}
+#endif
+
+/**
+ * kexec_do_relocs_ppc64 - Apply relocations based on relocation type.
+ * @my_r2: TOC pointer.
+ * @sym:   Symbol to relocate.
+ * @r_type:Relocation type.
+ * @loc:   Location to modify.
+ * @val:   Relocated symbol value.
+ * @addr:  Final location after relocation.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int kexec_do_relocs_ppc64(unsigned long my_r2, const Elf_Sym *sym,
+int r_type, void *loc, unsigned long val,
+unsigned long addr)
+{
+   int ret = 0;
+
+   switch (r_type) {
+   case R_PPC64_ADDR32:
+   /* Simply set it */
+   *(uint32_t *)loc = val;
+   break;
+
+   case R_PPC64_ADDR64:
+   /* Simply set it */
+   *(uint64_t *)loc = val;
+   break;
+
+   case R_PPC64_REL64:
+   *(uint64_t *)loc = val - (uint64_t)loc;
+   break;
+
+   case R_PPC64_REL32:
+   /* Convert value to relative */
+   

[PATCH v2 00/12] ppc64: enable kdump support for kexec_file_load syscall

2020-07-02 Thread Hari Bathini
This patch series enables kdump support for kexec_file_load system
call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools
code but heavily modified for kernel consumption. There is scope to
expand purgatory to verify sha256 digest along with other improvements
in purgatory code. Will deal with those changes in a separate patch
series later.

The first patch adds a weak arch_kexec_locate_mem_hole() function to
override locate memory hole logic suiting arch needs. There are some
special regions in ppc64 which should be avoided while loading buffer
& there are multiple callers to kexec_add_buffer making it complicated
to maintain range sanity and using generic lookup at the same time.

The second patch marks ppc64 specific code within arch/powerpc/kexec
and arch/powerpc/purgatory to make the subsequent code changes easy
to understand.

The next patch adds helper function to setup different memory ranges
needed for loading kdump kernel, booting into it and exporting the
crashing kernel's elfcore.

The fourth patch overrides arch_kexec_locate_mem_hole() function to
locate memory hole for kdump segments by accounting for the special
memory regions, referred to as excluded memory ranges, and sets
kbuf->mem when a suitable memory region is found.

The fifth patch moves walk_drmem_lmbs() out of .init section with
a few changes to reuse it for setting up kdump kernel's usable memory
ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs
and set linux,drconf-usable-memory & linux,usable-memory properties
in order to restrict kdump kernel's memory usage.

The seventh patch adds relocation support for the purgatory. Patch 8
helps setup the stack for the purgatory. The next patch setups up
backup region as a segment while loading kdump kernel and teaches
purgatory to copy it from source to destination.

Patch 10 builds the elfcore header for the running kernel & passes
the info to kdump kernel via "elfcorehdr=" parameter to export as
/proc/vmcore file. The next patch sets up the memory reserve map
for the kexec kernel and also claims kdump support for kdump as
all the necessary changes are added.

The last patch fixes a lookup issue for `kexec -l -s` case when
memory is reserved for crashkernel.

Tested the changes successfully on P8, P9 lpars, couple of OpenPOWER
boxes and a simulator.

Changes in v2:
* Introduced arch_kexec_locate_mem_hole() for override and dropped
  weak arch_kexec_add_buffer().
* Addressed warnings reported by lkp.
* Added patch to address kexec load issue when memory is reserved
  for crashkernel.
* Used the appropriate license header for the new files added.
* Added an option to merge ranges to minimize reallocations while
  adding memory ranges.
* Dropped within_crashkernel parameter for add_opal_mem_range() &
  add_rtas_mem_range() functions as it is not really needed.

---

Hari Bathini (12):
  kexec_file: allow archs to handle special regions while locating memory 
hole
  powerpc/kexec_file: mark PPC64 specific code
  powerpc/kexec_file: add helper functions for getting memory ranges
  ppc64/kexec_file: avoid stomping memory used by special regions
  powerpc/drmem: make lmb walk a bit more flexible
  ppc64/kexec_file: restrict memory usage of kdump kernel
  ppc64/kexec_file: add support to relocate purgatory
  ppc64/kexec_file: setup the stack for purgatory
  ppc64/kexec_file: setup backup region for kdump kernel
  ppc64/kexec_file: prepare elfcore header for crashing kernel
  ppc64/kexec_file: add appropriate regions for memory reserve map
  ppc64/kexec_file: fix kexec load failure with lack of memory hole


 arch/powerpc/include/asm/crashdump-ppc64.h |   15 
 arch/powerpc/include/asm/drmem.h   |9 
 arch/powerpc/include/asm/kexec.h   |   35 +
 arch/powerpc/include/asm/kexec_ranges.h|   18 
 arch/powerpc/include/asm/purgatory.h   |   11 
 arch/powerpc/kernel/prom.c |   13 
 arch/powerpc/kexec/Makefile|2 
 arch/powerpc/kexec/elf_64.c|   35 +
 arch/powerpc/kexec/file_load.c |   78 +
 arch/powerpc/kexec/file_load_64.c  | 1509 
 arch/powerpc/kexec/ranges.c|  397 +++
 arch/powerpc/mm/drmem.c|   87 +-
 arch/powerpc/mm/numa.c |   13 
 arch/powerpc/purgatory/Makefile|   28 -
 arch/powerpc/purgatory/purgatory_64.c  |   36 +
 arch/powerpc/purgatory/trampoline.S|  117 --
 arch/powerpc/purgatory/trampoline_64.S |  175 +++
 include/linux/kexec.h  |   29 -
 kernel/kexec_file.c|   16 
 19 files changed, 2413 insertions(+), 210 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
 create mode 100644 arch/powerpc/include/asm/kexec_ranges.h
 create mode 100644 arch/powerpc/include/asm/purgatory.h
 create mode 100644 arch/powerpc/kexec/file_load_64.c
 create 

[PATCH v2 03/12] powerpc/kexec_file: add helper functions for getting memory ranges

2020-07-02 Thread Hari Bathini
In kexec case, the kernel to be loaded uses the same memory layout as
the running kernel. So, passing on the DT of the running kernel would
be good enough.

But in case of kdump, different memory ranges are needed to manage
loading the kdump kernel, booting into it and exporting the elfcore
of the crashing kernel. The ranges are exlude memory ranges, usable
memory ranges, reserved memory ranges and crash memory ranges.

Exclude memory ranges specify the list of memory ranges to avoid while
loading kdump segments. Usable memory ranges list the memory ranges
that could be used for booting kdump kernel. Reserved memory ranges
list the memory regions for the loading kernel's reserve map. Crash
memory ranges list the memory ranges to be exported as the crashing
kernel's elfcore.

Add helper functions for setting up the above mentioned memory ranges.
This helpers facilitate in understanding the subsequent changes better
and make it easy to setup the different memory ranges listed above, as
and when appropriate.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Added an option to merge ranges while sorting to minimize reallocations
  for memory ranges list.
* Dropped within_crashkernel option for add_opal_mem_range() &
  add_rtas_mem_range() as it is not really needed.


 arch/powerpc/include/asm/kexec_ranges.h |   18 +
 arch/powerpc/kexec/Makefile |2 
 arch/powerpc/kexec/ranges.c |  397 +++
 3 files changed, 416 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/kexec_ranges.h
 create mode 100644 arch/powerpc/kexec/ranges.c

diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
b/arch/powerpc/include/asm/kexec_ranges.h
new file mode 100644
index 000..799dc40
--- /dev/null
+++ b/arch/powerpc/include/asm/kexec_ranges.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_POWERPC_KEXEC_RANGES_H
+#define _ASM_POWERPC_KEXEC_RANGES_H
+
+#define MEM_RANGE_CHUNK_SZ 2048/* Memory ranges size chunk */
+
+struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges);
+int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size);
+int add_tce_mem_ranges(struct crash_mem **mem_ranges);
+int add_initrd_mem_range(struct crash_mem **mem_ranges);
+int add_htab_mem_range(struct crash_mem **mem_ranges);
+int add_kernel_mem_range(struct crash_mem **mem_ranges);
+int add_rtas_mem_range(struct crash_mem **mem_ranges);
+int add_opal_mem_range(struct crash_mem **mem_ranges);
+int add_reserved_ranges(struct crash_mem **mem_ranges);
+void sort_memory_ranges(struct crash_mem *mrngs, bool merge);
+
+#endif /* _ASM_POWERPC_KEXEC_RANGES_H */
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 67c3553..4aff684 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -7,7 +7,7 @@ obj-y   += core.o crash.o core_$(BITS).o
 
 obj-$(CONFIG_PPC32)+= relocate_32.o
 
-obj-$(CONFIG_KEXEC_FILE)   += file_load.o file_load_$(BITS).o elf_$(BITS).o
+obj-$(CONFIG_KEXEC_FILE)   += file_load.o ranges.o file_load_$(BITS).o 
elf_$(BITS).o
 
 ifdef CONFIG_HAVE_IMA_KEXEC
 ifdef CONFIG_IMA
diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c
new file mode 100644
index 000..a704819
--- /dev/null
+++ b/arch/powerpc/kexec/ranges.c
@@ -0,0 +1,397 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * powerpc code to implement the kexec_file_load syscall
+ *
+ * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
+ * Copyright (C) 2004  IBM Corp.
+ * Copyright (C) 2004,2005  Milton D Miller II, IBM Corporation
+ * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
+ * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
+ * Copyright (C) 2020  IBM Corporation
+ *
+ * Based on kexec-tools' kexec-ppc64.c, fs2dt.c.
+ * Heavily modified for the kernel by
+ * Hari Bathini .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * get_max_nr_ranges - Get the max no. of ranges crash_mem structure
+ * could hold, given the size allocated for it.
+ * @size:  Allocation size of crash_mem structure.
+ *
+ * Returns the maximum no. of ranges.
+ */
+static inline unsigned int get_max_nr_ranges(size_t size)
+{
+   return ((size - sizeof(struct crash_mem)) /
+   sizeof(struct crash_mem_range));
+}
+
+/**
+ * get_mem_rngs_size - Get the allocated size of mrngs based on
+ * max_nr_ranges and chunk size.
+ * @mrngs: Memory ranges.
+ *
+ * Returns the maximum no. of ranges.
+ */
+static inline size_t get_mem_rngs_size(struct crash_mem *mrngs)
+{
+   size_t size;
+
+   if (!mrngs)
+   return 0;
+
+   size = (sizeof(struct crash_mem) +
+   (mrngs->max_nr_ranges * sizeof(struct crash_mem_range)));
+
+   /*
+* Memory is allocated in size multiple of MEM_RANGE_CHUNK_SZ.
+* So, align to get the actual 

[PATCH v2 04/12] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-02 Thread Hari Bathini
crashkernel region could have an overlap with special memory regions
like  opal, rtas, tce-table & such. These regions are referred to as
exclude memory ranges. Setup this ranges during image probe in order
to avoid them while finding the buffer for different kdump segments.
Override arch_kexec_locate_mem_hole() to locate a memory hole taking
these ranges into account.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Did arch_kexec_locate_mem_hole() override to handle special regions.
* Ensured holes in the memory are accounted for while locating mem hole.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/include/asm/crashdump-ppc64.h |   10 +
 arch/powerpc/include/asm/kexec.h   |7 -
 arch/powerpc/kexec/elf_64.c|7 +
 arch/powerpc/kexec/file_load_64.c  |  324 
 4 files changed, 344 insertions(+), 4 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h

diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h 
b/arch/powerpc/include/asm/crashdump-ppc64.h
new file mode 100644
index 000..90deb46
--- /dev/null
+++ b/arch/powerpc/include/asm/crashdump-ppc64.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_POWERPC_CRASHDUMP_PPC64_H
+#define _ASM_POWERPC_CRASHDUMP_PPC64_H
+
+/* min & max addresses for kdump load segments */
+#define KDUMP_BUF_MIN  (crashk_res.start)
+#define KDUMP_BUF_MAX  ((crashk_res.end < ppc64_rma_size) ? \
+crashk_res.end : (ppc64_rma_size - 1))
+
+#endif /* __ASM_POWERPC_CRASHDUMP_PPC64_H */
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 7008ea1..bf47a01 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -100,14 +100,16 @@ void relocate_new_kernel(unsigned long indirection_page, 
unsigned long reboot_co
 #ifdef CONFIG_KEXEC_FILE
 extern const struct kexec_file_ops kexec_elf64_ops;
 
-#ifdef CONFIG_IMA_KEXEC
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
+   struct crash_mem *exclude_ranges;
+
+#ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
-};
 #endif
+};
 
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
@@ -125,6 +127,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline);
 #endif /* CONFIG_PPC64 */
+
 #endif /* CONFIG_KEXEC_FILE */
 
 #else /* !CONFIG_KEXEC_CORE */
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 23ad04c..c695f94 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static void *elf64_load(struct kimage *image, char *kernel_buf,
unsigned long kernel_len, char *initrd,
@@ -46,6 +47,12 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
if (ret)
goto out;
 
+   if (image->type == KEXEC_TYPE_CRASH) {
+   /* min & max buffer values for kdump case */
+   kbuf.buf_min = pbuf.buf_min = KDUMP_BUF_MIN;
+   kbuf.buf_max = pbuf.buf_max = KDUMP_BUF_MAX;
+   }
+
ret = kexec_elf_load(image, , _info, , _load_addr);
if (ret)
goto out;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index e6bff960..932e0e5 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,6 +17,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_elf64_ops,
@@ -24,6 +27,240 @@ const struct kexec_file_ops * const kexec_file_loaders[] = {
 };
 
 /**
+ * get_exclude_memory_ranges - Get exclude memory ranges. This list includes
+ * regions like opal/rtas, tce-table, initrd,
+ * kernel, htab which should be avoided while
+ * setting up kexec load segments.
+ * @mem_ranges:Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_initrd_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_htab_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_kernel_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;

[PATCH v2 02/12] powerpc/kexec_file: mark PPC64 specific code

2020-07-02 Thread Hari Bathini
Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
same spirit.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* No changes.


 arch/powerpc/include/asm/kexec.h   |   11 +++
 arch/powerpc/kexec/Makefile|2 -
 arch/powerpc/kexec/elf_64.c|7 +-
 arch/powerpc/kexec/file_load.c |   37 ++
 arch/powerpc/kexec/file_load_64.c  |  108 ++
 arch/powerpc/purgatory/Makefile|4 +
 arch/powerpc/purgatory/trampoline.S|  117 
 arch/powerpc/purgatory/trampoline_64.S |  117 
 8 files changed, 248 insertions(+), 155 deletions(-)
 create mode 100644 arch/powerpc/kexec/file_load_64.c
 delete mode 100644 arch/powerpc/purgatory/trampoline.S
 create mode 100644 arch/powerpc/purgatory/trampoline_64.S

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index c684768..7008ea1 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -114,8 +114,17 @@ int setup_purgatory(struct kimage *image, const void 
*slave_code,
unsigned long fdt_load_addr);
 int setup_new_fdt(const struct kimage *image, void *fdt,
  unsigned long initrd_load_addr, unsigned long initrd_len,
- const char *cmdline);
+ const char *cmdline, int *node);
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
+
+#ifdef CONFIG_PPC64
+int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
+ const void *fdt, unsigned long kernel_load_addr,
+ unsigned long fdt_load_addr);
+int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
+   unsigned long initrd_load_addr,
+   unsigned long initrd_len, const char *cmdline);
+#endif /* CONFIG_PPC64 */
 #endif /* CONFIG_KEXEC_FILE */
 
 #else /* !CONFIG_KEXEC_CORE */
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 86380c6..67c3553 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -7,7 +7,7 @@ obj-y   += core.o crash.o core_$(BITS).o
 
 obj-$(CONFIG_PPC32)+= relocate_32.o
 
-obj-$(CONFIG_KEXEC_FILE)   += file_load.o elf_$(BITS).o
+obj-$(CONFIG_KEXEC_FILE)   += file_load.o file_load_$(BITS).o elf_$(BITS).o
 
 ifdef CONFIG_HAVE_IMA_KEXEC
 ifdef CONFIG_IMA
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 3072fd6..23ad04c 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -88,7 +88,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
goto out;
}
 
-   ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);
+   ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
+ initrd_len, cmdline);
if (ret)
goto out;
 
@@ -107,8 +108,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
 
slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset;
-   ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
- fdt_load_addr);
+   ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr,
+   fdt_load_addr);
if (ret)
pr_err("Error setting up the purgatory.\n");
 
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 143c917..99a2c4d 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * ppc64 code to implement the kexec_file_load syscall
+ * powerpc code to implement the kexec_file_load syscall
  *
  * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
  * Copyright (C) 2004  IBM Corp.
@@ -16,26 +16,10 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 
-#define SLAVE_CODE_SIZE256
-
-const struct kexec_file_ops * const kexec_file_loaders[] = {
-   _elf64_ops,
-   NULL
-};
-
-int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
-   /* We don't support crash kernels yet. */
-   if (image->type == KEXEC_TYPE_CRASH)
-   return -EOPNOTSUPP;
-
-   return kexec_image_probe_default(image, buf, buf_len);
-}
+#define SLAVE_CODE_SIZE256 /* First 0x100 bytes */
 
 /**
  * setup_purgatory - initialize the purgatory's global variables
@@ -127,24 +111,17 @@ int delete_fdt_mem_rsv(void *fdt, unsigned long start, 
unsigned long size)
  * @initrd_len:

[PATCH v2 01/12] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-02 Thread Hari Bathini
Some architectures may have special memory regions, within the given
memory range, which can't be used for the buffer in a kexec segment.
Implement weak arch_kexec_locate_mem_hole() definition which arch code
may override, to take care of special regions, while trying to locate
a memory hole.

Also, add the missing declarations for arch overridable functions and
and drop the __weak descriptors in the declarations to avoid non-weak
definitions from becoming weak.

Reported-by: kernel test robot 
[lkp: In v1, arch_kimage_file_post_load_cleanup() declaration was missing]
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Introduced arch_kexec_locate_mem_hole() for override and dropped
  weak arch_kexec_add_buffer().
* Dropped __weak identifier for arch overridable functions.
* Fixed the missing declaration for arch_kimage_file_post_load_cleanup()
  reported by lkp. lkp report for reference:
- https://lore.kernel.org/patchwork/patch/1264418/


 include/linux/kexec.h |   29 ++---
 kernel/kexec_file.c   |   16 ++--
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ea67910..9e93bef 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -183,17 +183,24 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, 
const char *name,
   bool get_value);
 void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name);
 
-int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
-unsigned long buf_len);
-void * __weak arch_kexec_kernel_image_load(struct kimage *image);
-int __weak arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-   Elf_Shdr *section,
-   const Elf_Shdr *relsec,
-   const Elf_Shdr *symtab);
-int __weak arch_kexec_apply_relocations(struct purgatory_info *pi,
-   Elf_Shdr *section,
-   const Elf_Shdr *relsec,
-   const Elf_Shdr *symtab);
+/* Architectures may override the below functions */
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len);
+void *arch_kexec_kernel_image_load(struct kimage *image);
+int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
+Elf_Shdr *section,
+const Elf_Shdr *relsec,
+const Elf_Shdr *symtab);
+int arch_kexec_apply_relocations(struct purgatory_info *pi,
+Elf_Shdr *section,
+const Elf_Shdr *relsec,
+const Elf_Shdr *symtab);
+int arch_kimage_file_post_load_cleanup(struct kimage *image);
+#ifdef CONFIG_KEXEC_SIG
+int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
+unsigned long buf_len);
+#endif
+int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
 
 extern int kexec_add_buffer(struct kexec_buf *kbuf);
 int kexec_locate_mem_hole(struct kexec_buf *kbuf);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 09cc78d..e89912d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -636,6 +636,19 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
 }
 
 /**
+ * arch_kexec_locate_mem_hole - Find free memory to place the segments.
+ * @kbuf:   Parameters for the memory search.
+ *
+ * On success, kbuf->mem will have the start address of the memory region 
found.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
+{
+   return kexec_locate_mem_hole(kbuf);
+}
+
+/**
  * kexec_add_buffer - place a buffer in a kexec segment
  * @kbuf:  Buffer contents and memory parameters.
  *
@@ -647,7 +660,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
  */
 int kexec_add_buffer(struct kexec_buf *kbuf)
 {
-
struct kexec_segment *ksegment;
int ret;
 
@@ -675,7 +687,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   ret = kexec_locate_mem_hole(kbuf);
+   ret = arch_kexec_locate_mem_hole(kbuf);
if (ret)
return ret;
 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64/defconfig: Enable CONFIG_KEXEC_FILE

2020-07-02 Thread Bhupesh Sharma
Hi Catalin,

On Fri, May 15, 2020 at 2:44 PM Bhupesh Sharma  wrote:
>
> Hi Arnd,
>
> On Thu, Apr 30, 2020 at 10:05 AM Bhupesh Sharma  wrote:
> >
> > On Tue, Apr 28, 2020 at 3:37 PM Catalin Marinas  
> > wrote:
> > >
> > > On Tue, Apr 28, 2020 at 01:55:58PM +0530, Bhupesh Sharma wrote:
> > > > On Wed, Apr 8, 2020 at 4:17 PM Mark Rutland  
> > > > wrote:
> > > > > On Tue, Apr 07, 2020 at 04:01:40AM +0530, Bhupesh Sharma wrote:
> > > > > >  arch/arm64/configs/defconfig | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/arch/arm64/configs/defconfig 
> > > > > > b/arch/arm64/configs/defconfig
> > > > > > index 24e534d85045..fa122f4341a2 100644
> > > > > > --- a/arch/arm64/configs/defconfig
> > > > > > +++ b/arch/arm64/configs/defconfig
> > > > > > @@ -66,6 +66,7 @@ CONFIG_SCHED_SMT=y
> > > > > >  CONFIG_NUMA=y
> > > > > >  CONFIG_SECCOMP=y
> > > > > >  CONFIG_KEXEC=y
> > > > > > +CONFIG_KEXEC_FILE=y
> > > > > >  CONFIG_CRASH_DUMP=y
> > > > > >  CONFIG_XEN=y
> > > > > >  CONFIG_COMPAT=y
> > > > > > --
> > > > > > 2.7.4
> > > >
> > > > Thanks a lot  Mark.
> > > >
> > > > Hi Catalin, Will,
> > > >
> > > > Can you please help pick this patch in the arm tree. We have an
> > > > increasing number of user-cases from distro users
> > > > who want to use kexec_file_load() as the default interface for
> > > > kexec/kdump on arm64.
> > >
> > > We could pick it up if it doesn't conflict with the arm-soc tree. They
> > > tend to pick most of the defconfig changes these days (and could as well
> > > pick this one).
> >
> > Thanks Catalin.
> > (+Cc Arnd)
> >
> > Hi Arnd,
> >
> > Can you please help pick this change via the arm-soc tree?
>
> Ping. Any updates on this defconfig patch.

Ping. Seems there is no reply from Arnd on this patch.
Can you please help pull in this one as well. It has been pending for
quite some time now.

Thanks,
Bhupesh


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA

2020-07-02 Thread Bhupesh Sharma
Hi Will,

On Thu, Jul 2, 2020 at 1:20 PM Will Deacon  wrote:
>
> On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
> > commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
> > ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
> >
> > However as reported by Prabhakar, this breaks kdump kernel booting in
> > ThunderX2 like arm64 systems. I have noticed this on another ampere
> > arm64 machine. The OOM log in the kdump kernel looks like this:
> >
> >   [0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic 
> > allocations
> >   [0.247713] swapper/0: page allocation failure: order:1, 
> > mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
> >   <..snip..>
> >   [0.274706] Call trace:
> >   [0.277170]  dump_backtrace+0x0/0x208
> >   [0.280863]  show_stack+0x1c/0x28
> >   [0.284207]  dump_stack+0xc4/0x10c
> >   [0.287638]  warn_alloc+0x104/0x170
> >   [0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
> >   [0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
> >   [0.301530]  alloc_page_interleave+0x20/0x90
> >   [0.305839]  alloc_pages_current+0xdc/0xf8
> >   [0.309972]  atomic_pool_expand+0x60/0x210
> >   [0.314108]  __dma_atomic_pool_init+0x50/0xa4
> >   [0.318504]  dma_atomic_pool_init+0xac/0x158
> >   [0.322813]  do_one_initcall+0x50/0x218
> >   [0.326684]  kernel_init_freeable+0x22c/0x2d0
> >   [0.331083]  kernel_init+0x18/0x110
> >   [0.334600]  ret_from_fork+0x10/0x18
> >
> > This patch limits the crashkernel allocation to the first 1GB of
> > the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
> > issues when crashkernel is executed, as it might have been originally
> > allocated from either a ZONE_DMA32 memory or mixture of memory chunks
> > belonging to both ZONE_DMA and ZONE_DMA32.
>
> How does this interact with this ongoing series:
>
> https://lore.kernel.org/r/20200628083458.40066-1-chenzho...@huawei.com
>
> (patch 4, in particular)

Many thanks for having a look at this patchset. I was not aware that
Chen had sent out a new version.
I had noted in the v9 review of the high/low range allocation
 that I was working
on a generic solution (irrespective of the crashkernel, low and high
range allocation) which resulted in this patchset.

The issue is two-fold: OOPs in memcfg layer (PATCH 1/2, which has been
Acked-by memcfg maintainer) and OOM in the kdump kernel due to
crashkernel allocation in ZONE_DMA32 regions(s) which is addressed by
this PATCH.

I will have a closer look at the v10 patchset Chen shared, but seems
it needs some rework as per Dave's review comments which he shared
today.
IMO, in the meanwhile this patchset  can be used to fix the existing
kdump issue with upstream kernel.

> > Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
> > Cc: Johannes Weiner 
> > Cc: Michal Hocko 
> > Cc: Vladimir Davydov 
> > Cc: James Morse 
> > Cc: Mark Rutland 
> > Cc: Will Deacon 
> > Cc: Catalin Marinas 
> > Cc: cgro...@vger.kernel.org
> > Cc: linux...@kvack.org
> > Cc: linux-arm-ker...@lists.infradead.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: kexec@lists.infradead.org
> > Reported-by: Prabhakar Kushwaha 
> > Signed-off-by: Bhupesh Sharma 
> > ---
> >  arch/arm64/mm/init.c | 16 ++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1e93cfc7c47a..02ae4d623802 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
> >   crash_size = PAGE_ALIGN(crash_size);
> >
> >   if (crash_base == 0) {
> > - /* Current arm64 boot protocol requires 2MB alignment */
> > - crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
> > + /* Current arm64 boot protocol requires 2MB alignment.
> > +  * Also limit the crashkernel allocation to the first
> > +  * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
> > +  * might run into OOM issues when crashkernel is executed,
> > +  * as it might have been originally allocated from
> > +  * either a ZONE_DMA32 memory or mixture of memory
> > +  * chunks belonging to both ZONE_DMA and ZONE_DMA32.
> > +  */
>
> This comment needs help. Why does putting the crashkernel in ZONE_DMA
> prevent "OOM issues"?

Sure, I can work on adding more details in the comment so that it
explains the potential OOM issue(s) better.

Thanks,
Bhupesh


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()

2020-07-02 Thread Bhupesh Sharma
Hi Michal,

On Thu, Jul 2, 2020 at 11:30 AM Michal Hocko  wrote:
>
> On Thu 02-07-20 03:44:19, Bhupesh Sharma wrote:
> > Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in a corner case seen on some arm64 boards when kdump kernel
> > runs with "cgroup_disable=memory" passed to the kdump kernel via
> > bootargs.
> >
> > The root-cause behind the same is that currently mem_cgroup_swap_init()
> > function is implemented as a subsys_initcall() call instead of a
> > core_initcall(), this means 'cgroup_memory_noswap' still
> > remains set to the default value (false) even when memcg is disabled via
> > "cgroup_disable=memory" boot parameter.
> >
> > This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in corner cases:
> >
> >   [0.265617] Unable to handle kernel NULL pointer dereference at 
> > virtual address 0188
> >   [0.274495] Mem abort info:
> >   [0.277311]   ESR = 0x9606
> >   [0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
> >   [0.285751]   SET = 0, FnV = 0
> >   [0.288830]   EA = 0, S1PTW = 0
> >   [0.291995] Data abort info:
> >   [0.294897]   ISV = 0, ISS = 0x0006
> >   [0.298765]   CM = 0, WnR = 0
> >   [0.301757] [0188] user address but active_mm is swapper
> >   [0.308174] Internal error: Oops: 9606 [#1] SMP
> >   [0.313097] Modules linked in:
> >   <..snip..>
> >   [0.331384] pstate: 0049 (nzcv daif +PAN -UAO BTYPE=--)
> >   [0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
> >   [0.347564] sp : fe0012b6f800
> >   [0.350905] x29: fe0012b6f800 x28: fe00116b3000
> >   [0.356268] x27: fe0012b6fb00 x26: 0020
> >   [0.361631] x25:  x24: fc00723ffe28
> >   [0.366994] x23: fe0010d5b468 x22: fe00116bfa00
> >   [0.372357] x21: fe0010aabda8 x20: 
> >   [0.377720] x19:  x18: 0010
> >   [0.383082] x17: 43e612f2 x16: a9863ed7
> >   [0.388445] x15:  x14: 202c303d70617773
> >   [0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
> >   [0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
> >   [0.404533] x9 : fe00100e9580 x8 : fe0010628160
> >   [0.409895] x7 : 00a8 x6 : fe00118f5e5e
> >   [0.415258] x5 : 0001 x4 : 
> >   [0.420621] x3 :  x2 : 
> >   [0.425983] x1 :  x0 : fc0060079000
> >   [0.431346] Call trace:
> >   [0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [0.438735]  shrink_lruvec+0x404/0x4f8
> >   [0.442516]  shrink_node+0x1a8/0x688
> >   [0.446121]  do_try_to_free_pages+0xe8/0x448
> >   [0.450429]  try_to_free_pages+0x110/0x230
> >   [0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
> >   [0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
> >   [0.464938]  alloc_page_interleave+0x20/0x90
> >   [0.469246]  alloc_pages_current+0xdc/0xf8
> >   [0.473379]  atomic_pool_expand+0x60/0x210
> >   [0.477514]  __dma_atomic_pool_init+0x50/0xa4
> >   [0.481910]  dma_atomic_pool_init+0xac/0x158
> >   [0.486220]  do_one_initcall+0x50/0x218
> >   [0.490091]  kernel_init_freeable+0x22c/0x2d0
> >   [0.494489]  kernel_init+0x18/0x110
> >   [0.498007]  ret_from_fork+0x10/0x18
> >   [0.501614] Code: aa1403e3 91106000 97f82a27 1411 (f940c663)
> >   [0.507770] ---[ end trace 9795948475817de4 ]---
> >   [0.512429] Kernel panic - not syncing: Fatal exception
> >   [0.517705] Rebooting in 10 seconds..
> >
> > Cc: Johannes Weiner 
> > Cc: Michal Hocko 
> > Cc: Vladimir Davydov 
> > Cc: James Morse 
> > Cc: Mark Rutland 
> > Cc: Will Deacon 
> > Cc: Catalin Marinas 
> > Cc: cgro...@vger.kernel.org
> > Cc: linux...@kvack.org
> > Cc: linux-arm-ker...@lists.infradead.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: kexec@lists.infradead.org
>
> Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for 
> integration")
>
> > Reported-by: Prabhakar Kushwaha 
> > Signed-off-by: Bhupesh Sharma 
>
> This is subtle as hell, I have to say. I find the ordering in the init
> calls very unintuitive and extremely hard to follow. The above commit
> has introduced the problem but the code previously has worked mostly by
> a luck because our default was flipped.
>
> Acked-by: Michal Hocko 

Thanks for reviewing the patch. Indeed its quite a corner case seen
only selected arm64 machines.

Regards,
Bhupesh

> > ---
> >  mm/memcontrol.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 19622328e4b5..8323e4b7b390 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7186,6 +7186,13 @@ static struct 

Re: [PATCH v6 0/2] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2020-07-02 Thread Bhupesh Sharma
On Thu, Jul 2, 2020 at 10:45 PM Catalin Marinas  wrote:
>
> On Thu, 14 May 2020 00:22:35 +0530, Bhupesh Sharma wrote:
> > Apologies for the delayed update. Its been quite some time since I
> > posted the last version (v5), but I have been really caught up in some
> > other critical issues.
> >
> > Changes since v5:
> > 
> > - v5 can be viewed here:
> >   http://lists.infradead.org/pipermail/kexec/2019-November/024055.html
> > - Addressed review comments from James Morse and Boris.
> > - Added Tested-by received from John on v5 patchset.
> > - Rebased against arm64 (for-next/ptr-auth) branch which has Amit's
> >   patchset for ARMv8.3-A Pointer Authentication feature vmcoreinfo
> >   applied.
> >
> > [...]
>
> Applied to arm64 (for-next/vmcoreinfo), thanks!
>
> [1/2] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo
>   https://git.kernel.org/arm64/c/1d50e5d0c505
> [2/2] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo
>   https://git.kernel.org/arm64/c/bbdbc11804ff

Thanks Catalin for pulling in the changes.

Dave and James, many thanks for reviewing the same as well.

Regards,
Bhupesh


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 0/2] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2020-07-02 Thread Catalin Marinas
On Thu, 14 May 2020 00:22:35 +0530, Bhupesh Sharma wrote:
> Apologies for the delayed update. Its been quite some time since I
> posted the last version (v5), but I have been really caught up in some
> other critical issues.
> 
> Changes since v5:
> 
> - v5 can be viewed here:
>   http://lists.infradead.org/pipermail/kexec/2019-November/024055.html
> - Addressed review comments from James Morse and Boris.
> - Added Tested-by received from John on v5 patchset.
> - Rebased against arm64 (for-next/ptr-auth) branch which has Amit's
>   patchset for ARMv8.3-A Pointer Authentication feature vmcoreinfo
>   applied.
> 
> [...]

Applied to arm64 (for-next/vmcoreinfo), thanks!

[1/2] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo
  https://git.kernel.org/arm64/c/1d50e5d0c505
[2/2] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo
  https://git.kernel.org/arm64/c/bbdbc11804ff

-- 
Catalin


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 1/2] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo

2020-07-02 Thread Catalin Marinas
On Thu, Jul 02, 2020 at 08:08:55PM +0800, Dave Young wrote:
> Hi Catalin,
> On 07/02/20 at 12:00pm, Catalin Marinas wrote:
> > On Thu, May 14, 2020 at 12:22:36AM +0530, Bhupesh Sharma wrote:
> > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > > index 9f1557b98468..18175687133a 100644
> > > --- a/kernel/crash_core.c
> > > +++ b/kernel/crash_core.c
> > > @@ -413,6 +413,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> > >   VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
> > >   VMCOREINFO_STRUCT_SIZE(mem_section);
> > >   VMCOREINFO_OFFSET(mem_section, section_mem_map);
> > > + VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
> > >  #endif
> > >   VMCOREINFO_STRUCT_SIZE(page);
> > >   VMCOREINFO_STRUCT_SIZE(pglist_data);
> > 
> > I can queue this patch via the arm64 tree (together with the second one)
> > but I'd like an ack from the kernel/crash_core.c maintainers. They don't
> > seem to have been cc'ed either (only the kexec list).
> 
> For the VMCOREINFO part, I'm fine with the changes, but since I do not
> understand the arm64 pieces so I would like to leave to arm64 people to
> review.  If arm64 bits are good enough, feel free to add:
> 
> Acked-by: Dave Young 

Thanks.

-- 
Catalin

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread Petr Mladek
On Thu 2020-07-02 17:43:22, lijiang wrote:
> 在 2020年07月02日 17:02, John Ogness 写道:
> > On 2020-07-02, lijiang  wrote:
> >> About the VMCOREINFO part, I made some tests based on the kernel patch
> >> v3, the makedumpfile and crash-utility can work as expected with your
> >> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
> >> can't correctly read the printk ring buffer information, and get the
> >> following error:
> >>
> >> "Missing the log_buf symbol"
> >>
> >> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
> >> in the makedumpfile and crash-utility.
> > 
> > Yes, a patch for this is needed (as well as for any other related
> > software floating around the internet).
> > 
> > I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
> > would be quite straight forward to port the makedumpfile patch. I will
> 
> Yes, it should be a similar patch.
> 
> > try to make some time for this.
> > 
> That would be nice. Thank you, John Ogness.
> 
> > I do not want to patch any other software for this. I think with 3
> > examples (crash, makedumpfile, vmcore-dmesg), others should be able to
> 
> It's good enough to have the patch for the makedumpfile, crash and 
> vmcore-dmesg,
> which can ensure the kdump(userspace) work well.

I agree that this three are the most important ones and should be
enough.

Thanks a lot for working on it and testing it.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread Petr Mladek
On Mon 2020-06-29 23:57:59, John Ogness wrote:
> On 2020-06-29, Petr Mladek  wrote:
> >> @@ @@ void __init setup_log_buf(int early)
> >> +  prb_init(_rb_dynamic,
> >> +   new_log_buf, order_base_2(new_log_buf_len),
> >> +   new_dict_buf, order_base_2(new_log_buf_len),
> >> +   new_descs, order_base_2(new_descs_count));
> >
> > order_base_2() is safe. But the result might be tat some allocated
> > space is not used.
> >
> > I would prefer to make sure that new_log_buf_len is rounded, e.g.
> > by roundup_pow_of_two(), at the beginning of the function. Then we
> > could use ilog2() here.
> 
> new_log_buf_len can only be set within log_buf_len_update(), and it
> is already doing exactly what you want:
> 
> if (size)
> size = roundup_pow_of_two(size);
> if (size > log_buf_len)
> new_log_buf_len = (unsigned long)size;

I know. I though about repeating this in setup_log_bug() to make it
error proof. But it is not really needed. I do not resist on it.

> I can switch to ilog2() instead of the more conservative order_base_2().

Yes, please change it to ilog2().

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 1/2] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo

2020-07-02 Thread Dave Young
Hi Catalin,
On 07/02/20 at 12:00pm, Catalin Marinas wrote:
> On Thu, May 14, 2020 at 12:22:36AM +0530, Bhupesh Sharma wrote:
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 9f1557b98468..18175687133a 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -413,6 +413,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> > VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
> > VMCOREINFO_STRUCT_SIZE(mem_section);
> > VMCOREINFO_OFFSET(mem_section, section_mem_map);
> > +   VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
> >  #endif
> > VMCOREINFO_STRUCT_SIZE(page);
> > VMCOREINFO_STRUCT_SIZE(pglist_data);
> 
> I can queue this patch via the arm64 tree (together with the second one)
> but I'd like an ack from the kernel/crash_core.c maintainers. They don't
> seem to have been cc'ed either (only the kexec list).

For the VMCOREINFO part, I'm fine with the changes, but since I do not
understand the arm64 pieces so I would like to leave to arm64 people to
review.  If arm64 bits are good enough, feel free to add:

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-02 Thread Dave Young
> > I'm confused about the "overlap with crashkernel memory", does that mean
> > those normal kernel used memory could be put in crashkernel reserved
> > memory range?  If so why can't just skip those areas while crashkernel
> > doing the reservation?
> I raised the same question in another mail. As Hari's answer, "kexec -p"
> skips these ranges in user space. And the same logic should be done in
> "kexec -s -p"

See it, thanks!  The confusion also applied to the userspace
implementation though.  Seems they have to be special cases because of 
the powerpc crashkernel reservation implemtation in kernel limitation

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-02 Thread Dave Young
On 07/01/20 at 11:48pm, Hari Bathini wrote:
> 
> 
> On 01/07/20 1:10 pm, Dave Young wrote:
> > Hi Hari,
> > On 06/27/20 at 12:35am, Hari Bathini wrote:
> >> crashkernel region could have an overlap with special memory regions
> >> like  opal, rtas, tce-table & such. These regions are referred to as
> >> exclude memory ranges. Setup this ranges during image probe in order
> >> to avoid them while finding the buffer for different kdump segments.
> >> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole
> >> accounting for these ranges. Also, override arch_kexec_add_buffer()
> >> to locate a memory hole & later call __kexec_add_buffer() function
> >> with kbuf->mem set to skip the generic locate memory hole lookup.
> >>
> >> Signed-off-by: Hari Bathini 
> >> ---
> >>  arch/powerpc/include/asm/crashdump-ppc64.h |   10 +
> >>  arch/powerpc/include/asm/kexec.h   |7 -
> >>  arch/powerpc/kexec/elf_64.c|7 +
> >>  arch/powerpc/kexec/file_load_64.c  |  292 
> >> 
> >>  4 files changed, 312 insertions(+), 4 deletions(-)
> >>  create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
> >>
> > [snip]
> >>  /**
> >> + * get_exclude_memory_ranges - Get exclude memory ranges. This list 
> >> includes
> >> + * regions like opal/rtas, tce-table, initrd,
> >> + * kernel, htab which should be avoided while
> >> + * setting up kexec load segments.
> >> + * @mem_ranges:Range list to add the memory ranges to.
> >> + *
> >> + * Returns 0 on success, negative errno on error.
> >> + */
> >> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
> >> +{
> >> +  int ret;
> >> +
> >> +  ret = add_tce_mem_ranges(mem_ranges);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_initrd_mem_range(mem_ranges);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_htab_mem_range(mem_ranges);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_kernel_mem_range(mem_ranges);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_rtas_mem_range(mem_ranges, false);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_opal_mem_range(mem_ranges, false);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  ret = add_reserved_ranges(mem_ranges);
> >> +  if (ret)
> >> +  goto out;
> >> +
> >> +  /* exclude memory ranges should be sorted for easy lookup */
> >> +  sort_memory_ranges(*mem_ranges);
> >> +out:
> >> +  if (ret)
> >> +  pr_err("Failed to setup exclude memory ranges\n");
> >> +  return ret;
> >> +}
> > 
> > I'm confused about the "overlap with crashkernel memory", does that mean
> > those normal kernel used memory could be put in crashkernel reserved
> 
> There are regions that could overlap with crashkernel region but they are
> not normal kernel used memory though. These are regions that kernel and/or
> f/w chose to place at a particular address for real mode accessibility
> and/or memory layout between kernel & f/w kind of thing.
> 
> > memory range?  If so why can't just skip those areas while crashkernel
> > doing the reservation?
> 
> crashkernel region has a dependency to be in the first memory block for it
> to be accessible in real mode. Accommodating this requirement while addressing
> other requirements would mean something like what we have now. A list of
> possible special memory regions in crashkernel region to take care of.
> 
> I have plans to split crashkernel region into low & high to have exclusive
> regions for crashkernel, even if that means to have two of them. But that
> is for another day with its own set of complexities to deal with...

Ok, I was not aware the powerpc crashkernel reservation is not
dynamically reserved.  But seems powerpc need those tricks at least for
the time being like you said.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-02 Thread Dave Young
On 07/02/20 at 12:01am, Hari Bathini wrote:
> 
> 
> On 01/07/20 1:16 pm, Dave Young wrote:
> > On 06/29/20 at 05:26pm, Hari Bathini wrote:
> >> Hi Petr,
> >>
> >> On 29/06/20 5:09 pm, Petr Tesarik wrote:
> >>> Hi Hari,
> >>>
> >>> is there any good reason to add two more functions with a very similar
> >>> name to an existing function? AFAICS all you need is a way to call a
> >>> PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so
> >>> you could add something like this:
> >>>
> >>> int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
> >>> {
> >>>   return 0;
> >>> }
> >>>
> >>> Call this function from kexec_add_buffer where appropriate and then
> >>> override it for PPC64 (it roughly corresponds to your
> >>> kexec_locate_mem_hole_ppc64() from PATCH 4/11).
> >>>
> >>> FWIW it would make it easier for me to follow the resulting code.
> >>
> >> Right, Petr.
> >>
> >> I was trying out a few things before I ended up with what I sent here.
> >> Bu yeah.. I did realize arch_kexec_locate_mem_hole() would have been better
> >> after sending out v1. Will take care of that in v2.
> > 
> > Another way is use arch private function to locate mem hole, then set
> > kbuf->mem, and then call kexec_add_buf, it will skip the common locate
> > hole function.
> 
> Dave, I did think about it. But there are a couple of places this can get
> tricky. One is ima_add_kexec_buffer() and the other is kexec_elf_load().
> These call sites could be updated to set kbuf->mem before kexec_add_buffer().
> But the current approach seemed like the better option for it creates a
> single point of control in setting up segment buffers and also, makes adding
> any new segments simpler, arch-specific segments or otherwise.
> 

Ok, thanks for the explanation.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 1/2] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo

2020-07-02 Thread Catalin Marinas
On Thu, May 14, 2020 at 12:22:36AM +0530, Bhupesh Sharma wrote:
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 9f1557b98468..18175687133a 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -413,6 +413,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
>   VMCOREINFO_STRUCT_SIZE(mem_section);
>   VMCOREINFO_OFFSET(mem_section, section_mem_map);
> + VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>  #endif
>   VMCOREINFO_STRUCT_SIZE(page);
>   VMCOREINFO_STRUCT_SIZE(pglist_data);

I can queue this patch via the arm64 tree (together with the second one)
but I'd like an ack from the kernel/crash_core.c maintainers. They don't
seem to have been cc'ed either (only the kexec list).

Thanks.

-- 
Catalin

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread lijiang
在 2020年07月02日 17:02, John Ogness 写道:
> On 2020-07-02, lijiang  wrote:
>> About the VMCOREINFO part, I made some tests based on the kernel patch
>> v3, the makedumpfile and crash-utility can work as expected with your
>> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
>> can't correctly read the printk ring buffer information, and get the
>> following error:
>>
>> "Missing the log_buf symbol"
>>
>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>> in the makedumpfile and crash-utility.
> 
> Yes, a patch for this is needed (as well as for any other related
> software floating around the internet).
> 
> I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
> would be quite straight forward to port the makedumpfile patch. I will

Yes, it should be a similar patch.

> try to make some time for this.
> 
That would be nice. Thank you, John Ogness.

> I do not want to patch any other software for this. I think with 3
> examples (crash, makedumpfile, vmcore-dmesg), others should be able to

It's good enough to have the patch for the makedumpfile, crash and vmcore-dmesg,
which can ensure the kdump(userspace) work well.

Thanks.
Lianbo

> implement the changes to their software without needing my help.
> 
> John Ogness
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread John Ogness
On 2020-07-02, lijiang  wrote:
> About the VMCOREINFO part, I made some tests based on the kernel patch
> v3, the makedumpfile and crash-utility can work as expected with your
> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
> can't correctly read the printk ring buffer information, and get the
> following error:
>
> "Missing the log_buf symbol"
>
> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
> in the makedumpfile and crash-utility.

Yes, a patch for this is needed (as well as for any other related
software floating around the internet).

I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
would be quite straight forward to port the makedumpfile patch. I will
try to make some time for this.

I do not want to patch any other software for this. I think with 3
examples (crash, makedumpfile, vmcore-dmesg), others should be able to
implement the changes to their software without needing my help.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread lijiang
Hi, John Ogness

About the VMCOREINFO part, I made some tests based on the kernel patch
v3, the makedumpfile and crash-utility can work as expected with your
patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
can't correctly read the printk ring buffer information, and get the
following error:

"Missing the log_buf symbol"

The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
in the makedumpfile and crash-utility.

BTW: If you already have a patch for the kexec-tools, would you mind sharing
it? I will make a test for the vmcore-dmesg.

Thanks.
Lianbo

在 2020年06月18日 22:49, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> Changes that exist due to the ringbuffer replacement:
> 
> - The VMCOREINFO has been updated for the new structures.
> 
> - Dictionary data is now stored in a separate data buffer from the
>   human-readable messages. The dictionary data buffer is set to the
>   same size as the message buffer. Therefore, the total required
>   memory for both dictionary and message data is
>   2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and
>   2 * log_buf_len (the kernel parameter) for the dynamic buffers.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes
>   for the static array and 72 * (log_buf_len >> 5) bytes for the
>   dynamic array.
> 
> Signed-off-by: John Ogness 
> ---
>  include/linux/kmsg_dump.h |   2 -
>  kernel/printk/printk.c| 944 --
>  2 files changed, 497 insertions(+), 449 deletions(-)
> 
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 3378bcbe585e..c9b0abe5ca91 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -45,8 +45,6 @@ struct kmsg_dumper {
>   bool registered;
>  
>   /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
>   u64 cur_seq;
>   u64 next_seq;
>  };
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8c14835be46c..7642ef634956 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -55,6 +55,7 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#include "printk_ringbuffer.h"
>  #include "console_cmdline.h"
>  #include "braille.h"
>  #include "internal.h"
> @@ -294,30 +295,24 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> + * Every record meta-data carries the timestamp in microseconds, as well as
> + * the standard userspace syslog level and syslog facility. The usual kernel
> + * messages use LOG_KERN; userspace-injected messages always carry a matching
> + * syslog facility, by default LOG_USER. The origin of every message can be
> + * reliably determined that way.
>   *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> + * The human readable log message of a record is available in @text, the
> + * length of the message text in @text_len. The stored message is not
> + * terminated.
>   *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> - * kernel messages use LOG_KERN; userspace-injected messages always carry
> - * a matching syslog facility, by default LOG_USER. The origin of every
> - * message can be reliably determined that way.
> - *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> - *
> - * Optionally, a message can carry a dictionary of properties (key/value 
> pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value
> + * pairs), to provide userspace with a machine-readable message context. The
> + * length of the dictionary is available in @dict_len. The dictionary is not
> + * terminated.
>   *
>   * Examples for 

Re: [PATCH v3 2/3] printk: add lockless ringbuffer

2020-07-02 Thread Petr Mladek
On Thu 2020-06-18 16:55:18, John Ogness wrote:
> Introduce a multi-reader multi-writer lockless ringbuffer for storing
> the kernel log messages. Readers and writers may use their API from
> any context (including scheduler and NMI). This ringbuffer will make
> it possible to decouple printk() callers from any context, locking,
> or console constraints. It also makes it possible for readers to have
> full access to the ringbuffer contents at any time and context (for
> example from any panic situation).

Please, find few comments below. They are primary about improving
the comments. There are also 3 ideas how to clean up the code
a bit[*].

Everything might be solved by followup patches. Well, I would prefer
to update the comments in the next respin if you do it because
of the third patch.[**]

The small code changes can be done later. I do not want to introduce
eventual problems at this stage.

Feel free to use:

Reviewed-by: Petr Mladek 


[*] I have pretty good feeling about the code. And I really like the extensive
comments. The remaining ideas are mostly from the nits department.

[**] We are getting close to push this into linux-next. I think that
 it makes sense only when also the 3rd patch is ready.

 Also we should have a solution for the crashdump/crash tools at hands.
 But I think that this is already the case.


> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> + * Data Rings
> + * ~~
> + * The two data rings (text and dictionary) function identically. They exist
> + * separately so that their buffer sizes can be individually set and they do
> + * not affect one another.
> + *
> + * Data rings are byte arrays composed of data blocks. Data blocks are
> + * referenced by blk_lpos structs that point to the logical position of the
> + * beginning of a data block and the beginning of the next adjacent data
> + * block. Logical positions are mapped directly to index values of the byte
> + * array ringbuffer.
> + *
> + * Each data block consists of an ID followed by the raw data. The ID is the
> + * identifier of a descriptor that is associated with the data block. A data
> + * block is considered valid if all of the following conditions are met:
> + *
> + *   1) The descriptor associated with the data block is in the committed
> + *  or reusable queried state.

This sounds strange. The information in the descriptor is valid when
the descriptor is in the committed or reusable state. But the data
in the data block are valid only when the related descriptor is in
the committed state.

The data block might get reused faster than the descriptor that owned
it earlier.

> + *   2) The blk_lpos struct within the descriptor associated with the data
> + *  block references back to the same data block.

This is true. But this cross check is needed only when the assocoated
descriptor is found via the ID stored in the data block. It is not
needed when the data block is found via id/seq number stored in
the descriptor that is in commited state.


> + *   3) The data block is within the head/tail logical position range.

Also this is true. But it works by design. It does not need to be
checked when validating the data.

I would like to make this clear. So, I would write something like:


  The raw data stored in a data block can be considered valid when the
  the associated descriptor is in the committed state.

  The above condition is enough when the data block is found via
  information from the descriptor that was chosen by its ID.

  One more check is needed when the data block is found via
  blk_lpos.next stored in the descriptor for the previous data block.
  The data block might point to a descriptor in a committed state just
  by chance. It is the right one only when blk_lpos.begin points back
  to the data block.

I do not know. It might be to complicated. I actually did not mind
about the original text in previous versions. But I realize now that
it might have been the reason why the checks in data_make_reusable()
were too complicated in the previous versions.



> + * Sample reader code::
> + *
> + *   struct printk_info info;
> + *   struct printk_record r;
> + *   char text_buf[32];
> + *   char dict_buf[32];
> + *   u64 seq;
> + *
> + *   prb_rec_init_rd(, , _buf[0], sizeof(text_buf),
> + *   _buf[0], sizeof(dict_buf));
> + *
> + *   prb_for_each_record(0, _rb, , ) {
> + *   if (info.seq != seq)
> + *   pr_warn("lost %llu records\n", info.seq - seq);
> + *
> + *   if (info.text_len > r.text_buf_size) {
> + *   pr_warn("record %llu text truncated\n", info.seq);
> + *   text_buf[sizeof(text_buf) - 1] = 0;
> + *   }

AFAIK, the trailing '\0' is missing even the the text is not
truncated. We need to always add it here. Otherwise, we could not
print the text using %s below.

Also it means that it is not safe to copy the text using
memcpy(new_buf, _buf[0], r.text_len) because 

Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA

2020-07-02 Thread Will Deacon
On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
> commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
> ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
> 
> However as reported by Prabhakar, this breaks kdump kernel booting in
> ThunderX2 like arm64 systems. I have noticed this on another ampere
> arm64 machine. The OOM log in the kdump kernel looks like this:
> 
>   [0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic 
> allocations
>   [0.247713] swapper/0: page allocation failure: order:1, 
> mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
>   <..snip..>
>   [0.274706] Call trace:
>   [0.277170]  dump_backtrace+0x0/0x208
>   [0.280863]  show_stack+0x1c/0x28
>   [0.284207]  dump_stack+0xc4/0x10c
>   [0.287638]  warn_alloc+0x104/0x170
>   [0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
>   [0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
>   [0.301530]  alloc_page_interleave+0x20/0x90
>   [0.305839]  alloc_pages_current+0xdc/0xf8
>   [0.309972]  atomic_pool_expand+0x60/0x210
>   [0.314108]  __dma_atomic_pool_init+0x50/0xa4
>   [0.318504]  dma_atomic_pool_init+0xac/0x158
>   [0.322813]  do_one_initcall+0x50/0x218
>   [0.326684]  kernel_init_freeable+0x22c/0x2d0
>   [0.331083]  kernel_init+0x18/0x110
>   [0.334600]  ret_from_fork+0x10/0x18
> 
> This patch limits the crashkernel allocation to the first 1GB of
> the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
> issues when crashkernel is executed, as it might have been originally
> allocated from either a ZONE_DMA32 memory or mixture of memory chunks
> belonging to both ZONE_DMA and ZONE_DMA32.

How does this interact with this ongoing series:

https://lore.kernel.org/r/20200628083458.40066-1-chenzho...@huawei.com

(patch 4, in particular)

> Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: cgro...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kexec@lists.infradead.org
> Reported-by: Prabhakar Kushwaha 
> Signed-off-by: Bhupesh Sharma 
> ---
>  arch/arm64/mm/init.c | 16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1e93cfc7c47a..02ae4d623802 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
>   crash_size = PAGE_ALIGN(crash_size);
>  
>   if (crash_base == 0) {
> - /* Current arm64 boot protocol requires 2MB alignment */
> - crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
> + /* Current arm64 boot protocol requires 2MB alignment.
> +  * Also limit the crashkernel allocation to the first
> +  * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
> +  * might run into OOM issues when crashkernel is executed,
> +  * as it might have been originally allocated from
> +  * either a ZONE_DMA32 memory or mixture of memory
> +  * chunks belonging to both ZONE_DMA and ZONE_DMA32.
> +  */

This comment needs help. Why does putting the crashkernel in ZONE_DMA
prevent "OOM issues"?

Will

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()

2020-07-02 Thread Michal Hocko
On Thu 02-07-20 03:44:19, Bhupesh Sharma wrote:
> Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
> function in a corner case seen on some arm64 boards when kdump kernel
> runs with "cgroup_disable=memory" passed to the kdump kernel via
> bootargs.
> 
> The root-cause behind the same is that currently mem_cgroup_swap_init()
> function is implemented as a subsys_initcall() call instead of a
> core_initcall(), this means 'cgroup_memory_noswap' still
> remains set to the default value (false) even when memcg is disabled via
> "cgroup_disable=memory" boot parameter.
> 
> This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
> function in corner cases:
> 
>   [0.265617] Unable to handle kernel NULL pointer dereference at virtual 
> address 0188
>   [0.274495] Mem abort info:
>   [0.277311]   ESR = 0x9606
>   [0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
>   [0.285751]   SET = 0, FnV = 0
>   [0.288830]   EA = 0, S1PTW = 0
>   [0.291995] Data abort info:
>   [0.294897]   ISV = 0, ISS = 0x0006
>   [0.298765]   CM = 0, WnR = 0
>   [0.301757] [0188] user address but active_mm is swapper
>   [0.308174] Internal error: Oops: 9606 [#1] SMP
>   [0.313097] Modules linked in:
>   <..snip..>
>   [0.331384] pstate: 0049 (nzcv daif +PAN -UAO BTYPE=--)
>   [0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
>   [0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
>   [0.347564] sp : fe0012b6f800
>   [0.350905] x29: fe0012b6f800 x28: fe00116b3000
>   [0.356268] x27: fe0012b6fb00 x26: 0020
>   [0.361631] x25:  x24: fc00723ffe28
>   [0.366994] x23: fe0010d5b468 x22: fe00116bfa00
>   [0.372357] x21: fe0010aabda8 x20: 
>   [0.377720] x19:  x18: 0010
>   [0.383082] x17: 43e612f2 x16: a9863ed7
>   [0.388445] x15:  x14: 202c303d70617773
>   [0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
>   [0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
>   [0.404533] x9 : fe00100e9580 x8 : fe0010628160
>   [0.409895] x7 : 00a8 x6 : fe00118f5e5e
>   [0.415258] x5 : 0001 x4 : 
>   [0.420621] x3 :  x2 : 
>   [0.425983] x1 :  x0 : fc0060079000
>   [0.431346] Call trace:
>   [0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
>   [0.438735]  shrink_lruvec+0x404/0x4f8
>   [0.442516]  shrink_node+0x1a8/0x688
>   [0.446121]  do_try_to_free_pages+0xe8/0x448
>   [0.450429]  try_to_free_pages+0x110/0x230
>   [0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
>   [0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
>   [0.464938]  alloc_page_interleave+0x20/0x90
>   [0.469246]  alloc_pages_current+0xdc/0xf8
>   [0.473379]  atomic_pool_expand+0x60/0x210
>   [0.477514]  __dma_atomic_pool_init+0x50/0xa4
>   [0.481910]  dma_atomic_pool_init+0xac/0x158
>   [0.486220]  do_one_initcall+0x50/0x218
>   [0.490091]  kernel_init_freeable+0x22c/0x2d0
>   [0.494489]  kernel_init+0x18/0x110
>   [0.498007]  ret_from_fork+0x10/0x18
>   [0.501614] Code: aa1403e3 91106000 97f82a27 1411 (f940c663)
>   [0.507770] ---[ end trace 9795948475817de4 ]---
>   [0.512429] Kernel panic - not syncing: Fatal exception
>   [0.517705] Rebooting in 10 seconds..
> 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: cgro...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kexec@lists.infradead.org

Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for 
integration")

> Reported-by: Prabhakar Kushwaha 
> Signed-off-by: Bhupesh Sharma 

This is subtle as hell, I have to say. I find the ordering in the init
calls very unintuitive and extremely hard to follow. The above commit
has introduced the problem but the code previously has worked mostly by
a luck because our default was flipped.

Acked-by: Michal Hocko 

> ---
>  mm/memcontrol.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19622328e4b5..8323e4b7b390 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7186,6 +7186,13 @@ static struct cftype memsw_files[] = {
>   { },/* terminate */
>  };
>  
> +/*
> + * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> + * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> + * remains set to false even when memcg is disabled via 
> "cgroup_disable=memory"
> + * boot parameter. This may result in premature OOPS inside 
> + *