Re: [PATCH v10 0/5] PowerPC: In-kernel handling of CPU/Memory hotplug/online/offline events for kdump kernel

2023-04-24 Thread Sourabh Jain



On 24/04/23 19:35, Eric DeVolder wrote:



On 4/23/23 05:52, Sourabh Jain wrote:

The Problem:

Post CPU/Memory hot plug/unplug and online/offline events the kernel
holds stale information about the system. Dump collection with stale
kdump kernel might end up in dump capture failure or an inaccurate dump
collection.

Existing solution:
==
The existing solution to keep the kdump kernel up-to-date by monitoring
CPU/Memory hotplug/online/offline events via udev rule and trigger a 
full

kdump kernel reload for every hotplug event.

Shortcomings:

- Leaves a window where kernel crash might not lead to a successful dump
   collection.
- Reloading all kexec components for each hotplug is inefficient.
- udev rules are prone to races if hotplug events are frequent.

More about issues with an existing solution is posted here:
  - https://lkml.org/lkml/2020/12/14/532
  - 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html


Proposed Solution:
==
Instead of reloading all kexec segments on CPU/Memory 
hotplug/online/offline
event, this patch series focuses on updating only the relevant kexec 
segment.

Once the kexec segments are loaded in the kernel reserved area then an
arch-specific hotplug handler will update the relevant kexec segment 
based on

hotplug event type.

Series Dependencies

This patch series implements the crash hotplug handler on PowerPC. 
The generic
crash hotplug handler is introduced by 
https://lkml.org/lkml/2023/4/4/1136 patch

series.

Git tree for testing:
=
The below git tree has this patch series applied on top of dependent 
patch

series.
https://github.com/sourabhjains/linux/tree/e21-s10

To realise the feature the kdump udev rule must updated to avoid
reloading of kdump reload on CPU/Memory hotplug/online/offline events.

   RHEL: /usr/lib/udev/rules.d/98-kexec.rules

-SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_reload_cpu"
-SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload_mem"
-SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload_mem"
+SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", 
GOTO="kdump_reload_end"
+SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", 
GOTO="kdump_reload_end"




I didn't see in the patch series where you would have the equivalent 
to the following (needed for the sysfs crash_hotplug entries):


#ifdef CONFIG_HOTPLUG_CPU
static inline int crash_hotplug_cpu_support(void) { return 1; }
#define crash_hotplug_cpu_support crash_hotplug_cpu_support
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
static inline int crash_hotplug_memory_support(void) { return 1; }
#define crash_hotplug_memory_support crash_hotplug_memory_support
#endif


I missed the above diff in my testing environment. Thanks you for 
bringing it

to my attention. I will fix this next version.

- Sourabh Jain


Re: [PATCH v10 0/5] PowerPC: In-kernel handling of CPU/Memory hotplug/online/offline events for kdump kernel

2023-04-24 Thread Eric DeVolder




On 4/23/23 05:52, Sourabh Jain wrote:

The Problem:

Post CPU/Memory hot plug/unplug and online/offline events the  kernel
holds stale information about the system. Dump collection with stale
kdump kernel might end up in dump capture failure or an inaccurate dump
collection.

Existing solution:
==
The existing solution to keep the kdump kernel up-to-date by monitoring
CPU/Memory hotplug/online/offline events via udev rule and trigger a full
kdump kernel reload for every hotplug event.

Shortcomings:

- Leaves a window where kernel crash might not lead to a successful dump
   collection.
- Reloading all kexec components for each hotplug is inefficient.
- udev rules are prone to races if hotplug events are frequent.

More about issues with an existing solution is posted here:
  - https://lkml.org/lkml/2020/12/14/532
  - https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html

Proposed Solution:
==
Instead of reloading all kexec segments on CPU/Memory hotplug/online/offline
event, this patch series focuses on updating only the relevant kexec segment.
Once the kexec segments are loaded in the kernel reserved area then an
arch-specific hotplug handler will update the relevant kexec segment based on
hotplug event type.

Series Dependencies

This patch series implements the crash hotplug handler on PowerPC. The generic
crash hotplug handler is introduced by https://lkml.org/lkml/2023/4/4/1136 patch
series.

Git tree for testing:
=
The below git tree has this patch series applied on top of dependent patch
series.
https://github.com/sourabhjains/linux/tree/e21-s10

To realise the feature the kdump udev rule must updated to avoid
reloading of kdump reload on CPU/Memory hotplug/online/offline events.

   RHEL: /usr/lib/udev/rules.d/98-kexec.rules

-SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_reload_cpu"
-SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload_mem"
-SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload_mem"
+SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"



I didn't see in the patch series where you would have the equivalent to the following (needed for 
the sysfs crash_hotplug entries):


#ifdef CONFIG_HOTPLUG_CPU
static inline int crash_hotplug_cpu_support(void) { return 1; }
#define crash_hotplug_cpu_support crash_hotplug_cpu_support
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
static inline int crash_hotplug_memory_support(void) { return 1; }
#define crash_hotplug_memory_support crash_hotplug_memory_support
#endif


Note: only kexec_file_load syscall will work. For kexec_load minor changes are
required in kexec tool.

---
Changelog:

v10:
   - Drop the patch that adds fdt_index attribute to struct kimage_arch
 Find the fdt segment index when needed.
   - Added more details into commits messages.
   - Rebased onto 6.3.0-rc5

v9:
   - Removed patch to prepare elfcorehdr crash notes for possible CPUs.
 The patch is moved to generic patch series that introduces generic
 infrastructure for in kernel crash update.
   - Removed patch to pass the hotplug action type to the arch crash
 hotplug handler function. The generic patch series has introduced
 the hotplug action type in kimage struct.
   - Add detail commit message for better understanding.

v8:
   - Restrict fdt_index initialization to machine_kexec_post_load
 it work for both kexec_load and kexec_file_load.[3/8] Laurent Dufour

   - Updated the logic to find the number of offline core. [6/8]

   - Changed the logic to find the elfcore program header to accommodate
 future memory ranges due memory hotplug events. [8/8]

v7
   - added a new config to configure this feature
   - pass hotplug action type to arch specific handler

v6
   - Added crash memory hotplug support

v5:
   - Replace COFNIG_CRASH_HOTPLUG with CONFIG_HOTPLUG_CPU.
   - Move fdt segment identification for kexec_load case to load path
 instead of crash hotplug handler
   - Keep new attribute defined under kimage_arch to track FDT segment
 under CONFIG_HOTPLUG_CPU config.

v4:
   - Update the logic to find the additional space needed for hotadd CPUs post
 kexec load. Refer "[RFC v4 PATCH 4/5] powerpc/crash hp: add crash hotplug
 support for kexec_file_load" patch to know more about the change.
   - Fix a couple of typo.
   - Replace pr_err to pr_info_once to warn user about memory hotplug
 support.
   - In crash hotplug handle exit the for loop if FDT segment is found.

v3
   - Move fdt_index and fdt_index_vaild variables to kimage_arch struct.
   - Rebase patche on top of https://lkml.org/lkml/2022/3/3/674 [v5]
   - Fixed warning reported by checpatch script

v2:
   - Use generic hotplug handler introduced by 
https://lkml.org/lkml/2022/2/9/1406, 

[PATCH v10 0/5] PowerPC: In-kernel handling of CPU/Memory hotplug/online/offline events for kdump kernel

2023-04-23 Thread Sourabh Jain
The Problem:

Post CPU/Memory hot plug/unplug and online/offline events the  kernel
holds stale information about the system. Dump collection with stale
kdump kernel might end up in dump capture failure or an inaccurate dump
collection.

Existing solution:
==
The existing solution to keep the kdump kernel up-to-date by monitoring
CPU/Memory hotplug/online/offline events via udev rule and trigger a full
kdump kernel reload for every hotplug event.

Shortcomings:

- Leaves a window where kernel crash might not lead to a successful dump
  collection.
- Reloading all kexec components for each hotplug is inefficient.
- udev rules are prone to races if hotplug events are frequent.

More about issues with an existing solution is posted here:
 - https://lkml.org/lkml/2020/12/14/532
 - https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html

Proposed Solution:
==
Instead of reloading all kexec segments on CPU/Memory hotplug/online/offline
event, this patch series focuses on updating only the relevant kexec segment.
Once the kexec segments are loaded in the kernel reserved area then an
arch-specific hotplug handler will update the relevant kexec segment based on
hotplug event type.

Series Dependencies

This patch series implements the crash hotplug handler on PowerPC. The generic
crash hotplug handler is introduced by https://lkml.org/lkml/2023/4/4/1136 patch
series.

Git tree for testing:
=
The below git tree has this patch series applied on top of dependent patch
series.
https://github.com/sourabhjains/linux/tree/e21-s10

To realise the feature the kdump udev rule must updated to avoid
reloading of kdump reload on CPU/Memory hotplug/online/offline events.

  RHEL: /usr/lib/udev/rules.d/98-kexec.rules

-SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_reload_cpu"
-SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload_mem"
-SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload_mem"
+SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

Note: only kexec_file_load syscall will work. For kexec_load minor changes are
required in kexec tool.

---
Changelog:

v10:
  - Drop the patch that adds fdt_index attribute to struct kimage_arch
Find the fdt segment index when needed.
  - Added more details into commits messages.
  - Rebased onto 6.3.0-rc5

v9:
  - Removed patch to prepare elfcorehdr crash notes for possible CPUs.
The patch is moved to generic patch series that introduces generic
infrastructure for in kernel crash update.
  - Removed patch to pass the hotplug action type to the arch crash
hotplug handler function. The generic patch series has introduced
the hotplug action type in kimage struct.
  - Add detail commit message for better understanding.

v8:
  - Restrict fdt_index initialization to machine_kexec_post_load
it work for both kexec_load and kexec_file_load.[3/8] Laurent Dufour

  - Updated the logic to find the number of offline core. [6/8]

  - Changed the logic to find the elfcore program header to accommodate
future memory ranges due memory hotplug events. [8/8]

v7
  - added a new config to configure this feature
  - pass hotplug action type to arch specific handler

v6
  - Added crash memory hotplug support

v5:
  - Replace COFNIG_CRASH_HOTPLUG with CONFIG_HOTPLUG_CPU.
  - Move fdt segment identification for kexec_load case to load path
instead of crash hotplug handler
  - Keep new attribute defined under kimage_arch to track FDT segment
under CONFIG_HOTPLUG_CPU config.

v4:
  - Update the logic to find the additional space needed for hotadd CPUs post
kexec load. Refer "[RFC v4 PATCH 4/5] powerpc/crash hp: add crash hotplug
support for kexec_file_load" patch to know more about the change.
  - Fix a couple of typo.
  - Replace pr_err to pr_info_once to warn user about memory hotplug
support.
  - In crash hotplug handle exit the for loop if FDT segment is found.

v3
  - Move fdt_index and fdt_index_vaild variables to kimage_arch struct.
  - Rebase patche on top of https://lkml.org/lkml/2022/3/3/674 [v5]
  - Fixed warning reported by checpatch script

v2:
  - Use generic hotplug handler introduced by 
https://lkml.org/lkml/2022/2/9/1406, a
significant change from v1.

Sourabh Jain (5):
  powerpc/kexec: turn some static helper functions public
  powerpc/crash: introduce a new config option CRASH_HOTPLUG
  powerpc/crash: add crash CPU hotplug support
  crash: forward memory_notify args to arch crash hotplug handler
  powerpc/kexec: add crash memory hotplug support

 arch/powerpc/Kconfig|  12 +
 arch/powerpc/include/asm/kexec.h|  10 +
 arch/powerpc/include/asm/kexec_ranges.h |   1 +
 arch/powerpc/kexec/core_64.c| 301