subject:"Re\: \[PATCH v12 7\/7\] x86\/crash\: Add x86 crash hotplug support"

On Fri, Oct 28, 2022 at 04:22:54PM -0500, Eric DeVolder wrote:
> /*
>  * For the kexec_file_load() syscall path, specify the maximum number of
>  * memory regions that the elfcorehdr buffer/segment can accommodate.
>  * These regions are obtained via walk_system_ram_res(); eg. the
>  * 'System RAM' entries in /proc/iomem.
>  * This value is combined with NR_CPUS and multiplied by sizeof(Elf64_Phdr)

NR_CPUS_DEFAULT

>  * to determine the final elfcorehdr memory buffer/segment size.
>  * The value 8192, for example, covers a (sparsely populated) 1TiB system
>  * consisting of 128MiB memblock size, while resulting in an elfcorehdr
>  * memory buffer/segment size under 1MiB.

... and it is a sane choice trying to accomodate both actual baremetal
and VM configurations."

Yeah, it's a good start.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support





On 10/28/22 15:30, Borislav Petkov wrote:

On Fri, Oct 28, 2022 at 02:26:58PM -0500, Eric DeVolder wrote:

config CRASH_MAX_MEMORY_RANGES
 depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
 int
 default 8192
 help
   For the kexec_file_load path, specify the maximum number of
   memory regions, eg. as represented by the 'System RAM' entries
   in /proc/iomem, that the elfcorehdr buffer/segment can accommodate.
   This value is combined with NR_CPUS and multiplied by Elf64_Phdr
   size to determine the final buffer size.


No, do this:

config CRASH_MEMORY_HOTPLUG_SUPPORT
 depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
 help
   Help text explaining what this feature is

this thing will simply get enabled when the user enables MEMORY_HOTPLUG
and CRASH_DUMP.

and then you do in the code:

/*
  * A comment explaining how the 8192 value has been selected.
  */
#define CRASH_MAX_MEMORY_RANGES 8192

Thx.



How is this comment?

/*
 * For the kexec_file_load() syscall path, specify the maximum number of
 * memory regions that the elfcorehdr buffer/segment can accommodate.
 * These regions are obtained via walk_system_ram_res(); eg. the
 * 'System RAM' entries in /proc/iomem.
 * This value is combined with NR_CPUS and multiplied by sizeof(Elf64_Phdr)
 * to determine the final elfcorehdr memory buffer/segment size.
 * The value 8192, for example, covers a (sparsely populated) 1TiB system
 * consisting of 128MiB memblock size, while resulting in an elfcorehdr
 * memory buffer/segment size under 1MiB.
 */
#define CRASH_MAX_MEMORY_RANGES 8192

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support





On 10/28/22 15:30, Borislav Petkov wrote:

On Fri, Oct 28, 2022 at 02:26:58PM -0500, Eric DeVolder wrote:

config CRASH_MAX_MEMORY_RANGES
 depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
 int
 default 8192
 help
   For the kexec_file_load path, specify the maximum number of
   memory regions, eg. as represented by the 'System RAM' entries
   in /proc/iomem, that the elfcorehdr buffer/segment can accommodate.
   This value is combined with NR_CPUS and multiplied by Elf64_Phdr
   size to determine the final buffer size.


No, do this:

config CRASH_MEMORY_HOTPLUG_SUPPORT
 depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
 help
   Help text explaining what this feature is

this thing will simply get enabled when the user enables MEMORY_HOTPLUG
and CRASH_DUMP.

and then you do in the code:

/*
  * A comment explaining how the 8192 value has been selected.
  */
#define CRASH_MAX_MEMORY_RANGES 8192

Thx.


ok, will do!
thanks!
eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

On Fri, Oct 28, 2022 at 02:26:58PM -0500, Eric DeVolder wrote:
> config CRASH_MAX_MEMORY_RANGES
> depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
> int
> default 8192
> help
>   For the kexec_file_load path, specify the maximum number of
>   memory regions, eg. as represented by the 'System RAM' entries
>   in /proc/iomem, that the elfcorehdr buffer/segment can accommodate.
>   This value is combined with NR_CPUS and multiplied by Elf64_Phdr
>   size to determine the final buffer size.

No, do this:

config CRASH_MEMORY_HOTPLUG_SUPPORT
depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
help
  Help text explaining what this feature is

this thing will simply get enabled when the user enables MEMORY_HOTPLUG
and CRASH_DUMP.

and then you do in the code:

/*
 * A comment explaining how the 8192 value has been selected.
 */
#define CRASH_MAX_MEMORY_RANGES 8192

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support





On 10/28/22 12:06, Borislav Petkov wrote:

On Fri, Oct 28, 2022 at 10:29:45AM -0500, Eric DeVolder wrote:

So it is with this in mind that I suggest we stay with the statically sized 
elfcorehdr buffer.

If that can be agreed upon, then it is "just a matter" of picking a useful
elfcorehdr size. Currently that size is derived from the NR_DEFAULT_CPUS and
CRASH_MAX_MEMORY_RANGES. So, there is still the CRASH_MAX_MEMORY_RANGES knob
to help a dial in size, should there be some issue with the default
value/size.


Let's see

 kbuf.memsz =
 (CONFIG_NR_CPUS_DEFAULT + CONFIG_CRASH_MAX_MEMORY_RANGES) *
 sizeof(Elf64_Phdr);

which, IINM, is

(8192 + 32768) * 56

which is something like 2M.

(CONFIG_NR_CPUS_DEFAULT = 8192 - this is because of MAXSMP which gets
set on distro kernels)

Now, since userspace kexec tools uses 2048 for max memory ranges, that
size becomes smaller - around half a Mb. And since y'all wanna be on the
safe side, you can quadruple it and have

(8192 + 8192) * 56

which is still under a megabyte. And that's fine, I guess, on a big
server.


Excellent, I'll set CRASH_MAX_MEMORY_RANGES to 8192! That seems a quite fair trade off of elfcorehdr 
size vs system size (ie 1TiB w/ 128MiB memblock size).





Or if there is desire to drop computing the size from NR_DEFAULT_CPUs and


I think you should leave the dependency on the Kconfig size so that
smaller machines which are configured this way, don't end up wasting
unnecessary memory.


Excellent, I'll leave the computation as NR_DEFAULT_CPUS + 
CRASH_MAX_MEMORY_RANGES.




It is my intention to correct the CRASH_MAX_MEMORY_RANGES (if we keep it) as 
such:

config CRASH_MAX_MEMORY_RANGES
 depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG


Yes, but don't leave it to the user to decide what number to choose
- choose a high enough number, explain why you've chosen this with a
comment and that's it.


I currently have the Kconfig item as:

config CRASH_MAX_MEMORY_RANGES
depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG
int
default 8192
help
  For the kexec_file_load path, specify the maximum number of
  memory regions, eg. as represented by the 'System RAM' entries
  in /proc/iomem, that the elfcorehdr buffer/segment can accommodate.
  This value is combined with NR_CPUS and multiplied by Elf64_Phdr
  size to determine the final buffer size.

I'll work to provide information a better explanation as to the 8192 number.

Thank you!
eric



Thx.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

On Fri, Oct 28, 2022 at 10:29:45AM -0500, Eric DeVolder wrote:
> So it is with this in mind that I suggest we stay with the statically sized 
> elfcorehdr buffer.
> 
> If that can be agreed upon, then it is "just a matter" of picking a useful
> elfcorehdr size. Currently that size is derived from the NR_DEFAULT_CPUS and
> CRASH_MAX_MEMORY_RANGES. So, there is still the CRASH_MAX_MEMORY_RANGES knob
> to help a dial in size, should there be some issue with the default
> value/size.

Let's see

kbuf.memsz =
(CONFIG_NR_CPUS_DEFAULT + CONFIG_CRASH_MAX_MEMORY_RANGES) *
sizeof(Elf64_Phdr);

which, IINM, is

(8192 + 32768) * 56

which is something like 2M.

(CONFIG_NR_CPUS_DEFAULT = 8192 - this is because of MAXSMP which gets
set on distro kernels)

Now, since userspace kexec tools uses 2048 for max memory ranges, that
size becomes smaller - around half a Mb. And since y'all wanna be on the
safe side, you can quadruple it and have

(8192 + 8192) * 56

which is still under a megabyte. And that's fine, I guess, on a big
server.

> Or if there is desire to drop computing the size from NR_DEFAULT_CPUs and

I think you should leave the dependency on the Kconfig size so that
smaller machines which are configured this way, don't end up wasting
unnecessary memory.

> It is my intention to correct the CRASH_MAX_MEMORY_RANGES (if we keep it) as 
> such:
> 
> config CRASH_MAX_MEMORY_RANGES
> depends on CRASH_DUMP && KEXEC_FILE && MEMORY_HOTPLUG

Yes, but don't leave it to the user to decide what number to choose
- choose a high enough number, explain why you've chosen this with a
comment and that's it.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

On 10/28/22 05:19, Borislav Petkov wrote:

On Thu, Oct 27, 2022 at 02:24:11PM -0500, Eric DeVolder wrote:

Be aware, in reality, that if the system was fully populated, it would not
actually consume all 8192 phdrs. Rather /proc/iomem would essentially show a
large contiguous address space which would require just a single phdr.

Then that from below:

pnum += CONFIG_CRASH_MAX_MEMORY_RANGES;

which then would end up allocating 8192 would be a total waste.

So why don't you make that number dynamic then?

You start with something sensible:

total_num_pheaders = num_online_cpus() + "some number of regions" + "some
few others"

I.e., a number which is a good compromise on the majority of machines.

Then, on hotplug events you count how many new regions are coming in
and when you reach the total_num_pheaders number, you double it (or some
other increase stragegy), reallocate the ELF header buffers etc needed
for kdump and you're good.

This way, you don't waste memory unnecessarily on the majority of
systems and those who need more, get to allocate more.

This patch series sizes and allocates the memory buffer/segment for the elfcorehdr once, at kdump
load time.

In order to dynamically resize the elcorehdr memory buffer/segment, that causes the following ripple
effects:

- Permitting resizing of the elfcorehdr requires a means to "allocate" a new size buffer from
within the crash kernel reserved area. There is no allocator today; currently it is a kind of
one-pass placement process that happens at load time. The critical side effect of allocating a new
elfcorehdr buffer memory/segment is that it creates a new address for the elfcorehdr.

- The elfcorehdr is passed to the crash kernel via the elfcorehdr= kernel cmdline option. As such,
a dynamic change to the size of the elfcorehdr size necessarily invites a change of address of that
buffer, and therefore a change to rewrite the crash kernel cmdline to reflect the new elfcorehdr
buffer address.

- A change to the cmdline, also invites a possible change of address of the buffer containing the
cmdline, and thus a change to the x86 boot_params, which contains the cmdline pointer.

- A change to the cmdline and/or boot_params, which are *not* excluded from the hash/digest, means
that the purgatory hash/digest needs to be recomputed, and purgatory re-linked with the new
hash/digest and replaced.

A fair amount of work, but I have had this working in the past, around the v1 patch series
timeframe. However, it only worked for the kexec_file_load() syscall as all the needed pieces of
information were available; but for kexec_load(), it isn't possible to relink purgatory as by that
point purgatory is but a user-space binary blob.

It was feedback on the v1/v2 that pointed out that by excluding the elfcorehdr from the hash/digest,
the "change of address" problem with the elfcorehdr buffer/segment goes away, and, in turn, negates
the need to: introduce an allocator for the crash kernel reserved space, rewrite the crash kernel
cmdline with a new elfcorehdr, update boot_params with a new cmdline and re-link and replace
purgatory with the updated digest. And it enables this hotplug efforts to support kexec_load()
syscall as well.

So it is with this in mind that I suggest we stay with the statically sized
elfcorehdr buffer.

If that can be agreed upon, then it is "just a matter" of picking a useful elfcorehdr size.
Currently that size is derived from the NR_DEFAULT_CPUS and CRASH_MAX_MEMORY_RANGES. So, there is
still the CRASH_MAX_MEMORY_RANGES knob to help a dial in size, should there be some issue with the
default value/size.

Or if there is desire to drop computing the size from NR_DEFAULT_CPUs and CRASH_MAX_MEMORY_RANGES
and simply go with CRASH_HOTPLUG_ELFCOREHDR_SZ which simply specifies the buffer size, then I'm also
good with that.

I still owe a much better explanation of how to size the elfcorehdr. I can use the comments and
ideas from the discussion to provide the necessary insight when choosing this value, whether that be
CRASH_MAX_MEMORY_RANGES or CRASH_HOTPLUG_ELFCOREHDR_SZ.

I'd prefer keeping CRASH_MAX_MEMORY_RANGES as that allow the maximum phdr
number value to be reflective of CPUs and/or memory; not all systems support
both CPU and memory hotplug. For example, I have queued up this change to
reflect this:

if (IS_ENABLED(CONFIG_HOTPLUG_CPU) || IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {

If you're going to keep CRASH_MAX_MEMORY_RANGES, then you can test only
that thing as it expresses the dependency on CONFIG_HOTPLUG_CPU and
CONFIG_MEMORY_HOTPLUG already.

If you end up making the number dynamic, then you could make that a
different Kconfig item which contains all that crash code as most of the
people won't need it anyway.

It is my intention to correct the CRASH_MAX_MEMORY_RANGES (if we keep it) as
such:

config CRASH_MAX_MEMORY_RANGES
depends on CRASH_DUMP && KEXEC_FILE &&

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

On Thu, Oct 27, 2022 at 02:24:11PM -0500, Eric DeVolder wrote:
> Be aware, in reality, that if the system was fully populated, it would not
> actually consume all 8192 phdrs. Rather /proc/iomem would essentially show a
> large contiguous address space which would require just a single phdr.

Then that from below:

pnum += CONFIG_CRASH_MAX_MEMORY_RANGES;

which then would end up allocating 8192 would be a total waste.

So why don't you make that number dynamic then?

You start with something sensible:

total_num_pheaders = num_online_cpus() + "some number of regions" + 
"some few others"

I.e., a number which is a good compromise on the majority of machines.

Then, on hotplug events you count how many new regions are coming in
and when you reach the total_num_pheaders number, you double it (or some
other increase stragegy), reallocate the ELF header buffers etc needed
for kdump and you're good.

This way, you don't waste memory unnecessarily on the majority of
systems and those who need more, get to allocate more.

> I'd prefer keeping CRASH_MAX_MEMORY_RANGES as that allow the maximum phdr
> number value to be reflective of CPUs and/or memory; not all systems support
> both CPU and memory hotplug. For example, I have queued up this change to
> reflect this:
> 
> if (IS_ENABLED(CONFIG_HOTPLUG_CPU) || IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {

If you're going to keep CRASH_MAX_MEMORY_RANGES, then you can test only
that thing as it expresses the dependency on CONFIG_HOTPLUG_CPU and
CONFIG_MEMORY_HOTPLUG already.

If you end up making the number dynamic, then you could make that a
different Kconfig item which contains all that crash code as most of the
people won't need it anyway.

Hmm?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-27 Thread Eric DeVolder





On 10/27/22 08:52, Baoquan He wrote:

On 10/26/22 at 04:54pm, David Hildenbrand wrote:

On 26.10.22 16:48, Baoquan He wrote:

On 10/25/22 at 12:31pm, Borislav Petkov wrote:

On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:

The concern to range number mainly is on Virt guest systems.


And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?


IIRC, ACPI only allows for 256 slots. PPC dlpar might provide more.



Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.


Yes, virtio-mem (but also PPC dlpar) can result in many individual memory
blocks with holes in between after hotunplug. Hotplug OTOH, usually tries to
"plug" these holes and reduce the total number of memory blocks. It might be
rare that our range will be heavily fragmented after unplug, but it's
certainly possible.

[...]



Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.


The more the better, unless "it hurts". Assuming a single memory block is
128 MiB, that would be 256 GiB.

Usually, on big systems, the memory block size is 2 GiB. So 4 TiB.


Thanks a lot for these valuable inputs, David.

Hi Boris, Eric

So what's your suggested value for the Kconfig option?

1) cpu number, 1024?
2) memory regions, 2048?

About below draft, any comment? We can decide a value based on our
knowledge, can adjust later if any real system has more than the number.

+config CRASH_ELF_CORE_PHDRS_NUM
+   depends on CRASH_DUMP && KEXEC_FILE && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+   int
+   default 3072
+   help
+ For the kexec_file_load path, specify the default number of
+ phdr for the vmcore. E.g the memory regions represented by the
+ 'System RAM' entries in /proc/iomem, the cpu notes of each
+ present cpu stored in /sys/devices/system/cpu/cpuX/crash_notes.

Thanks
Baoquan



I prefer to keep CRASH_MAX_MEMORY_RANGES, as explained in my response to your 
message on October 26.
eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-27 Thread Eric DeVolder

On 10/26/22 09:48, Baoquan He wrote:

On 10/25/22 at 12:31pm, Borislav Petkov wrote:

On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:

The concern to range number mainly is on Virt guest systems.

And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?

Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.

On baremetal system, basically only very high end server support
memory hotplug. I ever visited customer's lab and saw one server,
it owns 8 slots, on each slot a box containing about 20 cpus and 2T
memory at most can be plugged in at one time. So people won't make too
many slots for hotplugging since it's too expensive.

There you have it - the persuading argument.

So after re-reading the exchanges, many times, I think I realized that I have introduced confusion
by using "hotplug", and specifically "memory hotplug" and DIMMs in the same breath, and thus that
perhaps equated hotplug with ACPI DIMMs in these discussions.

Allow me to state that "hotplug" in this patch series refers to CPU and memory hot un/plug, and does
*not* explicitly refer to any particular underlying technology to generate CPU and/or memory hot
un/plug events.

For example, I have been using DIMMs as my example because that has been my test vehicle for
exercising this code; as such I think the discussion cornered itself into real world vs virt
discussion about DIMMs, with no end in sight. To be plain, this patch series does not intend to
convey or change anything specific about ACPI DIMMs.

In reality, when I state "hotplug" in these patches, I am talking generically and therefore
inclusive of any technology that can hot un/plug CPUs or memory. For memory specifically, this
includes ACPI DIMMs (whether baremetal or QEMU), ballooning, virtio-mem, PPC dlpar (per David),
Microsoft DynamicMemory, and the upcoming CXL.mem technology. Probably others that I am not aware.
Any of these technologies can add or remove memory from a bare metal and/or virtual system.

I apologize.

What is important is the number of memory regions (ie. /proc/iomem entries) that can be considered
to be the maximum. There is no kernel definition of such. The need to identify a maximum number is
so that the buffer containing the elfcorehdr can be sized and allocated at kdump load time, *once*.
This elfcorehdr buffer is then modified/updated repeatedly as hot un/plug events occur. It is *not*
re-allocated on each hot un/plug event; that is what the current solution does, sort of.

I checked user space kexec code, the maximum memory range number is
honored to x86_64 because of a HPE SGI system. After that, nobody
complains about it. Please see below user space kexec-tools commit in
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git

The memory ranges may be not all made by different DIMM slots, could be
firmware reservatoin, e.g efi/BIOS diggged out physical memory,

I don't know what that means.

If it is firmware crap, you want to exclude that from kdump anyway.

Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.

Now CONFIG_NR_CPUS has the maximum number as 8192. And user space
kexec-tools has maximum memory range number as 2048. We can take
the current 8192 + 2048 = 10K as default value conservatively. Or
take 8192 + 2048 * 2 = 12K which has two times of maximum memory range
bumber in kexec-tools. What do you think?

I still think that we should stick to reality and support what is
possible not what is potentially and theoretically there.

Yes, agree. We should try to get a number which satisfies needs in
reality.

For Kconfig CRASH_MAX_MEMORY_RANGES in this patch, I have three items to
suggest:

1) the name is not good, it doesn't reflect the fact that it's the
number of program headers of elfcorehdr which includes the cpu note
numbers and memory region numers.

The total number of program headers is,

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-26 Thread David Hildenbrand


On 26.10.22 16:48, Baoquan He wrote:

On 10/25/22 at 12:31pm, Borislav Petkov wrote:

On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:

The concern to range number mainly is on Virt guest systems.


And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?


IIRC, ACPI only allows for 256 slots. PPC dlpar might provide more.



Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.


Yes, virtio-mem (but also PPC dlpar) can result in many individual 
memory blocks with holes in between after hotunplug. Hotplug OTOH, 
usually tries to "plug" these holes and reduce the total number of 
memory blocks. It might be rare that our range will be heavily 
fragmented after unplug, but it's certainly possible.


[...]



Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.


The more the better, unless "it hurts". Assuming a single memory block 
is 128 MiB, that would be 256 GiB.


Usually, on big systems, the memory block size is 2 GiB. So 4 TiB.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-26 Thread Baoquan He

On 10/25/22 at 12:31pm, Borislav Petkov wrote:
> On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:
> > The concern to range number mainly is on Virt guest systems.
> 
> And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
> real machine?

Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.

> 
> > On baremetal system, basically only very high end server support
> > memory hotplug. I ever visited customer's lab and saw one server,
> > it owns 8 slots, on each slot a box containing about 20 cpus and 2T
> > memory at most can be plugged in at one time. So people won't make too
> > many slots for hotplugging since it's too expensive.
> 
> There you have it - the persuading argument.
> 
> > I checked user space kexec code, the maximum memory range number is
> > honored to x86_64 because of a HPE SGI system. After that, nobody
> > complains about it. Please see below user space kexec-tools commit in
> > https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> > 
> > The memory ranges may be not all made by different DIMM slots, could be
> > firmware reservatoin, e.g efi/BIOS diggged out physical memory,
>   ^
> 
> I don't know what that means.
> 
> If it is firmware crap, you want to exclude that from kdump anyway.

Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.

> 
> > Now CONFIG_NR_CPUS has the maximum number as 8192. And user space
> > kexec-tools has maximum memory range number as 2048. We can take
> > the current 8192 + 2048  = 10K as default value conservatively. Or
> > take 8192 + 2048 * 2 = 12K which has two times of maximum memory range
> > bumber in kexec-tools. What do you think?
> 
> I still think that we should stick to reality and support what is
> possible not what is potentially and theoretically there.

Yes, agree. We should try to get a number which satisfies needs in
reality.

For Kconfig CRASH_MAX_MEMORY_RANGES in this patch, I have three items to
suggest:

1) the name is not good, it doesn't reflect the fact that it's the
number of program headers of elfcorehdr which includes the cpu note 
numbers and memory region numers.

2) default cpu number, I suggest 512 or 1024. The biggest number I
ever saw in reality is 384. On virt system, it won't be too big. Below
is abstracted from arch/x86/Kconfig. A smaller one is also OK, we can
enlarge it when people really have a super machine and run into the
problem.

   config NR_CPUS_DEFAULT
   int
   depends on X86_64
   default 8192 if  MAXSMP
   default   64 if  SMP
   default1 if !SMP

3) For memory regions, I would suggest 2048. Likewise, smaller value is
also fine, we can enlarge it when a real system run into this.

I made a draft here for reference, with my undertanding. Please feel
free to change it.

+config CRASH_ELF_CORE_PHDRS_NUM
+   depends on CRASH_DUMP && KEXEC_FILE && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+   int
+   default 3072
+   help
+ For the kexec_file_load path, specify the default number of
+ phdr for the vmcore. E.g the memory regions represented by the
+ 'System RAM' entries in /proc/iomem, the cpu notes of each
+ present cpu stored in /sys/devices/system/cpu/cpuX/crash_notes.

Thanks

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-25 Thread Borislav Petkov

On Wed, Oct 12, 2022 at 11:20:59AM -0500, Eric DeVolder wrote:
> I once had CONFIG_CRASH_HOTPLUG, but you disagreed.
> 
> https://lore.kernel.org/lkml/ylgot+ludql+g%2...@zn.tnic/
> 
> From there I simply went with
> 
>  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
> 
> which route do you prefer?

If you do a single Kconfig item which depends on those two, it probably
is cleaner this way. And if the max memory ranges are hardcoded you
don't need the other prompt asking the user something she most likely
doesn't know how to answer properly.

That is, unless you wanna have that crash hotplug built in all the time.

Because CONFIG_HOTPLUG_CPU is pretty much always enabled so you might
just as well add the crash hotplug support unconditionally, without any
Kconfig ifdeffery whatsoever except CONFIG_MEMORY_HOTPLUG as that is
special and not present on the majority of hardware.

But on a plain simple laptop or workstation which has CPU hotplug, would
it make sense for the crash ranges to get updated too when CPUs are
offlined?

If so, I think you want this code present there too, without a Kconfig
item.

If this is server-only anyway, then a single Kconfig item sounds like
not such a bad idea.

I hope that makes some sense.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-25 Thread Borislav Petkov

On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:
> The concern to range number mainly is on Virt guest systems.

And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?

> On baremetal system, basically only very high end server support
> memory hotplug. I ever visited customer's lab and saw one server,
> it owns 8 slots, on each slot a box containing about 20 cpus and 2T
> memory at most can be plugged in at one time. So people won't make too
> many slots for hotplugging since it's too expensive.

There you have it - the persuading argument.

> I checked user space kexec code, the maximum memory range number is
> honored to x86_64 because of a HPE SGI system. After that, nobody
> complains about it. Please see below user space kexec-tools commit in
> https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> 
> The memory ranges may be not all made by different DIMM slots, could be
> firmware reservatoin, e.g efi/BIOS diggged out physical memory,
^

I don't know what that means.

If it is firmware crap, you want to exclude that from kdump anyway.

> Now CONFIG_NR_CPUS has the maximum number as 8192. And user space
> kexec-tools has maximum memory range number as 2048. We can take
> the current 8192 + 2048  = 10K as default value conservatively. Or
> take 8192 + 2048 * 2 = 12K which has two times of maximum memory range
> bumber in kexec-tools. What do you think?

I still think that we should stick to reality and support what is
possible not what is potentially and theoretically there.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-17 Thread Sourabh Jain



On 08/10/22 01:03, Eric DeVolder wrote:



On 9/19/22 02:06, Sourabh Jain wrote:


On 10/09/22 02:35, Eric DeVolder wrote:

For x86_64, when CPU or memory is hot un/plugged, the crash
elfcorehdr, which describes the CPUs and memory in the system,
must also be updated.

When loading the crash kernel via kexec_load or kexec_file_load,
the elfcorehdr is identified at run time in
crash_core:handle_hotplug_event().

To update the elfcorehdr for x86_64, a new elfcorehdr must be
generated from the available CPUs and memory. The new elfcorehdr
is prepared into a buffer, and then installed over the top of
the existing elfcorehdr.

In the patch 'kexec: exclude elfcorehdr from the segment digest'
the need to update purgatory due to the change in elfcorehdr was
eliminated.  As a result, no changes to purgatory or boot_params
(as the elfcorehdr= kernel command line parameter pointer
remains unchanged and correct) are needed, just elfcorehdr.

To accommodate a growing number of resources via hotplug, the
elfcorehdr segment must be sufficiently large enough to accommodate
changes, see the CRASH_MAX_MEMORY_RANGES configure item.

With this change, crash hotplug for kexec_file_load syscall
is supported. The kexec_load is also supported, but also
requires a corresponding change to userspace kexec-tools.

Signed-off-by: Eric DeVolder 
Acked-by: Baoquan He 
---
  arch/x86/Kconfig |  11 
  arch/x86/include/asm/kexec.h |  20 +++
  arch/x86/kernel/crash.c  | 102 
+++

  3 files changed, 133 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..cdfc9b2fdf98 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2056,6 +2056,17 @@ config CRASH_DUMP
    (CONFIG_RELOCATABLE=y).
    For more details see Documentation/admin-guide/kdump/kdump.rst
+config CRASH_MAX_MEMORY_RANGES
+    depends on CRASH_DUMP && KEXEC_FILE && (HOTPLUG_CPU || 
MEMORY_HOTPLUG)

+    int
+    default 32768
+    help
+  For the kexec_file_load path, specify the maximum number of
+  memory regions, eg. as represented by the 'System RAM' entries
+  in /proc/iomem, that the elfcorehdr buffer/segment can 
accommodate.

+  This value is combined with NR_CPUS and multiplied by Elf64_Phdr
+  size to determine the final buffer size.
+
  config KEXEC_JUMP
  bool "kexec jump"
  depends on KEXEC && HIBERNATION
diff --git a/arch/x86/include/asm/kexec.h 
b/arch/x86/include/asm/kexec.h

index a3760ca796aa..432073385b2d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -212,6 +212,26 @@ typedef void crash_vmclear_fn(void);
  extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
  extern void kdump_nmi_shootdown_cpus(void);
+void *arch_map_crash_pages(unsigned long paddr, unsigned long size);
+#define arch_map_crash_pages arch_map_crash_pages
+
+void arch_unmap_crash_pages(void **ptr);
+#define arch_unmap_crash_pages arch_unmap_crash_pages
+
+void arch_crash_handle_hotplug_event(struct kimage *image,
+    unsigned int hp_action);
+#define arch_crash_handle_hotplug_event 
arch_crash_handle_hotplug_event

+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+
  #endif /* __ASSEMBLY__ */
  #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 9ceb93c176a6..8fc7d678ac72 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
@@ -397,7 +398,18 @@ int crash_load_segments(struct kimage *image)
  image->elf_headers = kbuf.buffer;
  image->elf_headers_sz = kbuf.bufsz;
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
+    /* Ensure elfcorehdr segment large enough for hotplug changes */
+    kbuf.memsz =
+    (CONFIG_NR_CPUS_DEFAULT + CONFIG_CRASH_MAX_MEMORY_RANGES) *
+    sizeof(Elf64_Phdr);
+    /* Mark as usable to crash kernel, else crash kernel fails on 
boot */

+    image->elf_headers_sz = kbuf.memsz;
+    image->elfcorehdr_index = image->nr_segments;
+    image->elfcorehdr_index_valid = true;
+#else
  kbuf.memsz = kbuf.bufsz;
+#endif
  kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
  kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
  ret = kexec_add_buffer();
@@ -412,3 +424,93 @@ int crash_load_segments(struct kimage *image)
  return ret;
  }
  #endif /* CONFIG_KEXEC_FILE */
+
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
+/*
+ * NOTE: The addresses and sizes passed to this routine have
+ * already been fully aligned on page boundaries. There is no
+ * need for massaging the address or size.
+ */
+void *arch_map_crash_pages(unsigned long

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support





On 10/12/22 15:19, Eric DeVolder wrote:



On 10/12/22 12:46, Borislav Petkov wrote:

On Sat, Oct 08, 2022 at 10:35:14AM +0800, Baoquan He wrote:

Memory hptlug is not limited by a certin or a max number of memory
regions, but limited by how large of the linear mapping range which
physical can be mapped into.


Memory hotplug is not limited by some abstract range but by the *actual*
possibility of how many DIMM slots on any motherboard can hotplug
memory. Certainly not 32K.

So you can choose a sane default which covers *all* actual systems out
there.



We run here QEMU with the ability for 1024 DIMM slots. A DIMM can be any
reasonable power of 2 size, and then that DIMM is further divided into 
memblocks,
typically 128MiB.

So, for example, 1TiB requires 1024 DIMMs of 1GiB each with 128MiB memblocks, 
that results
in 8K possible memory regions. So just going to 4TiB reaches 32K memory regions.

This I can attest for virtualized DIMMs, not sure about other memory hotplug 
technologies
like virtio-mem or DynamicMemory. But it seems reasonable that those 
technologies could
also easily reach into these number ranges.

Eric


Oh, to be fair, if the above were fully populated, it would essentially 
coalescence
into a single reported region via crash_prepare_elf64_headers(). But in the 
sadistic
case, where every other memblock was offlined, that would result in the need to
report half of the memory regions via the elfcorehdr.

$0.02.
eric






For the Kconfig CRASH_MAX_MEMORY_RANGES Eric added, it's meaningful to
me to set a fixed value which is enough in reality.


Yes, exactly.


For extreme testing with special purpose, it could be broken easily,
people need decide by self whether the CONFIG_CRASH_MAX_MEMORY_RANGES
is enlarged or not.


I don't want for people to decide on one more thing where they have to
go and read a bunch of specs just to know what is a good value. So we
should set a sane, *practical* upper limit and simply go with it.

Everything else is testing stuff and if you test the kernel, then you
can change limits and values and so on as you want to.

Thx.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-12 Thread Borislav Petkov

On Wed, Oct 12, 2022 at 03:19:19PM -0500, Eric DeVolder wrote:
> We run here QEMU with the ability for 1024 DIMM slots.

QEMU, haha.

What is the highest count of DIMM slots which are hotpluggable on a
real, *physical* system today? Are you saying you can have 1K DIMM slots
on a board?

I hardly doubt that.

> So, for example, 1TiB requires 1024 DIMMs of 1GiB each with 128MiB
> memblocks, that results in 8K possible memory regions. So just going
> to 4TiB reaches 32K memory regions.

Lemme see if I understand this correctly: when a system like that
crashes, you want to kdump *all* those 4TiB in a vmcore? How long would
that dump take to complete? A day?

IOW, how does a realistic use case of this look like - not a QEMU one?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support





On 10/12/22 12:46, Borislav Petkov wrote:

On Sat, Oct 08, 2022 at 10:35:14AM +0800, Baoquan He wrote:

Memory hptlug is not limited by a certin or a max number of memory
regions, but limited by how large of the linear mapping range which
physical can be mapped into.


Memory hotplug is not limited by some abstract range but by the *actual*
possibility of how many DIMM slots on any motherboard can hotplug
memory. Certainly not 32K.

So you can choose a sane default which covers *all* actual systems out
there.



We run here QEMU with the ability for 1024 DIMM slots. A DIMM can be any
reasonable power of 2 size, and then that DIMM is further divided into 
memblocks,
typically 128MiB.

So, for example, 1TiB requires 1024 DIMMs of 1GiB each with 128MiB memblocks, 
that results
in 8K possible memory regions. So just going to 4TiB reaches 32K memory regions.

This I can attest for virtualized DIMMs, not sure about other memory hotplug 
technologies
like virtio-mem or DynamicMemory. But it seems reasonable that those 
technologies could
also easily reach into these number ranges.

Eric




For the Kconfig CRASH_MAX_MEMORY_RANGES Eric added, it's meaningful to
me to set a fixed value which is enough in reality.


Yes, exactly.


For extreme testing with special purpose, it could be broken easily,
people need decide by self whether the CONFIG_CRASH_MAX_MEMORY_RANGES
is enlarged or not.


I don't want for people to decide on one more thing where they have to
go and read a bunch of specs just to know what is a good value. So we
should set a sane, *practical* upper limit and simply go with it.

Everything else is testing stuff and if you test the kernel, then you
can change limits and values and so on as you want to.

Thx.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-12 Thread Borislav Petkov

On Sat, Oct 08, 2022 at 10:35:14AM +0800, Baoquan He wrote:
> Memory hptlug is not limited by a certin or a max number of memory
> regions, but limited by how large of the linear mapping range which
> physical can be mapped into.

Memory hotplug is not limited by some abstract range but by the *actual*
possibility of how many DIMM slots on any motherboard can hotplug
memory. Certainly not 32K.

So you can choose a sane default which covers *all* actual systems out
there.

> For the Kconfig CRASH_MAX_MEMORY_RANGES Eric added, it's meaningful to
> me to set a fixed value which is enough in reality.

Yes, exactly.

> For extreme testing with special purpose, it could be broken easily,
> people need decide by self whether the CONFIG_CRASH_MAX_MEMORY_RANGES
> is enlarged or not.

I don't want for people to decide on one more thing where they have to
go and read a bunch of specs just to know what is a good value. So we
should set a sane, *practical* upper limit and simply go with it.

Everything else is testing stuff and if you test the kernel, then you
can change limits and values and so on as you want to.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support




On 10/11/22 23:55, Sourabh Jain wrote:



If kmap_local_page() works for all archs, then I'm happy to drop these
arch_ variants and use it directly.


Yes, pls do.


I'll check with Sourabh to see if PPC can work with kmap_local_page().

I think kmap_local_page do support on  PowerPC. But can you explain why we need 
this
function here, aren't the reserve memory already available to use?


On x86, attempts to access the elfcorehdr without mapping it did not work 
(resulted
in a fault).

Let me know if using kmap_local_page() in place of __va() in 
arch_map_crash_pages().
If it does, then I can eliminate arch_un/map_crash_pages() and use 
kmap_local_page()
directly.

Hello Eric,

Atleast on ppc64 we have direct mapping available and hence just by doing page 
shift
on physical address (__va) we can get valid virtual address on powerpc. In 
short we don't
have to generate mapping again to access reserved region.

Regardless let's go with kdump_local_page API, it is supported on powerpc.

Thanks,
Sourabh Jain


Ok, I will go that route.
Thanks!
eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support