Re: [PATCH v2 1/5] arm64: kexec_file: Forbid non-crash kernels

2021-06-04 Thread James Morse
On 31/05/2021 10:57, Marc Zyngier wrote:
> It has been reported that kexec_file doesn't really work on arm64.
> It completely ignores any of the existing reservations, which results
> in the secondary kernel being loaded where the GICv3 LPI tables live,


> or even corrupting the ACPI tables.

I'd like to know how the ACPI tables bit happens.

ACPI tables should be in EFI_ACPI_RECLAIM_MEMORY or EFI_ACPI_MEMORY_NVS (which 
isn't
treated as usable).

EFI's reserve_regions() does this:
|   if (!is_usable_memory(md))
|   memblock_mark_nomap(paddr, size);
|
|   /* keep ACPI reclaim memory intact for kexec etc. */
|   if (md->type == EFI_ACPI_RECLAIM_MEMORY)
|   memblock_reserve(paddr, size);

which is called via efi_init(), and all those regions end up listed as reserved 
in
/proc/iomem. (this is why arm64 doesn't call acpi_reserve_initial_tables())

If your firmware puts ACPI tables are in EFI_CONVENTIONAL_MEMORY, you have 
bigger problems
as the kernel could get relocated over the top of them during boot, and even if 
it
doesn't, nothing stops that  memory being allocated for user-space.

Even acpi_table_upgrade() calls memblock_reserve() and happens early enough not 
to be a
problem.


Please share ... enjoyment, optional.

(boot with efi=debug and post the EFI memory map and the 'ACPI: FOO 
0xphysicaladdress'
stuff at the top of the boot log)


Thanks,

James


> Since only crash kernels are imune to this as they use a reserved
> memory region, disable the non-crash kernel use case. Further
> patches will try and restore the functionality.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] arm64: Make kexec_file_load honor iomem reservations

2021-06-04 Thread James Morse
Hi Marc,

On 31/05/2021 10:57, Marc Zyngier wrote:
> This series is a complete departure from the approach I initially sent
> almost a month ago[0]. Instead of trying to teach EFI, ACPI and other
> subsystem to use memblock, I've decided to stick with the iomem
> resource tree and use that exclusively for arm64.

> This means that my current approach is (despite what I initially
> replied to both Dave and Catalin) to provide an arm64-specific
> implementation of arch_kexec_locate_mem_hole() which walks the
> resource tree and excludes ranges of RAM that have been registered for
> any odd purpose. This is exactly what the userspace implementation
> does, and I don't really see a good reason to diverge from it.

Because in the ideal world we'd have only 'is it reserved' list to check 
against.
Memblock has been extended before. The resource-list is overly stringy, and I'm 
not sure
we can shove everything in the resource list.

Kexec already has problems on arm64 with memory hotplug. Fixing this for 
regular kexec in
/proc/iomem was rejected, and memblock's memblock_is_hotpluggable() is broken 
because
free_low_memory_core_early() does this:
|   memblock_clear_hotplug(0, -1)

Once that has been unpicked its clear kexec_file_load() can use
memblock_is_hotpluggable(). (its on the todo list, well, jira)


I'd prefer to keep kexec using memblock because it _shouldn't_ change after 
boot. Having
an "I want to reserve this and make it persistent over kexec" call that can 
happen at any
time can't work if the kexec image has already been loaded.
Practically, once user-space has started, you can't have new things you want to 
reserve
over kexec.


I don't see how the ACPI tables can escape short of a firmware bug. Could 
someone with an
affected platform boot with efi=debug and post the EFI memory map and the 
'ACPI: FOO
0xphysicaladdress' stuff at the top of the boot log?


efi_mem_reserve_persistent() has one caller for the GIC ITS stuff.

For the ITS, the reservations look like they are behind irqchip_init(), which 
is well
before the arch_initcall() that updates the resource tree from memblock. Your 
v1's first
patch should be sufficient.


> Again, this allows my Synquacer board to reliably use kexec_file_load
> with as little as 256M, something that would always fail before as it
> would overwrite most of the reserved tables.
> 
> Although this series still targets 5.14, the initial patch is a
> -stable candidate, and disables non-kdump uses of kexec_file_load. I
> have limited it to 5.10, as earlier kernels will require a different,
> probably more invasive approach.
> 
> Catalin, Ard: although this series has changed a bit compared to v1,
> I've kept your AB/RB tags. Should anything seem odd, please let me
> know and I'll drop them.


Thanks,

James


[0] I'm pretty sure this is enough. (Not tested)
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 4b7ee3fa9224..3ed45153ce7f 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -893,7 +893,7 @@ static int __init efi_memreserve_map_root(void)
return 0;
 }

-static int efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
+static int __efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
 {
struct resource *res, *parent;

@@ -911,6 +911,16 @@ static int efi_mem_reserve_iomem(phys_addr_t addr, u64 
size)
return parent ? request_resource(parent, res) : 0;
 }

+static int efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
+{
+   int err = __efi_mem_reserve_iomem(addr, size);
+
+   if(IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK) && !err)
+   memblock_reserve(addr, size);
+
+   return err;
+}
+
 int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
 {
struct linux_efi_memreserve *rsv;

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] arm64: kexec_file_load vs memory reservations

2021-06-02 Thread James Morse
Hi Marc,

On 02/06/2021 16:59, Marc Zyngier wrote:
> On Wed, 02 Jun 2021 15:22:00 +0100,
> James Morse  wrote:
>> On 29/04/2021 14:35, Marc Zyngier wrote:
>>> It recently became apparent that using kexec with kexec_file_load() on
>>> arm64 is pretty similar to playing Russian roulette.
>>>
>>> Depending on the amount of memory, the HW supported and the firmware
>>> interface used, your secondary kernel may overwrite critical memory
>>> regions without which the secondary kernel cannot boot (the GICv3 LPI
>>> tables being a prime example of such reserved regions).
>>>
>>> It turns out that there is at least two ways for reserved memory
>>> regions to be described to kexec: /proc/iomem for the userspace
>>> implementation, and memblock.reserved for kexec_file. 
>>
>> One is spilled into the other by request_standard_resources()...
>>
>>
>>> And of course,
>>> our LPI tables are only reserved using the resource tree, leading to
>>> the aforementioned stamping.
>>
>> Presumably well after efi_init() has run...
> 
> Yup, much later. And we can keep on reserving memory as long as we
> boot new CPUs. Having it as a one-off sync doesn't really help here.

It might need doing for all possible CPUs up-front... otherwise someone loads a 
kexec
kernel and correctly picks a safe location ... then a CPU comes online and 
reserves a hole
in the middle of it: kexec isn't using the selected location until you reboot().
(memory hotplug has some 'fun' in this area, which can only be fixed by using 
memblock,
which ought to know about removable memory ranges ... but doesn't)

There does need to be a point where the list of reserved memory stops changing.


>>> Similar things could happen with ACPI tables as well.
>>
>> efi_init() calls reserve_regions(), which has:
>> |/* keep ACPI reclaim memory intact for kexec etc. */
>> |if (md->type == EFI_ACPI_RECLAIM_MEMORY)
>> |memblock_reserve(paddr, size);
>>
>> This is also what stops mm from allocating them, as
>> memblock-reserved gets copied into the PG_Reserved flag by
>> free_low_memory_core_early()'s calls to reserve_bootmem_region().
>>
>> Is your machines firmware putting them in a region with a different type?

> Good question. Moritz (cc'd) saw the tables being overwritten on his
> system (which I don't have access to), so I guess this is not entirely
> clear cut how this happens.

If we have systems that store the tables in 'conventional memory' we have 
bigger problems!


> My SQ box reports the ACPI region as "ACPI Reclaim", so I guess it
> works as expected here.
> 
>> (The UEFI spec has something to say: see 2.3.6 "AArch64 Platforms":
>> | ACPI Tables loaded at boot time can be contained in memory of type 
>> EfiACPIReclaimMemory
>> | (recommended) or EfiACPIMemoryNVS
>>
>> NVS would fail the is_usable_memory() check earlier, so gets treated
>> as nomap)

> Note that I've since changed tactics and proposed that we fully rely
> on the resource tree instead[1].

Yup - I came back here to work out why you gave up on memblock:reserving the 
reserved
regions...



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] arm64: kexec_file_load vs memory reservations

2021-06-02 Thread James Morse
Hi Marc,

On 29/04/2021 14:35, Marc Zyngier wrote:
> It recently became apparent that using kexec with kexec_file_load() on
> arm64 is pretty similar to playing Russian roulette.
> 
> Depending on the amount of memory, the HW supported and the firmware
> interface used, your secondary kernel may overwrite critical memory
> regions without which the secondary kernel cannot boot (the GICv3 LPI
> tables being a prime example of such reserved regions).
> 
> It turns out that there is at least two ways for reserved memory
> regions to be described to kexec: /proc/iomem for the userspace
> implementation, and memblock.reserved for kexec_file. 

One is spilled into the other by request_standard_resources()...


> And of course,
> our LPI tables are only reserved using the resource tree, leading to
> the aforementioned stamping.

Presumably well after efi_init() has run...

> Similar things could happen with ACPI tables as well.

efi_init() calls reserve_regions(), which has:
|   /* keep ACPI reclaim memory intact for kexec etc. */
|   if (md->type == EFI_ACPI_RECLAIM_MEMORY)
|   memblock_reserve(paddr, size);

This is also what stops mm from allocating them, as memblock-reserved gets 
copied into the
PG_Reserved flag by free_low_memory_core_early()'s calls to 
reserve_bootmem_region().

Is your machines firmware putting them in a region with a different type?

(The UEFI spec has something to say: see 2.3.6 "AArch64 Platforms":
| ACPI Tables loaded at boot time can be contained in memory of type 
EfiACPIReclaimMemory
| (recommended) or EfiACPIMemoryNVS

NVS would fail the is_usable_memory() check earlier, so gets treated as nomap)


Thanks,

James

> On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> boots with that little memory), trying to kexec a secondary kernel
> failed every times. I can only presume that this was mostly tested
> using kdump, which preserves the entire kernel memory range.
> 
> This small series aims at triggering a discussion on what are the
> expectations for kexec_file, and whether we should unify the two
> reservation mechanisms.
> 
> And in the meantime, it gets things going...
> 
> Marc Zyngier (2):
>   firmware/efi: Tell memblock about EFI reservations
>   ACPI: arm64: Reserve the ACPI tables in memblock
> 
>  arch/arm64/kernel/acpi.c   |  1 +
>  drivers/firmware/efi/efi.c | 23 ++-
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

2021-02-01 Thread James Morse
Hi Pavel,

On 27/01/2021 17:27, Pavel Tatashin wrote:
> Enable MMU during kexec relocation in order to improve reboot performance.
> 
> If kexec functionality is used for a fast system update, with a minimal
> downtime, the relocation of kernel + initramfs takes a significant portion
> of reboot.
> 
> The reason for slow relocation is because it is done without MMU, and thus
> not benefiting from D-Cache.
> 
> Performance data
> 
> For this experiment, the size of kernel plus initramfs is small, only 25M.
> If initramfs was larger, than the improvements would be greater, as time
> spent in relocation is proportional to the size of relocation.
> 
> Previously:
> kernel shutdown   0.022131328s
> relocation0.440510736s
> kernel startup0.294706768s
> 
> Relocation was taking: 58.2% of reboot time
> 
> Now:
> kernel shutdown   0.032066576s
> relocation0.022158152s
> kernel startup0.296055880s
> 
> Now: Relocation takes 6.3% of reboot time
> 
> Total reboot is x2.16 times faster.
> 
> With bigger userland (fitImage 380M), the reboot time is improved by 3.57s,
> and is reduced from 3.9s down to 0.33s

> Previous approaches and discussions
> ---

The problem I see with this is rewriting the relocation code. It needs to work 
whether the
machine has enough memory to enable the MMU during kexec, or not.

In off-list mail to Pavel I proposed an alternative implementation here:
https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0

By using a copy of the linear map, and passing the phys_to_virt offset into
arm64_relocate_new_kernel() its possible to use the same code when we fail to 
allocate the
page tables, and run with the MMU off as it does today.
I'm convinced someone will crawl out of the woodwork screaming 'regression' if 
we
substantially increase the amount of memory needed to kexec at all.

>From that discussion: this didn't meet Pavel's timing needs.
If you depend on having all the src/dst pages lined up in a single line, it 
sounds like
you've over-tuned this to depend on the CPU's streaming mode. What causes the 
CPU to
start/stop that stuff is very implementation specific (and firmware 
configurable).
I don't think we should let this rule out systems that can kexec today, but 
don't have
enough extra memory for the page tables.
Having two copies of the relocation code is obviously a bad idea.


(as before: ) Instead of trying to make the relocations run quickly, can we 
reduce them?
This would benefit other architectures too.

Can the kexec core code allocate higher order pages, instead of doing 
everything page at
at time?

If you have a crash kernel reservation, can we use that to eliminate the 
relocations
completely?
(I think this suggestion has been lost in translation each time I make it.
I mean like this:
https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
Runes to test it:
| sudo ./kexec -p -u
| sudo cat /proc/iomem | grep Crash
|  b020-f01f : Crash kernel
| sudo ./kexec --mem-min=0xb020 --mem-max=0xf01ff -l ~/Image 
--reuse-cmdline

I bet its even faster!)


I think 'as fast as possible' and 'memory constrained' are mutually exclusive
requirements. We need to make the page tables optional with a single 
implementation.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: Add quick kexec support for kernel

2020-08-17 Thread James Morse
Hi guys,

On 14/08/2020 20:22, Pavel Tatashin wrote:
> On Fri, Aug 14, 2020 at 7:24 AM Dave Young  wrote:
>> On 08/14/20 at 04:21pm, Sang Yan wrote:
>>> On 08/14/20 14:58, Dave Young wrote:
>>>> On 08/14/20 at 01:52am, Sang Yan wrote:
>>> Yes, it's particularly obvious on arm64. I will add it to the patch log,
>>> and test how long it takes on x86 and other arch.

Earlier versions of kexec-tools had the in-purgatory checksum enabled 
unconditionally.
More recent versions let you disable it, I think the parameter is called 
no-checks. This
saves some time, but the relocations still have to be done.


>>>> About the arm64 problem, I know Pavel Tatashin is working on a patchset
>>>> to improve the performance with enabling mmu.
>>>>
>>>> I added Pavel in cc, can you try his patches?
>>>>
>>> Thanks for your tips, I will try these patches. @Pavel.
>>> Disable mmu after finishing copying pages?

>>>>> We introduce quick kexec to save time of copying memory as above,
>>>>> just like kdump(kexec on crash), by using reserved memory
>>>>> "Quick Kexec".
>>>>
>>>> This approach may have gain, but it also introduce extra requirements to
>>>> pre-reserve a memory region.  I wonder how Eric thinks about the idea.
>>>>
>>>> Anyway the "quick" name sounds not very good, I would suggest do not
>>>> introduce a new param, and the code can check if pre-reserved region
>>>> exist then use it, if not then fallback to old way.
>>>>
>>> aha. I agree with it, but I thought it may change the old behaviors of
>>> kexec_load.
>>>
>>> I will update a new patch without introducing new flags and new params.
>>
>> Frankly I'm still not sure it is worth to introduce a new interface if the
>> improvement can be done in arch code like Pavel is doing.  Can you try
>> that first?

> My patches will fix this issue. This is an ARM64 specific problem and
> I did not see this to be performance problem on x86 during kexec
> relocation. This happens because on ARM64 relocation is performed with
> MMU disabled, and when MMU is disabled the caching is disabled as
> well.

> I have a patch series that fixes this entirely, but James Morse
> (+CCed) and I still have not agreed on the final approach. We had an
> off-list conversation about it, and we need to continue it in public
> ML.
> 
> Here is some history:
> 
> This is the original series that I sent a year ago. It basically
> proposes the same thing as this series from Sang Yan:
> https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatas...@soleen.com/
> 
> Once, I realized that with enabling MMU the relocation is issue is
> gone completely, I sent a new series, and this is the latest version
> of that series:
> https://lore.kernel.org/lkml/20200326032420.27220-1-pasha.tatas...@soleen.com/
> 
> It has been tested in production, and several people from different
> companies commented to me that they are using it as well.
> 
> After my patch series was sent out, James created a new branch in his
> tree with his approach of enabling MMU without having a new VA space,
> but instead re-use what the kernel  has now. I have not tested that
> branch yet.

For context, that is here:
http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/kexec%2Bmmu/v0

I think we can maintain this approach, but it doesn't work for Pavel, as he has 
extra
requirements. I stopped looking at it because it became a solution no-one 
needed.


> Here are some comments from James Morse and the off-list discussion we had:
> ---
> It sounds like you are depending on write streaming mode to meet your
> target performance.
> This isn't even CPU specific, its cache and firmware configuration specific!
> I don't think we should optimise a general purpose operating system
> based on things like this.
> ..
> I think the best approach is going to be to eliminate the relocations 
> entirely.> ...
> I'm afraid I view this kexec-map thing as high-risk duct-tape over the
> kexec core code
> deliberately scattering the kexec payload.
> I'd prefer any approach that causes the payload to be stored in-place
> from the beginning
> as that benefits other architectures too.
> ---

The 'eliminate relocations' comment goes with some of the context you removed.


> It appears James is leaning to the approach of not performing
> relocation at all and use what is proposed by Sang Yan and me during
> my first approach for this problem.

The background to that is Pavel's timing requirements: Enabling the MMU isn't 
enough, from
his descripti

Re: [PATCH v6 2/2] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2020-07-01 Thread James Morse
Hi Bhupesh,

On 13/05/2020 19:52, Bhupesh Sharma wrote:
> vabits_actual variable on arm64 indicates the actual VA space size,
> and allows a single binary to support both 48-bit and 52-bit VA
> spaces.

I'd prefer the commit message not to refer to this 'vabits_actual' thing at 
all. By the
time a git-archaeologist comes to read this, it may be long gone.

Ideally this would refer to: TCR_EL1.TxSZ, which controls the VA space size,
and can be configured by a single kernel image to support either 48-bit or 
52-bit VA space.


> If the ARMv8.2-LVA optional feature is present, and we are running
> with a 64KB page size; then it is possible to use 52-bits of address
> space for both userspace and kernel addresses. However, any kernel
> binary that supports 52-bit must also be able to fall back to 48-bit
> at early boot time if the hardware feature is not present.
> 
> Since TCR_EL1.T1SZ indicates the size offset of the memory region
> addressed by TTBR1_EL1 (and hence can be used for determining the
> vabits_actual value) it makes more sense to export the same in
> vmcoreinfo rather than vabits_actual variable, as the name of the
> variable can change in future kernel versions, but the architectural
> constructs like TCR_EL1.T1SZ can be used better to indicate intended
> specific fields to user-space.
> 
> User-space utilities like makedumpfile and crash-utility, need to
> read this value from vmcoreinfo for determining if a virtual
> address lies in the linear map range.
> 
> While at it also add documentation for TCR_EL1.T1SZ variable being
> added to vmcoreinfo.
> 
> It indicates the size offset of the memory region addressed by TTBR1_EL1

> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 1f646b07e3e9..314391a156ee 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -7,6 +7,14 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
> +static inline u64 get_tcr_el1_t1sz(void);
> +
> +static inline u64 get_tcr_el1_t1sz(void)
> +{
> + return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
> +}
>  
>  void arch_crash_save_vmcoreinfo(void)
>  {
> @@ -16,6 +24,8 @@ void arch_crash_save_vmcoreinfo(void)
>   kimage_voffset);
>   vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
>   PHYS_OFFSET);
> + vmcoreinfo_append_str("NUMBER(TCR_EL1_T1SZ)=0x%llx\n",
> + get_tcr_el1_t1sz());
>   vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
>   vmcoreinfo_append_str("NUMBER(KERNELPACMASK)=0x%llx\n",
>       system_supports_address_auth() ?

(I think second guessing the kernel memory map is a sisyphean effort), but this 
register
isn't going to disappear, or change its meaning!:

Reviewed-by: James Morse 

You may need to re-post this to get the maintainer's attention as its normally 
safe to
assume patches posted before rc1 no longer apply. (in this case, this one does)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 5/5] dt-bindings: chosen: Document linux,low-memory-range for arm64 kdump

2020-05-29 Thread James Morse
Hi guys,

On 26/05/2020 22:18, Rob Herring wrote:
> On Fri, May 22, 2020 at 11:24:11AM +0800, chenzhou wrote:
>> On 2020/5/21 21:29, Rob Herring wrote:
>>> On Thu, May 21, 2020 at 3:35 AM Chen Zhou  wrote:
 Add documentation for DT property used by arm64 kdump:
 linux,low-memory-range.
 "linux,low-memory-range" is an another memory region used for crash
 dump kernel devices.

 diff --git a/Documentation/devicetree/bindings/chosen.txt 
 b/Documentation/devicetree/bindings/chosen.txt
 index 45e79172a646..bfe6fb6976e6 100644
 --- a/Documentation/devicetree/bindings/chosen.txt
 +++ b/Documentation/devicetree/bindings/chosen.txt

 +linux,low-memory-range
 +--
 +This property (arm64 only) holds a base address and size, describing a
 +limited region below 4G. Similar to "linux,usable-memory-range", it is
 +an another memory range which may be considered available for use by the
 +kernel.

>>> Why can't you just add a range to "linux,usable-memory-range"? It
>>> shouldn't be hard to figure out which part is below 4G.

>> The comments from James:
>> Won't this break if your kdump kernel doesn't know what the extra parameters 
>> are?
>> Or if it expects two ranges, but only gets one? These DT properties should 
>> be treated as
>> ABI between kernel versions, we can't really change it like this.
>>
>> I think the 'low' region is an optional-extra, that is never mapped by the 
>> first kernel. I
>> think the simplest thing to do is to add an 'linux,low-memory-range' that we
>> memblock_add() after memblock_cap_memory_range() has been called.
>> If its missing, or the new kernel doesn't know what its for, everything 
>> keeps working.
> 
> 
> I don't think there's a compatibility issue here though. The current 
> kernel doesn't care if the property is longer than 1 base+size. It only 
> checks if the size is less than 1 base+size.

Aha! I missed that.


> And yes, we can rely on 
> that implementation detail. It's only an ABI if an existing user 
> notices.
> 
> Now, if the low memory is listed first, then an older kdump kernel 
> would get a different memory range. If that's a problem, then define 
> that low memory goes last. 

This first entry would need to be the 'crashkernel' range where the kdump 
kernel is
placed, otherwise an older kernel won't boot. The rest can be optional extras, 
as long as
we are tolerant of it being missing...

I'll try and look at the rest of this series on Monday,


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 14/18] arm64: kexec: offset for relocation function

2020-05-07 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Soon, relocation function will share the same page with EL2 vectors.

The EL2 vectors would only be executed with the MMU off, so they don't need to 
be mapped
anywhere in particular. (this is something hibernate probably does sloppily).


> Add offset within this page to arm64_relocate_new_kernel, and also
> the total size of relocation code which will include both the function
> and the EL2 vectors.

See arch/arm64/kernel/vmlinux.lds.S and sections.h for an example of how the 
idmap,
hibernate and the non-KVM hyp code do this.

If we're going to change this, I'd prefer it be the same as the other users...


> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index 990185744148..d944c2e289b2 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -90,6 +90,13 @@ static inline void crash_prepare_suspend(void) {}
>  static inline void crash_post_resume(void) {}
>  #endif
>  
> +#if defined(CONFIG_KEXEC_CORE)
> +/* The beginning and size of relcation code to stage 2 kernel */
> +extern const unsigned long kexec_relocate_code_size;
> +extern const unsigned char kexec_relocate_code_start[];
> +extern const unsigned long kexec_kern_reloc_offset;
> +#endif

This makes these things visible globally ... but drops the arm64_ prefix!


> diff --git a/arch/arm64/kernel/relocate_kernel.S 
> b/arch/arm64/kernel/relocate_kernel.S
> index 22ccdcb106d3..aa9f2b2cd77c 100644
> --- a/arch/arm64/kernel/relocate_kernel.S
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -14,6 +14,9 @@
>  #include 
>  #include 
>  
> +.globl kexec_relocate_code_start
> +kexec_relocate_code_start:

Which of the fancy new asm-annotations should this be?



> @@ -86,13 +89,16 @@ ENTRY(arm64_relocate_new_kernel)
>  .ltorg
>  END(arm64_relocate_new_kernel)
>  
> -.Lcopy_end:
> +.Lkexec_relocate_code_end:
>  .org KEXEC_CONTROL_PAGE_SIZE
>  .align 3 /* To keep the 64-bit values below naturally aligned. */

>  /*
> - * arm64_relocate_new_kernel_size - Number of bytes to copy to the
> + * kexec_relocate_code_size - Number of bytes to copy to the
>   * control_code_page.
>   */
> -.globl arm64_relocate_new_kernel_size
> -arm64_relocate_new_kernel_size:
> - .quad   .Lcopy_end - arm64_relocate_new_kernel
> +.globl kexec_relocate_code_size
> +kexec_relocate_code_size:
> + .quad   .Lkexec_relocate_code_end - kexec_relocate_code_start
> +.globl kexec_kern_reloc_offset
> +kexec_kern_reloc_offset:
> + .quad   arm64_relocate_new_kernel - kexec_relocate_code_start
> 

Can't we calculate this from the start/end markers? These variables held in 
assembly
generated code is pretty manky.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 16/18] arm64: kexec: configure trans_pgd page table for kexec

2020-05-07 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Configure a page table located in kexec-safe memory that has
> the following mappings:
> 
> 1. identity mapping for text of relocation function with executable
>permission.
> 2. linear mappings for all source ranges
> 3. linear mappings for all destination ranges.

Its taken this long to work out your definition of linear here doesn't match 
the way the
rest of the arch code uses the term.

You are using the MMU to re-assemble the scattered kexec image in VA space, so 
that the
relocation code doesn't have to walk the list.

While its a cool trick, I don't think this is a good idea, it makes it much 
harder to
debug as we have a new definition for VA->PA, instead of re-using the kernels. 
We should
do the least surprising thing. The person debugging a problem's first 
assumptions should
be correct. Doing this means any debug information printed before kexec() is 
suddenly
useless for debugging a problem that occurs during relocation.

...

Let me hack together what I've been describing and we can discuss whether its 
simpler.
(most of next week is gone already though...)

(some Nits below)

> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index 0f758fd51518..8f4332ac607a 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -108,6 +108,12 @@ extern const unsigned long kexec_el2_vectors_offset;
>   * el2_vectorIf present means that relocation routine will go to EL1
>   *   from EL2 to do the copy, and then back to EL2 to do the jump
>   *   to new world.
> + * trans_ttbr0   idmap for relocation function and its argument
> + * trans_ttbr1   linear map for source/destination addresses.
> + * trans_t0szt0sz for idmap page in trans_ttbr0

You should be able to load the TTBR0_EL1 (and corresponding TCR_EL1.T0SZ) 
before kicking
off the relocation code. There should be no need to pass it in to assembly.

For example, hibernate sets TTBR0_EL1 in create_safe_exec_page().


> + * src_addr  linear map for source pages.
> + * dst_addr  linear map for destination pages.
> + * copy_len  Number of bytes that need to be copied
>   */
>  struct kern_reloc_arg {
>   phys_addr_t head;

> @@ -70,10 +71,90 @@ static void *kexec_page_alloc(void *arg)
>   return page_address(page);
>  }
>  
> +/*
> + * Map source segments starting from src_va, and map destination
> + * segments starting from dst_va, and return size of copy in
> + * *copy_len argument.
> + * Relocation function essentially needs to do:
> + * memcpy(dst_va, src_va, copy_len);
> + */
> +static int map_segments(struct kimage *kimage, pgd_t *pgdp,
> + struct trans_pgd_info *info,
> + unsigned long src_va,
> + unsigned long dst_va,
> + unsigned long *copy_len)
> +{
> + unsigned long *ptr = 0;
> + unsigned long dest = 0;
> + unsigned long len = 0;
> + unsigned long entry, addr;
> + int rc;
> +
> + for (entry = kimage->head; !(entry & IND_DONE); entry = *ptr++) {
> + addr = entry & PAGE_MASK;
> +
> + switch (entry & IND_FLAGS) {
> + case IND_DESTINATION:
> + dest = addr;
> + break;

So we hope to always find a destination first?


> + case IND_INDIRECTION:
> + ptr = __va(addr);
> + if (rc)
> + return rc;

Where does rc come from?

> + break;

> + case IND_SOURCE:
> + rc = trans_pgd_map_page(info, pgdp, __va(addr),
> + src_va, PAGE_KERNEL);
> + if (rc)
> + return rc;
> + rc = trans_pgd_map_page(info, pgdp, __va(dest),
> + dst_va, PAGE_KERNEL);
> + if (rc)
> + return rc;
> + dest += PAGE_SIZE;
> + src_va += PAGE_SIZE;
> + dst_va += PAGE_SIZE;
> + len += PAGE_SIZE;
> + }
> + }
> + *copy_len = len;
> +
> + return 0;
> +}
> +
> @@ -89,9 +170,18 @@ int machine_kexec_post_load(struct kimage *kimage)
>   kern_reloc_arg->el2_vector = __pa(reloc_code)
>   + kexec_el2_vectors_offset;
>   }
> +
> + /*
> +  * If relocation is not needed, we do not need to enable MMU in

Strictly you aren't enabling it, but disabling it _after_ the relocation.


> +  * relocation routine, therefore do not create page tables for
> +  * scenarios such as crash kernel
> +  */
> + if (!(kimage->head & IND_DONE))
> + rc = mmu_relocate_setup(kimage, reloc_code, kern_reloc_arg);
> +
>   kexec_image_info(kimage);
>  
> - 

Re: [PATCH v9 12/18] arm64: kexec: arm64_relocate_new_kernel don't use x0 as temp

2020-05-07 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> x0 will contain the only argument to arm64_relocate_new_kernel; don't
> use it as a temp. Reassigned registers to free-up x0.

The missing bit of motivation is _why_ you need x0 keep its value until the end 
of this code.

With that covered,
Reviewed-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 13/18] arm64: kexec: add expandable argument to relocation function

2020-05-07 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
> the following arguments:
> 
> head: start of array that contains relocation information.
> entry:entry point for new kernel or purgatory.
> dtb_mem:  first and only argument to entry.

> The number of arguments cannot be easily expended, because this
> function is also called from HVC_SOFT_RESTART, which preserves only
> three arguments. And, also arm64_relocate_new_kernel is written in
> assembly but called without stack, thus no place to move extra
> arguments to free registers.
> 
> Soon, we will need to pass more arguments: once we enable MMU we
> will need to pass information about page tables.


> Another benefit of allowing this function to accept more arguments, is that
> kernel can actually accept up to 4 arguments (x0-x3), however currently
> only one is used, but if in the future we will need for more (for example,
> pass information about when previous kernel exited to have a precise
> measurement in time spent in purgatory), we won't be easilty do that
> if arm64_relocate_new_kernel can't accept more arguments.

This is a niche debug hack.
We really don't want an ABI with purgatory. I think the register values it gets 
were added
early for compatibility with kexec_file_load().


> So, add a new struct: kern_reloc_arg, and place it in kexec safe page (i.e
> memory that is not overwritten during relocation).
> Thus, make arm64_relocate_new_kernel to only take one argument, that
> contains all the needed information.

Do we really not have enough registers?

The PCS[0] gives you 8 arguments. In this patch you use 6.


If this is really about the hyp-stub abi, please state that.


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index cee3be586384..b1122eea627e 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -59,13 +60,35 @@ void machine_kexec_cleanup(struct kimage *kimage)

>  int machine_kexec_post_load(struct kimage *kimage)
>  {
>   void *reloc_code = page_to_virt(kimage->control_code_page);
> + struct kern_reloc_arg *kern_reloc_arg = kexec_page_alloc(kimage);
> +
> + if (!kern_reloc_arg)
> + return -ENOMEM;
>  
>   memcpy(reloc_code, arm64_relocate_new_kernel,
>  arm64_relocate_new_kernel_size);
>   kimage->arch.kern_reloc = __pa(reloc_code);
> + kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
> + kern_reloc_arg->head = kimage->head;
> + kern_reloc_arg->entry_addr = kimage->start;
> + kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;

These kern_reloc_arg values are written via the cacheable linear map.
They are read in arm64_relocate_new_kernel() where the MMU is disabled an all 
memory
access are non-cacheable.

To ensure you read the values you wrote, you must clean kern_reloc_arg to the 
PoC.


>   kexec_image_info(kimage);
>  
>   return 0;Thanks,

James

[0]
https://developer.arm.com/docs/ihi0055/d/procedure-call-standard-for-the-arm-64-bit-architecture

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 11/18] arm64: kexec: arm64_relocate_new_kernel clean-ups

2020-05-07 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Remove excessive empty lines from arm64_relocate_new_kernel.

To make it harder to read? Or just for the churn ...

> Also, use comments on the same lines with instructions where
> appropriate.

Churn,


> Change ENDPROC to END it never returns.

It might be more useful to convert this to the new style annotations, which 
should be a
separate patch. See Documentation/asm-annotations.rst


> copy_page(dest, src, tmps...)
> Increments dest and src by PAGE_SIZE, so no need to store dest
> prior to calling copy_page and increment it after. Also, src is not
> used after a copy, not need to copy either.

This bit sounds like cleanup, but I can't isolate it from the noise below


> Call raw_dcache_line_size()  only when relocation is actually going to
> happen.

Why?

The pattern in this code is to setup register that don't change at the top, 
then do all
the work. I think this was an attempt to make it more readable.

Nothing branches back to that label, so this is fine, its just less obviously 
correct.


> Since '.align 3' is intended to align globals at the end of the file,
> move it there.


Please don't make noisy changes to whitespace and comments, its never worth it.


Thanks,

James


> diff --git a/arch/arm64/kernel/relocate_kernel.S 
> b/arch/arm64/kernel/relocate_kernel.S
> index c1d7db71a726..e9c974ea4717 100644
> --- a/arch/arm64/kernel/relocate_kernel.S
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -8,7 +8,6 @@
>  
>  #include 
>  #include 
> -
>  #include 
>  #include 
>  #include 
> @@ -17,25 +16,21 @@
>  /*
>   * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
>   *
> - * The memory that the old kernel occupies may be overwritten when coping the
> + * The memory that the old kernel occupies may be overwritten when copying 
> the
>   * new image to its final location.  To assure that the
>   * arm64_relocate_new_kernel routine which does that copy is not overwritten,
>   * all code and data needed by arm64_relocate_new_kernel must be between the
>   * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
>   * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
> - * control_code_page, a special page which has been set up to be preserved
> - * during the copy operation.
> + * safe memory that has been set up to be preserved during the copy 
> operation.
>   */
>  ENTRY(arm64_relocate_new_kernel)
> -
>   /* Setup the list loop variables. */
>   mov x18, x2 /* x18 = dtb address */
>   mov x17, x1 /* x17 = kimage_start */
>   mov x16, x0 /* x16 = kimage_head */
> - raw_dcache_line_size x15, x0/* x15 = dcache line size */
>   mov x14, xzr/* x14 = entry ptr */
>   mov x13, xzr/* x13 = copy dest */
> -
>   /* Clear the sctlr_el2 flags. */
>   mrs x0, CurrentEL
>   cmp x0, #CurrentEL_EL2
> @@ -46,14 +41,11 @@ ENTRY(arm64_relocate_new_kernel)
>   pre_disable_mmu_workaround
>   msr sctlr_el2, x0
>   isb
> -1:
> -
> - /* Check if the new image needs relocation. */
> +1:   /* Check if the new image needs relocation. */
>   tbnzx16, IND_DONE_BIT, .Ldone
> -
> + raw_dcache_line_size x15, x1/* x15 = dcache line size */
>  .Lloop:
>   and x12, x16, PAGE_MASK /* x12 = addr */
> -
>   /* Test the entry flags. */
>  .Ltest_source:
>   tbz x16, IND_SOURCE_BIT, .Ltest_indirection
> @@ -69,34 +61,18 @@ ENTRY(arm64_relocate_new_kernel)
>   b.lo2b
>   dsb sy
>  
> - mov x20, x13
> - mov x21, x12
> - copy_page x20, x21, x0, x1, x2, x3, x4, x5, x6, x7
> -
> - /* dest += PAGE_SIZE */
> - add x13, x13, PAGE_SIZE
> + copy_page x13, x12, x0, x1, x2, x3, x4, x5, x6, x7
>   b   .Lnext
> -
>  .Ltest_indirection:
>   tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
> -
> - /* ptr = addr */
> - mov x14, x12
> + mov x14, x12/* ptr = addr */
>   b   .Lnext
> -
>  .Ltest_destination:
>   tbz x16, IND_DESTINATION_BIT, .Lnext
> -
> - /* dest = addr */
> - mov x13, x12
> -
> + mov x13, x12/* dest = addr */
>  .Lnext:
> - /* entry = *ptr++ */
> - ldr x16, [x14], #8
> -
> - /* while (!(entry & DONE)) */
> - tbz x16, IND_DONE_BIT, .Lloop
> -
> + ldr x16, [x14], #8  /* entry = *ptr++ */
> + tbz x16, IND_DONE_BIT, .Lloop   /* while (!(entry & DONE)) */
>  .Ldone:
>   /* wait for writes from copy_page to finish */
>   dsb nsh
> @@ -110,16 +86,12 @@ ENTRY(arm64_relocate_new_kernel)
>   mov x2, xzr
>   mov x3, xzr
>   br  x17
> -
> -ENDPROC(arm64_relocate_new_kernel)
> -
>  .ltorg
> -
> 

Re: [PATCH v9 15/18] arm64: kexec: kexec EL2 vectors

2020-05-07 Thread James Morse
Hi Pavel,

What happened to the subject?
(it really needs a verb to make any sense)

On 26/03/2020 03:24, Pavel Tatashin wrote:
> If we have a EL2 mode without VHE, the EL2 vectors are needed in order
> to switch to EL2 and jump to new world with hyperivsor privileges.

Yes, but the hyp-stub has an API to let you do this... but you need your own 
version.

Could you explain why in the commit message?

(spelling: hyperivsor)


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index ab571fca9bd1..bd398def7627 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -84,6 +84,11 @@ int machine_kexec_post_load(struct kimage *kimage)
>   kern_reloc_arg->head = kimage->head;
>   kern_reloc_arg->entry_addr = kimage->start;
>   kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
> + /* Setup vector table only when EL2 is available, but no VHE */
> + if (is_hyp_mode_available() && !is_kernel_in_hyp_mode()) {
> + kern_reloc_arg->el2_vector = __pa(reloc_code)
> + + kexec_el2_vectors_offset;
> + }

Why does the asm relocation code need to know where the vector is? It must 
access it via HVC.




Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] kexec: Discard loaded image on memory hotplug

2020-05-01 Thread James Morse
On x86, the kexec payload contains a copy of the current memory map.
If memory is added or removed, this copy of the memory map becomes
stale. Getting this wrong may prevent the next kernel from booting.
The first kernel may die if it tries to re-assemble the next kernel
in memory that has been removed.

Discard the loaded kexec image when the memory map changes, user-space
should reload it.

Kdump is unaffected, as it is placed within the crashkernel reserved
memory area and only uses this memory. The stale memory map may affect
generation of the vmcore, but the kdump kernel should be in a position
to validate it.

Signed-off-by: James Morse 
---
This patch obsoletes:
 * kexec/memory_hotplug: Prevent removal and accidental use
https://lore.kernel.org/linux-arm-kernel/20200326180730.4754-1-james.mo...@arm.com/

 kernel/kexec_core.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..e1901e5bd4b5 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -22,10 +23,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1219,3 +1222,40 @@ void __weak arch_kexec_protect_crashkres(void)
 
 void __weak arch_kexec_unprotect_crashkres(void)
 {}
+
+/*
+ * If the memory layout changes, any loaded kexec image should be evicted
+ * as it may contain a copy of the (now stale) memory map. This also means
+ * we don't need to check the memory is still present when re-assembling the
+ * new kernel at machine_kexec() time.
+ */
+static int mem_change_cb(struct notifier_block *nb, unsigned long action,
+void *data)
+{
+   /*
+* Actions are either a change, or a change being cancelled.
+* A second discard for 'cancel's is harmless.
+*/
+
+   mutex_lock(_mutex);
+   if (kexec_image) {
+   kimage_free(xchg(_image, NULL));
+   pr_warn("loaded image discarded due to memory hotplug");
+   }
+   mutex_unlock(_mutex);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block mem_change_nb = {
+   .notifier_call = mem_change_cb,
+};
+
+static int __init register_mem_change_cb(void)
+{
+   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+   return register_memory_notifier(_change_nb);
+
+   return 0;
+}
+device_initcall(register_mem_change_cb);
-- 
2.26.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-05-01 Thread James Morse
Hi guys,

On 22/04/2020 17:40, David Hildenbrand wrote:
 Usually somewhere in the loaded image
 is a copy of the memory map at the time the kexec kernel was loaded.
 That will invalidate the memory map as well.
>>>
>>> Ah, unconditionally. Sure, x86 needs this.
>>> (arm64 re-discovers the memory map from firmware tables after kexec)

> Does this include hotplugged DIMMs e.g., under KVM?

If you advertise hotplugged memory to the guest using ACPI, yes.

We don't have a practical mechanism to pass 'fact's about the platform between 
kernels,
instead we rely on those facts being discoverable, or described by firmware.


 All of this should be for a very brief window of a few seconds, as
 the loaded kexec image is quite short.
>>>
>>> It seems I'm the outlier anticipating anything could happen between
>>> those syscalls.
>>
>> The design is:
>>  sys_kexec_load()
>>  shutdown scripts
>> sys_reboot(LINUX_REBOOT_CMD_KEXEC);
>>
>> There are two system call simply so that the shutdown scripts can run.
>> Now maybe someone somewhere does something different but that is not
>> expected.

[...]

> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
> can usually not get offlined and, therefore, the memory cannot get removed.

The crashkernel area on arm64 will always land in un-removable memory. We set 
PG_Reserved
on it too.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 10/18] arm64: kexec: cpu_soft_restart change argument types

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Change argument types from unsigned long to a more descriptive
> phys_addr_t.

For 'entry', which is a physical addresses, sure...

> diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
> index ed50e9587ad8..38cbd4068019 100644
> --- a/arch/arm64/kernel/cpu-reset.h
> +++ b/arch/arm64/kernel/cpu-reset.h
> @@ -10,17 +10,17 @@
>  
>  #include 
>  
> -void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
> - unsigned long arg0, unsigned long arg1, unsigned long arg2);

> +void __cpu_soft_restart(phys_addr_t el2_switch, phys_addr_t entry,
> + phys_addr_t arg0, phys_addr_t arg1, phys_addr_t arg2);

This looks weird because its re-using the hyp-stub API, because it might call 
the hyp-stub
from the idmap. entry is passed in, so this isn't tied to kexec. Without tying 
it to
kexec, how do you know arg2 is a physical address?
I think it tried to be re-usable because 32bit has more users for this.

arg0-2 are unsigned long because the hyp-stub is just moving general purpose 
registers around.

This is to avoid casting?
Sure, its only got one caller. This thing evolved because the platform-has-EL2 
and
kdump-while-KVM-was-running code was bolted on as they were discovered.


> -static inline void __noreturn cpu_soft_restart(unsigned long entry,
> -unsigned long arg0,
> -unsigned long arg1,
> -unsigned long arg2)
> +static inline void __noreturn cpu_soft_restart(phys_addr_t entry,
> +phys_addr_t arg0,
> +phys_addr_t arg1,
> +phys_addr_t arg2)
>  {
>   typeof(__cpu_soft_restart) *restart;
>  
> - unsigned long el2_switch = !is_kernel_in_hyp_mode() &&
> + phys_addr_t el2_switch = !is_kernel_in_hyp_mode() &&
>   is_hyp_mode_available();

What on earth happened here!?


>   restart = (void *)__pa_symbol(__cpu_soft_restart);


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 09/18] arm64: kexec: call kexec_image_info only once

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Currently, kexec_image_info() is called during load time, and
> right before kernel is being kexec'ed. There is no need to do both.

I think the original logic was if debugging, you'd see the load-time value in 
dmesg, and
the kexec-time values when the machine panic()d and you hadn't made a note of 
the previous
set. But I'm not sure why you'd have these debug messages enabled unless you 
were
debugging kexec.


> So, call it only once when segments are loaded and the physical
> location of page with copy of arm64_relocate_new_kernel is known.

Sure, keep the earlier version that is more likely to be seen.

Acked-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 07/18] arm64: trans_pgd: hibernate: idmap the single page that holds the copy page routines

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> From: James Morse 
> 
> To resume from hibernate, the contents of memory are restored from
> the swap image. This may overwrite any page, including the running
> kernel and its page tables.
> 
> Hibernate copies the code it uses to do the restore into a single
> page that it knows won't be overwritten, and maps it with page tables
> built from pages that won't be overwritten.
> 
> Today the address it uses for this mapping is arbitrary, but to allow
> kexec to reuse this code, it needs to be idmapped. To idmap the page
> we must avoid the kernel helpers that have VA_BITS baked in.
> 
> Convert create_single_mapping() to take a single PA, and idmap it.
> The page tables are built in the reverse order to normal using
> pfn_pte() to stir in any bits between 52:48. T0SZ is always increased
> to cover 48bits, or 52 if the copy code has bits 52:48 in its PA.
> 
> Pasha: The original patch from James
> inux-arm-kernel/20200115143322.214247-4-james.mo...@arm.com

-EBROKENLINK

The convention is to use a 'Link:' tag in the signed-off area.
e.g. 5a3577039cbe

> Adopted it to trans_pgd, so it can be commonly used by both Kexec
> and Hibernate. Some minor clean-ups.

Please describe your changes just before your SoB. This means each author 
sign's off on
the stuff above their SoB, and its obvious who made which changes.

Search for 'Lucky K Maintainer' in process/submitting-patches.rst for an 
example.


> diff --git a/arch/arm64/include/asm/trans_pgd.h 
> b/arch/arm64/include/asm/trans_pgd.h
> index 97a7ea73b289..4912d3caf0ca 100644
> --- a/arch/arm64/include/asm/trans_pgd.h
> +++ b/arch/arm64/include/asm/trans_pgd.h
> @@ -32,4 +32,7 @@ int trans_pgd_create_copy(struct trans_pgd_info *info, 
> pgd_t **trans_pgd,
>  int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
>  void *page, unsigned long dst_addr, pgprot_t pgprot);

This trans_pgd_map_page() used to be create_single_mapping(), which is where 
the original
patch made its changes.

You should only need one of these, not both.


> +int trans_pgd_idmap_page(struct trans_pgd_info *info, phys_addr_t 
> *trans_ttbr0,
> +  unsigned long *t0sz, void *page);
> +
>  #endif /* _ASM_TRANS_TABLE_H */

> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> index 37d7d1c60f65..c2517d1af2af 100644
> --- a/arch/arm64/mm/trans_pgd.c
> +++ b/arch/arm64/mm/trans_pgd.c
> @@ -242,3 +242,52 @@ int trans_pgd_map_page(struct trans_pgd_info *info, 
> pgd_t *trans_pgd,
>  
>   return 0;
>  }
> +
> +/*
> + * The page we want to idmap may be outside the range covered by VA_BITS that
> + * can be built using the kernel's p?d_populate() helpers. As a one off, for 
> a
> + * single page, we build these page tables bottom up and just assume that 
> will
> + * need the maximum T0SZ.
> + *
> + * Returns 0 on success, and -ENOMEM on failure.
> + * On success trans_ttbr0 contains page table with idmapped page, t0sz is 
> set to

> + * maxumum T0SZ for this page.

maxumum

> + */


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 08/18] arm64: kexec: move relocation function setup

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Currently, kernel relocation function is configured in machine_kexec()
> at the time of kexec reboot by using control_code_page.
> 
> This operation, however, is more logical to be done during kexec_load,
> and thus remove from reboot time. Move, setup of this function to
> newly added machine_kexec_post_load().

This would avoid the need to special-case the cache maintenance, so its a good 
cleanup...


> Because once MMU is enabled, kexec control page will contain more than
> relocation kernel, but also vector table, add pointer to the actual
> function within this page arch.kern_reloc. Currently, it equals to the
> beginning of page, we will add offsets later, when vector table is
> added.

If the vector table always comes second, wouldn't this be extra work to hold 
the value 0?
You can control the layout of this relocation code, as it has to be written in 
assembly.
I don't get why this would be necessary.


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index ae1bad0156cd..ec71a153cc2d 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -58,6 +59,17 @@ void machine_kexec_cleanup(struct kimage *kimage)
>   /* Empty routine needed to avoid build errors. */
>  }
>  
> +int machine_kexec_post_load(struct kimage *kimage)
> +{
> + void *reloc_code = page_to_virt(kimage->control_code_page);
> +
> + memcpy(reloc_code, arm64_relocate_new_kernel,
> +arm64_relocate_new_kernel_size);
> + kimage->arch.kern_reloc = __pa(reloc_code);

Could we move the two cache maintenance calls for this area in here too. 
Keeping it next
to the modification makes it clearer why it is required.

In this case we can use flush_icache_range() instead of its __variant because 
this now
happens much earlier.


> + return 0;
> +}

Regardless,
Reviewed-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 05/18] arm64: trans_pgd: pass NULL instead of init_mm to *_populate functions

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> trans_pgd_* should be independent from mm context because the tables that
> are created by this code are used when there are no mm context around, as
> it is between kernels. Simply replace mm_init's with NULL.

arm64's p?d_populate() helpers don't use the mm parameter, so it doesn't make 
any
difference. This was originally done so that if we ever needed anything from 
the mm, we
didn't get a NULL dereference or EL0 behaviour due to a future '!= _mm'.

If you think it matters,
Acked-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 03/18] arm64: trans_pgd: make trans_pgd_map_page generic

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> kexec is going to use a different allocator, so make

> trans_pgd_map_page to accept allocator as an argument, and also
> kexec is going to use a different map protection, so also pass
> it via argument.

This trans_pgd_map_page() used to be create_single_mapping() It creates page 
tables that
map one page: the relocation code.

Why do you need a different pgprot? Surely PAGE_KERNEL_EXEC is exactly what you 
want.


> diff --git a/arch/arm64/include/asm/trans_pgd.h 
> b/arch/arm64/include/asm/trans_pgd.h
> index 23153c13d1ce..ad5194ad178d 100644
> --- a/arch/arm64/include/asm/trans_pgd.h
> +++ b/arch/arm64/include/asm/trans_pgd.h
> @@ -12,10 +12,24 @@
>  #include 
>  #include 
>  
> +/*
> + * trans_alloc_page
> + *   - Allocator that should return exactly one zeroed page, if this
> + *allocator fails, trans_pgd returns -ENOMEM error.

trans_pgd is what you pass in to trans_pgd_map_page() or 
trans_pgd_create_copy().
Do you mean what those functions return?


> + *
> + * trans_alloc_arg
> + *   - Passed to trans_alloc_page as an argument
> + */

> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index 3d6f0fd73591..607bb1fbc349 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -195,6 +200,11 @@ static int create_safe_exec_page(void *src_start, size_t 
> length,
>unsigned long dst_addr,
>phys_addr_t *phys_dst_addr)
>  {
> + struct trans_pgd_info trans_info = {
> + .trans_alloc_page   = hibernate_page_alloc,
> + .trans_alloc_arg= (void *)GFP_ATOMIC,
> + };

As you need another copy of this in the next patch, is it worth declaring this 
globally
and making it const?


> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> index d20e48520cef..275a79935d7e 100644
> --- a/arch/arm64/mm/trans_pgd.c
> +++ b/arch/arm64/mm/trans_pgd.c
> @@ -180,8 +185,18 @@ int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned 
> long start,
>   return rc;
>  }
>  
> -int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
> -pgprot_t pgprot)
> +/*
> + * Add map entry to trans_pgd for a base-size page at PTE level.
> + * info: contains allocator and its argument
> + * trans_pgd:page table in which new map is added.
> + * page: page to be mapped.

> + * dst_addr: new VA address for the pages

~s/pages/page/

This thing only maps one page.


> + * pgprot:   protection for the page.
> + *
> + * Returns 0 on success, and -ENOMEM on failure.
> + */
> +int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
> +void *page, unsigned long dst_addr, pgprot_t pgprot)
>  {
>   pgd_t *pgdp;
>   pud_t *pudp;



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 02/18] arm64: hibernate: move page handling function to new trans_pgd.c

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Now, that we abstracted the required functions move them to a new home.
> Later, we will generalize these function in order to be useful outside
> of hibernation.

Reviewed-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 04/18] arm64: trans_pgd: pass allocator trans_pgd_create_copy

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Make trans_pgd_create_copy and its subroutines to use allocator that is
> passed as an argument

Reviewed-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 01/18] arm64: kexec: make dtb_mem always enabled

2020-04-29 Thread James Morse
Hi Pavel,

On 26/03/2020 03:24, Pavel Tatashin wrote:
> Currently, dtb_mem is enabled only when CONFIG_KEXEC_FILE is
> enabled. This adds ugly ifdefs to c files.

~s/dtb_mem/ARCH_HAS_KIMAGE_ARCH/ ?
dtb_mem is just one member of struct kimage_arch.


> Always enabled dtb_mem, when it is not used, it is NULL.
> Change the dtb_mem to phys_addr_t, as it is a physical address.

Regardless,
Reviewed-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-04-22 Thread James Morse
Hi Eric,

On 22/04/2020 14:04, Eric W. Biederman wrote:
> James Morse  writes:
>> On 15/04/2020 21:29, Eric W. Biederman wrote:
>>> James Morse  writes:
>>>> arm64 recently queued support for memory hotremove, which led to some
>>>> new corner cases for kexec.
>>>>
>>>> If the kexec segments are loaded for a removable region, that region may
>>>> be removed before kexec actually occurs. This causes the first kernel to
>>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>>
>>>> The first patch adds a memory notifier for kexec so that it can refuse
>>>> to allow in-use regions to be taken offline.
>>>>
>>>>
>>>> This doesn't solve the problem for arm64, where the new kernel must
>>>> initially rely on the data structures from the first boot to describe
>>>> memory. These don't describe hotpluggable memory.
>>>> If kexec places the kernel in one of these regions, it must also provide
>>>> a DT that describes the region in which the kernel was mapped as memory.
>>>> (and somehow ensure its always present in the future...)
>>>>
>>>> To prevent this from happening accidentally with unaware user-space,
>>>> patches two and three allow arm64 to give these regions a different
>>>> name.
>>>>
>>>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>>>> were added separately.
>>>>
>>>>
>>>> I haven't tried kdump.
>>>> Unaware kdump from user-space probably won't describe the hotplug
>>>> regions if the name is different, which saves us from problems if
>>>> the memory is no longer present at kdump time, but means the vmcore
>>>> is incomplete.
>>>>
>>>>
>>>> These patches are based on arm64's for-next/core branch, but can all
>>>> be merged independently.
>>>
>>> So I just looked through these quickly and I think there are real
>>> problems here we can fix, and that are worth fixing.
>>>
>>> However I am not thrilled with the fixes you propose.
>>
>> Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the 
>> existing
>> kexec-tools working.
>> (We've had 'unthrilling' patches like this before to prevent user-space from 
>> loading the
>> kernel over the top of the in-memory firmware tables.)
>>
>> arm64 expects the description of memory to come from firmware, be that UEFI 
>> for memory
>> present at boot, or the ACPI AML methods for memory that was added
>> later.
>>
>> On arm64 there is no standard location for memory. The kernel has to be 
>> handed a pointer
>> to the firmware tables that describe it. The kernel expects to boot from 
>> memory that was
>> present at boot.

> What do you do when the firmware is wrong? 

The firmware gets fixed. Its the only source of facts about the platform.


> Does arm64 support the
> mem=xxx@yyy kernel command line options?

Only the debug option to reduce the available memory.


> If you want to handle the general case of memory hotplug having a
> limitation that you have to boot from memory that was present at boot is
> a bug, because the memory might not be there.

arm64's arch code prevents the memory described by the UEFI memory map from 
being taken
offline/removed.

Memory present at boot may have firmware reservations, that are being used by 
some other
agent in the system. firmware-first RAS errors are one, the interrupt 
controllers'
property and pending tables are another.

The UEFI memory map's description of memory may have been incomplete, as there 
may have
been regions carved-out, not described at all instead of described as reserved.

The UEFI runtime services will live in memory described by the UEFI memory map.


>> Modifying the firmware tables at runtime doesn't solve the problem as we may 
>> need to move
>> the firmware-reserved memory region that describes memory. User-space may 
>> still load and
>> kexec either side of that update.
>>
>> Even if we could modify the structures at runtime, we can't update a loaded 
>> kexec image.
>> We have no idea which blob from userspace is the DT. It may not even be 
>> linux that has
>> been loaded.
> 
> What can be done and very reasonably so is on memory hotplug:
> - Unloaded any loaded kexec image.
> - Block loading any new image until the hotplug operation completes.
> 
> That is simple and generic, and can be done for all architectures.

Yes, certainly.


> 

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread James Morse
Hi Eric,

On 15/04/2020 21:33, Eric W. Biederman wrote:
> James Morse  writes:
> 
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.
>>
>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
>>
>> Signed-off-by: James Morse 
> 
> Given that we are talking about the destination pages for kexec
> not where the loaded kernel is currently stored the description is
> confusing.

I think David has some better wording to cover this. I thought I had it with 
'scattered
and re-assembled'.


> Beyond that I think it would be better to simply unload the loaded
> kernel at memory hotunplug time.

Unconditionally, or if it aliases the removed region?

I don't particular like it. User-space has asked for two impossible things, we 
are
changing the answer to the first when we see the second. Its a bit spooky. 
(maybe no one
will notice)


> Usually somewhere in the loaded image
> is a copy of the memory map at the time the kexec kernel was loaded.
> That will invalidate the memory map as well.

Ah, unconditionally. Sure, x86 needs this.
(arm64 re-discovers the memory map from firmware tables after kexec)

If that's an acceptable change in behaviour, sure, lets do that.


> All of this should be for a very brief window of a few seconds, as
> the loaded kexec image is quite short.

It seems I'm the outlier anticipating anything could happen between those 
syscalls.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name

2020-04-22 Thread James Morse
Hi Eric,

On 15/04/2020 21:37, Eric W. Biederman wrote:
> James Morse  writes:
> 
>> If kexec chooses to place the kernel in a memory region that was
>> added after boot, we fail to boot as the kernel is running from a
>> location that is not described as memory by the UEFI memory map or
>> the original DT.
>>
>> To prevent unaware user-space kexec from doing this accidentally,
>> give these regions a different name.
> 
> Please fix the problem and don't hack around it.

The problem is firmware didn't describe memory that wasn't present at boot.

arm64 relies on the firmware description of memory well before it can go poking 
around in
ACPI to find out where extra memory was added to the system.

We already need kexec to not overwrite in-memory structures left by firmware. 
(like, the
memory map). We do this by naming them reserved in /proc/iomem.

Doing the same for hotadded memory means existing kexec user-space can't do this
accidentally. The shape of /proc/iomem is the only trick in the book for 
arm64's kexec
userspace, as its the only thing it looks at.



Thanks,

James






___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-04-22 Thread James Morse
Hi Eric,

On 15/04/2020 21:29, Eric W. Biederman wrote:
> James Morse  writes:
> 
>> Hello!
>>
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
>>
>>
>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.
>>
>>
>> These patches are based on arm64's for-next/core branch, but can all
>> be merged independently.
> 
> So I just looked through these quickly and I think there are real
> problems here we can fix, and that are worth fixing.
> 
> However I am not thrilled with the fixes you propose.

Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the existing
kexec-tools working.
(We've had 'unthrilling' patches like this before to prevent user-space from 
loading the
kernel over the top of the in-memory firmware tables.)


arm64 expects the description of memory to come from firmware, be that UEFI for 
memory
present at boot, or the ACPI AML methods for memory that was added later.

On arm64 there is no standard location for memory. The kernel has to be handed 
a pointer
to the firmware tables that describe it. The kernel expects to boot from memory 
that was
present at boot.

Modifying the firmware tables at runtime doesn't solve the problem as we may 
need to move
the firmware-reserved memory region that describes memory. User-space may still 
load and
kexec either side of that update.

Even if we could modify the structures at runtime, we can't update a loaded 
kexec image.
We have no idea which blob from userspace is the DT. It may not even be linux 
that has
been loaded.

We can't emulate parts of UEFI's handover because kexec's purgatory isn't an 
EFI program.


I can't see a path through all this. If we have to modify existing user-space, 
I'd rather
leave it broken. We can detect the problem in the arch code and print a warning 
at load time.


James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-04-22 Thread James Morse
Hi Eric,

On 15/04/2020 21:36, Eric W. Biederman wrote:
> James Morse  writes:
> 
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.
> 
> Gah.  No.
> 
> Please find a way to pass the current memory map to the loaded kexec'd
> kernel.

> Starting a kernel with no way for it to know what the current memory map
> is just plain scary.

We have one. Firmware tables are the source of all this information. We don't 
tamper with
them.

Firmware describes memory present at boot in the UEFI memory map or DT. On 
systems with
ACPI, regions that were added after booting are discovered by running AML 
methods. (for
which we need to allocate memory, so you can't describe boot memory like this)

This doesn't work if you kexec from a hot-added region. You've booted from 
memory that
wasn't present at boot.

I don't think this is fixable with the set of constraints.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-04-14 Thread James Morse
Hi Dave,

On 31/03/2020 04:46, Dave Young wrote:
> I agreed that file load is still not widely used,  but in the long run
> we should not maintain both of them all the future time.  Especially
> when some kernel-userspace interfaces need to be introduced, file load
> will have the natural advantage.  We may keep the kexec_load for other
> misc usecases, but we can use file load for the major modern
> linux-to-linux loading.  I'm not saying we can do it immediately, just
> thought we should reduce the duplicate effort and try to avoid hacking if
> possible.

Sure. My aim here is to never debug this problem again.


> Anyway about this particular issue, I wonder if we can just reload with
> a udev rule as replied in another mail.

What if it doesn't? I can't find such a rule on my debian machine.
I don't think user-space can be relied on for something like this.

The best we could hope for here is a dying gasp from the old kernel:
| kexec: memory layout changed since kexec load, this may not work.
| Bye!

... assuming anyone sees such a message.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-04-14 Thread James Morse
Hi Dave, Pingfan,

On 02/04/2020 07:12, piliu wrote:
> On 04/02/2020 01:49 PM, Dave Young wrote:
>> On 03/26/20 at 06:07pm, James Morse wrote:
>>> Memory added to the system by hotplug has a 'System RAM' resource created
>>> for it. This is exposed to user-space via /proc/iomem.
>>>
>>> This poses problems for kexec on arm64. If kexec decides to place the
>>> kernel in one of these newly onlined regions, the new kernel will find
>>> itself booting from a region not described as memory in the firmware
>>> tables.
>>>
>>> Arm64 doesn't have a structure like the e820 memory map that can be
>>> re-written when memory is brought online. Instead arm64 uses the UEFI
>>> memory map, or the memory node from the DT, sometimes both. We never
>>> rewrite these.
>>
>> Could arm64 use similar way to update DT, or a cooked UEFI maps?

>> Add pingfan in cc, he said ppc64 update the DT after a memremove thus it
>> would be good to just redo a kexec load.

> Yes, the memory changes will be observed through device-node under
> /proc/device-tree/ (which is for powerpc).
> 
> Later if running kexec -l/-p , it can build new dtb with the latest info
> from /proc/device-tree
For arm64, the device-tree is set in stone. We don't have the runtime parts of
open-firmware that powerpc does. (my knowledge in this area is extremely sparse)
arm64 platforms where stuff like this changes tend to use ACPI instead, and 
these all have
to boot with UEFI, which means its the UEFI memory map that has authority.

We don't cook a fake UEFI memory map when things change because we treat it 
like the
set-in-stone DT. This means we only have discrepancies in firmware to 
workaround, instead
of any we introduce ourselves.

One of the UEFI configuration tables describes addresses Linux programmed into 
hardware
that can't be reset. Newer versions of Linux know how to pick these up on 
kexec... but
older versions don't know how to parse/rewrite/move that table. Cooking up new 
versions of
these tables would prevent us doing stuff like this, which we need to 
workaround hardware
that didn't get the 'kexec exists' memo.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread James Morse
Hi guys,

On 14/04/2020 08:05, David Hildenbrand wrote:
> On 10.04.20 21:10, Andrew Morton wrote:
>> It's unclear (to me) what is the status of this patchset.  But it does 
>> appear that
>> an new version can be expected?

> I'd suggest to unqueue the patches until we have a consensus.

Certainly!


> While there are a couple of ideas floating around here, my current
> suggestion would be either
> 
> 1. Indicate all hotplugged memory as "System RAM (hotplugged)" in
> /proc/iomem and the firmware memmap (on all architectures). This will
> require kexec changes,

> but I would have assume that kexec has to be
> updated in lock-step with the kernel

News to me: I was using the version I first built when arm64's support was new. 
I've only
had to update it once when we had to change user-space.

I don't think debian updates kexec-tools when it updates the kernel.


Making changes to /proc/iomem means updating user-space again, (for kdump). I'd 
like to
avoid that if its at all possible.


> just like e.g., makedumpfile.
> Modify kexec() to not place the kexec kernel on these areas (easy) but
> still consider them as crash regions to dump. When loading a kexec
> kernel, validate in the kernel that the memory is appropriate.


> 2. Make kexec() reload the the kernel whenever we e.g., get a udev event
> for removal of memory in /sys/devices/system/memory/.

I don't think we can rely on user-space to do something,


> On every remove_memory(), invalidate the loaded kernel in the kernel.

This is an option, ... but its a change of behaviour. If user-space asks for two
impossible things, the second request should fail. Having the first-one 
disappear is a bit
spooky...

Fortunately user-space checks the 'kexec -l' bit happened before it calls 
reboot() behind
'kexec -e'. So this works, but is not intuitive.

("Did I load it? What changed and when? oh, half a mile up in dmesg is a 
message saying
the kernel discarded the kexec kernel last wednesday.")


> As I mentioned somewhere, 1. will be interesting for virtio-mem, where
> we don't want any kexec kernel to be placed on virtio-mem-added memory.

Do these virtio-mem-added regions need to be accessible by kdump?
(do we already need a user-space change for that?)


A third option, along the line of what I posted:

Split the 'offline' and 'removed' ideas, which David mentioned somewhere. We'd 
end up with
(yet) another notifier chain, that prevents the memory being removed, but you 
can still
mark it as offline in /sys/. (...I'm not quite sure why you would do that...)

This would need hooking up for ACPI (which covers x86 and arm64), and other 
architectures
mechanisms for doing this...
arm64 can then switch is arch hook that prevents 'bootmem' being removed to 
this new
notifier chain, as the kernel can only boot from that was present at boot.


My preference is 3, then 2.

I think 1 is slightly less desirable than a message at kexec time that the 
memory layout
has changed since load, and this might not work...



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-03-30 Thread James Morse
Hi David,

On 3/30/20 2:23 PM, David Hildenbrand wrote:
 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
 index 0a54ffac8c68..69b03dd7fc74 100644
 --- a/mm/memory_hotplug.c
 +++ b/mm/memory_hotplug.c
 @@ -42,6 +42,10 @@
  #include "internal.h"
  #include "shuffle.h"
  
 +#ifndef MEMORY_HOTPLUG_RES_NAME
 +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
 +#endif
>>>
>>> So I assume changing this for all architectures would result in some
>>> user space tool breaking? Are we aware of any?
>>
>> Last time we had to touch arm64's /proc/iomem strings I went through debian's
>> codesearch for stuff that reads it, kexec-tools was the only thing that 
>> parsed
>> it in anger. (From memory, the other tools were looking for PCIe windows to 
>> do
>> firmware flashing..)
>>
>> Looking again, having qualifiers on the end of 'System RAM' looks like it 
>> could
>> confuse 's390-tools's detect_mem_chunks parser.
> 
> Good to know, we should find out if this could work.
> 
>>
>> It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate 
>> set.
>>
>>
>>> I do wonder if we should simply change it for all architectures if possible.
>>
>> If its possible that would be great. But I suspect that ship has sailed,
>> changing it on other architectures could break some fragile parsing code.
> 
> I assume any parser has to be prepared for new types showing up.
> Otherwise these would not be future proof. The question is if a common
> prefix is problematic.
> 
> E.g., Use "Hotplugged System RAM" instead of "System RAM (hotplug)"

Aha, I went for a (suffix) because that is what 32bit Arm did for the boot 
alias.


>> I'm wary of changing it on arm64, the only thing that makes it tolerable is 
>> that
>> memory hot-add was relatively recently merged, and we don't anticipate it 
>> being
>> widely used until you can remove memory as well.
>>
>> Changing it on arm64 is to prevent today's versions of kexec-tools from
>> accidentally placing the new kernel in memory that wasn't described at boot.
>> This leads to an unhandled exception during boot[0] because the kernel can't
>> access itself via the mapping of all memory. (hotpluggable regions are only
>> discovered by suitably configured ACPI systems much later)

> I want the very same for virtio-mem (initially x86-only, but later open
> for all archs). Can also be interesting for Hyper-V. kexec should not
> try to use hotplugged memory as kexec target, as memory blocks can be
> partially inaccessible.

Great! I assumed these placement requirements would be arm64 specific.


> Of course, I can provide an interface to override the name via
> add_memory(), but having it on all architectures handled in a similar
> way right from the start would be nicer.

I agree having it the same on all architectures would be good.

It sounds like virtio-mem is a better argument for doing this than arm64's
firmware memory description.

I'll have a read, and maybe post something to linux-arch to do this at rc1.
(I assume we'll have a few weeks to make sure arm64 at least uses the same name
if it goes on longer)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread James Morse
Hi Baoquan,

On 3/30/20 2:55 PM, Baoquan He wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
> 
> I talked about this with Dave Young. Currently, we tend to use
> kexec_file_load more in the future since most of its implementation is
> in kernel, we can get information about kernel more easilier. For the
> kexec kernel loaded into hotpluggable area, we can fix it in
> kexec_file_load side, we know the MOVABLE zone's start and end. As for
> the old kexec_load, we would like to keep it for back compatibility. At
> least in our distros, we have switched to kexec_file_load, will
> gradually obsolete kexec_load.

> So for this one, I suggest avoiding those
> MOVZBLE memory region when searching place for kexec kernel.

How does today's user-space know?


> Not sure if arm64 will still have difficulty.

arm64 added support for kexec_load first, then kexec_file_load. (evidently a
mistake).
kexec_file_load support was only added in the last year or so, I'd hazard most
people using this, are using the regular load kind. (and probably don't know or
care).



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-30 Thread James Morse
Hi David,

On 3/27/20 6:52 PM, David Hildenbrand wrote:
>>> 2. You do the kexec. The kexec kernel will only operate on a reserved
>>> memory region (reserved via e.g., kernel cmdline crashkernel=128M).
>>
>> I think you are merging the kexec and kdump behaviours.
>> (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')
> 
> Oh, I see - I think your example below clarifies things. Something like
> that should go in the cover letter if we end up in this patch being
> required :)

Do you mean the commit message? I think its far too long...

Adding a sentence about the way kexec load works may help, the first paragraph
would read:

| Kexec allows user-space to specify the address that the kexec image should be
| loaded to. Because this memory may be in use, an image loaded for kexec is not
| stored in place, instead its segments are scattered through memory, and are
| re-assembled when needed. In the meantime, the target memory may have been
| removed.

Do you think thats clearer?


> (I missed that the problematic part is "random" addresses passed by user
> space to the kernel, where it wants data to be loaded to on kexec -e)

[...]

>> Load kexec:
>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>
> 
> I assume this will trigger
> 
> kexec_load -> do_kexec_load -> kimage_load_segment ->
> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
> 
> Which will just allocate a bunch of pages and mark them reserved.
> 
> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
> segment allocations will actually end up on your DIMM (as it is onlined
> online_movable).
> 
> So, the loaded image (with its segments) from user won't be problematic
> and not get placed on your DIMM.
> 
> 
> Now, the problematic part is (via man kexec_load) "mem and memsz specify
> a physical address range that is the target of the copy."
> 
> So the place where the image will be "assembled" at when doing the
> reboot. Understood :)

Yup.

[...]

> I wonder if we should instead make the "kexec -e" fail. It tries to
> touch random system memory.

Heh, isn't touching random system memory what kexec does?!

Its all described to user-space as 'System RAM'. Teaching it to probe
/sys/devices/memory/... would require a user-space change.


> Denying to offline MOVABLE memory should be avoided - and what kexec
> does here sounds dangerous to me (allowing it to write random system
> memory).

> Roughly what I am thinking is this:
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ba1d91e868ca..70c39a5307e5 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1135,6 +1135,10 @@ int kernel_kexec(void)
> error = -EINVAL;
> goto Unlock;
> }
> +   if (!kexec_image_validate()) {
> +   error = -EINVAL;
> +   goto Unlock;
> +   }
> 
>  #ifdef CONFIG_KEXEC_JUMP
> if (kexec_image->preserve_context) {
> 
> 
> kexec_image_validate() would go over all segments and validate that the
> involved pages are actual valid memory (pfn_to_online_page()).
> 
> All we have to do is protect from memory hotplug until we switch to the
> new kernel.

(migrate_to_reboot_cpu() can sleep), I think you'd end up with something like
this patch, but only while kexec_in_progress. I don't think letting kexec fail
if the events occur in a different order is good for user-space.


> Will probably need some thought. But it will actually also bail out when
> user space passes wrong physical memory addresses, instead of
> triple-faulting silently.

With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
thing doesn't usually return, so we're likely to trigger error-handling that has
never run before.

(Last time I debugged one of these, it turned out kexec had taken the network
interfaces down, meaning the nfsroot was no longer accessible)

How can user-space know whether kexec is going to succeed, or fail like this?
Any loaded kexec kernel could secretly be in this broken state.

Can user-space know what caused this to become unreliable? (without reading the
kernel source)


Given kexec can be unloaded by user-space, I think its better to prevent us
getting into the broken state, preferably giving the hint that kexec us using
that memory. The user can 'kexec -u', then retry removing the memory.

I think forbidding the memory-offline is simpler for user-space to deal with.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-27 Thread James Morse
Hi David,

On 3/27/20 5:06 PM, David Hildenbrand wrote:
> On 27.03.20 17:56, James Morse wrote:
>> On 3/27/20 9:30 AM, David Hildenbrand wrote:
>>> On 26.03.20 19:07, James Morse wrote:
>>>> An image loaded for kexec is not stored in place, instead its segments
>>>> are scattered through memory, and are re-assembled when needed. In the
>>>> meantime, the target memory may have been removed.
>>>>
>>>> Because mm is not aware that this memory is still in use, it allows it
>>>> to be removed.
>>>>
>>>> Add a memory notifier to prevent the removal of memory regions that
>>>> overlap with a loaded kexec image segment. e.g., when triggered from the
>>>> Qemu console:
>>>> | kexec_core: memory region in use
>>>> | memory memory32: Offline failed.

>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>> index c19c0dad1ebe..ba1d91e868ca 100644
>>>> --- a/kernel/kexec_core.c
>>>> +++ b/kernel/kexec_core.c
>>
>>> E.g., in kernel/kexec_core.c:kimage_alloc_pages()
>>>
>>> "SetPageReserved(pages + i);"
>>>
>>> Pages that are reserved cannot get offlined. How are you able to trigger
>>> that before this patch? (where is the allocation path for kexec, which
>>> will not set the pages reserved?)
>>
>> This sets page reserved on the memory it gets back from
>> alloc_pages() in kimage_alloc_pages(). This is when you load the image[0].
>>
>> The problem I see is for the target or destination memory once you execute 
>> the
>> image. Once machine_kexec() runs, it tries to write to this, assuming it is
>> still present...

> Let's recap
> 
> 1. You load the image. You allocate memory for e.g., the kexec kernel.
> The pages will be marked PG_reserved, so they cannot be offlined.
> 
> 2. You do the kexec. The kexec kernel will only operate on a reserved
> memory region (reserved via e.g., kernel cmdline crashkernel=128M).

I think you are merging the kexec and kdump behaviours.
(Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')

For kdump, yes, the new kernel is loaded into the crashkernel reservation, and
confined to it.


For regular kexec, the new kernel can be loaded any where in memory. There might
be a difference with how this works on arm64

The regular kexec kernel isn't stored in its final location when its loaded, its
relocated there when the image is executed. The target/destination memory may
have been removed in the meantime.

(an example recipe below should clarify this)


> Is it that in 2., the reserved memory region (for the crashkernel) could
> have been offlined in the meantime?

No, for kdump: the crashkernel reservation is PG_reserved, and its not something
mm knows how to move, so that region can't be taken offline.

(On arm64 we additionally prevent the boot-memory from being removed as it is
all described as present by UEFI. The crashkernel reservation would always be
from this type of memory)


This is about a regular kexec, any crashdump reservation is irrelevant.
This kexec kernel is temporarily stored out of line, then relocated when 
executed.

A recipe so that we're at least on the same terminal! This is on a TX2 running
arm64's for-next/core using Qemu-TCG to emulate x86. (Sorry for the bizarre
config, its because Qemu supports hotremove on x86, but not yet on arm64).


Insert the memory:
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

| root@vm:~# free -m
|   totalusedfree  shared  ...
| Mem:918  52 814   0  ...
| Swap: 0   0   0


Bring it online:
| root@vm:~# cd /sys/devices/system/memory/
| root@vm:/sys/devices/system/memory# for F in memory3*; do echo \
| online_movable > $F/state; done

| Built 1 zonelists, mobility grouping on.  Total pages: 251049
| Policy zone: DMA32

| -bash: echo: write error: Invalid argument
| root@vm:/sys/devices/system/memory# free -m
|   totalusedfree  shared  ...
| Mem:   1942  531836   0  ...
| Swap: 0   0   0


Load kexec:
| root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline

Press the Attention button to request removal:

(qemu) device_del dimm1

| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Built 1 zonelists, mobility grouping on.  Total pages: 233728
| Policy zone: DMA32

The memory is gone:
| root@vm:/sys/devices/system/memory# free -m
|   total

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-27 Thread James Morse
Hi David,

On 3/27/20 9:30 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.
>>
>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
>>
>> Signed-off-by: James Morse 
>> ---
>>  kernel/kexec_core.c | 56 +
>>  1 file changed, 56 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index c19c0dad1ebe..ba1d91e868ca 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c

> E.g., in kernel/kexec_core.c:kimage_alloc_pages()
> 
> "SetPageReserved(pages + i);"
> 
> Pages that are reserved cannot get offlined. How are you able to trigger
> that before this patch? (where is the allocation path for kexec, which
> will not set the pages reserved?)

This sets page reserved on the memory it gets back from
alloc_pages() in kimage_alloc_pages(). This is when you load the image[0].

The problem I see is for the target or destination memory once you execute the
image. Once machine_kexec() runs, it tries to write to this, assuming it is
still present...


How can I make the commit message clearer?

're-assembled' and 'target memory' aren't quite cutting it, is there are a
correct term to use?
(destination?)



Thanks,

James


[0] Just to convince myself:
| kimage_alloc_pages+0x30/0x15c
| kimage_alloc_page+0x210/0x7d8
| kimage_load_segment+0x14c/0x8c8
| __arm64_sys_kexec_load+0x4f0/0x720
| do_el0_svc+0x13c/0x3c0
| el0_sync_handler+0x9c/0x3c0
| el0_sync+0x158/0x180

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-27 Thread James Morse
Hi Anshuman,

On 3/27/20 12:43 AM, Anshuman Khandual wrote:
> On 03/26/2020 11:37 PM, James Morse wrote:
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.

> Why the isolation process does not fail when these pages are currently
> being used by kexec ?

Kexec isn't using them right now, but it will once kexec is triggered.
Those physical addresses are held in some internal kexec data structures until
kexec is triggered.


>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
> 
> Yes this is definitely an added protection for these kexec loaded kernels
> memory areas from being offlined but I would have expected the preceding
> offlining to have failed as well.

kexec hasn't allocate the memory, part of the regions user-space may specify for
the next kernel may be in use. There is nothing to stop the memory being used in
the meantime.


Any other way of doing this would prevent us saying why it failed.
Like this, the user can spot the 'kexec: Memory region in use message', and
unload kexec.


>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index c19c0dad1ebe..ba1d91e868ca 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
>>  
>>  void __weak arch_kexec_unprotect_crashkres(void)
>>  {}
>> +
>> +/*
>> + * If user-space wants to offline memory that is in use by a loaded kexec
>> + * image, it should unload the image first.
>> + */

> Probably this would need kexec user manual and related system call man pages
> update as well.

I can't see anything relevant under Documentation. (kdump yes, kexec no...)


>> +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
>> + void *data)
>> +{
>> +int rv = NOTIFY_OK, i;
>> +struct memory_notify *arg = data;
>> +unsigned long pfn = arg->start_pfn;
>> +unsigned long nr_segments, sstart, send;
>> +unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>> +
>> +might_sleep();
> 
> Required ?

Habit, and I think best practice. We take a mutex, so might_sleep(), but we also
conditionally return before lockdep would see the mutex. Having this annotation
means a dangerous change to the way this is called triggers a warning without
having to test memory hotplug explicitly.


>> +
>> +if (action != MEM_GOING_OFFLINE)
>> +return NOTIFY_DONE;
>> +
>> +mutex_lock(_mutex);
>> +if (kexec_image) {
>> +nr_segments = kexec_image->nr_segments;
>> +
>> +for (i = 0; i < nr_segments; i++) {
>> +sstart = PFN_DOWN(kexec_image->segment[i].mem);
>> +send = PFN_UP(kexec_image->segment[i].mem +
>> +  kexec_image->segment[i].memsz);
>> +
>> +if ((pfn <= sstart && sstart < end_pfn) ||
>> +(pfn <= send && send < end_pfn)) {
>> +pr_warn("Memory region in use\n");
>> +rv = NOTIFY_BAD;
>> +break;
>> +}
>> +}
>> +}
>> +mutex_unlock(_mutex);
>> +
>> +return rv;
> 
> Variable 'rv' is redundant, should use NOTIFY_[BAD|OK] directly instead.

You'd prefer a mutex_unlock() in the middle of the loop? ... or goto?
(I'm not convinced)

>> +}
>> +
>> +static struct notifier_block mem_remove_nb = {
>> +.notifier_call = mem_remove_cb,
>> +};
>> +
>> +static int __init register_mem_remove_cb(void)
>> +{
>> +if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> 
> Should not all these new code here be wrapped with CONFIG_MEMORY_HOTREMOVE
> to reduce the scope as well as final code size when the config is disabled.

The compiler is really good at this. "if (false)" means this is all dead code,
and static means its not exported, so the compiler is free to remove it.

Not #ifdef-ing it makes it much more readable, and means the compiler checks its
valid C before it removes it. This avoids weird header include problems that
only show up on some rand-config builds.


Thanks,

James

>> +return register_memory_notifier(_remove_nb);
>> +
>> +return 0;
>> +}
>> +device_initcall(register_mem_remove_cb);
>>
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-27 Thread James Morse
Hi David,

On 3/27/20 9:27 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.

> IIRC other architectures handle that by setting the affected pages
> PageReserved. Any reason why to not stick to the same?

Hmm, I didn't spot this. How come core code doesn't do it if its needed?

Doesn't PG_Reserved prevent the page from being used for regular allocations?
(or is that only if its done early)

I prefer the runtime check as the dmesg output gives the user some chance of
knowing why their memory-offline failed, and doing something about it!


>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.

> Whenever memory is added/removed, kdump.service is to be restarted from
> user space, which will fixup the data structures such that kdump will
> not try to dump unplugged memory.

Cunning.


> Also, makedumpfile will check if the
> sections are still around IIRC.

Curious. I thought the vmcore was virtually addressed, how does it know which
linear-map portions correspond to sysfs memory nodes with KASLR?


> Not sure what you mean by "Unaware kdump from user-space".

The existing kexec-tools binaries, that (I assume) don't go probing to find out
if 'System RAM' is removable or not, loading a kdump kernel, along with the
user-space generated blob that describes the first kernel's memory usage to the
second kernel.

'user-space' here to distinguish all this from kexec_file_load().



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-03-27 Thread James Morse
Hi David,

On 3/27/20 9:59 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.


>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..69b03dd7fc74 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -42,6 +42,10 @@
>>  #include "internal.h"
>>  #include "shuffle.h"
>>  
>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>> +#endif
> 
> So I assume changing this for all architectures would result in some
> user space tool breaking? Are we aware of any?

Last time we had to touch arm64's /proc/iomem strings I went through debian's
codesearch for stuff that reads it, kexec-tools was the only thing that parsed
it in anger. (From memory, the other tools were looking for PCIe windows to do
firmware flashing..)

Looking again, having qualifiers on the end of 'System RAM' looks like it could
confuse 's390-tools's detect_mem_chunks parser.

It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate 
set.


> I do wonder if we should simply change it for all architectures if possible.

If its possible that would be great. But I suspect that ship has sailed,
changing it on other architectures could break some fragile parsing code.

I'm wary of changing it on arm64, the only thing that makes it tolerable is that
memory hot-add was relatively recently merged, and we don't anticipate it being
widely used until you can remove memory as well.

Changing it on arm64 is to prevent today's versions of kexec-tools from
accidentally placing the new kernel in memory that wasn't described at boot.
This leads to an unhandled exception during boot[0] because the kernel can't
access itself via the mapping of all memory. (hotpluggable regions are only
discovered by suitably configured ACPI systems much later)


Thanks,

James

[0]
| NUMA: NODE_DATA [mem 0x7fdf1780-0x7fdf3fff]
| Unable to handle kernel paging request at virtual address 4230aff8
| Mem abort info:
|   ESR = 0x9606
|   EC = 0x25: DABT (current EL), IL = 32 bits
|   SET = 0, FnV = 0
|   EA = 0, S1PTW = 0
| Data abort info:
|   ISV = 0, ISS = 0x0006
|   CM = 0, WnR = 0
| swapper pgtable: 4k pages, 48-bit VAs, pgdp=8181d000
| [4230aff8] pgd=7fff9003, pud=7fdf7003,
pmd=
| Internal error: Oops: 9606 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper Not tainted 5.6.0-rc3-00098-g3f6c690f5dfe
#11618
| Hardware name: linux,dummy-virt (DT)
| pstate: 80400085 (Nzcv daIf +PAN -UAO BTYPE=--)
| pc : vmemmap_pud_populate+0x2c/0xa0
| lr : vmemmap_populate+0x78/0x154

| Call trace:
|  vmemmap_pud_populate+0x2c/0xa0
|  vmemmap_populate+0x78/0x154
|  __populate_section_memmap+0x3c/0x60
|  sparse_init_nid+0x29c/0x414
|  sparse_init+0x154/0x170
|  bootmem_init+0x78/0xdc
|  setup_arch+0x280/0x5d0
|  start_kernel+0x98/0x4f8
| Code: f9469a84 92748e73 8b010e61 cb040033 (f9400261)
| random: get_random_bytes called from print_oops_end_marker+0x34/0x60
with crng_init=0
| ---[ end trace  ]---
| Kernel panic - not syncing: Attempted to kill the idle task!
| ---[ end Kernel panic - not syncing: Attempted to kill the idle task!



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-27 Thread James Morse
Hi Baoquan,

On 3/27/20 2:11 AM, Baoquan He wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).

> Do you mean you use 'kexec -l /boot/vmlinuz- --initrd ...' to load a
> kernel, next you hot remove some memory regions, then you execute
> 'kexec -e' to trigger kexec reboot?

Yes. But to make it more fun, get someone else to trigger the hot-remove behind
your back!


> I may not get the point clearly, but we usually do the loading and
> triggering of kexec-ed kernel at the same time. 

But its two syscalls. Should the second one fail if the memory layout has
changed since the first?

(UEFI does this for exit-boot-services, there is handshake to prove you know
what the current memory map is)


>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
>>
>>
>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.

> Currently, we will monitor udev events of mem hot add/remove, then
> reload kdump kernel. That reloading is only update the elfcorehdr,
> because crashkernel has to be reserved during 1st kernel bootup. I don't
> think this will have problem.

Great. I don't think there is much the kernel can do for the kdump case, so its
good to know the tools already exist for detecting and restarting the kdump load
when the memory layout changes.

For kdump via kexec-file-load, we would need to regenerate the elfcorehdr, I'm
hoping that can be done in core code.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 3/3] arm64: memory: Give hotplug memory a different resource name

2020-03-26 Thread James Morse
If kexec chooses to place the kernel in a memory region that was
added after boot, we fail to boot as the kernel is running from a
location that is not described as memory by the UEFI memory map or
the original DT.

To prevent unaware user-space kexec from doing this accidentally,
give these regions a different name.

Signed-off-by: James Morse 
---
This is a change in behaviour as seen by user-space, because memory hot-add
has already been merged.

 arch/arm64/include/asm/memory.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 2be67b232499..ef1686518469 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -166,6 +166,17 @@
 #define IOREMAP_MAX_ORDER  (PMD_SHIFT)
 #endif
 
+/*
+ * Memory hotplug allows new regions of 'System RAM' to be added to the system.
+ * These aren't described as memory by the UEFI memory map, or DT memory node.
+ * If we kexec from one of these regions, the new kernel boots from a location
+ * that isn't described as RAM.
+ *
+ * Give these resources a different name, so unaware kexec doesn't do this by
+ * accident.
+ */
+#define MEMORY_HOTPLUG_RES_NAME "System RAM (hotplug)"
+
 #ifndef __ASSEMBLY__
 extern u64 vabits_actual;
 #define PAGE_END   (_PAGE_END(vabits_actual))
-- 
2.25.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-03-26 Thread James Morse
Memory added to the system by hotplug has a 'System RAM' resource created
for it. This is exposed to user-space via /proc/iomem.

This poses problems for kexec on arm64. If kexec decides to place the
kernel in one of these newly onlined regions, the new kernel will find
itself booting from a region not described as memory in the firmware
tables.

Arm64 doesn't have a structure like the e820 memory map that can be
re-written when memory is brought online. Instead arm64 uses the UEFI
memory map, or the memory node from the DT, sometimes both. We never
rewrite these.

Allow an architecture to specify a different name for these hotplug
regions.

Signed-off-by: James Morse 
---
 mm/memory_hotplug.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..69b03dd7fc74 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,10 @@
 #include "internal.h"
 #include "shuffle.h"
 
+#ifndef MEMORY_HOTPLUG_RES_NAME
+#define MEMORY_HOTPLUG_RES_NAME "System RAM"
+#endif
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, 
u64 size)
 {
struct resource *res;
unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-   char *resource_name = "System RAM";
+   char *resource_name = MEMORY_HOTPLUG_RES_NAME;
 
if (start + size > max_mem_size)
return ERR_PTR(-E2BIG);
-- 
2.25.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-26 Thread James Morse
An image loaded for kexec is not stored in place, instead its segments
are scattered through memory, and are re-assembled when needed. In the
meantime, the target memory may have been removed.

Because mm is not aware that this memory is still in use, it allows it
to be removed.

Add a memory notifier to prevent the removal of memory regions that
overlap with a loaded kexec image segment. e.g., when triggered from the
Qemu console:
| kexec_core: memory region in use
| memory memory32: Offline failed.

Signed-off-by: James Morse 
---
 kernel/kexec_core.c | 56 +
 1 file changed, 56 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..ba1d91e868ca 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -22,10 +23,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
 
 void __weak arch_kexec_unprotect_crashkres(void)
 {}
+
+/*
+ * If user-space wants to offline memory that is in use by a loaded kexec
+ * image, it should unload the image first.
+ */
+static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
+void *data)
+{
+   int rv = NOTIFY_OK, i;
+   struct memory_notify *arg = data;
+   unsigned long pfn = arg->start_pfn;
+   unsigned long nr_segments, sstart, send;
+   unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
+
+   might_sleep();
+
+   if (action != MEM_GOING_OFFLINE)
+   return NOTIFY_DONE;
+
+   mutex_lock(_mutex);
+   if (kexec_image) {
+   nr_segments = kexec_image->nr_segments;
+
+   for (i = 0; i < nr_segments; i++) {
+   sstart = PFN_DOWN(kexec_image->segment[i].mem);
+   send = PFN_UP(kexec_image->segment[i].mem +
+ kexec_image->segment[i].memsz);
+
+   if ((pfn <= sstart && sstart < end_pfn) ||
+   (pfn <= send && send < end_pfn)) {
+   pr_warn("Memory region in use\n");
+   rv = NOTIFY_BAD;
+   break;
+   }
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return rv;
+}
+
+static struct notifier_block mem_remove_nb = {
+   .notifier_call = mem_remove_cb,
+};
+
+static int __init register_mem_remove_cb(void)
+{
+   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+   return register_memory_notifier(_remove_nb);
+
+   return 0;
+}
+device_initcall(register_mem_remove_cb);
-- 
2.25.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-26 Thread James Morse
Hello!

arm64 recently queued support for memory hotremove, which led to some
new corner cases for kexec.

If the kexec segments are loaded for a removable region, that region may
be removed before kexec actually occurs. This causes the first kernel to
lockup when applying the relocations. (I've triggered this on x86 too).

The first patch adds a memory notifier for kexec so that it can refuse
to allow in-use regions to be taken offline.


This doesn't solve the problem for arm64, where the new kernel must
initially rely on the data structures from the first boot to describe
memory. These don't describe hotpluggable memory.
If kexec places the kernel in one of these regions, it must also provide
a DT that describes the region in which the kernel was mapped as memory.
(and somehow ensure its always present in the future...)

To prevent this from happening accidentally with unaware user-space,
patches two and three allow arm64 to give these regions a different
name.

This is a change in behaviour for arm64 as memory hotadd and hotremove
were added separately.


I haven't tried kdump.
Unaware kdump from user-space probably won't describe the hotplug
regions if the name is different, which saves us from problems if
the memory is no longer present at kdump time, but means the vmcore
is incomplete.


These patches are based on arm64's for-next/core branch, but can all
be merged independently.

Thanks,

James Morse (3):
  kexec: Prevent removal of memory in use by a loaded kexec image
  mm/memory_hotplug: Allow arch override of non boot memory resource
names
  arm64: memory: Give hotplug memory a different resource name

 arch/arm64/include/asm/memory.h | 11 +++
 kernel/kexec_core.c | 56 +
 mm/memory_hotplug.c |  6 +++-
 3 files changed, 72 insertions(+), 1 deletion(-)

-- 
2.25.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v7 1/4] x86: kdump: move reserve_crashkernel_low() into crash_core.c

2020-01-16 Thread James Morse
Hi guys,

On 28/12/2019 09:32, Dave Young wrote:
> On 12/27/19 at 07:04pm, Chen Zhou wrote:
>> On 2019/12/27 13:54, Dave Young wrote:
>>> On 12/23/19 at 11:23pm, Chen Zhou wrote:
 In preparation for supporting reserve_crashkernel_low in arm64 as
 x86_64 does, move reserve_crashkernel_low() into kernel/crash_core.c.

 Note, in arm64, we reserve low memory if and only if crashkernel=X,low
 is specified. Different with x86_64, don't set low memory automatically.
>>>
>>> Do you have any reason for the difference?  I'd expect we have same
>>> logic if possible and remove some of the ifdefs.
>>
>> In x86_64, if we reserve crashkernel above 4G, then we call 
>> reserve_crashkernel_low()
>> to reserve low memory.
>>
>> In arm64, to simplify, we call reserve_crashkernel_low() at the beginning of 
>> reserve_crashkernel()
>> and then relax the arm64_dma32_phys_limit if reserve_crashkernel_low() 
>> allocated something.
>> In this case, if reserve crashkernel below 4G there will be 256M low memory 
>> set automatically
>> and this needs extra considerations.

> Sorry that I did not read the old thread details and thought that is
> arch dependent.  But rethink about that, it would be better that we can
> have same semantic about crashkernel parameters across arches.  If we
> make them different then it causes confusion, especially for
> distributions.

Surely distros also want one crashkernel* string they can use on all platforms 
without
having to detect the kernel version, platform or changeable memory layout...


> OTOH, I thought if we reserve high memory then the low memory should be
> needed.  There might be some exceptions, but I do not know the exact
> one,

> can we make the behavior same, and special case those systems which
> do not need low memory reservation.

Its tricky to work out which systems are the 'normal' ones.

We don't have a fixed memory layout for arm64. Some systems have no memory 
below 4G.
Others have no memory above 4G.

Chen Zhou's machine has some memory below 4G, but its too precious to reserve a 
large
chunk for kdump. Without any memory below 4G some of the drivers won't work.

I don't see what distros can set as their default for all platforms if high/low 
are
mutually exclusive with the 'crashkernel=' in use today. How did x86 navigate 
this, ... or
was it so long ago?

No one else has reported a problem with the existing placement logic, hence 
treating this
'low' thing as the 'in addition' special case.


>> previous discusses:
>>  https://lkml.org/lkml/2019/6/5/670
>>  https://lkml.org/lkml/2019/6/13/229
> 
> Another concern from James:
> "
> With both crashk_low_res and crashk_res, we end up with two entries in 
> /proc/iomem called
> "Crash kernel". Because its sorted by address, and kexec-tools stops 
> searching when it
> find "Crash kernel", you are always going to get the kernel placed in the 
> lower portion.
> "
> 
> The kexec-tools code is iterating all "Crash kernel" ranges and add them
> in an array.  In X86 code, it uses the higher range to locate memory.

Then my hurried reading of what the user-space code does was wrong!

If kexec-tools places the kernel in the low region, there may not be enough 
memory left
for whatever purpose it was reserved for. This was the motivation for giving it 
a
different name.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 00/25] arm64: MMU enabled kexec relocation

2020-01-15 Thread James Morse
Hi Pavel,

On 08/01/2020 17:59, Pavel Tatashin wrote:
> On Wed, Jan 8, 2020 at 12:32 PM Will Deacon  wrote:
>> On Wed, Dec 04, 2019 at 10:59:13AM -0500, Pavel Tatashin wrote:
>>> Many changes compared to version 6, so I decided to send it out now.
>>> James Morse raised an important issue to which I do not have a solution
>>> yet. But would like to discuss it.

(Christmas was badly timed relative to my holiday, so its taken a while for me 
to catch up)

The memory out of range of the idmap?
I've posted an RFC here[0] that makes hibernate idmap is ttbr0 page. This 
should let you
reuse that code and test it without a machine with a funny memory layout.


Thanks,

James

[0] 
https://lore.kernel.org/linux-arm-kernel/20200115143322.214247-1-james.mo...@arm.com/

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2020-01-10 Thread James Morse
Hi Bhupesh,

On 25/12/2019 19:01, Bhupesh Sharma wrote:
> On 12/12/2019 04:02 PM, James Morse wrote:
>> On 29/11/2019 19:59, Bhupesh Sharma wrote:
>>> vabits_actual variable on arm64 indicates the actual VA space size,
>>> and allows a single binary to support both 48-bit and 52-bit VA
>>> spaces.
>>>
>>> If the ARMv8.2-LVA optional feature is present, and we are running
>>> with a 64KB page size; then it is possible to use 52-bits of address
>>> space for both userspace and kernel addresses. However, any kernel
>>> binary that supports 52-bit must also be able to fall back to 48-bit
>>> at early boot time if the hardware feature is not present.
>>>
>>> Since TCR_EL1.T1SZ indicates the size offset of the memory region
>>> addressed by TTBR1_EL1 (and hence can be used for determining the
>>> vabits_actual value) it makes more sense to export the same in
>>> vmcoreinfo rather than vabits_actual variable, as the name of the
>>> variable can change in future kernel versions, but the architectural
>>> constructs like TCR_EL1.T1SZ can be used better to indicate intended
>>> specific fields to user-space.
>>>
>>> User-space utilities like makedumpfile and crash-utility, need to
>>> read/write this value from/to vmcoreinfo
>>
>> (write?)
> 
> Yes, also write so that the vmcoreinfo from an (crashing) arm64 system can be 
> used for
> analysis of the root-cause of panic/crash on say an x86_64 host using 
> utilities like
> crash-utility/gdb.

I read this as as "User-space [...] needs to write to vmcoreinfo".


>>> for determining if a virtual address lies in the linear map range.
>>
>> I think this is a fragile example. The debugger shouldn't need to know this.
> 
> Well that the current user-space utility design, so I am not sure we can 
> tweak that too much.
> 
>>> The user-space computation for determining whether an address lies in
>>> the linear map range is the same as we have in kernel-space:
>>>
>>>    #define __is_lm_address(addr)    (!(((u64)addr) & BIT(vabits_actual - 
>>> 1)))
>>
>> This was changed with 14c127c957c1 ("arm64: mm: Flip kernel VA space"). If 
>> user-space
>> tools rely on 'knowing' the kernel memory layout, they must have to 
>> constantly be fixed
>> and updated. This is a poor argument for adding this to something that ends 
>> up as ABI.
> 
> See above. The user-space has to rely on some ABI/guaranteed hardware-symbols 
> which can be
> used for 'determining' the kernel memory layout.

I disagree. Everything and anything in the kernel will change. The ABI rules 
apply to
stuff exposed via syscalls and kernel filesystems. It does not apply to kernel 
internals,
like the memory layout we used yesterday. 14c127c957c1 is a case in point.

A debugger trying to rely on this sort of thing would have to play catchup 
whenever it
changes.

I'm looking for a justification that isn't paper-thin. Putting 'for guessing 
the memory
map' in the commit message gives the impression we can support that.


>> I think a better argument is walking the kernel page tables from the core 
>> dump.
>> Core code's vmcoreinfo exports the location of the kernel page tables, but 
>> in the example
>> above you can't walk them without knowing how T1SZ was configured.
> 
> Sure, both makedumpfile and crash-utility (which walks the kernel page tables 
> from the
> core dump) use this (and similar) information currently in the user-space.

[...]

>> (From-memory: one of vmcore/kcore is virtually addressed, the other 
>> physically. Does this
>> fix your poblem in both cases?)
>>
>>
>>> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
>>> index ca4c3e12d8c5..f78310ba65ea 100644
>>> --- a/arch/arm64/kernel/crash_core.c
>>> +++ b/arch/arm64/kernel/crash_core.c
>>> @@ -7,6 +7,13 @@
>>>   #include 
>>>   #include 
>>
>> You need to include asm/sysreg.h for read_sysreg(), and asm/pgtable-hwdef.h 
>> for the macros
>> you added.
> 
> Ok. Will check as I did not get any compilation errors without the same and 
> build-bot also
> did not raise a flag for the missing include files.

Don't trust the header jungle!


>>> +static inline u64 get_tcr_el1_t1sz(void);
> 
>> Why do you need to do this?
> 
> Without this I was getting a missing declaration error, while compiling the 
> code.

Missing declaration?

>>> +static inline u64 get_tcr_el1_t1sz(void)
>>> +{
>>> +    return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
>>> +}

Here it is! (I must be going mad...)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2019-12-12 Thread James Morse
Hi Bhupesh,

On 29/11/2019 19:59, Bhupesh Sharma wrote:
> vabits_actual variable on arm64 indicates the actual VA space size,
> and allows a single binary to support both 48-bit and 52-bit VA
> spaces.
> 
> If the ARMv8.2-LVA optional feature is present, and we are running
> with a 64KB page size; then it is possible to use 52-bits of address
> space for both userspace and kernel addresses. However, any kernel
> binary that supports 52-bit must also be able to fall back to 48-bit
> at early boot time if the hardware feature is not present.
> 
> Since TCR_EL1.T1SZ indicates the size offset of the memory region
> addressed by TTBR1_EL1 (and hence can be used for determining the
> vabits_actual value) it makes more sense to export the same in
> vmcoreinfo rather than vabits_actual variable, as the name of the
> variable can change in future kernel versions, but the architectural
> constructs like TCR_EL1.T1SZ can be used better to indicate intended
> specific fields to user-space.
> 
> User-space utilities like makedumpfile and crash-utility, need to
> read/write this value from/to vmcoreinfo

(write?)

> for determining if a virtual address lies in the linear map range.

I think this is a fragile example. The debugger shouldn't need to know this.


> The user-space computation for determining whether an address lies in
> the linear map range is the same as we have in kernel-space:
> 
>   #define __is_lm_address(addr)   (!(((u64)addr) & BIT(vabits_actual - 
> 1)))

This was changed with 14c127c957c1 ("arm64: mm: Flip kernel VA space"). If 
user-space
tools rely on 'knowing' the kernel memory layout, they must have to constantly 
be fixed
and updated. This is a poor argument for adding this to something that ends up 
as ABI.


I think a better argument is walking the kernel page tables from the core dump.
Core code's vmcoreinfo exports the location of the kernel page tables, but in 
the example
above you can't walk them without knowing how T1SZ was configured.

On older kernels, user-space that needs this would have to assume the value it 
computes
from VA_BITs (also in vmcoreinfo) is the value in use.


---%<---
> I have sent out user-space patches for makedumpfile and crash-utility
> to add features for obtaining vabits_actual value from TCR_EL1.T1SZ (see
> [0] and [1]).
> 
> Akashi reported that he was able to use this patchset and the user-space
> changes to get user-space working fine with the 52-bit kernel VA
> changes (see [2]).
> 
> [0]. http://lists.infradead.org/pipermail/kexec/2019-November/023966.html
> [1]. http://lists.infradead.org/pipermail/kexec/2019-November/024006.html
> [2]. http://lists.infradead.org/pipermail/kexec/2019-November/023992.html
---%<---

This probably belongs in the cover letter instead of the commit log.

(From-memory: one of vmcore/kcore is virtually addressed, the other physically. 
Does this
fix your poblem in both cases?)


> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index ca4c3e12d8c5..f78310ba65ea 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -7,6 +7,13 @@
>  #include 
>  #include 

You need to include asm/sysreg.h for read_sysreg(), and asm/pgtable-hwdef.h for 
the macros
you added.


> +static inline u64 get_tcr_el1_t1sz(void);

Why do you need to do this?


> +static inline u64 get_tcr_el1_t1sz(void)
> +{
> + return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
> +}

(We don't modify this one, and its always the same one very CPU, so this is 
fine.
This function is only called once when the stringy vmcoreinfo elf_note is 
created...)


>  void arch_crash_save_vmcoreinfo(void)
>  {
>   VMCOREINFO_NUMBER(VA_BITS);
> @@ -15,5 +22,7 @@ void arch_crash_save_vmcoreinfo(void)
>   kimage_voffset);
>   vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
>   PHYS_OFFSET);
> + vmcoreinfo_append_str("NUMBER(tcr_el1_t1sz)=0x%llx\n",
> + get_tcr_el1_t1sz());

You document the name as being upper-case.
The two values either values either side are upper-case.


>   vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
>  }


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi/memreserve: register reservations as 'reserved' in /proc/iomem

2019-12-04 Thread James Morse
Hi Masa,

On 04/12/2019 17:17, Masayoshi Mizuma wrote:
> Thank you for sending the patch, but unfortunately it doesn't work for the 
> issue...
> 
> After applied your patch, the LPI tables are marked as reserved in
> /proc/iomem like as:
> 
> 8030-a1fd : System RAM
>   8048-8134 : Kernel code
>   8135-817b : reserved
>   817c-82ac : Kernel data
>   830f-830f : reserved # Property table
>   8348-83480fff : reserved # Pending table
>   8349-8349 : reserved # Pending table
> 
> However, kexec tries to allocate memory from System RAM, it doesn't care
> the reserved in System RAM.

> I'm not sure why kexec doesn't care the reserved in System RAM, however,

Hmm, we added these to fix a problem with the UEFI memory map, and more 
recently ACPI
tables being overwritten by kexec.

Which version of kexec-tools are you using? Could you try:
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git/commit/?h=arm64/resv_mem


> if the kexec behaivor is right, the LPI tables should not belong to
> System RAM.

> Like as:
> 
> 8030-830e : System RAM
>   8048-8134 : Kernel code
>   8135-817b : reserved
>   817c-82ac : Kernel data
> 830f-830f : reserved # Property table
> 8348-83480fff : reserved # Pending table
> 8349-8349 : reserved # Pending table
> 834a-a1fd : System RAM
> 
> I don't have ideas to separete LPI tables from System RAM... so I tried
> to add a new file to inform the LPI tables to userspace.

This is how 'nomap' memory appears, we carve it out of System RAM. A side 
effect of this
is kdump can't touch it, as you've told it this isn't memory.

As these tables are memory, mapped by the linear map, I think Ard's patch is 
the right
thing to do ... I suspect your kexec-tools doesn't have those patches from 
Akashi to make
it honour all second level entries.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4 0/2] Add support for arm64 to carry ima measurement

2019-10-25 Thread James Morse
Hi Pavel,

On 15/10/2019 19:47, Pavel Tatashin wrote:
>> I think the UEFI persistent-memory-reservations thing is a better fit for 
>> this [0][1].

> Thank you for your thought. As I understand you propose the to use the
> existing method as such:
> 1. Use the existing kexec ABI to pass reservation from kernel to
> kernel using EFI the same as is done for GICv3 tables.
> 2. Allow this memory to be reservable only during first Linux boot via
> EFI memory reserve
> 3. Allow to have this memory pre-reserved by firmware or to be
> embedded into device tree.
> 
> A question I have is how to tell that a reserved region is reserved
> for IMA use. With GICv3 it is done by reading the registers, finding
> the interrupt tables memory, and check that the memory ranges are
> indeed pre-reserved.

Good point, efi_mem_reserve_persistent() has no way of describing what a region 
is for,
you have to know that from somewhere else.


> Is there a way to name memory with the current ABI that you think is 
> acceptable?

This would need to go in the chosen node of the DT, like power-pc already does. 
This would
work on arm64:ACPI systems too (because they have a DT chosen node).


I'd like to understand why removing these entries is needed, it doesn't look 
like we have
an API call to remove them from the efi mem-reserve...

If it had a fixed position in memory its the sort of thing we'd expect firmware 
to
reserved during boot. (e.g. ramoops).

~

>From ima_add_kexec_buffer() this really is a transient memory reservation over 
>kexec.
I think the efi mem-reserve and a DT-chosen node entry with the PA is the only 
way to make
this work smoothly between DT<->ACPI systems.

We'd need a way of removing the efi mem-reserve in ima_free_kexec_buffer(), 
otherwise the
memory remains lost. The DT-chosen node entry should have its pointer zero'd 
out once
we've done this. (like we do for the KASLR seed).


Not considered is parsing the DT-chosen node entry as if it were a memreserve 
during early
boot. This wouldn't work if you kexec something that doesn't know what the node 
is, it
would overwrite the the memory and may not remove the node for the next kexec, 
which does.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4 0/2] Add support for arm64 to carry ima measurement

2019-10-25 Thread James Morse
Hi Mimi,

On 16/10/2019 02:44, Mimi Zohar wrote:
> On Tue, 2019-10-15 at 18:39 +0100, James Morse wrote:
>> If SecureBoot isn't relevant, I'm confused as to why kexec_file_load() is.
>>
>> I thought kexec_file_load() only existed because SecureBoot systems need to 
>> validate the
>> new OS images's signature before loading it, and we can't trust user-space 
>> calling Kexec
>> to do this.
>>
>> If there is no secure boot, why does this thing only work with 
>> kexec_file_load()?
>> (good news! With the UEFI memreseve table, it should work transparently with 
>> regular kexec
>> too)

> I'm so sorry for the confusion.  IMA was originally limited to
> extending trusted boot concepts to the OS.  As of Linux 3.10, IMA
> added support for extending secure boot concepts and auditing file
> hashes (commit e7c568e0fd0cf).
> 
> True, kexec_file_load is required for verifying the kexec kernel
> image, but it is also required for measuring the kexec kernel image as
> well.
> 
> After reading the kernel image into memory (kernel_read_file_from_fd),
> the hash is calculated and then added to the IMA measurement list and
> used to extend the TPM.  All of this is based on the IMA policy,
> including the TPM PCR.

Don't we get a set of segments with the regular kexec syscall? These could 
equally be
hashed and measured, and logged via IMA and/or extending the TPMs measurements.

(obviously this would include the command-line and maybe purgatory, which makes 
it less
predictable, but these are still the binary blobs that were given privileged 
access to the
system).


>>> I am not sure if i addressed all your concerns, please let me know
>>> if i missed anything. To me most concerns look to be towards the kexec case 
>>> and dependency
>>> on hardware(ACPI/TPM) during boot and early boot services, where as 
>>> carrying the logs is
>>> only during the kexec_file_load sys call and does not interfere with that 
>>> code path.
>>> IMA documentation: https://sourceforge.net/p/linux-ima/wiki/Home/
>>
>> Supporting ACPI in the same way is something we need to do from day one. 
>> kexec_file_load()
>> already does this. I'm not sure "only kexec_file_load()" is a justifiable 
>> restriction...

> The TPM PCRs are not reset on a soft reboot.  As a result, in order to
> validate the IMA measurement list against the TPM PCRs, the IMA
> measurement list is saved on kexec load, restored on boot, and then
> the memory allocated for carrying the measurement list across kexec is
> freed.

Hmm, this is why the reserved memory gets freed.

What happens to stuff that happens between kexec-load and boot?
There is a comment:
| /* segment size can't change between kexec load and execute */

But I can't see anywhere that enforces that. I guess those measurements will go 
missing,
and the TPM value will not match after kexec.



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4 0/2] Add support for arm64 to carry ima measurement

2019-10-15 Thread James Morse
Hi Prakhar,

(CC: +Ard : passing reserved memory between kernels using Kexec?)

On 15/10/2019 02:31, prsriva wrote:
> On 10/14/19 11:02 AM, James Morse wrote:
>> On 11/10/2019 01:35, Prakhar Srivastava wrote:
>>> Add support to carry ima measurement log
>>> to the next kexec'ed session triggered via kexec_file_load.
>>
>> I don't know much about 'ima', I'm assuming its the list of 'stuff' that has 
>> already been
>> fed into the TPM as part of SecureBoot. Please forgive the stupid questions,
>>
> The IMA logs are event logs for module load time signature validation(based 
> on policies)
> which are backed by the TPM. No SecureBoot information is present in the log 
> other than
> the boot aggregate.

Okay, so SecureBoot is optional with this thing.


>>> Currently during kexec the kernel file signatures are/can be validated
>>> prior to actual load, the information(PE/ima signature) is not carried
>>> to the next session. This lead to loss of information.
>>>
>>> Carrying forward the ima measurement log to the next kexec'ed session
>>> allows a verifying party to get the entire runtime event log since the
>>> last full reboot, since that is when PCRs were last reset.
>>
>> Hmm, You're adding this as a linux-specific thing in the chosen node, which 
>> points at a
>> memreserve.
>>
>> The question that normally needs answering when adding to the stuff we have 
>> to treat as
>> ABI over kexec is: how would this work from a bootloader that isn't kexec? 
>> Does it need to
>> work for non-linux OS?

> This change is only intended to be executed in the path of kexec_file_load 
> and not
> intended to be executed by any boot loader.(Not very aware of boot loader 
> calls.).

kexec_file_load only means something to the first kernel. If you boot something 
that isn't
linux, does it need to delete this stuff from the DT?
Even if you kexec_file_load linux, it could go on to regular-kexec something 
that is
not... what should that do with these things?

Other than the chosen node, the DT is treated as a cast-iron description of the 
platform,
we shouldn't be tinkering with it.

If its not describing hardware, it probably doesn't belong in the DT.


> The logs are non intended to be injected by the boot loader at all.

You're using linux as a bootloader with kexec. We have to treat the stuff that 
gets passed
between kernels as ABI, as people expect to be able to kexec to a newer kernel.

Is linux-as-a-bootloader special? Or should we work out what any bootloader 
should do here
first. This avoids having to change this when it turns out someone wants to log 
UEFI
DXE-drivers/modules in the TPM too.

>From the git-log of the ima code it looks like this is some linux-specific 
>format.
Are we certain it will never change, and nothing else ever needs to support it?
(e.g. the DXE driver example above. Is there another way that sort of thing 
would work?).


> The change is configurable(CONFIG_IMA_KEXEC) under the IMA subsection and can 
> be disabled
> if not needed.

Sure, but not needed isn't the same as not supported.
If we support it at all, we need to cover everything that needs supporting. If 
its ABI (we
treat data passed between kernels as if it is), we need to get it right first 
time.

(my point? We need to get the ACPI story sorted before we add any support... 
otherwise we
end up with two incompatible ways of doing this).

[...]

>> Sharing with powerpc is a great starting point ... but, how does this work 
>> for ACPI
>> systems?
>> How does this work if I keep kexecing between ACPI and DT?

> I don't have an answer to this, just going through the call stack i dont 
> believe it
> depends on ACPI as such. I am not the expert here, but more than willing to 
> try out the
> scenario in question.(Can you point me to some documentation to setup some 
> environment to
> test this.)

Yup: Documentation/arm64/arm-acpi.rst

Arm64's ACPI support depends on UEFI. As a starter:
https://wiki.ubuntu.com/ARM64/QEMU

You may need to pass 'acpi=on' on the commandline if your UEFI build supports 
both DT and
ACPI. The x86 name for UEFI-in-Qemu is OVMF, which helps when googling.


> Kexec_file_load call depends purely on DT implementation.

Heh. And it works with ACPI too! You'll note it only touches things in the 
chosen node...

An ACPI system boots without a DT. Linux's EFI-stub can make API calls and poke 
around in
the UEFI structures to find out about the system. When it finishes, the 
EFI-stub needs to
pass on a set of values to the kernel... we need some kind of key-value store 
...

To avoid re-inventing the wheel, the EFI-stub creates an empty DT, and shoves 
the cmdline,
the initrd etc in there... just like a DT bootloader would have done.

&g

Re: [PATCH V4 0/2] Add support for arm64 to carry ima measurement

2019-10-14 Thread James Morse
Hi Prakhar,

(You've CC'd a few folk who work for 'arm.org'...)

On 11/10/2019 01:35, Prakhar Srivastava wrote:
> Add support to carry ima measurement log
> to the next kexec'ed session triggered via kexec_file_load.

I don't know much about 'ima', I'm assuming its the list of 'stuff' that has 
already been
fed into the TPM as part of SecureBoot. Please forgive the stupid questions,


> Currently during kexec the kernel file signatures are/can be validated
> prior to actual load, the information(PE/ima signature) is not carried
> to the next session. This lead to loss of information.
> 
> Carrying forward the ima measurement log to the next kexec'ed session 
> allows a verifying party to get the entire runtime event log since the
> last full reboot, since that is when PCRs were last reset.

Hmm, You're adding this as a linux-specific thing in the chosen node, which 
points at a
memreserve.

The question that normally needs answering when adding to the stuff we have to 
treat as
ABI over kexec is: how would this work from a bootloader that isn't kexec? Does 
it need to
work for non-linux OS?

Changing anything other than the chosen node of the DT isn't something the 
kernel should
be doing. I suspect if you need reserved memory for this stuff, it should be 
carved out by
the bootloader, and like all other memreserves: should not be moved or deleted.

('fdt_delete_mem_rsv()' is a terrifying idea, we depend on the memreserve nodes 
to tell
use which 'memory' we shouldn't touch!)


Sharing with powerpc is a great starting point ... but, how does this work for 
ACPI systems?
How does this work if I keep kexecing between ACPI and DT?

I'd prefer it we only had one way this works on arm64, so whatever we do has to 
cover both.

Does ima work without UEFI secure-boot?
If not, the Linux-specific UEFI 'memreserve' table might be a better fit, this 
would be
the same for both DT and ACPI systems. Given U-boot supports the UEFI API too, 
its
probably the right thing to do regardless of secure-boot.

It looks like x86 doesn't support this either yet. If we have to add something 
to support
ACPI, it would be good if it covers both firmware mechanisms for arm64, and 
works for x86
in the same way.

(How does this thing interact with EFI's existing efi_tpm_eventlog_init()?)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 16/17] arm64: kexec: configure trans_pgd page table for kexec

2019-10-11 Thread James Morse
Hi Pavel,

On 04/10/2019 19:52, Pavel Tatashin wrote:
> Configure a page table located in kexec-safe memory that has
> the following mappings:
> 
> 1. identity mapping for text of relocation function with executable
>permission.
> 2. identity mapping for argument for relocation function.
> 3. linear mappings for all source ranges
> 4. linear mappings for all destination ranges.
> 
> Also, configure el2_vector, that is used to jump to new kernel from EL2 on
> non-VHE kernels.


> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index d5b79d4c7fae..450d8440f597 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -90,6 +90,23 @@ static inline void crash_prepare_suspend(void) {}
>  static inline void crash_post_resume(void) {}
>  #endif
>  
> +#if defined(CONFIG_KEXEC_CORE)
> +/* Global variables for the arm64_relocate_new_kernel routine. */
> +extern const unsigned char arm64_relocate_new_kernel[];
> +extern const unsigned long arm64_relocate_new_kernel_size;
> +
> +/* Body of the vector for escalating to EL2 from relocation routine */
> +extern const unsigned char kexec_el1_sync[];
> +extern const unsigned long kexec_el1_sync_size;

> +#define KEXEC_EL2_VECTOR_TABLE_SIZE  2048


> +#define KEXEC_EL2_SYNC_OFFSET(KEXEC_EL2_VECTOR_TABLE_SIZE / 
> 2)

Yuck.

Please don't generate one-off vectors like this. Create _all_ of them, and have 
the ones
that should never happen spin round a branch. Someone will hit one eventually, 
its a lot
easier to work out what happened if it stops on the first fault, instead of 
executing junk
and flying off into the weeds.

git grep invalid_vector

Having the vectors at a known offset in the page that does the relocation means 
you have a
fair idea what happened from just the PC.


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index fb6138a1c9ff..71479013dd24 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -74,15 +71,124 @@ static void *kexec_page_alloc(void *arg)

> +/*
> + * Map source segments starting from KEXEC_SRC_START, and map destination
> + * segments starting from KEXEC_DST_START, and return size of copy in
> + * *copy_len argument.
> + * Relocation function essentially needs to do:
> + * memcpy(KEXEC_DST_START, KEXEC_SRC_START, copy_len);
> + */
> +static int map_segments(struct kimage *kimage, pgd_t *pgdp,
> + struct trans_pgd_info *info,
> + unsigned long *copy_len)
> +{
> + unsigned long *ptr = 0;
> + unsigned long dest = 0;
> + unsigned long src_va = KEXEC_SRC_START;
> + unsigned long dst_va = KEXEC_DST_START;
> + unsigned long len = 0;
> + unsigned long entry, addr;
> + int rc;
> +
> + for (entry = kimage->head; !(entry & IND_DONE); entry = *ptr++) {
> + addr = entry & PAGE_MASK;
> +
> + switch (entry & IND_FLAGS) {
> + case IND_DESTINATION:
> + dest = addr;
> + break;
> + case IND_INDIRECTION:
> + ptr = __va(addr);
> + if (rc)
> + return rc;
> + break;

> + case IND_SOURCE:
> + rc = trans_pgd_map_page(info, pgdp, __va(addr),
> + src_va, PAGE_KERNEL);
> + if (rc)
> + return rc;
> + rc = trans_pgd_map_page(info, pgdp, __va(dest),
> + dst_va, PAGE_KERNEL);
> + if (rc)
> + return rc;
> + dest += PAGE_SIZE;
> + src_va += PAGE_SIZE;
> + dst_va += PAGE_SIZE;
> + len += PAGE_SIZE;
> + }

It looks like you're building a swiss cheese.

If you disable RODATA_FULL_DEFAULT_ENABLED, the kernel will use block mappings 
for the
linear map. This dramatically reduces the amount of memory in use. On Juno 
running with
39bit/4K, there is typically 6G of contiguous memory with no firmware/uefi 
holes in it.
This is mapped by 6 1G block mappings, which take up no additional memory.

For the first go at supporting this in mainline please keep as close as 
possible to the
existing hibernate code. Please use the helpers that copy the linear map.
(if you cant do pa->va in the relocation assembly you'd need to generate a 
virtually
addressed structure, which could then use hibernate's relocation assembly)

If all this extra code turns out to be a significant performance improvement, 
I'd like to
see the numbers. We can come back to it after we've got the simplest way of 
running
kexec's relocation with the MMU on merged.


> +static int mmu_relocate_setup(struct kimage *kimage, unsigned long 
> kern_reloc,
> +   struct 

Re: [PATCH v6 15/17] arm64: kexec: add expandable argument to relocation function

2019-10-11 Thread James Morse
Hi Pavel,

On 04/10/2019 19:52, Pavel Tatashin wrote:
> Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
> the following arguments:
> 
> head: start of array that contains relocation information.
> entry:entry point for new kernel or purgatory.
> dtb_mem:  first and only argument to entry.
> 
> The number of arguments cannot be easily expended, because this
> function is also called from HVC_SOFT_RESTART, which preserves only
> three arguments. And, also arm64_relocate_new_kernel is written in
> assembly but called without stack, thus no place to move extra
> arguments to free registers.
> 
> Soon, we will need to pass more arguments: once we enable MMU we
> will need to pass information about page tables.


> Another benefit of allowing this function to accept more arguments, is that
> kernel can actually accept up to 4 arguments (x0-x3), however currently
> only one is used, but if in the future we will need for more (for example,
> pass information about when previous kernel exited to have a precise
> measurement in time spent in purgatory), we won't be easilty do that
> if arm64_relocate_new_kernel can't accept more arguments.

(That is just a little niche)


> So, add a new struct: kern_reloc_arg, and place it in kexec safe page (i.e
> memory that is not overwritten during relocation).
> Thus, make arm64_relocate_new_kernel to only take one argument, that
> contains all the needed information.


> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index d15ca1ca1e83..d5b79d4c7fae 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -90,12 +90,30 @@ static inline void crash_prepare_suspend(void) {}
>  static inline void crash_post_resume(void) {}
>  #endif
>  
> +/*
> + * kern_reloc_arg is passed to kernel relocation function as an argument.
> + * head  kimage->head, allows to traverse through relocation 
> segments.
> + * entry_addrkimage->start, where to jump from relocation function 
> (new
> + *   kernel, or purgatory entry address).
> + * kern_arg0 first argument to kernel is its dtb address. The other
> + *   arguments are currently unused, and must be set to 0
> + */
> +struct kern_reloc_arg {
> + unsigned long   head;
> + unsigned long   entry_addr;
> + unsigned long   kern_arg0;
> + unsigned long   kern_arg1;
> + unsigned long   kern_arg2;
> + unsigned long   kern_arg3;

... at least one of these should by phys_addr_t!

While the sizes are the same on arm64, this reminds the reader what kind of 
address this
is, and lets the compiler warn you if you make a mistake.


> +};

> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 214685760e1c..900394907fd8 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  int main(void)
>  {
> @@ -126,6 +127,14 @@ int main(void)
>  #ifdef CONFIG_ARM_SDE_INTERFACE
>DEFINE(SDEI_EVENT_INTREGS, offsetof(struct sdei_registered_event, 
> interrupted_regs));
>DEFINE(SDEI_EVENT_PRIORITY,offsetof(struct sdei_registered_event, 
> priority));
> +#endif
> +#ifdef CONFIG_KEXEC_CORE
> +  DEFINE(KRELOC_HEAD,offsetof(struct kern_reloc_arg, head));
> +  DEFINE(KRELOC_ENTRY_ADDR,  offsetof(struct kern_reloc_arg, entry_addr));
> +  DEFINE(KRELOC_KERN_ARG0,   offsetof(struct kern_reloc_arg, kern_arg0));
> +  DEFINE(KRELOC_KERN_ARG1,   offsetof(struct kern_reloc_arg, kern_arg1));
> +  DEFINE(KRELOC_KERN_ARG2,   offsetof(struct kern_reloc_arg, kern_arg2));
> +  DEFINE(KRELOC_KERN_ARG3,   offsetof(struct kern_reloc_arg, kern_arg3));

Please use kexec as the prefix. The kernel also applies relocations during 
early boot.
These are global values, and in isolation doesn't imply kexec.


>  #endif
>return 0;
>  }

> diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
> index ed50e9587ad8..7a8720ff186f 100644
> --- a/arch/arm64/kernel/cpu-reset.h
> +++ b/arch/arm64/kernel/cpu-reset.h
> @@ -11,12 +11,10 @@
>  #include 
>  
>  void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
> - unsigned long arg0, unsigned long arg1, unsigned long arg2);
> + unsigned long arg);

phys_addr_t


>  static inline void __noreturn cpu_soft_restart(unsigned long entry,
> -unsigned long arg0,
> -unsigned long arg1,
> -unsigned long arg2)
> +unsigned long arg)

phys_addr_t


>  {
>   typeof(__cpu_soft_restart) *restart;
>  

> diff --git a/arch/arm64/kernel/relocate_kernel.S 
> b/arch/arm64/kernel/relocate_kernel.S
> index c1d7db71a726..d352faf7cbe6 100644
> --- a/arch/arm64/kernel/relocate_kernel.S
> +++ 

Re: [PATCH v6 14/17] arm64: kexec: move relocation function setup and clean up

2019-10-11 Thread James Morse
Hi Pavel,

On 04/10/2019 19:52, Pavel Tatashin wrote:
> Currently, kernel relocation function is configured in machine_kexec()
> at the time of kexec reboot by using control_code_page.
> 
> This operation, however, is more logical to be done during kexec_load,
> and thus remove from reboot time. Move, setup of this function to
> newly added machine_kexec_post_load().
> 
> In addition, do some cleanup: add infor about reloction function to

infor ? reloction?


> kexec_image_info(), and remove extra messages from machine_kexec().


> Make dtb_mem, always available, if CONFIG_KEXEC_FILE is not configured
> dtb_mem is set to zero anyway.

This is unrelated cleanup, please do it as an earlier patch to make it clearer 
what you
are changing here.

(I'm not convinced you need to cache va<->pa)


> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index 12a561a54128..d15ca1ca1e83 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -90,14 +90,15 @@ static inline void crash_prepare_suspend(void) {}
>  static inline void crash_post_resume(void) {}
>  #endif
>  
> -#ifdef CONFIG_KEXEC_FILE
>  #define ARCH_HAS_KIMAGE_ARCH
>  
>  struct kimage_arch {
>   void *dtb;
>   unsigned long dtb_mem;

> + unsigned long kern_reloc;

This is cache-ing the physical address of an all-architectures value from 
struct kimage,
in the arch specific part of struct kiamge. Why?

(You must have the struct kimage on hand to access this thing at all!)

If its supposed to be a physical address, please use phys_addr_t.

>  };


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index 0df8493624e0..9b41da50e6f7 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -42,6 +42,7 @@ static void _kexec_image_info(const char *func, int line,
>   pr_debug("start:   %lx\n", kimage->start);
>   pr_debug("head:%lx\n", kimage->head);
>   pr_debug("nr_segments: %lu\n", kimage->nr_segments);
> + pr_debug("kern_reloc: %pa\n", >arch.kern_reloc);
>  
>   for (i = 0; i < kimage->nr_segments; i++) {
>   pr_debug("  segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu 
> pages\n",
> @@ -58,6 +59,19 @@ void machine_kexec_cleanup(struct kimage *kimage)
>   /* Empty routine needed to avoid build errors. */
>  }
>  
> +int machine_kexec_post_load(struct kimage *kimage)
> +{
> + unsigned long kern_reloc;
> +
> + kern_reloc = page_to_phys(kimage->control_code_page);

kern_reloc should be phys_addr_t.


> + memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
> +arm64_relocate_new_kernel_size);
> + kimage->arch.kern_reloc = kern_reloc;


Please move the cache maintenance in here too. This will save us doing it late 
during
kdump. This will also group the mmu-on changes together.


> +}


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 03/17] arm64: hibernate: check pgd table allocation

2019-10-11 Thread James Morse
Hi Pavel,

On 04/10/2019 19:52, Pavel Tatashin wrote:
> There is a bug in create_safe_exec_page(), when page table is allocated
> it is not checked that table is allocated successfully:
> 
> But it is dereferenced in: pgd_none(READ_ONCE(*pgdp)).  Check that
> allocation was successful.


> Fixes: 82869ac57b5d ("arm64: kernel: Add support for 
> hibernate/suspend-to-disk")
> 
> Signed-off-by: Pavel Tatashin 

Nit: Please remove the stray newline so all the tags appear together.


> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index d52f69462c8f..ef46ce66d7e8 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -217,6 +217,11 @@ static int create_safe_exec_page(void *src_start, size_t 
> length,
>   __flush_icache_range(dst, dst + length);
>  
>   trans_pgd = allocator(mask);
> + if (!trans_pgd) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   pgdp = pgd_offset_raw(trans_pgd, dst_addr);
>   if (pgd_none(READ_ONCE(*pgdp))) {
>   pudp = allocator(mask);
> 

Thanks for splitting [0] into two ... but this fix depends on the previous 
patch - which
isn't an issue that anyone can hit, and doesn't match Greg's 
'stable-kernel-rules'.

Please separate out this patch - and post it on its own as a stand-alone fix 
that can be
sent to the stable trees.


Mixing fixes with other patches leads to problems like this. It isn't possible 
to pick
this fix independently of the cleanup in the previous patch.


Thanks,

James

[0] 
https://lore.kernel.org/linux-arm-kernel/ddd81093-89fc-5146-0b33-ad3bd9a1c...@arm.com/


Re: [PATCH v3 06/17] arm64, hibernate: add trans_pgd public functions

2019-10-11 Thread James Morse
Hi Pavel,

On 06/09/2019 17:00, Pavel Tatashin wrote:
> On Fri, Sep 6, 2019 at 11:18 AM James Morse  wrote:
>> On 21/08/2019 19:31, Pavel Tatashin wrote:
>>> trans_pgd_create_copy() and trans_pgd_map_page() are going to be
>>> the basis for public interface of new subsystem that handles page
>>
>> Please don't call this a subsystem. 'sound' and 'mm' are subsystems, this is 
>> just some
>> shared code.

> Sounds good: just could not find a better term to call trans_pgd_*.

I don't like the trans_pgd_ name either, but I can't think of anything better, 
and its
only a name.


> I won't fix log commits.

Please avoid the word 'subsystem',


>>> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
>>> index 750ecc7f2cbe..2e29d620b56c 100644
>>> --- a/arch/arm64/kernel/hibernate.c
>>> +++ b/arch/arm64/kernel/hibernate.c
>>> @@ -182,39 +182,15 @@ int arch_hibernation_header_restore(void *addr)
>>
>>> +int trans_pgd_map_page(pgd_t *trans_pgd, void *page,
>>> +unsigned long dst_addr,
>>> +pgprot_t pgprot)
>>
>> If this thing is going to be exposed, its name should reflect that its 
>> creating a set of
>> page tables, to map a single page.
>>
>> A function called 'map_page' with this prototype should 'obviously' map 
>> @page at @dst_addr
>> in @trans_pgd using the provided @pgprot... but it doesn't.
> 
> Answered below...
> 
>>
>> This is what 'create' was doing in the old name, if that wasn't obvious, its 
>> because
>> naming things is hard!
>> | trans_create_single_page_mapping()?
>>
>> (might be too verbose)
>>
>> I think this bites you in patch 8, where you 'generalise' this.

> The new naming makes more sense to me. The old code had function named:
> 
> create_safe_exec_page()
> 
> It was doing four things: 1. creating the actual page via provided
> allocator, 2. copying content from the provided page to new page, 3
> creating a new page table. 4 mapping it to a new page table at
> specified destination address

Yup, all implied in the work of creation.


> After, I generalize this the function the prototype looks like this:
> 
> int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
>  void *page, unsigned long
> dst_addr, pgprot_t pgprot)
> 
> The function only does the "4" from the old code: map the specified
> page at dst_addr.


> The trans_pgd is already created.

Which one is this?
The existing hibernate code has two PGD. One for the copy of the linear-map, 
one for this
safe page that contains the code doing the copying.


> Of course, and
> mapping function will have to allocate missing tables in the page
> tables when necessary.

I think you are over generalising this, to support a case that doesn't exist.

Hibernate needs a copy of the linear map to relocate memory, without stepping in
page-table, and an executable page it can do that from.

To get kexec to relocate the kernel with the MMU on... you need the same.

When do you need to add an arbitrary page to either of these sets of tables? 
Its either a
copy of the linear-map, or the single executable page.

When would does 'trans_pgd_map_page()' get used outside those two?

(looking in your later series, I see you are using it to try and idmap stuff 
into the low
memory. We can't do stuff like this because there may not be any memory in 
range of the
page table helpers. More details in that patch)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 02/17] arm64: hibernate: pass the allocated pgdp to ttbr0

2019-10-11 Thread James Morse
Hi Pavel,

On 04/10/2019 19:52, Pavel Tatashin wrote:
> ttbr0 should be set to the beginning of pgdp, however, currently
> in create_safe_exec_page it is set to pgdp after pgd_offset_raw(),
> which works by accident.
> 
> Fixes: 0194e760f7d2 ("arm64: hibernate: avoid potential TLB conflict")

(That was a 'break before make' fix, the affected code comes from:
 82869ac57b5d (""arm64: kernel: Add support for hibernate/suspend-to-disk))

But, it works in all one circumstances its used: we know all the top bits will 
be zero.
I agree its by accident and we should fix it.

I don't think we should send it to stable.
Please drop the fixes tag, with that:
Reviewed-by: James Morse 


Thanks,

James


[0] 
https://lore.kernel.org/linux-arm-kernel/ddd81093-89fc-5146-0b33-ad3bd9a1c...@arm.com/

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 08/17] arm64, trans_pgd: make trans_pgd_map_page generic

2019-10-11 Thread James Morse
Hi Pavel,

On 06/09/2019 19:58, Pavel Tatashin wrote:
> On Fri, Sep 6, 2019 at 11:20 AM James Morse  wrote:
>> On 21/08/2019 19:31, Pavel Tatashin wrote:
>>> Currently, trans_pgd_map_page has assumptions that are relevant to
>>> hibernate. But, to make it generic we must allow it to use any allocator

>>> and also, can't assume that entries do not exist in the page table
>>> already.

[...]

>> Please don't use the page tables as an array: this is what the offset 
>> helpers are for.
> 
> Sure, I can use:
> 
> pte_offset_kernel()
> pmd_offset()
> pud_offset()
> pgd_offset_raw()

> The code becomes a little less efficient, because offsets return
> pointer to the entry after READ_ONCE, and we need to use another
> READ_ONCE() to read its content to parse its value in for example
> pud_table(), pud_none() etc . In my case we use READ_ONCE() only one
> time  per entry and operate on the content multiple times. Also,
> because of unfortunate differences in macro names, the code become a
> little less symmetric. Still, I can change the code to use _offsets
> here. Please let me know if you still think it is better to use them
> here.

We should make this as clearly readable as possible, that way reviewers can 
spot the bugs.
Using the helpers makes this more maintainable, as the helpers may be where 
strange things
like 52bit VA get implemented.

Making it fast is the compilers job. I agree it can't remove READ_ONCE()es, but 
I think
the difference between one and two READ_ONCE()es per leaf-entry is 
insignificant when we
go on to copy megabytes worth of data.


>> The copy_p?d() functions should decide if they should manipulate _this_ 
>> entry based on
>> _this_ entry and the kernel configuration. This is only really done in 
>> _copy_pte(), which
>> is where it should stay.
> 
> I am sorry, I do not understand this comment. Could you please
> elaborate what would you like me to change.

Consider the current _copy_pte():
|   } else if (debug_pagealloc_enabled() && !pte_none(pte)) {
|   /*
|* debug_pagealloc will removed the PTE_VALID bit if
|* the page isn't in use by the resume kernel. It may have
|* been in use by the original kernel, in which case we need
|* to put it back in our copy to do the restore.
|*
|* Before marking this entry valid, check the pfn should
|* be mapped.
|*/
|   BUG_ON(!pfn_valid(pte_pfn(pte)));
|
|   set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
|   }

>From this it is very obvious that we only put the valid bits back into the 
>page table if
debug_pagealloc is enabled and the not-valid PTE's PFN points to memory that 
was part of
the linear map.

If this logic gets moved apart, and strung together with global variables, its 
not at all
clear what happens.


>>> diff --git a/arch/arm64/include/asm/trans_pgd.h 
>>> b/arch/arm64/include/asm/trans_pgd.h
>>> index c7b5402b7d87..e3d022b1b526 100644
>>> --- a/arch/arm64/include/asm/trans_pgd.h
>>> +++ b/arch/arm64/include/asm/trans_pgd.h
>>> @@ -11,10 +11,45 @@
>>>  #include 
>>>  #include 
>>>
>>> +/*
>>> + * trans_alloc_page
>>> + *   - Allocator that should return exactly one uninitilaized page, if this
>>> + *allocator fails, trans_pgd returns -ENOMEM error.
>>> + *
>>> + * trans_alloc_arg
>>> + *   - Passed to trans_alloc_page as an argument
>>
>> This is very familiar.
> 
> Sorry, What do you mean?

This stuff used to take a pointer to a function that allocates a page, and an 
argument for
that allocator ... until patch 2 when you squashed it all in... only to undo it 
here. This
looks like churn.


>>> + * trans_flags

[...]

> I re-evaluated "flags", and figured that they are indeed not needed.
> So, I will embed them into the code directly.

Great!



>>> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
>>> index 00b62d8640c2..dbabccd78cc4 100644
>>> --- a/arch/arm64/mm/trans_pgd.c
>>> +++ b/arch/arm64/mm/trans_pgd.c
>>> @@ -17,6 +17,16 @@
>>>  #include 
>>>  #include 

>>>
>>> -int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long 
>>> dst_addr,
>>> -pgprot_t pgprot)
>>> +int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
>>> +void *page, unsigned long dst_addr, pgprot_t pgprot)
>>>  {
>>> - pgd_t *pgdp;
>>> - pud_t *pudp;
>>> - pmd_t *pmdp;
>>> - pt

Re: [PATCH v3 12/17] arm64, trans_pgd: complete generalization of trans_pgds

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Make the last private functions in page table copy path generlized for use
> outside of hibernate.
> 
> Switch to use the provided allocator, flags, and source page table. Also,
> unify all copy function implementations to reduce the possibility of bugs.

By changing it? No one has reported any problems. We're more likely to break it 
making
unnecessary changes.

Why is this necessary?


> All page table levels are implemented symmetrically.


> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> index efd42509d069..ccd9900f8edb 100644
> --- a/arch/arm64/mm/trans_pgd.c
> +++ b/arch/arm64/mm/trans_pgd.c
> @@ -27,139 +27,157 @@ static void *trans_alloc(struct trans_pgd_info *info)

> -static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
> +static int copy_pte(struct trans_pgd_info *info, pte_t *dst_ptep,
> + pte_t *src_ptep, unsigned long start, unsigned long end)
>  {
> - pte_t pte = READ_ONCE(*src_ptep);
> -
> - if (pte_valid(pte)) {
> - /*
> -  * Resume will overwrite areas that may be marked
> -  * read only (code, rodata). Clear the RDONLY bit from
> -  * the temporary mappings we use during restore.
> -  */
> - set_pte(dst_ptep, pte_mkwrite(pte));
> - } else if (debug_pagealloc_enabled() && !pte_none(pte)) {
> - /*
> -  * debug_pagealloc will removed the PTE_VALID bit if
> -  * the page isn't in use by the resume kernel. It may have
> -  * been in use by the original kernel, in which case we need
> -  * to put it back in our copy to do the restore.
> -  *
> -  * Before marking this entry valid, check the pfn should
> -  * be mapped.
> -  */
> - BUG_ON(!pfn_valid(pte_pfn(pte)));
> -
> - set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
> - }
> -}

> -static int copy_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long start,
> - unsigned long end)
> -{
> - pte_t *src_ptep;
> - pte_t *dst_ptep;
>   unsigned long addr = start;
> + int i = pte_index(addr);
>  
> - dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> - if (!dst_ptep)
> - return -ENOMEM;
> - pmd_populate_kernel(_mm, dst_pmdp, dst_ptep);
> - dst_ptep = pte_offset_kernel(dst_pmdp, start);
> -
> - src_ptep = pte_offset_kernel(src_pmdp, start);
>   do {
> - _copy_pte(dst_ptep, src_ptep, addr);
> - } while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
> + pte_t src_pte = READ_ONCE(src_ptep[i]);
> +
> + if (pte_none(src_pte))
> + continue;

> + if (info->trans_flags & TRANS_MKWRITE)
> + src_pte = pte_mkwrite(src_pte);

This should be unconditional. The purpose of this thing is to create a set of 
page tables
you can use to overwrite all of memory. Why would you want to keep the RDONLY 
flag for
normal memory?


> + if (info->trans_flags & TRANS_MKVALID)
> + src_pte = pte_mkpresent(src_pte);
> + if (info->trans_flags & TRANS_CHECKPFN) {
> + if (!pfn_valid(pte_pfn(src_pte)))
> + return -ENXIO;
> + }

This lets you skip the pfn_valid() check if you want to create bogus mappings. 
This should
not be conditional.
This removes the BUG_ON(), which is there to make sure we stop if we find 
page-table
corruption.

Please keep the shape of _copy_pte() as it is. Putting a different mapping in 
the copied
tables is risky, the code that does it should all be in one place, along with 
the
justification of why its doing this. Anything else is harder to debug when it 
goes wrong.


> + set_pte(_ptep[i], src_pte);
> + } while (addr += PAGE_SIZE, i++, addr != end && i < PTRS_PER_PTE);

Incrementing pte/pud/pmg/pgd pointers is a common pattern in the kernel's page 
table
walkers. Why do we need to change this to index it like an array?

This needs to look like walk_page_range() as the eventual aim is to remove it, 
and use the
core-code page table walker.

(at the time it was merged the core-code page table walker removed block 
mappings it
didn't like, which didn't go well.)

This is a backwards step as it makes any attempt to remove this arch-specific 
walker harder.


>  
>   return 0;
>  }



Thanks,

James


Re: [PATCH v3 11/17] arm64, trans_pgd: add PUD_SECT_RDONLY

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Thre is PMD_SECT_RDONLY that is used in pud_* function which is confusing.

Nit: There

I bet it was equally confusing before before you moved it! Could you do this 
earlier in
the series with the rest of the cleanup?

With that,
Acked-by: James Morse 


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 10/17] arm64, trans_pgd: adjust trans_pgd_create_copy interface

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Make trans_pgd_create_copy inline with the other functions in
> trans_pgd: use the trans_pgd_info argument, and also use the
> trans_pgd_create_empty.
> 
> Note, that the functions that are called by trans_pgd_create_copy are
> not yet adjusted to be compliant with trans_pgd: they do not yet use
> the provided allocator, do not check for generic errors, and do not yet
> use the flags in info argument.


> diff --git a/arch/arm64/include/asm/trans_pgd.h 
> b/arch/arm64/include/asm/trans_pgd.h
> index 26e5a63676b5..f4a5f255d4a7 100644
> --- a/arch/arm64/include/asm/trans_pgd.h
> +++ b/arch/arm64/include/asm/trans_pgd.h
> @@ -43,7 +43,12 @@ struct trans_pgd_info {
>  /* Create and empty trans_pgd page table */
>  int trans_pgd_create_empty(struct trans_pgd_info *info, pgd_t **trans_pgd);
>  
> -int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned long start,
> +/*
> + * Create trans_pgd and copy entries from from_table to trans_pgd in range
> + * [start, end)
> + */
> +int trans_pgd_create_copy(struct trans_pgd_info *info, pgd_t **trans_pgd,
> +   pgd_t *from_table, unsigned long start,
> unsigned long end);

This creates a copy of the linear-map. Why does it need to be told from_table?


> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index 8c2641a9bb09..8bb602e91065 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -323,15 +323,42 @@ int swsusp_arch_resume(void)
>   phys_addr_t phys_hibernate_exit;
>   void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
> void *, phys_addr_t, phys_addr_t);
> + struct trans_pgd_info trans_info = {
> + .trans_alloc_page   = hibernate_page_alloc,
> + .trans_alloc_arg= (void *)GFP_ATOMIC,
> + /*
> +  * Resume will overwrite areas that may be marked read only
> +  * (code, rodata). Clear the RDONLY bit from the temporary
> +  * mappings we use during restore.
> +  */
> + .trans_flags= TRANS_MKWRITE,
> + };


> + /*
> +  * debug_pagealloc will removed the PTE_VALID bit if the page isn't in
> +  * use by the resume kernel. It may have been in use by the original
> +  * kernel, in which case we need to put it back in our copy to do the
> +  * restore.
> +  *
> +  * Before marking this entry valid, check the pfn should be mapped.
> +  */
> + if (debug_pagealloc_enabled())
> + trans_info.trans_flags |= (TRANS_MKVALID | TRANS_CHECKPFN);

The debug_pagealloc_enabled() check should be with the code that generates a 
different
entry. Whether the different entry is correct needs to be considered with
debug_pagealloc_enabled() in mind. You are making this tricky logic less clear.

There is no way the existing code invents an entry for a !pfn_valid() page. 
With your
'checkpfn' flag, this thing can. You don't need to generalise this for 
hypothetical users.


If kexec needs to create mappings for bogus pages, I'd like to know why.


>   /*
>* Restoring the memory image will overwrite the ttbr1 page tables.
>* Create a second copy of just the linear map, and use this when
>* restoring.
>*/
> - rc = trans_pgd_create_copy(_pg_dir, PAGE_OFFSET, 0);
> - if (rc)
> + rc = trans_pgd_create_copy(_info, _pg_dir, init_mm.pgd,
> +PAGE_OFFSET, 0);

> + if (rc) {
> + if (rc == -ENOMEM)
> + pr_err("Failed to allocate memory for temporary page 
> tables.\n");
> + else if (rc == -ENXIO)
> + pr_err("Tried to set PTE for PFN that does not 
> exist\n");
>   goto out;
> + }

If you think the distinction for this error message is useful, it would be 
clearer to
change it in the current hibernate code before you move it. (_copy_pte() to 
return an
error, instead of silently failing). Done here, this is unrelated noise.

I doubt this is specific to kexec.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 09/17] arm64, trans_pgd: add trans_pgd_create_empty

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> This functions returns a zeroed trans_pgd using the allocator that is
> specified in the info argument.
> 
> trans_pgds should be created by using this function.

This function takes the allocator you give it, and calls it once.

Given both users need one pgd, and have to provide the allocator, it seems 
strange that
they aren't trusted to call it.

I don't think this patch is necessary.

Let the caller pass in the pgd_t to the helpers.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 08/17] arm64, trans_pgd: make trans_pgd_map_page generic

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Currently, trans_pgd_map_page has assumptions that are relevant to
> hibernate. But, to make it generic we must allow it to use any allocator

Sounds familiar: you removed this in patch 2.


> and also, can't assume that entries do not exist in the page table
> already.

This thing creates a set of page tables to map one page: the relocation code.
This is mapped in TTBR0_EL1.
It can assume existing entries do not exist, because it creates the 
single-entry levels as
it goes. Kexec also needs to map precisely one page for relocation. You don't 
need to
generalise this.

'trans_pgd_create_copy()' is what creates a copy the linear map. This is mapped 
in TTBR1_EL1.

There is no reason for kexec to behave differently here.


> Also, we can't use init_mm here.

Why not? arm64's pgd_populate() doesn't use the mm. It's only there to make it 
obvious
this is an EL1 mapping we are creating. We use the kernel-asid with the new 
mapping.

The __ version is a lot less readable. Please don't use the page tables as an 
array: this
is what the offset helpers are for.


> Also, add "flags" for trans_pgd_info, they are going to be used
> in copy functions once they are generalized.

You don't need to 'generalize' this to support hypothetical users.
There are only two: hibernate and kexec, both of which are very specialised. 
Making these
things top-level marionette strings will tangle the logic.

The copy_p?d() functions should decide if they should manipulate _this_ entry 
based on
_this_ entry and the kernel configuration. This is only really done in 
_copy_pte(), which
is where it should stay.


> diff --git a/arch/arm64/include/asm/trans_pgd.h 
> b/arch/arm64/include/asm/trans_pgd.h
> index c7b5402b7d87..e3d022b1b526 100644
> --- a/arch/arm64/include/asm/trans_pgd.h
> +++ b/arch/arm64/include/asm/trans_pgd.h
> @@ -11,10 +11,45 @@
>  #include 
>  #include 
>  
> +/*
> + * trans_alloc_page
> + *   - Allocator that should return exactly one uninitilaized page, if this
> + *allocator fails, trans_pgd returns -ENOMEM error.
> + *
> + * trans_alloc_arg
> + *   - Passed to trans_alloc_page as an argument

This is very familiar.


> + * trans_flags
> + *   - bitmap with flags that control how page table is filled.
> + * TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
> + *writeable by removing RDONLY flag from PTE.

Why would you ever keep the read-only flags in a set of page tables that exist 
to let you
overwrite memory?


> + * TRANS_MKVALID: during page table copy, if PTE present, but not valid,
> + *make it valid.

Please keep this logic together with the !pte_none(pte) and 
debug_pagealloc_enabled()
check, where it is today.

Making an entry valid without those checks should never be necessary.


> + * TRANS_CHECKPFN: During page table copy, for every PTE entry check that
> + * PFN that this PTE points to is valid. Otherwise return
> + * -ENXIO

Hibernate does this when inventing a new mapping. This is how we check the 
kernel
should be able to read/write this page. If !pfn_valid(), the page should not be 
mapped.

Why do you need to turn this off?

It us only necessary at the leaf level, and only if debug-pagealloc is in use. 
Please keep
all these bits together, as its much harder to understand why this entry needs 
inventing
when its split up like this.



> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index 6ee81bbaa37f..17426dc8cb54 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -179,6 +179,12 @@ int arch_hibernation_header_restore(void *addr)
>  }
>  EXPORT_SYMBOL(arch_hibernation_header_restore);
>  
> +static void *
> +hibernate_page_alloc(void *arg)
> +{
> + return (void *)get_safe_page((gfp_t)(unsigned long)arg);
> +}
> +
>  /*
>   * Copies length bytes, starting at src_start into an new page,
>   * perform cache maintentance, then maps it at the specified address low
> @@ -195,6 +201,11 @@ static int create_safe_exec_page(void *src_start, size_t 
> length,
>unsigned long dst_addr,
>phys_addr_t *phys_dst_addr)
>  {
> + struct trans_pgd_info trans_info = {
> + .trans_alloc_page   = hibernate_page_alloc,
> + .trans_alloc_arg= (void *)GFP_ATOMIC,
> + .trans_flags= 0,
> + };
>   void *page = (void *)get_safe_page(GFP_ATOMIC);
>   pgd_t *trans_pgd;
>   int rc;
> @@ -209,7 +220,7 @@ static int create_safe_exec_page(void *src_start, size_t 
> length,
>   if (!trans_pgd)
>   return -ENOMEM;
>  
> - rc = trans_pgd_map_page(trans_pgd, page, dst_addr,
> + rc = trans_pgd_map_page(_info, trans_pgd, page, dst_addr,
>   PAGE_KERNEL_EXEC);
>   if (rc)
>   return rc;
> diff 

Re: [PATCH v3 07/17] arm64, hibernate: move page handling function to new trans_pgd.c

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Now, that we abstracted the required functions move them to a new home.
> Later, we will generalize these function in order to be useful outside
> of hibernation.

> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> new file mode 100644
> index ..00b62d8640c2
> --- /dev/null
> +++ b/arch/arm64/mm/trans_pgd.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2019, Microsoft Corporation.
> + * Pavel Tatashin 

Hmmm, while line-count isn't a useful metric: this file contains 41% of the 
code that was
in hibernate.c, but has stripped the substantial copyright-pedigree that the 
hibernate
code had built up over the years.
(counting lines identified by 'cloc' as code, not comments or blank)

If you are copying or moving a non trivial quantity of code, you need to 
preserve the
copyright. Something like 'Derived from the arm64 hibernate support which 
has:'


> + */
> +
> +/*
> + * Transitional tables are used during system transferring from one world to
> + * another: such as during hibernate restore, and kexec reboots. During these
> + * phases one cannot rely on page table not being overwritten.

I think you need to mention that hibernate and kexec are rewriting memory, and 
may
overwrite the live page tables, therefore ...


> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 

#include 
#include 
#include 


> +static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
> +{
> + pte_t pte = READ_ONCE(*src_ptep);
> +

> + if (pte_valid(pte)) {

> + /*
> +  * Resume will overwrite areas that may be marked
> +  * read only (code, rodata). Clear the RDONLY bit from
> +  * the temporary mappings we use during restore.
> +  */
> + set_pte(dst_ptep, pte_mkwrite(pte));

> + } else if (debug_pagealloc_enabled() && !pte_none(pte)) {

> + /*
> +  * debug_pagealloc will removed the PTE_VALID bit if
> +  * the page isn't in use by the resume kernel. It may have
> +  * been in use by the original kernel, in which case we need
> +  * to put it back in our copy to do the restore.
> +  *
> +  * Before marking this entry valid, check the pfn should
> +  * be mapped.
> +  */

> + BUG_ON(!pfn_valid(pte_pfn(pte)));


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 06/17] arm64, hibernate: add trans_pgd public functions

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> trans_pgd_create_copy() and trans_pgd_map_page() are going to be
> the basis for public interface of new subsystem that handles page

Please don't call this a subsystem. 'sound' and 'mm' are subsystems, this is 
just some
shared code.

> tables for cases which are between kernels: kexec, and hibernate.

Even though you've baked the get_safe_page() calls into trans_pgd_map_page()?


> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index 750ecc7f2cbe..2e29d620b56c 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -182,39 +182,15 @@ int arch_hibernation_header_restore(void *addr)

> +int trans_pgd_map_page(pgd_t *trans_pgd, void *page,
> +unsigned long dst_addr,
> +pgprot_t pgprot)

If this thing is going to be exposed, its name should reflect that its creating 
a set of
page tables, to map a single page.

A function called 'map_page' with this prototype should 'obviously' map @page 
at @dst_addr
in @trans_pgd using the provided @pgprot... but it doesn't.

This is what 'create' was doing in the old name, if that wasn't obvious, its 
because
naming things is hard!
| trans_create_single_page_mapping()?

(might be too verbose)

I think this bites you in patch 8, where you 'generalise' this.


>  {
> - void *page = (void *)get_safe_page(GFP_ATOMIC);
> - pgd_t *trans_pgd;
>   pgd_t *pgdp;
>   pud_t *pudp;
>   pmd_t *pmdp;
>   pte_t *ptep;
>  
> - if (!page)
> - return -ENOMEM;
> -
> - memcpy(page, src_start, length);
> - __flush_icache_range((unsigned long)page, (unsigned long)page + length);
> -
> - trans_pgd = (void *)get_safe_page(GFP_ATOMIC);
> - if (!trans_pgd)
> - return -ENOMEM;
> -
>   pgdp = pgd_offset_raw(trans_pgd, dst_addr);
>   if (pgd_none(READ_ONCE(*pgdp))) {
>   pudp = (void *)get_safe_page(GFP_ATOMIC);
> @@ -242,6 +218,44 @@ static int create_safe_exec_page(void *src_start, size_t 
> length,
>   ptep = pte_offset_kernel(pmdp, dst_addr);
>   set_pte(ptep, pfn_pte(virt_to_pfn(page), PAGE_KERNEL_EXEC));
>  
> + return 0;
> +}
> +
> +/*
> + * Copies length bytes, starting at src_start into an new page,
> + * perform cache maintentance, then maps it at the specified address low

Could you fix the spelling of maintenance as git thinks you've moved it?


> + * address as executable.
> + *
> + * This is used by hibernate to copy the code it needs to execute when
> + * overwriting the kernel text. This function generates a new set of page
> + * tables, which it loads into ttbr0.
> + *
> + * Length is provided as we probably only want 4K of data, even on a 64K
> + * page system.
> + */
> +static int create_safe_exec_page(void *src_start, size_t length,
> +  unsigned long dst_addr,
> +  phys_addr_t *phys_dst_addr)
> +{


Thanks,

James


Re: [PATCH v3 01/17] kexec: quiet down kexec reboot

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Here is a regular kexec command sequence and output:
> =
> $ kexec --reuse-cmdline -i --load Image
> $ kexec -e
> [  161.342002] kexec_core: Starting new kernel
> 
> Welcome to Buildroot
> buildroot login:
> =
> 
> Even when "quiet" kernel parameter is specified, "kexec_core: Starting
> new kernel" is printed.
> 
> This message has  KERN_EMERG level, but there is no emergency, it is a
> normal kexec operation, so quiet it down to appropriate KERN_NOTICE.

As this doesn't have a dependency with the rest of the series, you may want to 
post it
independently so it can be picked up independently.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 05/17] arm64, hibernate: check pgd table allocation

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> There is a bug in create_safe_exec_page(), when page table is allocated
> it is not checked that table is allocated successfully:
> 
> But it is dereferenced in: pgd_none(READ_ONCE(*pgdp)).

If there is a bug, it shouldn't be fixed part way through a series. This makes 
it
difficult to backport the fix.

Please split this out as an independent patch with a 'Fixes:' tag for the 
commit that
introduced the bug.


> Another issue,

So this patch does two things? That is rarely a good idea. Again, this makes it 
difficult
to backport the fix.


> is that phys_to_ttbr() uses an offset in page table instead
> of pgd directly.

If you were going to reuse this, that would be a bug. But because the only page 
that is
being mapped, is mapped to PAGE_SIZE, all the top bits will be 0. The offset 
calls are
boiler-plate. It doesn't look intentional, but its harmless.


Please separate out the potential NULL-dereference bits so there is a clean 
stand-alone
fix that can be sent to the stable trees.


Thanks,

James


Re: [PATCH v3 04/17] arm64, hibernate: rename dst to page in create_safe_exec_page

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> create_safe_exec_page() allocates a safe page and maps it at a
> specific location, also this function returns the physical address
> of newly allocated page.
> 
> The destination VA, and PA are specified in arguments: dst_addr,
> phys_dst_addr
> 
> However, within the function it uses "dst" which has unsigned long
> type, but is actually a pointers in the current virtual space. This
> is confusing to read.

The type? There are plenty of places in the kernel that an unsigned-long is 
actually a
pointer. This isn't unusual.


> Rename dst to more appropriate page (page that is created), and also
> change its time to "void *"

If you think its clearer,
Reviewed-by: James Morse 


Thanks,

James


Re: [PATCH v3 03/17] arm64, hibernate: remove gotos in create_safe_exec_page

2019-09-06 Thread James Morse
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Usually, gotos are used to handle cleanup after exception, but
> in case of create_safe_exec_page there are no clean-ups. So,
> simply return the errors directly.

Reviewed-by: James Morse 


Thanks,

James


Re: [PATCH v3 02/17] arm64, hibernate: use get_safe_page directly

2019-09-06 Thread James Morse
Hi Pavel,

Nit: The pattern for the subject prefix should be "arm64: hibernate:"..
Its usually possible to spot the pattern from "git log --oneline $file".

On 21/08/2019 19:31, Pavel Tatashin wrote:
> create_safe_exec_page is a local function that uses the
> get_safe_page() to allocate page table and pages and one pages
> that is getting mapped.

I can't parse this.

create_safe_exec_page() uses hibernate's allocator to create a set of page 
table to map a
single page that will contain the relocation code.


> Remove the allocator related arguments, and use get_safe_page
> directly, as it is done in other local functions in this
> file.
... because kexec can't use this as it doesn't have a working allocator.
Removing this function pointer makes it easier to refactor the code later.

(this thing is only a function pointer so kexec could use it too ... It looks 
like you're
creating extra work. Patch 7 moves these new calls out to a new file... 
presumably so
another patch can remove them again)

As stand-alone cleanup the patch looks fine, but you probably don't need to do 
this.


Thanks,

James


Re: [PATCH v1 0/8] arm64: MMU enabled kexec relocation

2019-08-16 Thread James Morse
Hi Pavel,

On 15/08/2019 21:09, Pavel Tatashin wrote:
>>> Also, I'd appreciate if anyone could test this series on vhe hardware
>>> with vhe kernel, it does not look like QEMU can emulate it yet
>>
>> This locks up during resume from hibernate on my AMD Seattle, a regular v8.0 
>> machine.
> 
> Thanks for reporting a bug I will root cause and fix it.

>> Please try and build the series to reduce review time. What you have here is 
>> an all-new
>> page-table generation API, which you switch hibernate and kexec too. This is 
>> effectively a
>> new implementation of hibernate and kexec. There are three things here that 
>> need review.
>>
>> You have a regression in your all-new implementation of hibernate. It took 
>> six months (and
>> lots of review) to get the existing code right, please don't rip it out if 
>> there is
>> nothing wrong with it.
> 
>> Instead, please just move the hibernate copy_page_tables() code, and then 
>> wire kexec up.
>> You shouldn't need to change anything in the copy_page_tables() code as the 
>> linear map is
>> the same in both cases.

> It is not really an all-new implementation of hibernate (for kexec it
> is true though). I used the current implementation of hibernate as
> bases, and simply generalized the functions by providing a flexible
> interface. So what you are asking is actually exactly what I am doing.

I disagree. The resume page-table code is the bulk of the complexity in 
hibernate.c. Your
first patch dumps ~200 lines of differently-complex code, and your second 
switches
hibernate over to it.

Instead, please move that code, keeping it as it is. git will spot the move, 
and the
generated diffstat should only reflect the build-system changes. You don't need 
to 'switch
hibernate to transitional page tables.'

Adding kexec will then show-up what needs changing, each change comes with a 
commit
message explaining why. Having these as 'generalisations' in the first patch is 
a mess.

There is existing code that we don't want to break. Any changes need to be done 
as a
sequence of small incremental changes. It can't be reviewed any other way.


> I realize, that I introduced a bug that I will fix.

Done as a sequence of small incremental changes, I could bisect it to the patch 
that
introduces the bug, and probably fix it from the description in the commit 
message.


>> It looks like you are creating the page tables just after the kexec:segments 
>> have been
>> loaded. This will go horribly wrong if anything changes between then and 
>> kexec time. (e.g.
>> memory you've got mapped gets hot-removed).
>> This needs to be done as late as possible, so we don't waste memory, and the 
>> world can't
>> change around us. Reboot notifiers run before kexec, can't we do the 
>> memory-allocation there?

> Kexec by design does not allow allocate during kexec time. This is
> because we cannot fail during kexec syscall.

This problem needs solving.

| Reboot notifiers run before kexec, can't we do the memory-allocation there?


> All allocations must be done during kexec load time.

This increases the memory footprint. I don't think we should waste ~2MB per GB 
of kernel
memory on this feature. (Assuming 4K pages and rodata_full)

Another option is to allocate this memory at load time, but then free it so it 
can be used
in the meantime. You can keep the list of allocated pfn, as we know they aren't 
in use by
the running kernel, kexec metadata, loaded images etc.

Memory hotplug would need handling carefully, as would anything that 'donates' 
memory to
another agent. (I suspect the TEE stuff does this, I don't know how it 
interacts with kexec)


> Kernel memory cannot be hot-removed, as
> it is not part of ZONE_MOVABLE, and cannot be migrated.

Today, yes. Tomorrow?, "arm64/mm: Enable memory hot remove":
https://lore.kernel.org/r/1563171470-3117-1-git-send-email-anshuman.khand...@arm.com


 Previously:
 kernel shutdown 0.022131328s
 relocation  0.440510736s
 kernel startup  0.294706768s

 Relocation was taking: 58.2% of reboot time

 Now:
 kernel shutdown 0.032066576s
 relocation  0.022158152s
 kernel startup  0.296055880s

 Now: Relocation takes 6.3% of reboot time

 Total reboot is x2.16 times faster.
>>
>> When I first saw these numbers they were ~'0.29s', which I wrongly assumed 
>> was 29 seconds.
>> Savings in milliseconds, for _reboot_ is a hard sell. I'm hoping that on the 
>> machines that
>> take minutes to kexec we'll get numbers that make this change more 
>> convincing.

> Sure, this userland is very small kernel+userland is only 47M. Here is
> another data point: fitImage: 380M, it contains a larger userland.
> The numbers for kernel shutdown and startup are the same as this is
> the same kernel, but relocation takes: 3.58s
> shutdown: 0.02s
> relocation: 3.58s
> startup:  0.30s
> 
> Relocation take 88% of reboot time. And, we must have it under one second.

Where does this one second 

Re: [PATCH v1 0/8] arm64: MMU enabled kexec relocation

2019-08-15 Thread James Morse
Hi Pavel,

On 08/08/2019 19:44, Pavel Tatashin wrote:
> Just a friendly reminder, please send your comments on this series.

(Please don't top-post)


> It's been a week since I sent out these patches, and no feedback yet.

A week is not a lot of time, people are busy, go to conferences, some even dare 
to take
holiday!


> Also, I'd appreciate if anyone could test this series on vhe hardware
> with vhe kernel, it does not look like QEMU can emulate it yet

This locks up during resume from hibernate on my AMD Seattle, a regular v8.0 
machine.


Please try and build the series to reduce review time. What you have here is an 
all-new
page-table generation API, which you switch hibernate and kexec too. This is 
effectively a
new implementation of hibernate and kexec. There are three things here that 
need review.

You have a regression in your all-new implementation of hibernate. It took six 
months (and
lots of review) to get the existing code right, please don't rip it out if 
there is
nothing wrong with it.


Instead, please just move the hibernate copy_page_tables() code, and then wire 
kexec up.
You shouldn't need to change anything in the copy_page_tables() code as the 
linear map is
the same in both cases.


It looks like you are creating the page tables just after the kexec:segments 
have been
loaded. This will go horribly wrong if anything changes between then and kexec 
time. (e.g.
memory you've got mapped gets hot-removed).
This needs to be done as late as possible, so we don't waste memory, and the 
world can't
change around us. Reboot notifiers run before kexec, can't we do the 
memory-allocation there?


> On Thu, Aug 1, 2019 at 11:24 AM Pavel Tatashin
>  wrote:
>>
>> Enable MMU during kexec relocation in order to improve reboot performance.
>>
>> If kexec functionality is used for a fast system update, with a minimal
>> downtime, the relocation of kernel + initramfs takes a significant portion
>> of reboot.
>>
>> The reason for slow relocation is because it is done without MMU, and thus
>> not benefiting from D-Cache.
>>
>> Performance data
>> 
>> For this experiment, the size of kernel plus initramfs is small, only 25M.
>> If initramfs was larger, than the improvements would be greater, as time
>> spent in relocation is proportional to the size of relocation.
>>
>> Previously:
>> kernel shutdown 0.022131328s
>> relocation  0.440510736s
>> kernel startup  0.294706768s
>>
>> Relocation was taking: 58.2% of reboot time
>>
>> Now:
>> kernel shutdown 0.032066576s
>> relocation  0.022158152s
>> kernel startup  0.296055880s
>>
>> Now: Relocation takes 6.3% of reboot time
>>
>> Total reboot is x2.16 times faster.

When I first saw these numbers they were ~'0.29s', which I wrongly assumed was 
29 seconds.
Savings in milliseconds, for _reboot_ is a hard sell. I'm hoping that on the 
machines that
take minutes to kexec we'll get numbers that make this change more convincing.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/8] arm64, mm: transitional tables

2019-08-15 Thread James Morse
Hi Pavel,

On 01/08/2019 16:24, Pavel Tatashin wrote:
> There are cases where normal kernel pages tables, i.e. idmap_pg_dir
> and swapper_pg_dir are not sufficient because they may be overwritten.
> 
> This happens when we transition from one world to another: for example
> during kexec kernel relocation transition, and also during hibernate
> kernel restore transition.
> 
> In these cases, if MMU is needed, the page table memory must be allocated
> from a safe place. Transitional tables is intended to allow just that.

> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index db92950bb1a0..dcb4f13c7888 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -110,6 +110,7 @@
>  #define PUD_TABLE_BIT(_AT(pudval_t, 1) << 1)
>  #define PUD_TYPE_MASK(_AT(pudval_t, 3) << 0)
>  #define PUD_TYPE_SECT(_AT(pudval_t, 1) << 0)
> +#define PUD_SECT_RDONLY  (_AT(pudval_t, 1) << 7) /* 
> AP[2] */

This shouldn't be needed. As far as I'm aware, we only get read-only pages in 
the linear
map from debug-pagealloc, and the module aliases. Both of which require the 
linear map to
be made of page-size mappings.

Where are you seeing these?


> diff --git a/arch/arm64/include/asm/trans_table.h 
> b/arch/arm64/include/asm/trans_table.h
> new file mode 100644
> index ..c7aef70587a1
> --- /dev/null
> +++ b/arch/arm64/include/asm/trans_table.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (c) 2019, Microsoft Corporation.
> + * Pavel Tatashin 
> + */
> +
> +#ifndef _ASM_TRANS_TABLE_H
> +#define _ASM_TRANS_TABLE_H
> +
> +#include 
> +#include 
> +
> +/*
> + * trans_alloc_page
> + *   - Allocator that should return exactly one uninitilaized page, if this
> + *allocator fails, trans_table returns -ENOMEM error.
> + *
> + * trans_alloc_arg
> + *   - Passed to trans_alloc_page as an argument
> + *
> + * trans_flags
> + *   - bitmap with flags that control how page table is filled.
> + * TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
> + *writeable by removing RDONLY flag from PTE.
> + * TRANS_MKVALID: during page table copy, if PTE present, but not valid,
> + *make it valid.
> + * TRANS_CHECKPFN: During page table copy, for every PTE entry check that
> + * PFN that this PTE points to is valid. Otherwise return
> + * -ENXIO

Adding top-level global knobs to manipulate the copied linear map is going to 
lead to
bugs. The existing code will only change the PTE in specific circumstances, 
that it tests
for, that only happen at the PTE level.


> + * TRANS_FORCEMAP: During page map, if translation exists, force
> + * overwrite it. Otherwise -ENXIO may be returned by
> + * trans_table_map_* functions if conflict is detected.

This one, sounds like a very bad idea.


Thanks,

James


Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation

2019-07-26 Thread James Morse
Hi Pavel,

On 17/07/2019 20:13, Pavel Tatashin wrote:
>> After a quick skim:
>>
>> This will map 'nomap' regions of memory with cacheable attributes. This is a 
>> non-starter.
>> These regions were described by firmware as having content that was/is 
>> written with
>> different attributes. The attributes must match whenever it is mapped, 
>> otherwise we have a
>> loss of coherency. Mapping this stuff as cacheable means the CPU can 
>> prefetch it into the
>> cache whenever it likes.
> 
>> It may be important that we do not ever map some of these regions, even 
>> though its
>> described as memory. On AMD-Seattle the bottom page of memory is reserved by 
>> firmware for
>> its own use; it is made secure-only, and any access causes an
>> external-abort/machine-check. UEFI describes this as 'Reserved', and we 
>> preserve this in
>> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly 
>> with the
>> 'nomap' attribute.
>>
>> Mapping a 'new'/unknown region with cacheable attributes can never be safe, 
>> even if we
>> trusted kexec-tool to only write the kernel to memory. The host may be using 
>> a bigger page
>> size causing more memory to become cacheable than was intended.
>> Linux's EFI support rounds the UEFI memory map to the largest support page 
>> size, (and
>> winges about firmware bugs).
>> If we're allowing kexec to load images in a region not described as 
>> IORESOURCE_SYSTEM_RAM,
>> that is a bug we should fix.
> 
> We are allowing this. If you consider this to be a bug, I will fix it,
> and this will actually simplify the idmap page table. User will
> receive an error during kexec load if a request is made to load into
> !IORESOURCE_SYSTEM_RAM region.

I consider this a bug, but we can see what others think.
This suggests kexec-tools can open /proc/iomem, find a likely looking gap, and 
try to load
the new kernel between two platform devices.


>> The only way to do this properly is to copy the linear mapping. The arch 
>> code has lots of
>> complex code to generate it correctly at boot, we do not want to duplicate 
>> it.
>> (this is why hibernate copies the linear mapping)
> 
> As I understand, you would like to take a copy of idmap page table,
> and add entries only for segment
> sources and destinations into the new page table?

I don't think there is a need to idmap memory at all. We should copy the linear 
map so you
know you won't overwrite its page tables as part of loading the new kernel.


> If so, there is a slight problem: arch hook machine_kexec_prepare() is
> called prior to loading segments from userland. We can solve this by
> adding another hook that is called after kimage_terminate().

Yes, all this would need doing as machine_kexec() runs. We preferably need to 
allocate
memory in this path, or at least have a bitmap of what we can/can't overwrite.


>> These patches do not remove the running page tables from TTBR1. As you 
>> overwrite the live
>> page tables you will corrupt the state of the CPU. The page-table walker may 
>> access things
>> that aren't memory, cache memory that shouldn't be cached (see above), and 
>> allocate
>> conflicting entries in the TLB.
> 
> Indeed. However, I was following what is done in create_safe_exec_page():
> https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263
> 
> ttbr1 is not removed there. Am I missing something, or is not yet
> configured there?

Hibernate maps a single executable page in ttbr0_el1 that holds its relocation 
code.
The relocation code then switches ttbr1_el1 to point to the copy of the linear 
map. See
the 'break_before_make_ttbr_switch' macro in swsusp_arch_suspend_exit().


> I will set ttbr1 to zero page.
> 
>> You cannot use the mm page table helpers to build an idmap on arm64. The mm 
>> page table
>> helpers have a compile-time VA_BITS, and we support systems where there is 
>> no memory below
>> 1<> 39bit VA kernel,
>> the idmap will have more page table levels than the page table helpers can 
>> build. This is
>> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
>> You already need to copy the linear-map, so using an idmap is extra work. 
>> You want to work
>> with linear-map addresses, you probably need to add the field to the 
>> appropriate structure.
> 
> OK. Makes sense. I will do the way hibernate setup this table. I was
> indeed following x86, hoping that eventually it would be possible to
> unite: kasan, hibernate, and kexec implementation of this page table.

Our kasan and hibernate code already went a different way. I doubt we can bring 
them back
in to look like x86, they have different problems to solve.


>> The kexec relocation code still runs at EL2. You can't use a copy of the 
>> linear map here
>> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been 
>> torn back to
>> the hyp-stub.
> 
> As I understand normally on baremetal kexec runs at EL1 not EL2. On my
> machine is_kernel_in_hyp_mode() 

Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation

2019-07-17 Thread James Morse
Hi Pavel,

On 16/07/2019 17:56, Pavel Tatashin wrote:
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.

The 'tl;dr' version of this: I strongly urge you to start with the hibernate 
code that
already covers all these known corner cases. x86 was not a good starting point.


After a quick skim:

This will map 'nomap' regions of memory with cacheable attributes. This is a 
non-starter.
These regions were described by firmware as having content that was/is written 
with
different attributes. The attributes must match whenever it is mapped, 
otherwise we have a
loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch 
it into the
cache whenever it likes.
It may be important that we do not ever map some of these regions, even though 
its
described as memory. On AMD-Seattle the bottom page of memory is reserved by 
firmware for
its own use; it is made secure-only, and any access causes an
external-abort/machine-check. UEFI describes this as 'Reserved', and we 
preserve this in
the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with 
the
'nomap' attribute.

Mapping a 'new'/unknown region with cacheable attributes can never be safe, 
even if we
trusted kexec-tool to only write the kernel to memory. The host may be using a 
bigger page
size causing more memory to become cacheable than was intended.
Linux's EFI support rounds the UEFI memory map to the largest support page 
size, (and
winges about firmware bugs).
If we're allowing kexec to load images in a region not described as 
IORESOURCE_SYSTEM_RAM,
that is a bug we should fix.

The only way to do this properly is to copy the linear mapping. The arch code 
has lots of
complex code to generate it correctly at boot, we do not want to duplicate it.
(this is why hibernate copies the linear mapping)


These patches do not remove the running page tables from TTBR1. As you 
overwrite the live
page tables you will corrupt the state of the CPU. The page-table walker may 
access things
that aren't memory, cache memory that shouldn't be cached (see above), and 
allocate
conflicting entries in the TLB.

You cannot use the mm page table helpers to build an idmap on arm64. The mm 
page table
helpers have a compile-time VA_BITS, and we support systems where there is no 
memory below
1< This patch series works in terms, that I can kexec-reboot both in QEMU

I wouldn't expect Qemu's emulation of the MMU and caches to be performance 
accurate.


> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.

Some information about the physical machine you tested this on would help.
I'm guessing its v8.0, and booted at EL2


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [v1 0/5] allow to reserve memory for normal kexec kernel

2019-07-10 Thread James Morse
Hi Pasha,

On 09/07/2019 14:07, Pavel Tatashin wrote:
>>> Enabling MMU and D-Cache for relocation  would essentially require the
>>> same changes in kernel. Could you please share exactly why these were
>>> not accepted upstream into kexec-tools?
>>
>> Because '--no-checks' is a much simpler alternative.
>>
>> More of the discussion:
>> https://lore.kernel.org/linux-arm-kernel/5599813d-f83c-d154-287a-c131c4829...@arm.com/
>>
>> While you can make purgatory a fully-fledged operating system, it doesn't 
>> really need to
>> do anything on arm64. Errata-workarounds alone are a reason not do start 
>> down this path.
> 
> Thank you James. I will summaries the information gathered from the
> yesterday's/today's discussion and add it to the cover letter together
> with ARM64 tag. I think, the patch series makes sense for ARM64 only,
> unless there are other platforms that disable caching/MMU during
> relocation.

I'd prefer not to reserve additional memory for regular kexec just to avoid the 
relocation.
If the kernel's relocation work is so painful we can investigate doing it while 
the MMU is
enabled. If you can compare regular-kexec with kexec_file_load() you eliminate 
the
purgatory part of the work.


Thanks,

James


Re: [v1 0/5] allow to reserve memory for normal kexec kernel

2019-07-09 Thread James Morse
Hi Pavel,

On 09/07/2019 11:55, Pavel Tatashin wrote:
> On Tue, Jul 9, 2019 at 6:36 AM Bhupesh Sharma  wrote:
>> On Tue, Jul 9, 2019 at 2:46 AM Pavel Tatashin  
>> wrote:
>>> Currently, it is only allowed to reserve memory for crash kernel, because
>>> it is a requirement in order to be able to boot into crash kernel without
>>> touching memory of crashed kernel is to have memory reserved.
>>>
>>> The second benefit for having memory reserved for kexec kernel is
>>> that it does not require a relocation after segments are loaded into
>>> memory.
>>>
>>> If kexec functionality is used for a fast system update, with a minimal
>>> downtime, the relocation of kernel + initramfs might take a significant
>>> portion of reboot.
>>>
>>> In fact, on the machine that we are using, that has ARM64 processor
>>> it takes 0.35s to relocate during kexec, thus taking 52% of kernel reboot
>>> time:
>>>
>>> kernel shutdown 0.03s
>>> relocation  0.35s
>>> kernel startup  0.29s
>>>
>>> Image: 13M and initramfs is 24M. If initramfs increases, the relocation
>>> time increases proportionally.
>>>
>>> While, it is possible to add 'kexeckernel=' parameters support to other
>>> architectures by modifying reserve_crashkernel(), in this series this is
>>> done for arm64 only.

>>
>> This seems like an issue with time spent while doing sha256
>> verification while in purgatory.
>>
>> Can you please try the following two patches which enable D-cache in
>> purgatory before SHA verification and disable it before switching to
>> kernel:
>>
>> http://lists.infradead.org/pipermail/kexec/2017-May/018839.html
>> http://lists.infradead.org/pipermail/kexec/2017-May/018840.html
> 
> Hi Bhupesh,
> 
> The verification was taking 2.31s. This is why it is disabled via
> kexec's '-i' flag. Therefore 0.35s is only the relocation part where
> time is spent, and with my patches the time is completely gone.
> Actually, I am glad you showed these patches to me because I might
> pull them and enable verification for our needs.
> 
>>
>> Note that these were not accepted upstream but are included in several
>> distros in some form or the other :)
> 
> Enabling MMU and D-Cache for relocation  would essentially require the
> same changes in kernel. Could you please share exactly why these were
> not accepted upstream into kexec-tools?

Because '--no-checks' is a much simpler alternative.

More of the discussion:
https://lore.kernel.org/linux-arm-kernel/5599813d-f83c-d154-287a-c131c4829...@arm.com/

While you can make purgatory a fully-fledged operating system, it doesn't 
really need to
do anything on arm64. Errata-workarounds alone are a reason not do start down 
this path.


Thanks,

James


Re: [v1 0/5] allow to reserve memory for normal kexec kernel

2019-07-09 Thread James Morse
Hi Pavel, Eric,

(Subject-Nit: 'arm64:' is needed to match the style for arm64's arch code. 
Without it the
maintainer is likely to skip the patches as being for core code.)

On 09/07/2019 01:09, Pavel Tatashin wrote:
>> Something is very very wrong there.
>>
>> Last I measured memory bandwidth seriously I could touch a Gigabyte per
>> second easily, and that was nearly 20 years ago.  Did you manage to
>> disable caching or have some particularly slow code that does the
>> reolocations.
>>
>> There is a serious cost to reserving memory in that it is simply not
>> available at other times.  For kexec on panic there is no other reliable
>> way to get memory that won't be DMA'd to.

> Indeed, but sometimes fast reboot is more important than the cost of
> reserving 32M-64M of memory.

>> We have options in this case and I would strongly encourage you to track
>> down why that copy in relocation is so very slow.  I suspect a 4KiB page
>> size is large enough that it can swamp pointer following costs.
>>
>> My back of the napkin math says even 20 years ago your copying costs
>> should be only 0.037s.  The only machine I have ever tested on where
>> the copy costs were noticable was my old 386.

>> Maybe I am out to lunch here but a claim that your memory only runs
>> at 100MiB/s (the speed of my spinning rust hard drive) is rather
>> incredible.

> I agree,  my measurement on this machine was 2,857MB/s. Perhaps when
> MMU is disabled ARM64 also has caching disabled? The function that
> loops through array of pages and relocates them to final destination
> is this:

> A comment before calling it:
> 
> 205   /*
> 206   * cpu_soft_restart will shutdown the MMU, disable data caches, then
> 207   * transfer control to the reboot_code_buffer which contains a copy of
> 208   * the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
> 209   * uses physical addressing to relocate the new image to its final
> 210   * position and transfers control to the image entry point when the
> 211   * relocation is complete.
> 212   * In kexec case, kimage->start points to purgatory assuming that
> 213   * kernel entry and dtb address are embedded in purgatory by
> 214   * userspace (kexec-tools).
> 215   * In kexec_file case, the kernel starts directly without purgatory.
> 216   */

> So, as I understand at least data caches are disabled, and MMU is
> disabled, perhaps this is why this function is so incredibly slow?

Yup, spot on.

Kexec typically wants to place the new kernel over the top of the old one, so 
its
guaranteed to overwrite the live swapper_pg_dir.
There is also nothing to prevent the other parts of the page-tables being 
overwritten as
we relocate the kernel. The way the the kexec series chose to make this safe 
was the
simplest: turn the MMU off. We need to enter purgatory with the MMU off anyway.

(Its worth checking your kexec-tools purgatory isn't spending a decade 
generating a SHA256
of the kernel while the MMU is off. This is pointless as we don't suspect the 
previous
kernel of corrupting memory, and we can't debug/report the problem if we detect 
a
different SHA256. Newer kexec-tools have some commandline option to turn this 
thing off.)


> Perhaps, there is a better way to fix this problem by keeping caches
> enabled while still relocating? Any suggestions from Aarch64
> developers?

Turning the MMU off is the simplest. The alternative is a lot more complicated:

(To get the benefit of the caches, we need the MMU enabled to tell the hardware 
what the
cache-ability attributes of each page of memory are.)

We'd need to copy the page tables to build a new set out of memory we know 
won't get
overwritten. Switching to this 'safe set' is tricky, as it also maps the code 
we're
executing. To do that we'd need to use TTBR0 to hold another 'safe mapping' of 
the code
we're running, while we change our view of the linear-map.

Hibernate does exactly this, so its possible to re-use some of that logic. From 
memory, I
think the reason that didn't get done is kexec doesn't provide an allocator, 
and needs the
MMU off at some point anyway.


Thanks,

James


Re: [Query] arm64: Right approach to support Image.gz file type via kexec_file_load()

2019-06-24 Thread James Morse
Hi Bhupesh,

On 24/06/2019 06:59, Bhupesh Sharma wrote:
> On 06/20/2019 08:58 PM, James Morse wrote:
>> On 19/06/2019 22:23, Bhupesh Sharma wrote:
>>> Since most distributions use 'make zinstall' rule inside 
>>> 'arch/arm64/boot/Makefile' (see
>>> [1] for details) to install the arm64 Image.gz compressed file inside the 
>>> boot destination
>>> directory (for e.g. /boot), currently we cannot use kexec_file_load() to 
>>> load vmlinuz (or
>>> Image.gz):
>>
>> It's not just kexec_file_load(), we don't support booting from compressed or 
>> elf image
>> formats either: the bootloader has to decompress any Image.gz before it can 
>> run it.
> 
> That's correct.
> 
>>> ... kernel returns -EINVAL error value, as it is not able to locate the 
>>> magic number
>>> =0x644d5241, which is expected in the 64-byte header of the decompressed 
>>> kernel image


>>> This seems to have the following pros and cons, which I can think of:
>>>
>>> Pros:
>>>   - Changes can be handled in the user-space (kexec_tools) and no changes 
>>> are required in
>>> kernel space for handling the unsigned/non-secure boot case.
>>>
>>> Cons:
>>>   - One obvious issue is how to handle the signed kernel Image.gz, because 
>>> signature
>>> verification is managed inside the kernel, so handling a signed Image.gz 
>>> would require
>>> kernel intervention eventually.

>> How do you sign an Image.gz? Isn't the signature written into the PE header?

> That's correct, normally in user-land one uses standard signing utilities 
> like the sbsign
> tools (see [1]). For example I use the following method to sign the 
> decompressed kernel
> Image:
> 
> $ sbsign --key ${KEY} --cert ${CERT} Image --output Image.signed

| morse@eglon:~/kernel/linux/build_arm64$ sbsign --key certs/signing_key.pem  
--cert \
|   certs/signing_key.pem arch/arm64/boot/Image.gz
| Invalid DOS header magic

... because that gzip file isn't a PE32+ that can be loaded by UEFI.


> The problem happens when we have a Image.gz instead of a decompressed Image 
> in distro
> environments.
> 
> Now the process becomes a lot more complicated:
> 
> - User uses sbsign tool to sign Image.gz, lets call the resulting file as 
> Image.gz.signed

Secure Boot is the reason for all this signature stuff. Having only one way 
these things
get signed means it will work on any platform that supports signatures. 
Supporting
multiple formats means potentially multiple, conflicting, signatures.


Image.gz isn't a PE32+ file, so wherever sbsign puts the signature, it isn't 
somewhere the
kernel or UEFI know how to validate it.


> - kexec_file_load() is invoked using a command line resembling something like:
> 
> $ kexec -s -l Image.gz.signed --initrd=/boot/initramfs-`uname -r`.img 
> --reuse-cmdline
> 
> - Now since kexec_tools (user land) has no support for parsing the signature 
> appended
> before the Image.gz file (using which it creates a decompressed Image file) 
> and then to
> re-sign the resulting Image file (before it is passed as a fd to the 
> syscall), I am not
> sure how this can be handled in user-land appropriately.

Surely the right thing to do here is unzip the file, sign it, re-compress it.
It has been compressed to save space. The bootloader/grub has to uncompress it 
before if
can get get UEFI to boot it.
I'm pretty sure you can't point the efi boot variables at a gzip file.

(UEFI has its own compression scheme in appendix H, I can't find any mention of 
gzip in
the spec)


>>> 2. Add support in kernel (for which I have a RFC patch ready), which 
>>> handles an 'Image.gz'
>>> being passed via kexec_file_load(), using an approach as follows:
>>>
>>> a). Define a 'arch_kexec_kernel_image_probe' for arm64, which overrides the 
>>> __weak
>>> definition in 'kernel/kexec_file.c'
>>> b). Inside 'arch_kexec_kernel_image_probe' for arm64, check if we have been 
>>> passed a
>>> magic header  0x1f, 0x8b (\037 \213) which indicates a 'gzip format' Image 
>>> file.
>>> b). Decompress the contents inside a buffer using a decompress_kernel() -> 
>>> gunzip() ->
>>> inflate() logic.
>>>
>>> This seems to have the following pros and cons, which I can think of:
>>>
>>> Pros:
>>>   - Handling signed Image.gz becomes easier in the kernel itself.
>>
>> I don't follow: you can't boot this, so why would you sign it?
> 
> 
> Because that's what most distributions do normally (I share some signing
> rules as a reference below) - if the gzipped EFI imag

Re: [Query] arm64: Right approach to support Image.gz file type via kexec_file_load()

2019-06-20 Thread James Morse
Hi Bhupesh,

On 19/06/2019 22:23, Bhupesh Sharma wrote:
> Since most distributions use 'make zinstall' rule inside 
> 'arch/arm64/boot/Makefile' (see
> [1] for details) to install the arm64 Image.gz compressed file inside the 
> boot destination
> directory (for e.g. /boot), currently we cannot use kexec_file_load() to load 
> vmlinuz (or
> Image.gz):

It's not just kexec_file_load(), we don't support booting from compressed or 
elf image
formats either: the bootloader has to decompress any Image.gz before it can run 
it.


> ... kernel returns -EINVAL error value, as it is not able to locate the magic 
> number 
> =0x644d5241, which is expected in the 64-byte header of the decompressed 
> kernel image


> I can figure out two ways to address this:
> 
> 1. Add support in user-space kexec-tools (for which I have a RFC patch 
> ready), which
> handles an 'Image.gz' being passed via kexec_file_load(), using an approach 
> as follows:
> 
> a). Copy the contents of Image.gz to a temporary file.
> b). Decompress (gunzip-decompress) the contents inside the temporary file.
> c). Pass the 'fd' of the temporary file to the kernel space. So basically the 
> kernel space
> still gets a decompressed kernel image to load via kexec_tools

Sounds reasonable.
(I guess you need to decompress it first to know the size to pass to 
kexec_file_load(),
hence the intermediate copy)


> This seems to have the following pros and cons, which I can think of:
> 
> Pros:
>  - Changes can be handled in the user-space (kexec_tools) and no changes are 
> required in
> kernel space for handling the unsigned/non-secure boot case.
> 
> Cons:
>  - One obvious issue is how to handle the signed kernel Image.gz, because 
> signature
> verification is managed inside the kernel, so handling a signed Image.gz 
> would require
> kernel intervention eventually.

How do you sign an Image.gz? Isn't the signature written into the PE header?


>  - Passing decompressed image from user-space requires the kernel to read 
> large amount of
> data from the user-space.

The kernel can't decompress itself, so this large amount of data has to be 
moved at some
point.


> 2. Add support in kernel (for which I have a RFC patch ready), which handles 
> an 'Image.gz'
> being passed via kexec_file_load(), using an approach as follows:
> 
> a). Define a 'arch_kexec_kernel_image_probe' for arm64, which overrides the 
> __weak
> definition in 'kernel/kexec_file.c'
> b). Inside 'arch_kexec_kernel_image_probe' for arm64, check if we have been 
> passed a 
> magic header  0x1f, 0x8b (\037 \213) which indicates a 'gzip format' Image 
> file.
> b). Decompress the contents inside a buffer using a decompress_kernel() -> 
> gunzip() ->
> inflate() logic.
> 
> This seems to have the following pros and cons, which I can think of:
> 
> Pros:
>  - Handling signed Image.gz becomes easier in the kernel itself.

I don't follow: you can't boot this, so why would you sign it?


> Cons:
>  - One needs to add a decompress_kernel() -> gunzip() -> inflate() kind-of 
> logic in kernel
> space to handle gzipp'ed image for arm64.

We support gzipped initramfs so the code already exists. More of a problem is 
kdump (which
we don't yet support), which has to fit in the reserved crashkernel region, and 
we won't
know the size of the compressed image until we've decompressed it. (its just 
fiddly)


> So, I was wondering which approach should be more suitable - fixing this in 
> user-space v/s
> fix this in kernel-space.

As user-space can do this, I think it should!


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/4] arm64: kdump: support reserving crashkernel above 4G

2019-06-13 Thread James Morse
Hi Chen Zhou,

On 13/06/2019 12:27, Chen Zhou wrote:
> On 2019/6/6 0:29, James Morse wrote:
>> On 07/05/2019 04:50, Chen Zhou wrote:
>>> When crashkernel is reserved above 4G in memory, kernel should
>>> reserve some amount of low memory for swiotlb and some DMA buffers.
>>
>>> Meanwhile, support crashkernel=X,[high,low] in arm64. When use
>>> crashkernel=X parameter, try low memory first and fall back to high
>>> memory unless "crashkernel=X,high" is specified.
>>
>> What is the 'unless crashkernel=...,high' for? I think it would be simpler 
>> to relax the
>> ARCH_LOW_ADDRESS_LIMIT if reserve_crashkernel_low() allocated something.
>>
>> This way "crashkernel=1G" tries to allocate 1G below 4G, but fails if there 
>> isn't enough
>> memory. "crashkernel=1G crashkernel=16M,low" allocates 16M below 4G, which 
>> is more likely
>> to succeed, if it does it can then place the 1G block anywhere.
>>
> Yeah, this is much simpler.

>>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>>> index 413d566..82cd9a0 100644
>>> --- a/arch/arm64/kernel/setup.c
>>> +++ b/arch/arm64/kernel/setup.c
>>> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>>> request_resource(res, _data);
>>>  #ifdef CONFIG_KEXEC_CORE
>>> /* Userspace will find "Crash kernel" region in /proc/iomem. */
>>> +   if (crashk_low_res.end && crashk_low_res.start >= res->start &&
>>> +   crashk_low_res.end <= res->end)
>>> +   request_resource(res, _low_res);
>>> if (crashk_res.end && crashk_res.start >= res->start &&
>>> crashk_res.end <= res->end)
>>> request_resource(res, _res);
>>
>> With both crashk_low_res and crashk_res, we end up with two entries in 
>> /proc/iomem called
>> "Crash kernel". Because its sorted by address, and kexec-tools stops 
>> searching when it
>> find "Crash kernel", you are always going to get the kernel placed in the 
>> lower portion.
>>
>> I suspect this isn't what you want, can we rename crashk_low_res for arm64 
>> so that
>> existing kexec-tools doesn't use it?

> In my patchset, in addition to the kernel patches, i also modify the 
> kexec-tools.
>   arm64: support more than one crash kernel 
> regions(http://lists.infradead.org/pipermail/kexec/2019-April/022792.html).
> In kexec-tools patch, we read all the "Crash kernel" entry and load crash 
> kernel high.

But we can't rely on people updating user-space when they update the kernel!

[...]


>> I'm afraid you've missed the ugly bit of the crashkernel reservation...
>>
>> arch/arm64/mm/mmu.c::map_mem() marks the crashkernel as 'nomap' during the 
>> first pass of
>> page-table generation. This means it isn't mapped in the linear map. It then 
>> maps it with
>> page-size mappings, and removes the nomap flag.
>>
>> This is done so that arch_kexec_protect_crashkres() and
>> arch_kexec_unprotect_crashkres() can remove the valid bits of the 
>> crashkernel mapping.
>> This way the old-kernel can't accidentally overwrite the crashkernel. It 
>> also saves us if
>> the old-kernel and the crashkernel use different memory attributes for the 
>> mapping.
>>
>> As your low-memory reservation is intended to be used for devices, having it 
>> mapped by the
>> old-kernel as cacheable memory is going to cause problems if those CPUs 
>> aren't taken
>> offline and go corrupting this memory. (we did crash for a reason after all)
>>
>>
>> I think the simplest thing to do is mark the low region as 'nomap' in
>> reserve_crashkernel() and always leave it unmapped. We can then describe it 
>> via a
>> different string in /proc/iomem, something like "Crash kernel (low)". Older 
>> kexec-tools
>> shouldn't use it, (I assume its not using strncmp() in a way that would do 
>> this by
>> accident), and newer kexec-tools can know to describe it in the DT, but it 
>> can't write to it.

> I did miss the bit of the crashkernel reservation.
> I will fix this in next version.

I think all that is needed is to make the low-region nmap, and describe it via 
/proc/iomem
with a name where nothing will try and use it by accident.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] support reserving crashkernel above 4G on arm64 kdump

2019-06-13 Thread James Morse
Hi Chen Zhou,

On 13/06/2019 12:27, Chen Zhou wrote:
> On 2019/6/6 0:32, James Morse wrote:
>> On 07/05/2019 04:50, Chen Zhou wrote:
>>> We use crashkernel=X to reserve crashkernel below 4G, which will fail
>>> when there is no enough memory. Currently, crashkernel=Y@X can be used
>>> to reserve crashkernel above 4G, in this case, if swiotlb or DMA buffers
>>> are requierd, capture kernel will boot failure because of no low memory.
>>
>>> When crashkernel is reserved above 4G in memory, kernel should reserve
>>> some amount of low memory for swiotlb and some DMA buffers. So there may
>>> be two crash kernel regions, one is below 4G, the other is above 4G.
>>
>> This is a good argument for supporting the 'crashkernel=...,low' version.
>> What is the 'crashkernel=...,high' version for?
>>
>> Wouldn't it be simpler to relax the ARCH_LOW_ADDRESS_LIMIT if we see 
>> 'crashkernel=...,low'
>> in the kernel cmdline?
>>
>> I don't see what the 'crashkernel=...,high' variant is giving us, it just 
>> complicates the
>> flow of reserve_crashkernel().
>>
>> If we called reserve_crashkernel_low() at the beginning of 
>> reserve_crashkernel() we could
>> use crashk_low_res.end to change some limit variable from 
>> ARCH_LOW_ADDRESS_LIMIT to
>> memblock_end_of_DRAM().
>> I think this is a simpler change that gives you what you want.
> 
> According to your suggestions, we should do like this:
> 1. call reserve_crashkernel_low() at the beginning of reserve_crashkernel()
> 2. mark the low region as 'nomap'
> 3. use crashk_low_res.end to change some limit variable from 
> ARCH_LOW_ADDRESS_LIMIT to
> memblock_end_of_DRAM()
> 4. rename crashk_low_res as "Crash kernel (low)" for arm64

> 5. add an 'linux,low-memory-range' node in DT

(This bit would happen in kexec-tools)


> Do i understand correctly?

Yes, I think this is simpler and still gives you what you want.
It also leaves the existing behaviour unchanged, which helps with keeping 
compatibility
with existing user-space and older kdump kernels.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] arm64, vmcoreinfo : Append 'PTRS_PER_PGD' to vmcoreinfo

2019-06-07 Thread James Morse
Hi Bhupesh,

(sorry for the delay on this)

On 04/05/2019 13:53, Bhupesh Sharma wrote:
> On 04/03/2019 11:24 PM, Bhupesh Sharma wrote:
>> On 04/02/2019 10:56 PM, James Morse wrote:
>>> Yes the kernel code is going to move around, this is why the information we 
>>> expose via
>>> vmcoreinfo needs to be thought through: something we would always need, 
>>> regardless of how
>>> the kernel implements it.
>>>

>>> Pointer-auth changes all this again, as we may prefer to use the bits for 
>>> pointer-auth in
>>> one TTB or the other. PTRS_PER_PGD may show the 52bit value in this case, 
>>> but neither TTBR
>>> is mapping 52bits of VA.
>>>
>>>
>>>> So far, I have generally come across discussions where the following 
>>>> variations of the
>>>> address spaces have been proposed/requested:
>>>> - 48bit kernel VA + 48-bit User VA,
>>>> - 48-bit kernel VA + 52-bit User VA,
>>>
>>> + 52bit kernel, because there is excessive quantities of memory, and the 
>>> kernel maps it
>>> all, but 48-bit user, because it never maps all the memory, and we prefer 
>>> the bits for
>>> pointer-auth.
>>>
>>>> - 52-bit kernel VA + 52-bit User VA.
>>>
>>> And...  all four may happen with the same built image. I don't see how you 
>>> can tell these
>>> cases apart with the one (build-time-constant!) PTRS_PER_PGD value.
>>>
>>> I'm sure some of these cases are hypothetical, but by considering it all 
>>> now, we can avoid
>>> three more kernel:vmcoreinfo updates, and three more 
>>> fix-user-space-to-use-the-new-value.
>>
>> Agree.
>>
>>> I think you probably do need PTRS_PER_PGD, as this is the one value the mm 
>>> is using to
>>> generate page tables. I'm pretty sure you also need T0SZ and T1SZ to know 
>>> if that's
>>> actually in use, or the kernel is bodging round it with an offset.
>>
>> Sure, I am open to suggestions (as I realize that we need an additional 
>> VA_BITS_ACTUAL
>> variable export'ed for 52-bit kernel VA changes).

(stepping back a bit:)

I'm against exposing arch-specific #ifdefs that correspond to how we've 
configured the
arch code's interactions with mm. These are all moving targets, we can't have 
any of it
become ABI.

I have a straw-man for this: What is the value of PTE_FILE_MAX_BITS on your 
system?
I have no idea what this value is or means, an afternoon's archaeology would be 
needed(!).
This is something that made sense for one kernel version, a better idea came 
along, and it
was replaced. If we'd exposed this to user-space, we'd have to generate a 
value, even if
it doesn't mean anything. Exposing VA_BITS_ACTUAL is the same.

(Keep an eye out for when we change the kernel memory map, and any 
second-guessing based
on VA_BITS turns out to be wrong)


What we do have are the hardware properties. The kernel can't change these.


>> Also how do we standardize reading T0SZ and T1SZ in user-space. Do you 
>> propose I make an
>> enhancement in the cpu-feature-registers interface (see [1]) or the HWCAPS 
>> interface
>> (see [2]) for the same?

cpufeature won't help you if you've already panic()d and only have the vmcore 
file. This
stuff needs to go in vmcoreinfo.

As long as there is a description of how userspace uses these values, I think 
adding
key/values for TCR_EL1.TxSZ to the vmcoreinfo is a sensible way out of this. 
You probably
need TTBR1_EL1.BADDR too. (it should be specific fields, to prevent 'new uses' 
becoming ABI)

This tells you how the hardware was configured, and covers any combination of 
TxSZ tricks
we play, and whether those address bits are used for VA, or ptrauth for TTBR0 
or TTRB1.


> Any comments on the above points? At the moment we have to carry these fixes 
> in the
> distribution kernels and I would like to have these fixed in upstream kernel 
> itself.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] support reserving crashkernel above 4G on arm64 kdump

2019-06-05 Thread James Morse
Hi!

On 07/05/2019 04:50, Chen Zhou wrote:
> We use crashkernel=X to reserve crashkernel below 4G, which will fail
> when there is no enough memory. Currently, crashkernel=Y@X can be used
> to reserve crashkernel above 4G, in this case, if swiotlb or DMA buffers
> are requierd, capture kernel will boot failure because of no low memory.

> When crashkernel is reserved above 4G in memory, kernel should reserve
> some amount of low memory for swiotlb and some DMA buffers. So there may
> be two crash kernel regions, one is below 4G, the other is above 4G.

This is a good argument for supporting the 'crashkernel=...,low' version.
What is the 'crashkernel=...,high' version for?

Wouldn't it be simpler to relax the ARCH_LOW_ADDRESS_LIMIT if we see 
'crashkernel=...,low'
in the kernel cmdline?

I don't see what the 'crashkernel=...,high' variant is giving us, it just 
complicates the
flow of reserve_crashkernel().

If we called reserve_crashkernel_low() at the beginning of 
reserve_crashkernel() we could
use crashk_low_res.end to change some limit variable from 
ARCH_LOW_ADDRESS_LIMIT to
memblock_end_of_DRAM().
I think this is a simpler change that gives you what you want.


> Then
> Crash dump kernel reads more than one crash kernel regions via a dtb
> property under node /chosen,
> linux,usable-memory-range = .

Won't this break if your kdump kernel doesn't know what the extra parameters 
are?
Or if it expects two ranges, but only gets one? These DT properties should be 
treated as
ABI between kernel versions, we can't really change it like this.

I think the 'low' region is an optional-extra, that is never mapped by the 
first kernel. I
think the simplest thing to do is to add an 'linux,low-memory-range' that we
memblock_add() after memblock_cap_memory_range() has been called.
If its missing, or the new kernel doesn't know what its for, everything keeps 
working.


> Besides, we need to modify kexec-tools:
>   arm64: support more than one crash kernel regions(see [1])

> I post this patch series about one month ago. The previous changes and
> discussions can be retrived from:

Ah, this wasn't obvious as you've stopped numbering the series. Please label 
the next one
'v6' so that we can describe this as 'v5'. (duplicate numbering would be even 
more confusing!)


Thanks,

James


Re: [PATCH 2/4] arm64: kdump: support reserving crashkernel above 4G

2019-06-05 Thread James Morse
Hello,

On 07/05/2019 04:50, Chen Zhou wrote:
> When crashkernel is reserved above 4G in memory, kernel should
> reserve some amount of low memory for swiotlb and some DMA buffers.

> Meanwhile, support crashkernel=X,[high,low] in arm64. When use
> crashkernel=X parameter, try low memory first and fall back to high
> memory unless "crashkernel=X,high" is specified.

What is the 'unless crashkernel=...,high' for? I think it would be simpler to 
relax the
ARCH_LOW_ADDRESS_LIMIT if reserve_crashkernel_low() allocated something.

This way "crashkernel=1G" tries to allocate 1G below 4G, but fails if there 
isn't enough
memory. "crashkernel=1G crashkernel=16M,low" allocates 16M below 4G, which is 
more likely
to succeed, if it does it can then place the 1G block anywhere.


> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 413d566..82cd9a0 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>   request_resource(res, _data);
>  #ifdef CONFIG_KEXEC_CORE
>   /* Userspace will find "Crash kernel" region in /proc/iomem. */
> + if (crashk_low_res.end && crashk_low_res.start >= res->start &&
> + crashk_low_res.end <= res->end)
> + request_resource(res, _low_res);
>   if (crashk_res.end && crashk_res.start >= res->start &&
>   crashk_res.end <= res->end)
>   request_resource(res, _res);

With both crashk_low_res and crashk_res, we end up with two entries in 
/proc/iomem called
"Crash kernel". Because its sorted by address, and kexec-tools stops searching 
when it
find "Crash kernel", you are always going to get the kernel placed in the lower 
portion.

I suspect this isn't what you want, can we rename crashk_low_res for arm64 so 
that
existing kexec-tools doesn't use it?


> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index d2adffb..3fcd739 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -74,20 +74,37 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  static void __init reserve_crashkernel(void)
>  {
>   unsigned long long crash_base, crash_size;
> + bool high = false;
>   int ret;
>  
>   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   _size, _base);
>   /* no crashkernel= or invalid value specified */
> - if (ret || !crash_size)
> - return;
> + if (ret || !crash_size) {
> + /* crashkernel=X,high */
> + ret = parse_crashkernel_high(boot_command_line,
> + memblock_phys_mem_size(),
> + _size, _base);
> + if (ret || !crash_size)
> + return;
> + high = true;
> + }
>  
>   crash_size = PAGE_ALIGN(crash_size);
>  
>   if (crash_base == 0) {
> - /* Current arm64 boot protocol requires 2MB alignment */
> - crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
> - crash_size, SZ_2M);
> + /*
> +  * Try low memory first and fall back to high memory
> +  * unless "crashkernel=size[KMG],high" is specified.
> +  */
> + if (!high)
> + crash_base = memblock_find_in_range(0,
> + ARCH_LOW_ADDRESS_LIMIT,
> + crash_size, CRASH_ALIGN);
> + if (!crash_base)
> + crash_base = memblock_find_in_range(0,
> + memblock_end_of_DRAM(),
> + crash_size, CRASH_ALIGN);
>   if (crash_base == 0) {
>   pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
>   crash_size);
> @@ -105,13 +122,18 @@ static void __init reserve_crashkernel(void)
>   return;
>   }
>  
> - if (!IS_ALIGNED(crash_base, SZ_2M)) {
> + if (!IS_ALIGNED(crash_base, CRASH_ALIGN)) {
>   pr_warn("cannot reserve crashkernel: base address is 
> not 2MB aligned\n");
>   return;
>   }
>   }
>   memblock_reserve(crash_base, crash_size);
>  
> + if (crash_base >= SZ_4G && reserve_crashkernel_low()) {
> + memblock_free(crash_base, crash_size);
> + return;

This is going to be annoying on platforms that don't have, and don't need 
memory below 4G.
A "crashkernel=...,low" on these system will break crashdump. I don't think we 
should
expect users to know the memory layout. (I'm assuming distro's are going to add 
a low
reservation everywhere, just in case)

I think the 'low' region should be a small optional/best-effort extra, that 
kexec-tools
can't touch.


I'm afraid 

Re: [PATCH 1/4] x86: kdump: move reserve_crashkernel_low() into kexec_core.c

2019-06-05 Thread James Morse
Hello,

On 07/05/2019 04:50, Chen Zhou wrote:
> In preparation for supporting reserving crashkernel above 4G
> in arm64 as x86_64 does, move reserve_crashkernel_low() into
> kexec/kexec_core.c.


> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 905dae8..9ee33b6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -463,59 +460,6 @@ static void __init 
> memblock_x86_reserve_range_setup_data(void)
>  # define CRASH_ADDR_HIGH_MAX MAXMEM
>  #endif
>  
> -static int __init reserve_crashkernel_low(void)
> -{
> -#ifdef CONFIG_X86_64

The behaviour of this #ifdef has disappeared, won't 32bit x86 now try and 
reserve a chunk
of unnecessary 'low' memory?

[...]


> @@ -579,9 +523,13 @@ static void __init reserve_crashkernel(void)
>   return;
>   }
>  
> - if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
> - memblock_free(crash_base, crash_size);
> - return;
> + if (crash_base >= (1ULL << 32)) {
> + if (reserve_crashkernel_low()) {
> + memblock_free(crash_base, crash_size);
> + return;
> + }
> +
> + insert_resource(_resource, _low_res);


Previously reserve_crashkernel_low() was #ifdefed to do nothing if 
!CONFIG_X86_64, I don't
see how 32bit is skipping this reservation...


>   }
>  
>   pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
> RAM: %ldMB)\n",
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index b9b1bc5..096ad63 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -63,6 +63,10 @@
>  
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
>  
> +#ifndef CRASH_ALIGN
> +#define CRASH_ALIGN SZ_128M
> +#endif

Why 128M? Wouldn't we rather each architecture tells us its minimum alignment?


> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index d714044..3492abd 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -39,6 +39,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -96,6 +98,60 @@ int kexec_crash_loaded(void)
>  }
>  EXPORT_SYMBOL_GPL(kexec_crash_loaded);
>  
> +int __init reserve_crashkernel_low(void)
> +{
> + unsigned long long base, low_base = 0, low_size = 0;
> + unsigned long total_low_mem;
> + int ret;
> +
> + total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
> +
> + /* crashkernel=Y,low */
> + ret = parse_crashkernel_low(boot_command_line, total_low_mem,
> + _size, );
> + if (ret) {
> + /*
> +  * two parts from lib/swiotlb.c:
> +  * -swiotlb size: user-specified with swiotlb= or default.
> +  *
> +  * -swiotlb overflow buffer: now hardcoded to 32k. We round it
> +  * to 8M for other buffers that may need to stay low too. Also
> +  * make sure we allocate enough extra low memory so that we
> +  * don't run out of DMA buffers for 32-bit devices.
> +  */
> + low_size = max(swiotlb_size_or_default() + (8UL << 20),

SZ_8M?

> + 256UL << 20);

SZ_256M?


> + } else {
> + /* passed with crashkernel=0,low ? */
> + if (!low_size)
> + return 0;
> + }
> +
> + low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN);
> + if (!low_base) {
> + pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> smaller size.\n",
> +(unsigned long)(low_size >> 20));
> + return -ENOMEM;
> + }
> +
> + ret = memblock_reserve(low_base, low_size);
> + if (ret) {
> + pr_err("%s: Error reserving crashkernel low memblock.\n",
> + __func__);
> + return ret;
> + }
> +
> + pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System 
> low RAM: %ldMB)\n",
> + (unsigned long)(low_size >> 20),
> + (unsigned long)(low_base >> 20),
> + (unsigned long)(total_low_mem >> 20));
> +
> + crashk_low_res.start = low_base;
> + crashk_low_res.end   = low_base + low_size - 1;
> +
> + return 0;
> +}


Thanks,

James


Re: [PATCH v3 1/3] arm64, vmcoreinfo : Append 'PTRS_PER_PGD' to vmcoreinfo

2019-04-02 Thread James Morse
Hi Kazu,

On 27/03/2019 16:07, Kazuhito Hagio wrote:
> On 3/26/2019 12:36 PM, James Morse wrote:
>> On 20/03/2019 05:09, Bhupesh Sharma wrote:
>>> With ARMv8.2-LVA architecture extension availability, arm64 hardware
>>> which supports this extension can support a virtual address-space upto
>>> 52-bits.
>>>
>>> Since at the moment we enable the support of this extension in kernel
>>> via CONFIG flags, e.g.
>>>  - User-space 52-bit LVA via CONFIG_ARM64_USER_VA_BITS_52
>>>
>>> so, there is no clear mechanism in the user-space right now to
>>> determine these CONFIG flag values and hence determine the maximum
>>> virtual address space supported by the underlying kernel.
>>>
>>> User-space tools like 'makedumpfile' therefore are broken currently
>>> as they have no proper method to calculate the 'PTRS_PER_PGD' value
>>> which is required to perform a page table walk to determine the
>>> physical address of a corresponding virtual address found in
>>> kcore/vmcoreinfo.
>>>
>>> If one appends 'PTRS_PER_PGD' number to vmcoreinfo for arm64,
>>> it can be used in user-space to determine the maximum virtual address
>>> supported by underlying kernel.
>>
>> I don't think this really solves the problem, it feels fragile.
>>
>> I can see how vmcoreinfo tells you VA_BITS==48, PAGE_SIZE==64K and 
>> PTRS_PER_PGD=1024.
>> You can use this to work out that the top level page table size isn't 
>> consistent with a
>> 48bit VA, so 52bit VA must be in use...
>>
>> But wasn't your problem walking the kernel page tables? In particular the 
>> offset that we
>> apply because the tables were based on a 48bit VA shifted up in 
>> swapper_pg_dir.
>>
>> Where does the TTBR1_EL1 offset come from with this property? I assume 
>> makedumpfile
>> hard-codes it when it sees 52bit is in use ... somewhere.

> My understanding is that the TTBR1_EL1 offset comes from a kernel
> virtual address with the exported PTRS_PER_PGD.
> 
> With T1SZ is 48bit and T0SZ is 52bit,

(PTRS_PER_PGD doesn't tell you this, PTRS_PER_PGD lets you spot something odd is
happening, and this just happens to be the only odd combination today.)


> kva = 0x<--- start of kernel virtual address

Does makedumpfile have this value? If the kernel were using 52bit VA for TTBR1 
this value
would be different.


> pgd_index(kva) = (kva >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)
>= (0x >> 42) & (1024 - 1)
>= 0x003fffc0 & 0x3ff
>= 0x3c0  <--- the offset (0x3c0) is included
> 
> This is what kernel does now, so makedumpfile also wants to do.

Sure, and it would work today. I'm worried about tomorrow, where we support 
something new,
and need to bundle new information out through vmcoreinfo. This ends up being 
used to
fingerprint the kernel support, instead of as the value it was intended to be.


>> We haven't solved the problem!
>>
>> Today __cpu_setup() sets T0SZ and T1SZ differently for 52bit VA, but in the 
>> future it
>> could set them the same, or different the other-way-round.
>>
>> Will makedumpfile using this value keep working once T1SZ is 52bit VA too? 
>> In this case
>> there would be no ttbr offset.
> 
> If T1SZ is 52bit, probably kernel virtual address starts from 
> 0xfff0,

I didn't think this 'bottom of the ttbr1 mapping range' value was exposed 
anywhere.
Where can user-space get this from? (I can't see it in the vmcoreinfo list)


> then the offset becomes 0 with the pgd_index() above.
> I think makedumpfile will keep working with that.


Steve mentions a 52/48 combination in his kernel series:
https://lore.kernel.org/linux-arm-kernel/20190218170245.14915-1-steve.cap...@arm.com/


I think vmcoreinfo-users will eventually need to spot 52bit used in TTBR1 
and/or TTBR0,
and possibly: configured, but not enabled in either. (this is because the bits 
are also
used for pointer-auth, the kernel may be built for both pointer-auth and 52-bit 
VA, and
chose which to enabled at boot based on some policy)

I don't see how you can do this with one value.
I'd like to get this right now, so we user-space doesn't need updating again!


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] arm64, vmcoreinfo : Append 'PTRS_PER_PGD' to vmcoreinfo

2019-04-02 Thread James Morse
Hi Bhupesh,

On 28/03/2019 11:42, Bhupesh Sharma wrote:
> On 03/26/2019 10:06 PM, James Morse wrote:
>> On 20/03/2019 05:09, Bhupesh Sharma wrote:
>>> With ARMv8.2-LVA architecture extension availability, arm64 hardware
>>> which supports this extension can support a virtual address-space upto
>>> 52-bits.
>>>
>>> Since at the moment we enable the support of this extension in kernel
>>> via CONFIG flags, e.g.
>>>   - User-space 52-bit LVA via CONFIG_ARM64_USER_VA_BITS_52
>>>
>>> so, there is no clear mechanism in the user-space right now to
>>> determine these CONFIG flag values and hence determine the maximum
>>> virtual address space supported by the underlying kernel.
>>>
>>> User-space tools like 'makedumpfile' therefore are broken currently
>>> as they have no proper method to calculate the 'PTRS_PER_PGD' value
>>> which is required to perform a page table walk to determine the
>>> physical address of a corresponding virtual address found in
>>> kcore/vmcoreinfo.
>>>
>>> If one appends 'PTRS_PER_PGD' number to vmcoreinfo for arm64,
>>> it can be used in user-space to determine the maximum virtual address
>>> supported by underlying kernel.
>>
>> I don't think this really solves the problem, it feels fragile.
>>
>> I can see how vmcoreinfo tells you VA_BITS==48, PAGE_SIZE==64K and 
>> PTRS_PER_PGD=1024.
>> You can use this to work out that the top level page table size isn't 
>> consistent with a
>> 48bit VA, so 52bit VA must be in use...
>>
>> But wasn't your problem walking the kernel page tables? In particular the 
>> offset that we
>> apply because the tables were based on a 48bit VA shifted up in 
>> swapper_pg_dir.
>>
>> Where does the TTBR1_EL1 offset come from with this property? I assume 
>> makedumpfile
>> hard-codes it when it sees 52bit is in use ... somewhere.
>> We haven't solved the problem!

> But isn't the TTBR1_EL1 offset already appended by the kernel via 
> e842dfb5a2d3 ("arm64:
> mm: Offset TTBR1 to allow 52-bit PTRS_PER_PGD")
> in case of kernel configuration where 52-bit userspace VAs are possible.

> Accordingly we have the following assembler helper in 
> 'arch/arm64/include/asm/assembler.h':
> 
>    .macro  offset_ttbr1, ttbr
> #ifdef CONFIG_ARM64_52BIT_VA
>    orr \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
> #endif
>    .endm
> 
> where:
> #ifdef CONFIG_ARM64_52BIT_VA
> /* Must be at least 64-byte aligned to prevent corruption of the TTBR */
> #define TTBR1_BADDR_4852_OFFSET    (((UL(1) << (52 - PGDIR_SHIFT)) - \
>     (UL(1) << (48 - PGDIR_SHIFT))) * 8)
> #endif

Sure, and all this would work today, because there is only one weird 
combination. But once
we support another combination of 52bit-va, you'd either need another value, or 
to start
using PTRS_PER_PGD as a flag for v5.1_FUNNY_BEHAVIOUR_ONE.


[...]

> Note that the above computation holds true both for PTRS_PER_PGD = 64 (48-bit 
> kernel with
> 48-bit User VA) and 1024 (48-bit with 52-bit User VA) cases. And these are the
> configurations for which we are trying to fix the user-space regressions 
> reported (on
> arm64) recently.

... and revisit it when there is another combination?


>> Today __cpu_setup() sets T0SZ and T1SZ differently for 52bit VA, but in the 
>> future it
>> could set them the same, or different the other-way-round.
>>
>> Will makedumpfile using this value keep working once T1SZ is 52bit VA too? 
>> In this case
>> there would be no ttbr offset.
>>
>> If you need another vmcoreinfo flag once that happens, we've done something 
>> wrong here.
> 
> I am currently experimenting with Steve's patches for 52-bit kernel VA
> (<https://lwn.net/Articles/780093/>) and will comment more on the same when I 
> am able to
> get the user-space utilities like makedumpfile and kexec-tools to work with 
> the same on
> both ARMv8 Fast Simulator model and older CPUs which don't support ARMv8.2 
> extensions.


> However, I think we should not hold up fixes for regressions already 
> reported, because the
> 52-bit kernel VA changes probably still need some more rework.

Chucking things into vmcoreinfo isn't free: we need to keep them there forever, 
otherwise
yesterdays version of the tools breaks. Can we take the time to get this right 
for the
cases we know about?

Yes the kernel code is going to move around, this is why the information we 
expose via
vmcoreinfo needs to be thought through: something we would always need, 
regardless of how
the kernel implements it.


>&g

Re: [PATCH v3 1/3] arm64, vmcoreinfo : Append 'PTRS_PER_PGD' to vmcoreinfo

2019-03-26 Thread James Morse
Hi Bhupesh,

On 20/03/2019 05:09, Bhupesh Sharma wrote:
> With ARMv8.2-LVA architecture extension availability, arm64 hardware
> which supports this extension can support a virtual address-space upto
> 52-bits.
> 
> Since at the moment we enable the support of this extension in kernel
> via CONFIG flags, e.g.
>  - User-space 52-bit LVA via CONFIG_ARM64_USER_VA_BITS_52
> 
> so, there is no clear mechanism in the user-space right now to
> determine these CONFIG flag values and hence determine the maximum
> virtual address space supported by the underlying kernel.
> 
> User-space tools like 'makedumpfile' therefore are broken currently
> as they have no proper method to calculate the 'PTRS_PER_PGD' value
> which is required to perform a page table walk to determine the
> physical address of a corresponding virtual address found in
> kcore/vmcoreinfo.
> 
> If one appends 'PTRS_PER_PGD' number to vmcoreinfo for arm64,
> it can be used in user-space to determine the maximum virtual address
> supported by underlying kernel.

I don't think this really solves the problem, it feels fragile.

I can see how vmcoreinfo tells you VA_BITS==48, PAGE_SIZE==64K and 
PTRS_PER_PGD=1024.
You can use this to work out that the top level page table size isn't 
consistent with a
48bit VA, so 52bit VA must be in use...

But wasn't your problem walking the kernel page tables? In particular the 
offset that we
apply because the tables were based on a 48bit VA shifted up in swapper_pg_dir.

Where does the TTBR1_EL1 offset come from with this property? I assume 
makedumpfile
hard-codes it when it sees 52bit is in use ... somewhere.
We haven't solved the problem!

Today __cpu_setup() sets T0SZ and T1SZ differently for 52bit VA, but in the 
future it
could set them the same, or different the other-way-round.

Will makedumpfile using this value keep working once T1SZ is 52bit VA too? In 
this case
there would be no ttbr offset.

If you need another vmcoreinfo flag once that happens, we've done something 
wrong here.

(Not to mention what happens if the TTBR1_EL1 uses 52bit va, but TTBR0_EL1 
doesn't)


> Suggested-by: Steve Capper 

(CC: +Steve)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo

2019-02-15 Thread James Morse
Hi guys,

(CC: +Steve, +Kristina) "What's the best way of letting user-space know the MMU
config when 52-bit VA and pointer-auth may be in use?"

On 13/02/2019 19:52, Kazuhito Hagio wrote:
> On 2/13/2019 1:22 PM, James Morse wrote:
>> On 13/02/2019 11:15, Dave Young wrote:
>>> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>>>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>>>> mentioned that the changes look necessary.
>>>>>
>>>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>>>
>>>>>>> The increased 'PTRS_PER_PGD' value for such cases needs to be then
>>>>>>> calculated as is done by the underlying kernel
>>
>> Aha! Nothing to do with which-bits-are-pfn in the tables...
>>
>> You need to know if the top level PGD is 512bytes or bigger. As we use a
>> kmem-cache the adjacent data could be some else's page tables.
>>
>> Is this really a problem though? You can't pull the user-space pgd pointers 
>> out
>> of no-where, you must have walked some task_struct and struct_mm's to find 
>> them.
>> In which case you would have the VMAs on hand to tell you if its in the 
>> mapped
>> user range.
>>
>> It would be good to avoid putting something arch-specific in here if we can 
>> at
>> all help it.

>>>>>>> (see
>>>>>>> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>>>>>>>
>>>>>>> #define PTRS_PER_PGD  (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>>>
>>>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>>>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>>>
>>>> /* to find an entry in a page-table-directory */
>>>> #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 
>>>> 1))
>>>
>>> Since Dave mentioned crash tool does not need it, but crash should also
>>> travel the pg tables.
> 
> The crash utility is always invoked with vmlinux, so it can read the
> vabits_user variable directly from vmcore, but makedumpfile can not.

(This sounds fragile. That symbol's name may change, it may disappear
completely! ... but I guess crash changes with every kernel release anyway)


>>> If this is really necessary it would be good to describe what will
>>> happen without the patch, eg. some user visible error from an actual test 
>>> etc.
>>
>> Yes please, it would really help if there was a specific example we could 
>> discuss.
> 
> With 52-bit user space and 48-bit kernel space configuration,
> makedumpfile will not be able to convert a virtual kernel address
> to a physical address, and fail to capture a dumpfile, because the
> pgd_index() will return a wrong index.

Got it, thanks!
(all this user stuff had me thinking it was user-space you were trying to walk).

Yes, this is because of commit e842dfb5a2d3 ("arm64: mm: Offset TTBR1 to allow
52-bit PTRS_PER_PGD"). The kernel has offset the ttbr1 value, if you try and
walk it without knowing the offset you get junk.

Ideally we tell you the offset with some 'ttbr1_offset=' in vmcoreinfo, but if
the offsetting code disappears, the kernel would still have to provide
'ttbr1_offset=0' for user-space to keep working.

I'd like to find something future-proof that always has an unambiguous meaning,
and isn't a problem if the kernel variable/symbol/kconfig names change.

With pointer-auth in use too you can't guess which bits are address and which
bits are data.

Taking arch-specific to its extreme, we could expose TCR_EL1, but this is a
problem if we ever switch that per task (some new bits may turn up with a new
feature). Some of those bits vary per cpu too, so we'd have to mask them out in
case user-space tries to conclude something from them.


My current best suggestion is to export:
from core code:
* USER_MMAP_END, the maximum value a user-space can try and mmap().
This would normally be TASK_SIZE, but x86 and powerpc also have support for
larger VA space, and its plumbed into mm slightly differently. We should have
one arch-independent property that covers all these. On arm64 this would be the
runtime va bits for user-space's TTBR. (This assumes the value isn't per-task)

arch specific:
* ARM64_TCR.T1SZ, the va bits mapped by the kernel's TTBR. (We can assume we'll
never flip user/kernel space). This has to be arch specific, it will always have
a value and its meaning come

Re: [PATCH] arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo

2019-02-13 Thread James Morse
Hi guys,

On 13/02/2019 11:15, Dave Young wrote:
> On 02/12/19 at 11:03pm, Kazuhito Hagio wrote:
>> On 2/12/2019 2:59 PM, Bhupesh Sharma wrote:
>>> BTW, in the makedumpfile enablement patch thread for ARMv8.2 LVA
>>> (which I sent out for 52-bit User space VA enablement) (see [0]), Kazu
>>> mentioned that the changes look necessary.
>>>
>>> [0]. http://lists.infradead.org/pipermail/kexec/2019-February/022431.html
>>
> The increased 'PTRS_PER_PGD' value for such cases needs to be then
> calculated as is done by the underlying kernel

Aha! Nothing to do with which-bits-are-pfn in the tables...

You need to know if the top level PGD is 512bytes or bigger. As we use a
kmem-cache the adjacent data could be some else's page tables.

Is this really a problem though? You can't pull the user-space pgd pointers out
of no-where, you must have walked some task_struct and struct_mm's to find them.
In which case you would have the VMAs on hand to tell you if its in the mapped
user range.

It would be good to avoid putting something arch-specific in here if we can at
all help it.


> (see
> 'arch/arm64/include/asm/pgtable-hwdef.h' for details):
>
> #define PTRS_PER_PGD  (1 << (MAX_USER_VA_BITS - PGDIR_SHIFT))
>>
>> Yes, this is the reason why makedumpfile needs the MAX_USER_VA_BITS.
>> It is used for pgd_index() also in makedumpfile to walk page tables.
>>
>> /* to find an entry in a page-table-directory */
>> #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 
>> 1))
> 
> Since Dave mentioned crash tool does not need it, but crash should also
> travel the pg tables.
> 
> If this is really necessary it would be good to describe what will
> happen without the patch, eg. some user visible error from an actual test etc.

Yes please, it would really help if there was a specific example we could 
discuss.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo

2019-02-04 Thread James Morse
Hi Bhupesh,

On 04/02/2019 14:35, Bhupesh Sharma wrote:
> On 01/31/2019 03:09 AM, Bhupesh Sharma wrote:
>> On 01/30/2019 08:51 PM, James Morse wrote:
>>> On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:
>>>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>>>> supports these extensions can support upto 52-bit virtual and 52-bit
>>>> physical addresses respectively.
>>>>
>>>> Since at the moment we enable the support of these extensions via CONFIG
>>>> flags, e.g.
>>>>   - LPA via CONFIG_ARM64_PA_BITS_52
>>>>
>>>> there are no clear mechanisms in user-space right now to
>>>> deteremine these CONFIG flag values and also determine the PARange and
>>>> VARange address values.
>>>> User-space tools like 'makedumpfile' and 'crash-utility' can instead
>>>> use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
>>>> the maximum virtual address and physical address (respectively)
>>>> supported by underlying kernel.
>>>>
>>>> A reference 'makedumpfile' implementation which uses this approach to
>>>> determining the maximum physical address is available in [0].
>>>
>>> Why does it need to know?
>>>
>>> (Suzuki asked the same question on your earlier version)
>>> https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c...@arm.com/


>> I have shared some details (after discussion with our test teams) in reply to
>> the review comments from Suzuki here:
>> http://lists.infradead.org/pipermail/kexec/2019-January/022389.html, and
>> http://lists.infradead.org/pipermail/kexec/2019-January/022390.html
>>
>> Just to summarize, I mentioned in my replies to the review comments tha the
>> makedumpfile implementation (for decoding the PTE) was just as an example,
>> however there can be other user-space applications, for e.g a user-space
>> application running with 48-bit kernel VA and 52-bit user space VA and
>> requesting allocation in 'high' address via a 'hint' to mmap.

But vmcoreinfo is the wrong place to expose this information. (it can be
configured off, and is only accessible to root)


>>>  From your github link it looks like you use this to re-assemble the two 
>>> bits
>>> of the PFN from the pte. Can't you always do this for 64K pages? CPUs with
>>> the feature always do this too, its not something the kernel turns on.
>>
>> Ok, let me try to give some perspective of a common makedumpfile use-case
>> before I jump into the details:


>> Also hardcoding the PTE calculation to use the high address bit mask always
>> will break the backward compatibility with older kernels (which don't support
>> 52-bit address space extensions).

What would go wrong?

The hardware ignores those bits and supplies zero. As far as I can see the
kernel has always generated zero there.


>> (b). Also x86_64 already has a vmcoreinfo export for 'pgtable_l5_enabled':

So? 5-level page tables is a different feature. I agree you need to know the
number of levels to walk the page-tables, but that isn't how arm64's 52bit stuff
works.


> Ping. Since this patch fixes a regression with user-space tools like
> makedumpfile and crash-utility which are broken since arm64 kernels with 
> 52-bit
> VA and PA support are available (and distributions which enable them), would
> request review comments/ack on this simple change.

Broken how? What goes wrong?

I can see how a not-52bit-aware crash/gdb/whatever would be confused by high
bits being set in the physical address, and possibly throw them away.
But once it supports this for 64K pages, I don't see what can go wrong if those
bits aren't set. Why does it need to know?


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] arm64: Expose address bits (physical/virtual) via cpuinfo

2019-02-04 Thread James Morse
Hi Bhupesh,

On 30/01/2019 19:48, Bhupesh Sharma wrote:
> On 01/29/2019 03:39 PM, Suzuki K Poulose wrote:
>> On 28/01/2019 20:57, Bhupesh Sharma wrote:
>>> With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
>>> supports these extensions can support upto 52-bit virtual and 52-bit
>>> physical addresses respectively.
>>>
>>> Since at the moment we enable the support of these extensions via CONFIG
>>> flags, e.g.
>>>   - LPA via CONFIG_ARM64_PA_BITS_52, and
>>>   - LVA via CONFIG_ARM64_FORCE_52BIT
>>>
>>> The easiest way a user can determine the physical/virtual
>>> addresses supported on the hardware, is via the '/proc/cpuinfo'
>>> interface.
>>
>> Why do we need this information ?
> 
> Sorry for the delay in reply, but I wanted to collect as much information from
> our test teams as possible before replying to this thread.
> 
> So here is brief list of reasons, as to why we need this information in 
> user-space:
> 
> 1. This information is useful for a non-expert user, using Linux distributions
> (like Fedora) on different arm64 platforms. The default configuration 
> (.config)
> will be the same for a distribution flavor and is supposed to work fine on all
> underlying arm64 platforms.
> 
> a). Now some of these underlying platforms may support ARMv8-8.2 extension 
> while
> others don't.
> 
> b). Users performing performance bench-marking on these platforms run 
> benchmarks
> with different page-sizes and address ranges.

> c). Right now they have no way to know, about the underlying VARange and 
> PARange
> values other than reading the config file and search for the flags.

Why do they need to know? What decision can you make with this information that
you can't make without it?


> For e.g. lets consider the 'pg-table_tests' (See -
> ), which is used to test and
> verify 5-level page table behavior on x86_64 Linux. It requires determining if
> 5-level page tables are fully supported, 

... but we don't have 5-level pages tables ...


> for which it uses either 'Intel 'la57'
> cpu flag' in:
> 
> $ cat /proc/cpuinfo', or
> 
> $ grep CONFIG_X86_5LEVEL /boot/config-$(uname -r)
> CONFIG_X86_5LEVEL=y
> 
> This test suite is easily modifiable for verifying 52-bit ARMv8.2-LVA support.

This looks like a test to check all kernel page-table walkers have been updated
for a fifth level. We don't need to worry about this.

You should just need to remove the arch-specific test. If you provide the hint
on platforms that support it, the mapping should succeed. On platforms that
don't, it won't.

Why does user space need to know in advance of making the hint?


> d). Now when running the above suite and sharing results, it might be that the
> .config file is not available or even in the case it is available the CONFIG
> flag settings in .config file are not intuitive to a non-expert user for arm64
> (the example below is of 64K page size, 48-bit kernel VA, 52-bit User space VA
> and 52-bit PA):

I agree inspecting the Kconfig is an inappropriate way for user-space 'to know'
what the kernel supports.

I can only see a 'supports 52bit va' flag as being useful to a program that
doesn't actually want to use it, but for some bizarre reason wants to know.

For coredumps the question isn't "was it supported", but "was it in use", which
you can tell from the pagetables.


> Also right now there is an absence of a standard ABI between the user-space 
> and
> kernel for exporting this information to the user-space, with two exceptions:
> 
> 1. For vmcoreinfo specific user-space utilities (like makedumpfile and crash) 
> I
> have proposed a couple of CONFIG flags to be added to the vmcoreinfo, so that
> user-space utilities can use the same (See
>  for 
> details).

vmcoreinfo is for things like crash/gdb/makedumpfile to provide kernel-specific
information that they couldn't possibly work without. Like the page size. 52bit
support doesn't fit here as a 52bit-aware walker works regardless of whether
52bit was in use.


> 2. For other user-space utilities (especially those which make a 'mmap' call 
> and
> pass an address hint to the get the kernel to provide a high address), 

> I can see only two methods to determine the underlying kernel support:
> 
> a). Read the CONFIG flags from .config (as I captured some paragraphs above), 
> or
> 
> b). In absence of .config file on the system, read the system ID registers 
> like
> 'ID_AA64MMFR0_EL1' and 'ID_AA64MMFR2_EL1' (which PATCH 2/2 of this series 
> tries
> to enable from kernel side) and then make a decision on whether to pass a hint
> to 'mmap'.

It seems you're expecting to know whether 52bit-VA is supported without actually
using it. What is this useful for?

The point of the hint is you want to allocate memory, and can work with 52bit-VA
if the platform supports it. If it doesn't, you still want to allocate the
memory. We shouldn't need a 

Re: [PATCH] arm64, vmcoreinfo : Append 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' to vmcoreinfo

2019-01-30 Thread James Morse

Hi Bhupesh,

On 01/30/2019 12:23 PM, Bhupesh Sharma wrote:

With ARMv8.2-LVA and LPA architecture extensions, arm64 hardware which
supports these extensions can support upto 52-bit virtual and 52-bit
physical addresses respectively.

Since at the moment we enable the support of these extensions via CONFIG
flags, e.g.
  - LPA via CONFIG_ARM64_PA_BITS_52

there are no clear mechanisms in user-space right now to
deteremine these CONFIG flag values and also determine the PARange and
VARange address values.
User-space tools like 'makedumpfile' and 'crash-utility' can instead
use the 'MAX_USER_VA_BITS' and 'MAX_PHYSMEM_BITS' values to determine
the maximum virtual address and physical address (respectively)
supported by underlying kernel.

A reference 'makedumpfile' implementation which uses this approach to
determining the maximum physical address is available in [0].


Why does it need to know?

(Suzuki asked the same question on your earlier version)
https://lore.kernel.org/linux-arm-kernel/cff44754-7fe4-efea-bc8e-4dde2277c...@arm.com/


From your github link it looks like you use this to re-assemble the two 
bits of the PFN from the pte. Can't you always do this for 64K pages? 
CPUs with the feature always do this too, its not something the kernel 
turns on.



Thanks,

James



diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index ca4c3e12d8c5..ad231be5c0d8 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -10,6 +10,8 @@
  void arch_crash_save_vmcoreinfo(void)
  {
VMCOREINFO_NUMBER(VA_BITS);
+   VMCOREINFO_NUMBER(MAX_USER_VA_BITS);
+   VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
kimage_voffset);




___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64: kexec_file: return successfully even if kaslr-seed doesn't exist

2019-01-11 Thread James Morse
Hi Akashi,

On 11/01/2019 07:40, AKASHI Takahiro wrote:
> In kexec_file_load, kaslr-seed property of the current dtb will be deleted
> any way before setting a new value if possible. It doesn't matter whether
> it exists in the current dtb.

Could you describe what goes wrong without this patch?

If the rng isn't ready yet, this stale error causes us to return an error,
causing kexec_file_load() to fail unnecessarily.

> So "ret" should be reset to 0 here.
> 
> Fixes: commit 884143f60c89 ("arm64: kexec_file: add kaslr support")

Reviewed-by: James Morse 


Thanks!

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64: invalidate TLB before turning MMU on

2018-12-13 Thread James Morse
Hi Qian,

On 13/12/2018 05:22, Qian Cai wrote:
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> dump just hung. It has 4 threads on each core. Each 2-core share a same
> L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> L3 cache.
> 
> It turned out that this was due to the TLB contained stale entries (or
> uninitialized junk which just happened to look valid) from the first
> kernel before turning the MMU on in the second kernel which caused this
> instruction hung,

This is a great find, thanks for debugging this!

The kernel should already handle this, as we don't trust the bootloader to clean
up either.

In arch/arm64/mm/proc.S::__cpu_setup()
|/*
| * __cpu_setup
| *
| * Initialise the processor for turning the MMU on.  Return in x0 the
| * value of the SCTLR_EL1 register.
| */
|   .pushsection ".idmap.text", "awx"
| ENTRY(__cpu_setup)
|   tlbivmalle1 // Invalidate local TLB
|   dsb nsh

This is called from stext, which then branches to __primary_switch(), which
calls __enable_mmu() where you see this problem. It shouldn't not be possible to
allocate new tlb entries between these points...

Do you have CONFIG_RANDOMIZE_BASE disabled? This causes enable_mmu() to be
called twice, the extra tlb maintenance is in __primary_switch.
(if it works with this turned off, it points to the extra off/tlbi/on sequence).


> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 4471f570a295..5196f3d729de 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
>   msr ttbr0_el1, x2   // load TTBR0
>   msr ttbr1_el1, x1   // load TTBR1
>   isb
> + dsb nshst
> + tlbivmalle1 // invalidate TLB
> + dsb nsh
> + isb
>   msr sctlr_el1, x0
>   isb

The overall change here is that we do extra maintenance later.

Can move this around to bisect where the TLB entries are either coming from, or
failing-to-be invalidated?
Do your first and kdump kernels have the same VA_BITS/PAGE_SIZE?
As a stab in the dark, (totally untested):
--%<--
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 2c75b0b903ae..a5f3b7314bda 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
  */
.pushsection ".idmap.text", "awx"
 ENTRY(__cpu_setup)
-   tlbivmalle1 // Invalidate local TLB
-   dsb nsh
-
mov x0, #3 << 20
msr cpacr_el1, x0   // Enable FP/ASIMD
mov x0, #1 << 12// Reset mdscr_el1 and disable
@@ -465,5 +462,10 @@ ENTRY(__cpu_setup)
 1:
 #endif /* CONFIG_ARM64_HW_AFDBM */
msr tcr_el1, x10
+   isb
+
+   tlbivmalle1 // Invalidate local TLB
+   dsb nsh
+
ret // return to head.S
 ENDPROC(__cpu_setup)
--%<--


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


  1   2   3   >