Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-18 Thread Ard Biesheuvel
On Wed, 18 Sept 2024 at 05:14, Eric W. Biederman  wrote:
>
> Ard Biesheuvel  writes:
>
> > On Tue, 17 Sept 2024 at 17:24, Eric W. Biederman  
> > wrote:
> >>
> >> Ard Biesheuvel  writes:
> >>
> >> > Hi Eric,
> >> >
> >> > Thanks for chiming in.
> >>
> >> It just looked like after James gave some expert input the
> >> conversation got stuck, so I am just trying to move it along.
> >>
> >> I don't think anyone knows what this whole elephant looks like,
> >> which makes solving the problem tricky.
> >>
> >> > On Mon, 16 Sept 2024 at 22:21, Eric W. Biederman  
> >> > wrote:
> >> >>
> > ...
> >> >>
> >> >> This leaves two practical questions if I have been following everything
> >> >> correctly.
> >> >>
> >> >> 1) How to get kexec to avoid picking that memory for the new kernel to
> >> >>run in before it initializes itself. (AKA the getting stomped by
> >> >>relocate kernel problem).
> >> >>
> >> >> 2) How to point the new kernel to preserved tpm_log.
> >> >>
> >> >>
> >> >> This recommendation is from memory so it may be a bit off but
> >> >> the general structure should work.  The idea is as follows.
> >> >>
> >> >> - Pass the information between kernels.
> >> >>
> >> >>   It is probably simplest for the kernel to have a command line option
> >> >>   that tells the kernel the address and size of the tpm_log.
> >> >>
> >> >>   We have a couple of mechanisms here.  Assuming you are loading a
> >> >>   bzImage with kexec_file_load you should be able to have the in kernel
> >> >>   loader to add those arguments to the kernel command line.
> >> >>
> >> >
> >> > This shouldn't be necessary, and I think it is actively harmful to
> >> > keep inventing special ways for the kexec kernel to learn about these
> >> > things that deviate from the methods used by the first kernel. This is
> >> > how we ended up with 5 sources of truth for the physical memory map
> >> > (EFI memory map, memblock and 3 different versions of the e820 memory
> >> > map).
> >> >
> >> > We should try very hard to make kexec idempotent, and reuse the
> >> > existing methods where possible. In this case, the EFI configuration
> >> > table is already being exposed to the kexec kernel, which describes
> >> > the base of the allocation. The size of the allocation can be derived
> >> > from the table header.
> >> >
> >> >> - Ensure that when the loader is finding an address to load the new
> >> >>   kernel it treats the address of the tpm_log as unavailable.
> >> >>
> >> >
> >> > The TPM log is a table created by the EFI stub loader, which is part
> >> > of the kernel. So if we need to tweak this for kexec's benefit, I'd
> >> > prefer changing it in a way that can accommodate the first kernel too.
> >> > However, I think the current method already has that property so I
> >> > don't think we need to do anything (modulo fixing the bug)
> >>
> >> I am fine with not inventing a new mechanism, but I think we need
> >> to reuse whatever mechanism the stub loader uses to pass it's
> >> table to the kernel.  Not the EFI table that disappears at
> >> ExitBootServices().
> >>
> >
> > Not sure what you mean here - the EFI table that gets clobbered by
> > kexec *is* the table that is created by the stub loader to pass the
> > TPM log to the kernel. Not sure what alternative you have in mind
> > here.
>
> I was referring to whatever the EFI table that James Bottomley mentioned
> that I presume the stub loader reads from when the stub loader
> constructs the tpm_log that is passed to the kernel.
>

There is no such table. The event log is exposed by the firmware via a
TCG2 protocol interface, which is no longer available after boot. So
the stub loader (which is the last kernel component that has access to
this interface) invokes this protocol and copies the output into a
table in memory which is exposed to the kernel proper as a EFI
configuration table.

So the main issue here is that EFI configuration tables are passed on
to kexec kernels, and we have to ensure (in the general case) that the
associated memory is not reus

Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-17 Thread Ard Biesheuvel
On Tue, 17 Sept 2024 at 17:24, Eric W. Biederman  wrote:
>
> Ard Biesheuvel  writes:
>
> > Hi Eric,
> >
> > Thanks for chiming in.
>
> It just looked like after James gave some expert input the
> conversation got stuck, so I am just trying to move it along.
>
> I don't think anyone knows what this whole elephant looks like,
> which makes solving the problem tricky.
>
> > On Mon, 16 Sept 2024 at 22:21, Eric W. Biederman  
> > wrote:
> >>
...
> >>
> >> This leaves two practical questions if I have been following everything
> >> correctly.
> >>
> >> 1) How to get kexec to avoid picking that memory for the new kernel to
> >>run in before it initializes itself. (AKA the getting stomped by
> >>relocate kernel problem).
> >>
> >> 2) How to point the new kernel to preserved tpm_log.
> >>
> >>
> >> This recommendation is from memory so it may be a bit off but
> >> the general structure should work.  The idea is as follows.
> >>
> >> - Pass the information between kernels.
> >>
> >>   It is probably simplest for the kernel to have a command line option
> >>   that tells the kernel the address and size of the tpm_log.
> >>
> >>   We have a couple of mechanisms here.  Assuming you are loading a
> >>   bzImage with kexec_file_load you should be able to have the in kernel
> >>   loader to add those arguments to the kernel command line.
> >>
> >
> > This shouldn't be necessary, and I think it is actively harmful to
> > keep inventing special ways for the kexec kernel to learn about these
> > things that deviate from the methods used by the first kernel. This is
> > how we ended up with 5 sources of truth for the physical memory map
> > (EFI memory map, memblock and 3 different versions of the e820 memory
> > map).
> >
> > We should try very hard to make kexec idempotent, and reuse the
> > existing methods where possible. In this case, the EFI configuration
> > table is already being exposed to the kexec kernel, which describes
> > the base of the allocation. The size of the allocation can be derived
> > from the table header.
> >
> >> - Ensure that when the loader is finding an address to load the new
> >>   kernel it treats the address of the tpm_log as unavailable.
> >>
> >
> > The TPM log is a table created by the EFI stub loader, which is part
> > of the kernel. So if we need to tweak this for kexec's benefit, I'd
> > prefer changing it in a way that can accommodate the first kernel too.
> > However, I think the current method already has that property so I
> > don't think we need to do anything (modulo fixing the bug)
>
> I am fine with not inventing a new mechanism, but I think we need
> to reuse whatever mechanism the stub loader uses to pass it's
> table to the kernel.  Not the EFI table that disappears at
> ExitBootServices().
>

Not sure what you mean here - the EFI table that gets clobbered by
kexec *is* the table that is created by the stub loader to pass the
TPM log to the kernel. Not sure what alternative you have in mind
here.

> > That said, I am doubtful that the kexec kernel can make meaningful use
> > of the TPM log to begin with, given that the TPM will be out of sync
> > at this point. But it is still better to keep it for symmetry, letting
> > the higher level kexec/kdump logic running in user space reason about
> > whether the TPM log has any value to it.
>
> Someone seems to think so or there would not be a complaint that it is
> getting corrupted.
>

No. The problem is that the size of the table is *in* the table, and
so if it gets corrupted, the code that attempts to memblock_reserve()
it goes off into the weeds. But that does not imply there is a point
to having access to this table from a kexec kernel in the first place.

> This should not be the kexec-on-panic kernel as that runs in memory
> that is reserved solely for it's own use.  So we are talking something
> like using kexec as a bootloader.
>

kexec as a bootloader under TPM based measured boot will need to do a
lot more than pass the firmware's event log to the next kernel. I'd
expect a properly engineered kexec to replace this table entirely, and
include the hashes of the assets it has loaded and measured into the
respective PCRs.

But let's stick to solving the actual issue here, rather than
philosophize on how kexec might work in this context.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-16 Thread Ard Biesheuvel
Hi Eric,

Thanks for chiming in.

On Mon, 16 Sept 2024 at 22:21, Eric W. Biederman  wrote:
>
> James Bottomley  writes:
>
> > On Fri, 2024-09-13 at 04:57 -0700, Breno Leitao wrote:
> >> Hello James,
> >>
> >> On Thu, Sep 12, 2024 at 12:22:01PM -0400, James Bottomley wrote:
> >> > On Thu, 2024-09-12 at 06:03 -0700, Breno Leitao wrote:
> >> > > Hello Ard,
> >> > >
> >> > > On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
> >> > > > I don't see how this could be an EFI bug, given that it does
> >> > > > not deal with E820 tables at all.
> >> > >
> >> > > I want to back up a little bit and make sure I am following the
> >> > > discussion.
> >> > >
> >> > > From what I understand from previous discussion, we have an EFI
> >> > > bug as the root cause of this issue.
> >> > >
> >> > > This happens because the EFI does NOT mark the EFI TPM event log
> >> > > memory region as reserved (EFI_RESERVED_TYPE). Not having an
> >> > > entry for the event table memory in EFI memory mapped, then
> >> > > libstub will ignore it completely (the TPM event log memory
> >> > > range) and not populate e820 table with it.
> >> >
> >> > Wait, that's not correct.  The TPM log is in memory that doesn't
> >> > survive ExitBootServices (by design in case the OS doesn't care
> >> > about it).  So the EFI stub actually copies it over to a new
> >> > configuration table that is in reserved memory before it calls
> >> > ExitBootServices.  This new copy should be in kernel reserved
> >> > memory regardless of its e820 map status.
> >>
> >> First of all, thanks for clarifying some points here.
> >>
> >> How should the TPM log table be passed to the next kernel when
> >> kexecing() since it didn't surive ExitBootServices?
> >
> > I've no idea.  I'm assuming you don't elaborately reconstruct the EFI
> > boot services, so you can't enter the EFI boot stub before
> > ExitBootServices is called?  So I'd guess you want to preserve the EFI
> > table that copied the TPM data in to kernel memory.
>
> This leaves two practical questions if I have been following everything
> correctly.
>
> 1) How to get kexec to avoid picking that memory for the new kernel to
>run in before it initializes itself. (AKA the getting stomped by
>relocate kernel problem).
>
> 2) How to point the new kernel to preserved tpm_log.
>
>
> This recommendation is from memory so it may be a bit off but
> the general structure should work.  The idea is as follows.
>
> - Pass the information between kernels.
>
>   It is probably simplest for the kernel to have a command line option
>   that tells the kernel the address and size of the tpm_log.
>
>   We have a couple of mechanisms here.  Assuming you are loading a
>   bzImage with kexec_file_load you should be able to have the in kernel
>   loader to add those arguments to the kernel command line.
>

This shouldn't be necessary, and I think it is actively harmful to
keep inventing special ways for the kexec kernel to learn about these
things that deviate from the methods used by the first kernel. This is
how we ended up with 5 sources of truth for the physical memory map
(EFI memory map, memblock and 3 different versions of the e820 memory
map).

We should try very hard to make kexec idempotent, and reuse the
existing methods where possible. In this case, the EFI configuration
table is already being exposed to the kexec kernel, which describes
the base of the allocation. The size of the allocation can be derived
from the table header.

> - Ensure that when the loader is finding an address to load the new
>   kernel it treats the address of the tpm_log as unavailable.
>

The TPM log is a table created by the EFI stub loader, which is part
of the kernel. So if we need to tweak this for kexec's benefit, I'd
prefer changing it in a way that can accommodate the first kernel too.
However, I think the current method already has that property so I
don't think we need to do anything (modulo fixing the bug)

That said, I am doubtful that the kexec kernel can make meaningful use
of the TPM log to begin with, given that the TPM will be out of sync
at this point. But it is still better to keep it for symmetry, letting
the higher level kexec/kdump logic running in user space reason about
whether the TPM log has any value to it.

-- 
Ard.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-14 Thread Ard Biesheuvel
On Sat, 14 Sept 2024 at 08:46, Dave Young  wrote:
>
> On Fri, 13 Sept 2024 at 18:56, Dave Young  wrote:
> >
> > On Thu, 12 Sept 2024 at 22:15, Ard Biesheuvel  wrote:
> > >
> > > (cc Dave)
> >
> > Thanks for ccing me.
> >
> > >
> > > Full thread here:
> > > https://lore.kernel.org/all/camj1kxg1hbiafkryc5qm1vj5x7x-dmlndqqo2aynhmrxdz-...@mail.gmail.com/T/#u
> > >
> > > On Thu, 12 Sept 2024 at 16:05, Ard Biesheuvel  wrote:
> > > >
> > > > On Thu, 12 Sept 2024 at 15:55, Usama Arif  
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 12/09/2024 14:10, Ard Biesheuvel wrote:
> > > > > > Does the below help at all?
> > > > > >
> > > > > > --- a/drivers/firmware/efi/tpm.c
> > > > > > +++ b/drivers/firmware/efi/tpm.c
> > > > > > @@ -60,7 +60,7 @@ int __init efi_tpm_eventlog_init(void)
> > > > > > }
> > > > > >
> > > > > > tbl_size = sizeof(*log_tbl) + log_tbl->size;
> > > > > > -   memblock_reserve(efi.tpm_log, tbl_size);
> > > > > > +   efi_mem_reserve(efi.tpm_log, tbl_size);
> > > > > >
> > > > > > if (efi.tpm_final_log == EFI_INVALID_TABLE_ADDR) {
> > > > > > pr_info("TPM Final Events table not present\n");
> > > > >
> > > > > Unfortunately not. efi_mem_reserve updates e820_table, while kexec 
> > > > > looks at /sys/firmware/memmap
> > > > > which is e820_table_firmware.
> > > > >
> > > > > arch_update_firmware_area introduced in the RFC patch does the same 
> > > > > thing as efi_mem_reserve does at
> > > > > its end, just with e820_table_firmware instead of e820_table.
> > > > > i.e. efi_mem_reserve does:
> > > > > e820__range_update(addr, size, E820_TYPE_RAM, 
> > > > > E820_TYPE_RESERVED);
> > > > > e820__update_table(e820_table);
> > > > >
> > > > > while arch_update_firmware_area does:
> > > > > e820__range_update_firmware(addr, size, E820_TYPE_RAM, 
> > > > > E820_TYPE_RESERVED);
> > > > > e820__update_table(e820_table_firmware);
> > > > >
> > > >
> > > > Shame.
> > > >
> > > > Using efi_mem_reserve() is appropriate here in any case, but I guess
> > > > kexec on x86 needs to be fixed to juggle the EFI memory map, memblock
> > > > table, and 3 (!) versions of the E820 table in the correct way
> > > > (e820_table, e820_table_kexec and e820_table_firmware)
> > > >
> > > > Perhaps we can put this additional logic in x86's implementation of
> > > > efi_arch_mem_reserve()? AFAICT, all callers of efi_mem_reserve() deal
> > > > with configuration tables produced by the firmware that may not be
> > > > reserved correctly if kexec looks at e820_table_firmware[] only.
> > >
> >
> > I have not read all the conversations,  let me have a look and response 
> > later.
> >
>
> I'm still confused after reading the code about why this issue can
> still happen with a efi_mem_reserve.
> Usama, Breno, could any of you share the exact steps on how to
> reproduce this issue with a kvm guest?
>

The code does not use efi_mem_reserve() only memblock_reserve().

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
On Thu, 12 Sept 2024 at 17:35, Usama Arif  wrote:
>
>
>
> On 12/09/2024 16:21, Ard Biesheuvel wrote:
> > On Thu, 12 Sept 2024 at 16:29, Breno Leitao  wrote:
> >>
> >> On Thu, Sep 12, 2024 at 03:10:43PM +0200, Ard Biesheuvel wrote:
> >>> On Thu, 12 Sept 2024 at 15:03, Breno Leitao  wrote:
> >>>> On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
> >>>>> I don't see how this could be an EFI bug, given that it does not deal
> >>>>> with E820 tables at all.
> >>>>
> >>>> I want to back up a little bit and make sure I am following the
> >>>> discussion.
> >>>>
> >>>> From what I understand from previous discussion, we have an EFI bug as
> >>>> the root cause of this issue.
> >>>>
> >>>> This happens because the EFI does NOT mark the EFI TPM event log memory
> >>>> region as reserved (EFI_RESERVED_TYPE).
> >>>
> >>> Why do you think EFI should use EFI_RESERVED_TYPE in this case?
> >>>
> >>> The EFI spec is very clear that EFI_RESERVED_TYPE really shouldn't be
> >>> used for anything by EFI itself. It is quite common for EFI
> >>> configuration tables to be passed as EfiRuntimeServicesData (SMBIOS),
> >>> EfiBootServicesData (ESRT) or EFiAcpiReclaim (ACPI tables).
> >>>
> >>> Reserved memory is mostly for memory that even the firmware does not
> >>> know what it is for, i.e., particular platform specific uses.
> >>>
> >>> In general, it is up to the OS to ensure that EFI configuration tables
> >>> that it cares about should be reserved in the correct way.
> >>
> >> Thanks for the explanation.
> >>
> >> So, if I understand what you meant here, the TPM event log memory range
> >> shouldn't be listed as a memory region in EFI memory map (as passed by
> >> the firmware to the OS).
> >>
> >> Hence, this is not a EFI firmware bug, but a OS/Kernel bug.
> >>
> >> Am I correct with the statements above?
> >>
> >
> > No, not entirely. But I also missed the face that this table is
> > actually created by the EFI stub in Linux, not the firmware. It is
> > *not* the same memory region that the TPM2 ACPI table describes, and
> > so what the various specs say about that is entirely irrelevant.
> >
> > The TPM event log configuration table is created by the EFI stub,
> > which uses the TCG2::GetEventLog () protocol method to obtain it. This
> > must be done by the EFI stub because these protocols will no longer be
> > available once the OS boots. But the data is not used by the EFI stub,
> > only by the OS, which is why it is passed in memory like this.
> >
> > The memory in question is allocated as EFI_LOADER_DATA, and so we are
> > relying on the OS to know that this memory is special, and needs to be
> > preserved.
> >
> > I think the solution here is to use a different memory type:
> >
> > --- a/drivers/firmware/efi/libstub/tpm.c
> > +++ b/drivers/firmware/efi/libstub/tpm.c
> > @@ -96,7 +96,7 @@ static void efi_retrieve_tcg2_eventlog(int version,
> > efi_physical_addr_t log_loca
> > }
> >
> > /* Allocate space for the logs and copy them. */
> > -   status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> > +   status = efi_bs_call(allocate_pool, EFI_ACPI_RECLAIM_MEMORY,
> >  sizeof(*log_tbl) + log_size, (void 
> > **)&log_tbl);
> >
> > if (status != EFI_SUCCESS) {
> >
> > which will be treated appropriately by the existing EFI-to-E820
> > conversion logic.
>
> I have tested above diff, and it works! No memory corruption.
>
> The region comes up as ACPI data:
> [0.00] BIOS-e820: [mem 0x7fb6d000-0x7fb7efff] ACPI 
> data
>
> and kexec doesnt interfere with it.
>

Thanks, I'll take that as a tested-by

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
On Thu, 12 Sept 2024 at 16:29, Breno Leitao  wrote:
>
> On Thu, Sep 12, 2024 at 03:10:43PM +0200, Ard Biesheuvel wrote:
> > On Thu, 12 Sept 2024 at 15:03, Breno Leitao  wrote:
> > > On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
> > > > I don't see how this could be an EFI bug, given that it does not deal
> > > > with E820 tables at all.
> > >
> > > I want to back up a little bit and make sure I am following the
> > > discussion.
> > >
> > > From what I understand from previous discussion, we have an EFI bug as
> > > the root cause of this issue.
> > >
> > > This happens because the EFI does NOT mark the EFI TPM event log memory
> > > region as reserved (EFI_RESERVED_TYPE).
> >
> > Why do you think EFI should use EFI_RESERVED_TYPE in this case?
> >
> > The EFI spec is very clear that EFI_RESERVED_TYPE really shouldn't be
> > used for anything by EFI itself. It is quite common for EFI
> > configuration tables to be passed as EfiRuntimeServicesData (SMBIOS),
> > EfiBootServicesData (ESRT) or EFiAcpiReclaim (ACPI tables).
> >
> > Reserved memory is mostly for memory that even the firmware does not
> > know what it is for, i.e., particular platform specific uses.
> >
> > In general, it is up to the OS to ensure that EFI configuration tables
> > that it cares about should be reserved in the correct way.
>
> Thanks for the explanation.
>
> So, if I understand what you meant here, the TPM event log memory range
> shouldn't be listed as a memory region in EFI memory map (as passed by
> the firmware to the OS).
>
> Hence, this is not a EFI firmware bug, but a OS/Kernel bug.
>
> Am I correct with the statements above?
>

No, not entirely. But I also missed the face that this table is
actually created by the EFI stub in Linux, not the firmware. It is
*not* the same memory region that the TPM2 ACPI table describes, and
so what the various specs say about that is entirely irrelevant.

The TPM event log configuration table is created by the EFI stub,
which uses the TCG2::GetEventLog () protocol method to obtain it. This
must be done by the EFI stub because these protocols will no longer be
available once the OS boots. But the data is not used by the EFI stub,
only by the OS, which is why it is passed in memory like this.

The memory in question is allocated as EFI_LOADER_DATA, and so we are
relying on the OS to know that this memory is special, and needs to be
preserved.

I think the solution here is to use a different memory type:

--- a/drivers/firmware/efi/libstub/tpm.c
+++ b/drivers/firmware/efi/libstub/tpm.c
@@ -96,7 +96,7 @@ static void efi_retrieve_tcg2_eventlog(int version,
efi_physical_addr_t log_loca
}

/* Allocate space for the logs and copy them. */
-   status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
+   status = efi_bs_call(allocate_pool, EFI_ACPI_RECLAIM_MEMORY,
 sizeof(*log_tbl) + log_size, (void **)&log_tbl);

if (status != EFI_SUCCESS) {

which will be treated appropriately by the existing EFI-to-E820
conversion logic.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
On Thu, 12 Sept 2024 at 15:55, Usama Arif  wrote:
>
>
>
> On 12/09/2024 14:10, Ard Biesheuvel wrote:
> > Does the below help at all?
> >
> > --- a/drivers/firmware/efi/tpm.c
> > +++ b/drivers/firmware/efi/tpm.c
> > @@ -60,7 +60,7 @@ int __init efi_tpm_eventlog_init(void)
> > }
> >
> > tbl_size = sizeof(*log_tbl) + log_tbl->size;
> > -   memblock_reserve(efi.tpm_log, tbl_size);
> > +   efi_mem_reserve(efi.tpm_log, tbl_size);
> >
> > if (efi.tpm_final_log == EFI_INVALID_TABLE_ADDR) {
> > pr_info("TPM Final Events table not present\n");
>
> Unfortunately not. efi_mem_reserve updates e820_table, while kexec looks at 
> /sys/firmware/memmap
> which is e820_table_firmware.
>
> arch_update_firmware_area introduced in the RFC patch does the same thing as 
> efi_mem_reserve does at
> its end, just with e820_table_firmware instead of e820_table.
> i.e. efi_mem_reserve does:
> e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> e820__update_table(e820_table);
>
> while arch_update_firmware_area does:
> e820__range_update_firmware(addr, size, E820_TYPE_RAM, 
> E820_TYPE_RESERVED);
> e820__update_table(e820_table_firmware);
>

Shame.

Using efi_mem_reserve() is appropriate here in any case, but I guess
kexec on x86 needs to be fixed to juggle the EFI memory map, memblock
table, and 3 (!) versions of the E820 table in the correct way
(e820_table, e820_table_kexec and e820_table_firmware)

Perhaps we can put this additional logic in x86's implementation of
efi_arch_mem_reserve()? AFAICT, all callers of efi_mem_reserve() deal
with configuration tables produced by the firmware that may not be
reserved correctly if kexec looks at e820_table_firmware[] only.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
(cc Dave)

Full thread here:
https://lore.kernel.org/all/camj1kxg1hbiafkryc5qm1vj5x7x-dmlndqqo2aynhmrxdz-...@mail.gmail.com/T/#u

On Thu, 12 Sept 2024 at 16:05, Ard Biesheuvel  wrote:
>
> On Thu, 12 Sept 2024 at 15:55, Usama Arif  wrote:
> >
> >
> >
> > On 12/09/2024 14:10, Ard Biesheuvel wrote:
> > > Does the below help at all?
> > >
> > > --- a/drivers/firmware/efi/tpm.c
> > > +++ b/drivers/firmware/efi/tpm.c
> > > @@ -60,7 +60,7 @@ int __init efi_tpm_eventlog_init(void)
> > > }
> > >
> > > tbl_size = sizeof(*log_tbl) + log_tbl->size;
> > > -   memblock_reserve(efi.tpm_log, tbl_size);
> > > +   efi_mem_reserve(efi.tpm_log, tbl_size);
> > >
> > > if (efi.tpm_final_log == EFI_INVALID_TABLE_ADDR) {
> > > pr_info("TPM Final Events table not present\n");
> >
> > Unfortunately not. efi_mem_reserve updates e820_table, while kexec looks at 
> > /sys/firmware/memmap
> > which is e820_table_firmware.
> >
> > arch_update_firmware_area introduced in the RFC patch does the same thing 
> > as efi_mem_reserve does at
> > its end, just with e820_table_firmware instead of e820_table.
> > i.e. efi_mem_reserve does:
> > e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > e820__update_table(e820_table);
> >
> > while arch_update_firmware_area does:
> > e820__range_update_firmware(addr, size, E820_TYPE_RAM, 
> > E820_TYPE_RESERVED);
> > e820__update_table(e820_table_firmware);
> >
>
> Shame.
>
> Using efi_mem_reserve() is appropriate here in any case, but I guess
> kexec on x86 needs to be fixed to juggle the EFI memory map, memblock
> table, and 3 (!) versions of the E820 table in the correct way
> (e820_table, e820_table_kexec and e820_table_firmware)
>
> Perhaps we can put this additional logic in x86's implementation of
> efi_arch_mem_reserve()? AFAICT, all callers of efi_mem_reserve() deal
> with configuration tables produced by the firmware that may not be
> reserved correctly if kexec looks at e820_table_firmware[] only.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
On Thu, 12 Sept 2024 at 15:03, Breno Leitao  wrote:
>
> Hello Ard,
>
> On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
> > I don't see how this could be an EFI bug, given that it does not deal
> > with E820 tables at all.
>
> I want to back up a little bit and make sure I am following the
> discussion.
>
> From what I understand from previous discussion, we have an EFI bug as
> the root cause of this issue.
>
> This happens because the EFI does NOT mark the EFI TPM event log memory
> region as reserved (EFI_RESERVED_TYPE).

Why do you think EFI should use EFI_RESERVED_TYPE in this case?

The EFI spec is very clear that EFI_RESERVED_TYPE really shouldn't be
used for anything by EFI itself. It is quite common for EFI
configuration tables to be passed as EfiRuntimeServicesData (SMBIOS),
EfiBootServicesData (ESRT) or EFiAcpiReclaim (ACPI tables).

Reserved memory is mostly for memory that even the firmware does not
know what it is for, i.e., particular platform specific uses.

In general, it is up to the OS to ensure that EFI configuration tables
that it cares about should be reserved in the correct way.

> Not having an entry for the
> event table memory in EFI memory mapped, then libstub will ignore it
> completely (the TPM event log memory range) and not populate e820 table
> with it.
>
> Once the e820 table does not have the memory range for TPM event log,
> then the kernel is free to overwrite that memory region, causing
> corruptions all across the board.
>

We shouldn't be relying on the E820 table for this.

> From what I understand from the thread discussion, there are three ways
> to "solve" it:
>
> 1) Fix the EFI to pass the TPM event log memory as reserved.
>
> 2) Workaround it in libstub, and considering the TPM event log memory
> range when populating the e820 table. (As proposed in
> https://lore.kernel.org/all/2542182d-aa79-4705-91b6-fa593bacf...@gmail.com/)
>
> 3) Workaround in later in the kernel, as proposed in
> https://lore.kernel.org/all/20240911104109.1831501-1-usamaarif...@gmail.com/
>

Does the below help at all?

--- a/drivers/firmware/efi/tpm.c
+++ b/drivers/firmware/efi/tpm.c
@@ -60,7 +60,7 @@ int __init efi_tpm_eventlog_init(void)
}

tbl_size = sizeof(*log_tbl) + log_tbl->size;
-   memblock_reserve(efi.tpm_log, tbl_size);
+   efi_mem_reserve(efi.tpm_log, tbl_size);

if (efi.tpm_final_log == EFI_INVALID_TABLE_ADDR) {
pr_info("TPM Final Events table not present\n");

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-12 Thread Ard Biesheuvel
On Thu, 12 Sept 2024 at 12:23, Usama Arif  wrote:
>
>
>
> On 11/09/2024 12:51, Ard Biesheuvel wrote:
> > On Wed, 11 Sept 2024 at 12:41, Usama Arif  wrote:
> >>
> >> Looking at the TPM spec [1]
> >>
> >> If the ACPI TPM2 table contains the address and size of the Platform
> >> Firmware TCG log, firmware “pins” the memory associated with the
> >> Platform FirmwareTCG log, and reports this memory as “Reserved” memory
> >> via the INT 15h/E820 interface.
> >>
> >> It looks like the firmware should pass this as reserved in e820 memory
> >> map. However, it doesn't seem to. The firmware being tested on is:
> >> dmidecode -s bios-version
> >> edk2-20240214-2.el9
> >>
> >> When this area is not reserved, it comes up as usable in
> >> /sys/firmware/memmap. This means that kexec, which uses that memmap
> >> to find usable memory regions, can select the region where efi.tpm_log
> >> is and overwrite it and relocate_kernel.
> >>
> >> Having a fix in firmware can be difficult to get through. As a secondary
> >> fix, this patch marks that region as reserved in e820_table_firmware if it
> >> is currently E820_TYPE_RAM so that kexec doesn't use it for kernel 
> >> segments.
> >>
> >> [1] 
> >> https://trustedcomputinggroup.org/wp-content/uploads/PC-ClientPlatform_Profile_for_TPM_2p0_Systems_v49_161114_public-review.pdf
> >>
> >> Signed-off-by: Usama Arif 
> >
> > I would expect the EFI memory map to E820 conversion implemented in
> > the EFI stub to take care of this.
> >
>
> So I have been making a prototype with EFI stub, and the unfinished version 
> is looking like a
> horrible hack.
>
> The only way to do this in libstub is to pass log_tbl all the way from 
> efi_retrieve_tcg2_eventlog
> to efi_stub_entry and from there to setup_e820.
> While going through the efi memory map and converting it to e820 table in 
> setup_e820, you have to check
> if log_tbl falls in any of the ranges and if the range is E820_TYPE_RAM. If 
> that condition is satisfied,
> then you have to split that range into 3. i.e. the E820_TYPE_RAM range before 
> tpm_log, the tpm_log
> E820_TYPE_RESERVED range, and the E820_TYPE_RAM range after. There are no 
> helper functions, so this
> splitting involves playing with a lot of pointers, and it looks quite ugly. I 
> believe doing this
> way is more likely to introduce bugs.
>
> If we are having to compensate for an EFI bug, would it make sense to do it 
> in the way done
> in RFC and do it in kernel rather than libstub? It is simple and very likely 
> to be bug free.
>

I don't see how this could be an EFI bug, given that it does not deal
with E820 tables at all.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-11 Thread Ard Biesheuvel
On Wed, 11 Sept 2024 at 12:41, Usama Arif  wrote:
>
> Looking at the TPM spec [1]
>
> If the ACPI TPM2 table contains the address and size of the Platform
> Firmware TCG log, firmware “pins” the memory associated with the
> Platform FirmwareTCG log, and reports this memory as “Reserved” memory
> via the INT 15h/E820 interface.
>
> It looks like the firmware should pass this as reserved in e820 memory
> map. However, it doesn't seem to. The firmware being tested on is:
> dmidecode -s bios-version
> edk2-20240214-2.el9
>
> When this area is not reserved, it comes up as usable in
> /sys/firmware/memmap. This means that kexec, which uses that memmap
> to find usable memory regions, can select the region where efi.tpm_log
> is and overwrite it and relocate_kernel.
>
> Having a fix in firmware can be difficult to get through. As a secondary
> fix, this patch marks that region as reserved in e820_table_firmware if it
> is currently E820_TYPE_RAM so that kexec doesn't use it for kernel segments.
>
> [1] 
> https://trustedcomputinggroup.org/wp-content/uploads/PC-ClientPlatform_Profile_for_TPM_2p0_Systems_v49_161114_public-review.pdf
>
> Signed-off-by: Usama Arif 

I would expect the EFI memory map to E820 conversion implemented in
the EFI stub to take care of this.

If you are not booting via the EFI stub, the bootloader is performing
this conversion, and so it should be done there instead.


> ---
>  arch/x86/include/asm/e820/api.h | 2 ++
>  arch/x86/kernel/e820.c  | 6 ++
>  arch/x86/platform/efi/efi.c | 9 +
>  drivers/firmware/efi/tpm.c  | 2 +-
>  include/linux/efi.h | 7 +++
>  5 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h
> index 2e74a7f0e935..4e9aa24f03bd 100644
> --- a/arch/x86/include/asm/e820/api.h
> +++ b/arch/x86/include/asm/e820/api.h
> @@ -16,6 +16,8 @@ extern bool e820__mapped_all(u64 start, u64 end, enum 
> e820_type type);
>
>  extern void e820__range_add   (u64 start, u64 size, enum e820_type type);
>  extern u64  e820__range_update(u64 start, u64 size, enum e820_type old_type, 
> enum e820_type new_type);
> +extern u64  e820__range_update_firmware(u64 start, u64 size, enum e820_type 
> old_type,
> +   enum e820_type new_type);
>  extern u64  e820__range_remove(u64 start, u64 size, enum e820_type old_type, 
> bool check_type);
>  extern u64  e820__range_update_table(struct e820_table *t, u64 start, u64 
> size, enum e820_type old_type, enum e820_type new_type);
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 4893d30ce438..912400161623 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -538,6 +538,12 @@ u64 __init e820__range_update_table(struct e820_table 
> *t, u64 start, u64 size,
> return __e820__range_update(t, start, size, old_type, new_type);
>  }
>
> +u64 __init e820__range_update_firmware(u64 start, u64 size, enum e820_type 
> old_type,
> +  enum e820_type new_type)
> +{
> +   return __e820__range_update(e820_table_firmware, start, size, 
> old_type, new_type);
> +}
> +
>  /* Remove a range of memory from the E820 table: */
>  u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, 
> bool check_type)
>  {
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 88a96816de9a..aa95f77d7a30 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -171,6 +171,15 @@ static void __init do_add_efi_memmap(void)
> e820__update_table(e820_table);
>  }
>
> +/* Reserve firmware area if it was marked as RAM */
> +void arch_update_firmware_area(u64 addr, u64 size)
> +{
> +   if (e820__get_entry_type(addr, addr + size) == E820_TYPE_RAM) {
> +   e820__range_update_firmware(addr, size, E820_TYPE_RAM, 
> E820_TYPE_RESERVED);
> +   e820__update_table(e820_table_firmware);
> +   }
> +}
> +
>  /*
>   * Given add_efi_memmap defaults to 0 and there is no alternative
>   * e820 mechanism for soft-reserved memory, import the full EFI memory
> diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c
> index e8d69bd548f3..8e6e7131d718 100644
> --- a/drivers/firmware/efi/tpm.c
> +++ b/drivers/firmware/efi/tpm.c
> @@ -60,6 +60,7 @@ int __init efi_tpm_eventlog_init(void)
> }
>
> tbl_size = sizeof(*log_tbl) + log_tbl->size;
> +   arch_update_firmware_area(efi.tpm_log, tbl_size);
> memblock_reserve(efi.tpm_log, tbl_size);
>
> if (efi.tpm_final_log == EFI_INVALID_TABLE_ADDR) {
> @@ -107,4 +108,3 @@ int __init efi_tpm_eventlog_init(void)
> early_memunmap(log_tbl, sizeof(*log_tbl));
> return ret;
>  }
> -
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 6bf3c4fe8511..9c239cdff771 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1371,4 +1371,11 @@ extern struct b

Re: [RFCv2 0/9] UEFI emulator for kexec

2024-09-09 Thread Ard Biesheuvel
On Mon, 9 Sept 2024 at 15:49, Philipp Rudo  wrote:
>
> Hi Lennart,
> Hi Jan,
>
> On Mon, 9 Sep 2024 12:42:45 +0200
> Jan Hendrik Farr  wrote:
>
> > On 09 11:48:30, Lennart Poettering wrote:
> > > On Fr, 06.09.24 12:54, Philipp Rudo (pr...@redhat.com) wrote:
> > >
> > > > I mostly agree on what you have wrote. But I see a big problem in
> > > > running the EFI emulator in user space when it comes to secure boot.
> > > > The chain of trust ends in the kernel. So it's the kernel that needs to
> > > > verify that the image to be loaded can be trusted. But when the EFI
> > > > runtime is in user space the kernel simply cannot do that. Which means,
> > > > if we want to go this way, we would need to extend the chain of trust
> > > > to user space. Which will be a whole bucket of worms, not just a
> > > > can.
> > >
> > > May it would be nice to have a way to "zap" userspace away, i.e. allow
> > > the kernel to get rid of all processes in some way, reliable. And then
> > > simply start a new userspace, from a trusted definition. Or in other
> > > words: if you don't want to trust the usual userspace, then let's
> > > maybe just terminate it, and create it anew, with a clean, pristine
> > > definition the old userspace cannot get access to.
> >
> > Well, this is an interesting idea!
> >
> > However, I'm sceptical if this could be done in a secure way. How do we
> > ensure that nothing the old userspace did with the various interfaces to
> > the kernel has no impact on the new userspace? Maybe others can chime in
> > on this? Does kernel_lockdown give more guarantees related to this?
> >
> > Even if this is possible in a secure way, there is a problem with doing
> > this for kernels that are to be kexec'd on kernel panic. In this
> > approach we can't pre-run them until EBS(), so we would rely on the old
> > kernel to still be intact when we want to kexec reboot.
>
> I don't believe there's a way to do that on running kernels. As Jan
> pointed out, this cannot be done during reboot, as for kdump that would
> mean to run after a panic. So it would need to run when the new image
> is loaded. But at that time your user space is running. Plus you also
> always have a user space component that triggers kexec. So you cannot
> simply "zap" user space but have to somehow stash it away, run your
> trusted user space and, then restore the old user space again. That
> sounds pretty error prone to me. Plus it will tank your performance
> every time you do a kexec, which for kdump is every boot...
>

kdump has a kexec kernel 'standby' to launch when the kernel panics.
So for the UKI/EFI payload case, this would imply that the load
involves running the payload until EBS() and freezing the state.

Whether execution occurs in true user space or in a deprivileged
kernel context is an implementation detail, imho. We don't want to run
external code in privileged mode inside the kernel in any case, as
this would violate lockdown already. But it should be feasible to have
a EFI compatible layer in the kernel that invokes the EFI entrypoint
of an image in a way that protects the host kernel. This could be user
mode on the CPU or perhaps a minimal KVM virtual machine.

The advantage of this approach is that the whole concept of purgatory
can be avoided - the EFI boot phase runs in parallel with the previous
kernel, which has full control over authentication and [emulated] PCR
externsion, and has ultimate control over whether the kexec reboot is
permitted.

> > You could do a system where you kexec into an intermediate kernel. That
> > kernel get's kexec'd with a signed initrd that can use the normal
> > kexec_load syscall to load do any kind of preparation in userspace.
> > Problem: For that intermediate enviroment we already need a format
> > that combines kernel image, initrd, cmdline all signed in one package
> > aka UKI. Was it the chicken or the egg?
> >
> > But this shows that if we implemented UKIs the easy way (kernel simply
> > checks signature, extracts the pieces, and kexecs them like normal),
> > this approach could always be used to support kexec for other future
> > formats. They could use the kernels UKI support to boot into an
> > intermediate kernel with UEFI implemented in userspace in the initrd.
> >
> > So basically support UKIs the easy way and use them to be able to
> > securely zap away userspace and start with a fresh kernel and signed
> > userspace as a way to support other UEFI formats that are not UKI.
>
> Well, in theory that should work. But I see several problems:
>
> 1) How does the first kernel tell the intermediate kernel which
> file(s) with wich command line to load? In fact, how does the first
> kernel get the information itself? You would need a new system call
> that takes two kernel images, one for the intermediate and one for the
> kernel to load,for that.
>
> Of course you could also build the intermediate UKI during kernel build
> and include it into the image. Similar to what is done with the
> 

Re: [PATCH v10 20/20] x86/efi: EFI stub DRTM launch support for Secure Launch

2024-08-29 Thread Ard Biesheuvel
On Thu, 29 Aug 2024 at 15:24, Jonathan McDowell  wrote:
>
> On Wed, Aug 28, 2024 at 01:19:16PM -0700, ross.philip...@oracle.com wrote:
> > On 8/28/24 10:14 AM, Ard Biesheuvel wrote:
> > > On Wed, 28 Aug 2024 at 19:09, kernel test robot  wrote:
> > > >
> > > > Hi Ross,
> > > >
> > > > kernel test robot noticed the following build warnings:
> > > >
> > > > [auto build test WARNING on tip/x86/core]
> > > > [also build test WARNING on char-misc/char-misc-testing 
> > > > char-misc/char-misc-next char-misc/char-misc-linus 
> > > > herbert-cryptodev-2.6/master efi/next linus/master v6.11-rc5]
> > > > [cannot apply to herbert-crypto-2.6/master next-20240828]
> > > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > > And when submitting patch, we suggest to use '--base' as documented in
> > > > https://urldefense.com/v3/__https://git-scm.com/docs/git-format-patch*_base_tree_information__;Iw!!ACWV5N9M2RV99hQ!KhkZK77BXRIR4F24tKkUeIlIrdqXtUW2vcnDV74c_5BmrQBQaQ4FqcDKKv9LB3HQUocTGkrmIxuz-LAC$
> > > >  ]
> > > >
> > > > url:
> > > > https://urldefense.com/v3/__https://github.com/intel-lab-lkp/linux/commits/Ross-Philipson/Documentation-x86-Secure-Launch-kernel-documentation/20240827-065225__;!!ACWV5N9M2RV99hQ!KhkZK77BXRIR4F24tKkUeIlIrdqXtUW2vcnDV74c_5BmrQBQaQ4FqcDKKv9LB3HQUocTGkrmI7Z6SQKy$
> > > > base:   tip/x86/core
> > > > patch link:
> > > > https://urldefense.com/v3/__https://lore.kernel.org/r/20240826223835.3928819-21-ross.philipson*40oracle.com__;JQ!!ACWV5N9M2RV99hQ!KhkZK77BXRIR4F24tKkUeIlIrdqXtUW2vcnDV74c_5BmrQBQaQ4FqcDKKv9LB3HQUocTGkrmIzWfs1XZ$
> > > > patch subject: [PATCH v10 20/20] x86/efi: EFI stub DRTM launch support 
> > > > for Secure Launch
> > > > config: i386-randconfig-062-20240828 
> > > > (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20240829/202408290030.febuhhbr-...@intel.com/config__;!!ACWV5N9M2RV99hQ!KhkZK77BXRIR4F24tKkUeIlIrdqXtUW2vcnDV74c_5BmrQBQaQ4FqcDKKv9LB3HQUocTGkrmIwkYG0TY$
> > > >  )
> > >
> > >
> > > This is a i386 32-bit build, which makes no sense: this stuff should
> > > just declare 'depends on 64BIT'
> >
> > Our config entry already has 'depends on X86_64' which in turn depends on
> > 64BIT. I would think that would be enough. Do you think it needs an explicit
> > 'depends on 64BIT' in our entry as well?
>
> The error is in x86-stub.c, which is pre-existing and compiled for 32
> bit as well, so you need more than a "depends" here.
>

Ugh, that means this is my fault - apologies. Replacing the #ifdef
with IS_ENABLED() makes the code visible to the 32-bit compiler, even
though the code is disregarded.

I'd still prefer IS_ENABLED(), but this would require the code in
question to live in a separate compilation unit (which depends on
CONFIG_SECURE_LAUNCH). If that is too fiddly, feel free to bring back
the #ifdef CONFIG_SECURE_LAUNCH here (and retain my R-b)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 01/15] x86/boot: Place kernel_info at a fixed offset

2024-08-28 Thread Ard Biesheuvel
(cc Stuart)

On Thu, 21 Mar 2024 at 15:46, Daniel P. Smith
 wrote:
>
> Hi Ard!
>
> On 2/15/24 02:56, Ard Biesheuvel wrote:
> > On Wed, 14 Feb 2024 at 23:31, Ross Philipson  
> > wrote:
> >>
> >> From: Arvind Sankar 
> >>
> >> There are use cases for storing the offset of a symbol in kernel_info.
> >> For example, the trenchboot series [0] needs to store the offset of the
> >> Measured Launch Environment header in kernel_info.
> >>
> >
> > Why? Is this information consumed by the bootloader?
>
> Yes, the bootloader needs a standardized means to find the offset of the
> MLE header, which communicates a set of meta-data needed by the DCE in
> order to set up for and start the loaded kernel. Arm will also need to
> provide a similar metadata structure and alternative entry point (or a
> complete rewrite of the existing entry point), as the current Arm entry
> point is in direct conflict with Arm DRTM specification.
>

Digging up an old thread here: could you elaborate on this? What do
you mean by 'Arm entry point' and how does it conflict directly with
the Arm DRTM specification? The Linux/arm64 port predates that spec by
about 10 years, so I would expect the latter to take the former into
account. If that failed to happen, we should fix the spec while we
still can.

Thanks,
Ard.



Re: [PATCH v10 20/20] x86/efi: EFI stub DRTM launch support for Secure Launch

2024-08-28 Thread Ard Biesheuvel
On Wed, 28 Aug 2024 at 19:09, kernel test robot  wrote:
>
> Hi Ross,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on tip/x86/core]
> [also build test WARNING on char-misc/char-misc-testing 
> char-misc/char-misc-next char-misc/char-misc-linus 
> herbert-cryptodev-2.6/master efi/next linus/master v6.11-rc5]
> [cannot apply to herbert-crypto-2.6/master next-20240828]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:
> https://github.com/intel-lab-lkp/linux/commits/Ross-Philipson/Documentation-x86-Secure-Launch-kernel-documentation/20240827-065225
> base:   tip/x86/core
> patch link:
> https://lore.kernel.org/r/20240826223835.3928819-21-ross.philipson%40oracle.com
> patch subject: [PATCH v10 20/20] x86/efi: EFI stub DRTM launch support for 
> Secure Launch
> config: i386-randconfig-062-20240828 
> (https://download.01.org/0day-ci/archive/20240829/202408290030.febuhhbr-...@intel.com/config)


This is a i386 32-bit build, which makes no sense: this stuff should
just declare 'depends on 64BIT'


> compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
> 617a15a9eac96088ae5e9134248d8236e34b91b1)
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240829/202408290030.febuhhbr-...@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202408290030.febuhhbr-...@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> >> drivers/firmware/efi/libstub/x86-stub.c:945:41: sparse: sparse: non 
> >> size-preserving pointer to integer cast
>drivers/firmware/efi/libstub/x86-stub.c:953:65: sparse: sparse: non 
> size-preserving pointer to integer cast
> >> drivers/firmware/efi/libstub/x86-stub.c:980:70: sparse: sparse: non 
> >> size-preserving integer to pointer cast
>drivers/firmware/efi/libstub/x86-stub.c:1014:45: sparse: sparse: non 
> size-preserving integer to pointer cast
>
> vim +945 drivers/firmware/efi/libstub/x86-stub.c
>
>927
>928  static bool efi_secure_launch_update_boot_params(struct slr_table 
> *slrt,
>929   struct boot_params 
> *boot_params)
>930  {
>931  struct slr_entry_intel_info *txt_info;
>932  struct slr_entry_policy *policy;
>933  struct txt_os_mle_data *os_mle;
>934  bool updated = false;
>935  int i;
>936
>937  txt_info = slr_next_entry_by_tag(slrt, NULL, 
> SLR_ENTRY_INTEL_INFO);
>938  if (!txt_info)
>939  return false;
>940
>941  os_mle = txt_os_mle_data_start((void *)txt_info->txt_heap);
>942  if (!os_mle)
>943  return false;
>944
>  > 945  os_mle->boot_params_addr = (u64)boot_params;
>946
>947  policy = slr_next_entry_by_tag(slrt, NULL, 
> SLR_ENTRY_ENTRY_POLICY);
>948  if (!policy)
>949  return false;
>950
>951  for (i = 0; i < policy->nr_entries; i++) {
>952  if (policy->policy_entries[i].entity_type == 
> SLR_ET_BOOT_PARAMS) {
>953  policy->policy_entries[i].entity = 
> (u64)boot_params;
>954  updated = true;
>955  break;
>956  }
>957  }
>958
>959  /*
>960   * If this is a PE entry into EFI stub the mocked up boot 
> params will
>961   * be missing some of the setup header data needed for the 
> second stage
>962   * of the Secure Launch boot.
>963   */
>964  if (image) {
>965  struct setup_header *hdr = (struct setup_header 
> *)((u8 *)image->image_base +
>966  offsetof(struct 
> boot_params, hdr));
>967  u64 cmdline_ptr;
>968
>969  boot_params->hdr.setup_sects = hdr->setup_sects;
>970  boot_params->hdr.syssize = hdr->syssize;
>971  boot_params->hdr.version = hdr->version;
>972  boot_params->hdr.loadflags = hdr->loadflags;
>973  boot_params->hdr.kernel_alignment = 
> hdr->kernel_alignment;
>974  boot_params->hdr.min_alignment = hdr->min_alignment;
>975  boot_params->hdr.xloadflags = hdr->xloadflags;
>976  boot_params->hdr.init_size = hdr->init_size;
>977  boot_params->hdr.kernel_info_offset = 
> hdr->kernel_info_offset;
>978  efi_set_u64_fo

Re: [RFCv2 0/9] UEFI emulator for kexec

2024-08-28 Thread Ard Biesheuvel
On Mon, 19 Aug 2024 at 16:55, Pingfan Liu  wrote:
>
> *** Background ***
>
> As more PE format kernel images are introduced, it post challenge to kexec to
> cope with the new format.
>
> In my attempt to add support for arm64 zboot image in the kernel [1],
> Ard suggested using an emulator to tackle this issue.  Last year, when
> Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> emulator approach again [3]
>
> After discussion, Ard's approach seems to be a more promising solution
> to handle PE format kernels once and for all.  This series follows that
> approach and implements an emulator to emulate EFI boot time services,
> allowing the efistub kernel to self-extract and boot.
>
> Another year has passed, and UKI kernel is more and more frequently used
> in product. I think it is time to pay effort to resolve this issue.
>
>
> *** Overview of implement ***
> The whole model consits of three parts:
>
> -1. The emulator
> It is a self-relocatable PIC code, which is finally linked into kernel, but 
> not
> export any internal symbol to kernel.  It mainly contains: a PE file parser,
> which loads PE format kernel, a group of functions to emulate efi boot 
> service.
>
> -2. inside kernel, PE-format loader
> Its main task is to set up two extra kexec_segment, one for emulator, the 
> other
> for passing information from the first kernel to emulator.
>
> -3. set up identity mapping only for the memory used by the emulator.
> Here it relies on kimage_alloc_control_pages() to get pages, which will not
> stamped during the process of kexec relocate (cp from src to dst). And since 
> the
> mapping only covers a small range of memory, it cost small amount memory.
>
>
> *** To do ***
>
> Currently, it only works on arm64 virt machine. For x86, it needs some 
> slightly
> changes. (I plan to do it in the next version)
>
> Also, this series does not implement a memory allocator, which I plan to
> implement with the help of bitmap.
>
> About console, currently it hard code for arm64 virt machine, later it should
> extract the information through ACPI table.
>
> For kdump code, it is not implmented yet. But it should share the majority of
> this series.
>
>
> *** Test of this series ***
> I have tested this series on arm64 virt machine. There I booted the 
> vmlinuz.efi
> and kexec_file_load a UKI image, then switch to the second kernel.
>
> I used a modified kexec-tools [4], which just skips the check of the file 
> format and passes the file directly to kernel.
>
> [1]: 
> https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
> [2]: https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/
> [3]: 
> https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/#mc60aa591cb7616ceb39e1c98f352383f9ba6e985
> [4]: https://github.com/pfliu/kexec-tools.git branch: kexec_uefi_emulator
>
>
> RFCv1 -> RFCv2:
> -1.Support to run UKI kernel by: add LoadImage() and StartImage(), add
>PE file relocation support, add InstallMultiProtocol()
> -2.Also set up idmap for EFI runtime memory descriptor since UKI's
>systemd-stub calls runtime service
> -3.Move kexec_pe_image.c from arch/arm64/kernel to kernel/, since it
>aims to provide a more general architecture support.
>
> RFCv1: 
> https://lore.kernel.org/linux-efi/20240718085759.13247-1-pi...@redhat.com/
> RFCv2: https://github.com/pfliu/linux.git  branch kexec_uefi_emulator_RFCv2
>
> Cc: Ard Biesheuvel 
> Cc: Jan Hendrik Farr 
> Cc: Philipp Rudo 
> Cc: Lennart Poettering 
> Cc: Jarkko Sakkinen 
> Cc: Eric Biederman 
> Cc: Baoquan He 
> Cc: Dave Young 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: kexec@lists.infradead.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
>
>
>
> Pingfan Liu (9):
>   efi/libstub: Ask efi_random_alloc() to skip unusable memory
>   efi/libstub: Complete efi_simple_text_output_protocol
>   efi/emulator: Initial rountines to emulate EFI boot time service
>   efi/emulator: Turn on mmu for arm64
>   kexec: Introduce kexec_pe_image to parse and load PE file
>   arm64: kexec: Introduce a new member param_mem to kimage_arch
>   arm64: mm: Change to prototype of
>   arm64: kexec: Prepare page table for emulator
>   arm64: kexec: Enable kexec_pe_image
>

Thanks for putting this RFC together. This is useful work, and gives
us food for thought and discussion.

There are a few problems that become apparent when going through these changes.

1. Implementing UEFI entirely is intractable, and unnecessary.
Implementing the subset of UEFI that is 

Re: [RFCv2 1/9] efi/libstub: Ask efi_random_alloc() to skip unusable memory

2024-08-28 Thread Ard Biesheuvel
On Mon, 19 Aug 2024 at 16:55, Pingfan Liu  wrote:
>
> efi_random_alloc() demands EFI_ALLOCATE_ADDRESS when allocate_pages(),
> but the current implement can not ensure the selected target locates
> inside free area, that is to exclude EFI_BOOT_SERVICES_*,
> EFI_RUNTIME_SERVICES_* etc.
>
> Fix the issue by checking md->type.
>
> Signed-off-by: Pingfan Liu 
> Cc: Ard Biesheuvel 
> To: linux-...@vger.kernel.org
> ---
>  drivers/firmware/efi/libstub/randomalloc.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/randomalloc.c 
> b/drivers/firmware/efi/libstub/randomalloc.c
> index c41e7b2091cdd..7304e767688f2 100644
> --- a/drivers/firmware/efi/libstub/randomalloc.c
> +++ b/drivers/firmware/efi/libstub/randomalloc.c
> @@ -79,6 +79,8 @@ efi_status_t efi_random_alloc(unsigned long size,
> efi_memory_desc_t *md = (void *)map->map + map_offset;
> unsigned long slots;
>
> +   if (!(md->type & (EFI_CONVENTIONAL_MEMORY || 
> EFI_PERSISTENT_MEMORY)))
> +   continue;

This is wrong in 3 different ways:
- md->type is not a bitmask
- || is not bitwise but boolean
- get_entry_num_slots() ignores all memory types except
EFI_CONVENTIONAL_MEMORY anyway.

So what exactly are you trying to fix here?


> slots = get_entry_num_slots(md, size, ilog2(align), alloc_min,
> alloc_max);
> MD_NUM_SLOTS(md) = slots;
> @@ -111,6 +113,9 @@ efi_status_t efi_random_alloc(unsigned long size,
> efi_physical_addr_t target;
> unsigned long pages;
>
> +   if (!(md->type & (EFI_CONVENTIONAL_MEMORY || 
> EFI_PERSISTENT_MEMORY)))
> +   continue;
> +
> if (total_mirrored_slots > 0 &&
> !(md->attribute & EFI_MEMORY_MORE_RELIABLE))
> continue;
> --
> 2.41.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v10 08/20] x86/boot: Place TXT MLE header in the kernel_info section

2024-08-27 Thread Ard Biesheuvel
On Tue, 27 Aug 2024 at 00:42, Ross Philipson  wrote:
>
> The MLE (measured launch environment) header must be locatable by the
> boot loader and TXT must be setup to do a launch with this header's
> location. While the offset to the kernel_info structure does not need
> to be at a fixed offset, the offsets in the header must be relative
> offsets from the start of the setup kernel. The support in the linker
> file achieves this.
>
> Signed-off-by: Ross Philipson 
> Suggested-by: Ard Biesheuvel 

Reviewed-by: Ard Biesheuvel 

> ---
>  arch/x86/boot/compressed/kernel_info.S | 50 +++---
>  arch/x86/boot/compressed/vmlinux.lds.S |  7 
>  2 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/kernel_info.S 
> b/arch/x86/boot/compressed/kernel_info.S
> index f818ee8fba38..a0604a0d1756 100644
> --- a/arch/x86/boot/compressed/kernel_info.S
> +++ b/arch/x86/boot/compressed/kernel_info.S
> @@ -1,12 +1,20 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
> +#include 
>  #include 
>
> -   .section ".rodata.kernel_info", "a"
> +/*
> + * The kernel_info structure is not placed at a fixed offest in the
> + * kernel image. So this macro and the support in the linker file
> + * allow the relative offsets for the MLE header within the kernel
> + * image to be configured at build time.
> + */
> +#define roffset(X) ((X) - kernel_info)
>
> -   .global kernel_info
> +   .section ".rodata.kernel_info", "a"
>
> -kernel_info:
> +   .balign 16
> +SYM_DATA_START(kernel_info)
> /* Header, Linux top (structure). */
> .ascii  "LToP"
> /* Size. */
> @@ -17,6 +25,40 @@ kernel_info:
> /* Maximal allowed type for setup_data and setup_indirect structs. */
> .long   SETUP_TYPE_MAX
>
> +   /* Offset to the MLE header structure */
> +#if IS_ENABLED(CONFIG_SECURE_LAUNCH)
> +   .long   roffset(mle_header_offset)
> +#else
> +   .long   0
> +#endif
> +
>  kernel_info_var_len_data:
> /* Empty for time being... */
> -kernel_info_end:
> +SYM_DATA_END_LABEL(kernel_info, SYM_L_LOCAL, kernel_info_end)
> +
> +#if IS_ENABLED(CONFIG_SECURE_LAUNCH)
> +   /*
> +* The MLE Header per the TXT Specification, section 2.1
> +* MLE capabilities, see table 4. Capabilities set:
> +* bit 0: Support for GETSEC[WAKEUP] for RLP wakeup
> +* bit 1: Support for RLP wakeup using MONITOR address
> +* bit 2: The ECX register will contain the pointer to the MLE page 
> table
> +* bit 5: TPM 1.2 family: Details/authorities PCR usage support
> +* bit 9: Supported format of TPM 2.0 event log - TCG compliant
> +*/
> +SYM_DATA_START(mle_header)
> +   .long   0x9082ac5a  /* UUID0 */
> +   .long   0x74a7476f  /* UUID1 */
> +   .long   0xa2555c0f  /* UUID2 */
> +   .long   0x42b651cb  /* UUID3 */
> +   .long   0x0034  /* MLE header size */
> +   .long   0x00020002  /* MLE version 2.2 */
> +   .long   roffset(sl_stub_entry_offset) /* Linear entry point of MLE 
> (virt. address) */
> +   .long   0x  /* First valid page of MLE */
> +   .long   0x  /* Offset within binary of first byte of MLE */
> +   .long   roffset(_edata_offset) /* Offset within binary of last byte + 
> 1 of MLE */
> +   .long   0x0227  /* Bit vector of MLE-supported capabilities */
> +   .long   0x  /* Starting linear address of command line 
> (unused) */
> +   .long   0x  /* Ending linear address of command line (unused) 
> */
> +SYM_DATA_END(mle_header)
> +#endif
> diff --git a/arch/x86/boot/compressed/vmlinux.lds.S 
> b/arch/x86/boot/compressed/vmlinux.lds.S
> index 083ec6d7722a..f82184801462 100644
> --- a/arch/x86/boot/compressed/vmlinux.lds.S
> +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> @@ -118,3 +118,10 @@ SECTIONS
> }
> ASSERT(SIZEOF(.rela.dyn) == 0, "Unexpected run-time relocations 
> (.rela) detected!")
>  }
> +
> +#ifdef CONFIG_SECURE_LAUNCH
> +PROVIDE(kernel_info_offset  = ABSOLUTE(kernel_info - startup_32));
> +PROVIDE(mle_header_offset   = kernel_info_offset + ABSOLUTE(mle_header - 
> startup_32));
> +PROVIDE(sl_stub_entry_offset= kernel_info_offset + 
> ABSOLUTE(sl_stub_entry - startup_32));
> +PROVIDE(_edata_offset   = kernel_info_offset + ABSOLUTE(_edata - 
> startup_32));
> +#endif
> --
> 2.39.3
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v10 20/20] x86/efi: EFI stub DRTM launch support for Secure Launch

2024-08-27 Thread Ard Biesheuvel
On Tue, 27 Aug 2024 at 00:44, Ross Philipson  wrote:
>
> This support allows the DRTM launch to be initiated after an EFI stub
> launch of the Linux kernel is done. This is accomplished by providing
> a handler to jump to when a Secure Launch is in progress. This has to be
> called after the EFI stub does Exit Boot Services.
>
> Signed-off-by: Ross Philipson 

Reviewed-by: Ard Biesheuvel 

> ---
>  drivers/firmware/efi/libstub/efistub.h  |  8 ++
>  drivers/firmware/efi/libstub/x86-stub.c | 98 +
>  2 files changed, 106 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/efistub.h 
> b/drivers/firmware/efi/libstub/efistub.h
> index d33ccbc4a2c6..baf42d6d0796 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -135,6 +135,14 @@ void efi_set_u64_split(u64 data, u32 *lo, u32 *hi)
> *hi = upper_32_bits(data);
>  }
>
> +static inline
> +void efi_set_u64_form(u32 lo, u32 hi, u64 *data)
> +{
> +   u64 upper = hi;
> +
> +   *data = lo | upper << 32;
> +}
> +
>  /*
>   * Allocation types for calls to boottime->allocate_pages.
>   */
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index f8e465da344d..04786c1b3b5d 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,6 +9,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>  #include 
> @@ -923,6 +925,99 @@ static efi_status_t efi_decompress_kernel(unsigned long 
> *kernel_entry)
> return efi_adjust_memory_range_protection(addr, kernel_text_size);
>  }
>
> +static bool efi_secure_launch_update_boot_params(struct slr_table *slrt,
> +struct boot_params 
> *boot_params)
> +{
> +   struct slr_entry_intel_info *txt_info;
> +   struct slr_entry_policy *policy;
> +   struct txt_os_mle_data *os_mle;
> +   bool updated = false;
> +   int i;
> +
> +   txt_info = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_INTEL_INFO);
> +   if (!txt_info)
> +   return false;
> +
> +   os_mle = txt_os_mle_data_start((void *)txt_info->txt_heap);
> +   if (!os_mle)
> +   return false;
> +
> +   os_mle->boot_params_addr = (u64)boot_params;
> +
> +   policy = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_ENTRY_POLICY);
> +   if (!policy)
> +   return false;
> +
> +   for (i = 0; i < policy->nr_entries; i++) {
> +   if (policy->policy_entries[i].entity_type == 
> SLR_ET_BOOT_PARAMS) {
> +   policy->policy_entries[i].entity = (u64)boot_params;
> +   updated = true;
> +   break;
> +   }
> +   }
> +
> +   /*
> +* If this is a PE entry into EFI stub the mocked up boot params will
> +* be missing some of the setup header data needed for the second 
> stage
> +* of the Secure Launch boot.
> +*/
> +   if (image) {
> +   struct setup_header *hdr = (struct setup_header *)((u8 
> *)image->image_base +
> +   offsetof(struct boot_params, 
> hdr));
> +   u64 cmdline_ptr;
> +
> +   boot_params->hdr.setup_sects = hdr->setup_sects;
> +   boot_params->hdr.syssize = hdr->syssize;
> +   boot_params->hdr.version = hdr->version;
> +   boot_params->hdr.loadflags = hdr->loadflags;
> +   boot_params->hdr.kernel_alignment = hdr->kernel_alignment;
> +   boot_params->hdr.min_alignment = hdr->min_alignment;
> +   boot_params->hdr.xloadflags = hdr->xloadflags;
> +   boot_params->hdr.init_size = hdr->init_size;
> +   boot_params->hdr.kernel_info_offset = hdr->kernel_info_offset;
> +   efi_set_u64_form(boot_params->hdr.cmd_line_ptr, 
> boot_params->ext_cmd_line_ptr,
> +&cmdline_ptr);
> +   boot_params->hdr.cmdline_size = strlen((const char 
> *)cmdline_ptr);
> +   }
> +
> +   return updated;
> +}
> +
> +static void efi_secure_launch(struct boot_params *boot_params)
> +{
> +   struct slr_entry_dl_info *dlinfo;
> +   efi_guid_t guid = SLR_TABLE_GUID;
> +   dl_handler_func handler_callback;
> +   struct slr_table *slrt;
> +
> +   if (!IS_ENABLED(CONFIG_SECURE_LAUNCH))
> +   return;
&

Re: [PATCH v3 1/2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2024-07-17 Thread Ard Biesheuvel
On Wed, 17 Jul 2024 at 14:32, Steve Wahl  wrote:
>
> From: Tao Liu 
>
> A kexec kernel boot failure is sometimes observed on AMD CPUs due to
> an unmapped EFI config table array.  This can be seen when "nogbpages"
> is on the kernel command line, and has been observed as a full BIOS
> reboot rather than a successful kexec.
>
> This was also the cause of reported regressions attributed to Commit
> 7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page
> should be mapped.") which was subsequently reverted.
>
> To avoid this page fault, explicitly include the EFI config table
> array in the kexec identity map.
>
> Further explanation:
>
> The following 2 commits caused the EFI config table array to be
> accessed when enabling sev at kernel startup.
>
> commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features
>   earlier during boot")
> commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature
>   detection/setup")
>
> This is in the code that examines whether SEV should be enabled or
> not, so it can even affect systems that are not SEV capable.
>
> This may result in a page fault if the EFI config table array's
> address is unmapped. Since the page fault occurs before the new kernel
> establishes its own identity map and page fault routines, it is
> unrecoverable and kexec fails.
>
> Most often, this problem is not seen because the EFI config table
> array gets included in the map by the luck of being placed at a memory
> address close enough to other memory areas that *are* included in the
> map created by kexec.
>
> Both the "nogbpages" command line option and the "use gpbages only
> where full GB page should be mapped" patch greatly reduce the chance
> of being included in the map by luck, which is why the problem
> appears.
>
> Signed-off-by: Tao Liu 
> Signed-off-by: Steve Wahl 
> Tested-by: Pavin Joseph 
> Tested-by: Sarah Brofeldt 
> Tested-by: Eric Hagberg 
> ---
>
> Version 3: Do not rename map_efi_systab to map_efi_tables, and don't add
> 'config table' to the comments, per Ard Biesheuvel request.
>
>  arch/x86/kernel/machine_kexec_64.c | 27 +++
>  1 file changed, 27 insertions(+)
>

Reviewed-by: Ard Biesheuvel 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2024-07-16 Thread Ard Biesheuvel
On Mon, 15 Jul 2024 at 11:53, Steve Wahl  wrote:
>
> From: Tao Liu 
>
> A kexec kernel boot failure is sometimes observed on AMD CPUs due to
> an unmapped EFI config table array.  This can be seen when "nogbpages"
> is on the kernel command line, and has been observed as a full BIOS
> reboot rather than a successful kexec.
>
> This was also the cause of reported regressions attributed to Commit
> 7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page
> should be mapped.") which was subsequently reverted.
>
> To avoid this page fault, explicitly include the EFI config table
> array in the kexec identity map.
>
> Further explanation:
>
> The following 2 commits caused the EFI config table array to be
> accessed when enabling sev at kernel startup.
>
> commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features
>   earlier during boot")
> commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature
>   detection/setup")
>
> This is in the code that examines whether SEV should be enabled or
> not, so it can even affect systems that are not SEV capable.
>
> This may result in a page fault if the EFI config table array's
> address is unmapped. Since the page fault occurs before the new kernel
> establishes its own identity map and page fault routines, it is
> unrecoverable and kexec fails.
>
> Most often, this problem is not seen because the EFI config table
> array gets included in the map by the luck of being placed at a memory
> address close enough to other memory areas that *are* included in the
> map created by kexec.
>
> Both the "nogbpages" command line option and the "use gpbages only
> where full GB page should be mapped" patch greatly reduce the chance
> of being included in the map by luck, which is why the problem
> appears.
>
> Signed-off-by: Tao Liu 
> Signed-off-by: Steve Wahl 
> Tested-by: Pavin Joseph 
> Tested-by: Sarah Brofeldt 
> Tested-by: Eric Hagberg 
> ---
>  arch/x86/kernel/machine_kexec_64.c | 35 ++
>  1 file changed, 31 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index cc0f7f70b17b..563d119f9f29 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #ifdef CONFIG_ACPI
>  /*
> @@ -83,10 +84,12 @@ const struct kexec_file_ops * const kexec_file_loaders[] 
> = {
>  #endif
>
>  static int
> -map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p)

I think we can keep the name - the array of EFI config table
references could be considered part of the system table, even though
it may live in a separate allocation.

> +map_efi_tables(struct x86_mapping_info *info, pgd_t *level4p)
>  {
>  #ifdef CONFIG_EFI
> unsigned long mstart, mend;
> +   void *kaddr;
> +   int ret;
>
> if (!efi_enabled(EFI_BOOT))
> return 0;
> @@ -102,6 +105,30 @@ map_efi_systab(struct x86_mapping_info *info, pgd_t 
> *level4p)
> if (!mstart)
> return 0;
>
> +   ret = kernel_ident_mapping_init(info, level4p, mstart, mend);
> +   if (ret)
> +   return ret;
> +
> +   kaddr = memremap(mstart, mend - mstart, MEMREMAP_WB);
> +   if (!kaddr) {
> +   pr_err("Could not map UEFI system table\n");
> +   return -ENOMEM;
> +   }
> +
> +   mstart = efi_config_table;
> +
> +   if (efi_enabled(EFI_64BIT)) {
> +   efi_system_table_64_t *stbl = (efi_system_table_64_t *)kaddr;
> +
> +   mend = mstart + sizeof(efi_config_table_64_t) * 
> stbl->nr_tables;
> +   } else {
> +   efi_system_table_32_t *stbl = (efi_system_table_32_t *)kaddr;
> +
> +   mend = mstart + sizeof(efi_config_table_32_t) * 
> stbl->nr_tables;
> +   }
> +
> +   memunmap(kaddr);
> +
> return kernel_ident_mapping_init(info, level4p, mstart, mend);
>  #endif
> return 0;
> @@ -241,10 +268,10 @@ static int init_pgtable(struct kimage *image, unsigned 
> long start_pgtable)
> }
>
> /*
> -* Prepare EFI systab and ACPI tables for kexec kernel since they are
> -* not covered by pfn_mapped.
> +* Prepare EFI systab, config table and ACPI tables for kexec kernel

Please avoid 'config table' here, as it is ambiguous. IMO you can just
drop this hunk (and the one below)

> +* since they are not covered by pfn_mapped.
>  */
> -   result = map_efi_systab(&info, level4p);
> +   result = map_efi_tables(&info, level4p);
> if (result)
> return result;
>
> --
> 2.26.2
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-06 Thread Ard Biesheuvel
Hello Ross,

On Fri, 31 May 2024 at 03:32, Ross Philipson  wrote:
>
> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> later AMD SKINIT) to vector to during the late launch. The symbol
> sl_stub_entry is that entry point and its offset into the kernel is
> conveyed to the launching code using the MLE (Measured Launch
> Environment) header in the structure named mle_header. The offset of the
> MLE header is set in the kernel_info. The routine sl_stub contains the
> very early late launch setup code responsible for setting up the basic
> environment to allow the normal kernel startup_32 code to proceed. It is
> also responsible for properly waking and handling the APs on Intel
> platforms. The routine sl_main which runs after entering 64b mode is
> responsible for measuring configuration and module information before
> it is used like the boot params, the kernel command line, the TXT heap,
> an external initramfs, etc.
>
> Signed-off-by: Ross Philipson 
> ---
>  Documentation/arch/x86/boot.rst|  21 +
>  arch/x86/boot/compressed/Makefile  |   3 +-
>  arch/x86/boot/compressed/head_64.S |  30 +
>  arch/x86/boot/compressed/kernel_info.S |  34 ++
>  arch/x86/boot/compressed/sl_main.c | 577 
>  arch/x86/boot/compressed/sl_stub.S | 725 +
>  arch/x86/include/asm/msr-index.h   |   5 +
>  arch/x86/include/uapi/asm/bootparam.h  |   1 +
>  arch/x86/kernel/asm-offsets.c  |  20 +
>  9 files changed, 1415 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/sl_main.c
>  create mode 100644 arch/x86/boot/compressed/sl_stub.S
>
> diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
> index 4fd492cb4970..295cdf9bcbdb 100644
> --- a/Documentation/arch/x86/boot.rst
> +++ b/Documentation/arch/x86/boot.rst
> @@ -482,6 +482,14 @@ Protocol:  2.00+
> - If 1, KASLR enabled.
> - If 0, KASLR disabled.
>
> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> +
> +   - Used internally by the setup kernel to communicate
> + Secure Launch status to kernel proper.
> +
> +   - If 1, Secure Launch enabled.
> +   - If 0, Secure Launch disabled.
> +
>Bit 5 (write): QUIET_FLAG
>
> - If 0, print early messages.
> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
>
>This field contains maximal allowed type for setup_data and setup_indirect 
> structs.
>
> +   =
> +Field name:mle_header_offset
> +Offset/size:   0x0010/4
> +   =
> +
> +  This field contains the offset to the Secure Launch Measured Launch 
> Environment
> +  (MLE) header. This offset is used to locate information needed during a 
> secure
> +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> have
> +  Secure Launch capabilities. The MLE entry point is called from TXT on the 
> BSP
> +  following a success measured launch. The specific state of the processors 
> is
> +  outlined in the TXT Software Development Guide, the latest can be found 
> here:
> +  
> https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> +
>

Could we just repaint this field as the offset relative to the start
of kernel_info rather than relative to the start of the image? That
way, there is no need for patch #1, and given that the consumer of
this field accesses it via kernel_info, I wouldn't expect any issues
in applying this offset to obtain the actual address.


>  The Image Checksum
>  ==
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index 9189a0e28686..9076a248d4b4 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o \
> +   $(obj)/sl_main.o $(obj)/sl_stub.o
>
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index 1dcb794c5479..803c9e2e6d85 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -420,6 +420,13 @@ SYM_CODE_START(startup_64)
> pushq   $0
> popfq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   /* Ensure the relocation region is coverd by a PMR */

covered

> +   movq%rbx, %rdi
> +   movl$(_bss - startup_32), %esi
> +   callq   sl_check_region
> +#endif
> +
>  /*
>   * Copy the compressed kernel to the end of our buffer
>   * where decompression in place becomes safe.
> @@ -462,6 

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-05 Thread Ard Biesheuvel
On Wed, 5 Jun 2024 at 09:43, Borislav Petkov  wrote:
>
> On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote:
> > It's something good to have but not must for the time being,  also no
> > idea how to save the status across boot, for EFI boot case probably a
> > EFI var can be used;
>
> Yes.
>
> > but how can it be cleared in case of physical boot.  Otherwise
> > probably injecting some kernel parameters, anyway this needs more
> > thinking.
>
> Yeah, this'll need proper analysis whether we can even do that reliably.
>
> We need to increment it only on the kexec reboot paths and clear it on
> the normal reboot paths.
>

I'd argue for the opposite: ideally, the difference between the first
boot and not-the-first-boot should be abstracted away by the
'bootloader' side of kexec as much as possible, so that the tricky
early startup code doesn't have to be riddled with different code
paths depending on !kexec vs kexec.

TDX is a good case in point here: rather than add more conditionals,
I'd urge to remove them so the TDX startup code doesn't have to care
about the difference at all. If there is anything special that needs
to be done, it belongs in the kexec implementation of the previous
kernel.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-04 Thread Ard Biesheuvel
On Tue, 4 Jun 2024 at 19:34,  wrote:
>
> On 6/4/24 10:27 AM, Ard Biesheuvel wrote:
> > On Tue, 4 Jun 2024 at 19:24,  wrote:
> >>
> >> On 5/31/24 6:33 AM, Ard Biesheuvel wrote:
> >>> On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >>>>
> >>>> Hello Ross,
> >>>>
> >>>> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> >>>> wrote:
> >>>>>
> >>>>> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> >>>>> later AMD SKINIT) to vector to during the late launch. The symbol
> >>>>> sl_stub_entry is that entry point and its offset into the kernel is
> >>>>> conveyed to the launching code using the MLE (Measured Launch
> >>>>> Environment) header in the structure named mle_header. The offset of the
> >>>>> MLE header is set in the kernel_info. The routine sl_stub contains the
> >>>>> very early late launch setup code responsible for setting up the basic
> >>>>> environment to allow the normal kernel startup_32 code to proceed. It is
> >>>>> also responsible for properly waking and handling the APs on Intel
> >>>>> platforms. The routine sl_main which runs after entering 64b mode is
> >>>>> responsible for measuring configuration and module information before
> >>>>> it is used like the boot params, the kernel command line, the TXT heap,
> >>>>> an external initramfs, etc.
> >>>>>
> >>>>> Signed-off-by: Ross Philipson 
> >>>>> ---
> >>>>>Documentation/arch/x86/boot.rst|  21 +
> >>>>>arch/x86/boot/compressed/Makefile  |   3 +-
> >>>>>arch/x86/boot/compressed/head_64.S |  30 +
> >>>>>arch/x86/boot/compressed/kernel_info.S |  34 ++
> >>>>>arch/x86/boot/compressed/sl_main.c | 577 
> >>>>>arch/x86/boot/compressed/sl_stub.S | 725 
> >>>>> +
> >>>>>arch/x86/include/asm/msr-index.h   |   5 +
> >>>>>arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >>>>>arch/x86/kernel/asm-offsets.c  |  20 +
> >>>>>9 files changed, 1415 insertions(+), 1 deletion(-)
> >>>>>create mode 100644 arch/x86/boot/compressed/sl_main.c
> >>>>>create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >>>>>
> >>>>> diff --git a/Documentation/arch/x86/boot.rst 
> >>>>> b/Documentation/arch/x86/boot.rst
> >>>>> index 4fd492cb4970..295cdf9bcbdb 100644
> >>>>> --- a/Documentation/arch/x86/boot.rst
> >>>>> +++ b/Documentation/arch/x86/boot.rst
> >>>>> @@ -482,6 +482,14 @@ Protocol:  2.00+
> >>>>>   - If 1, KASLR enabled.
> >>>>>   - If 0, KASLR disabled.
> >>>>>
> >>>>> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> >>>>> +
> >>>>> +   - Used internally by the setup kernel to communicate
> >>>>> + Secure Launch status to kernel proper.
> >>>>> +
> >>>>> +   - If 1, Secure Launch enabled.
> >>>>> +   - If 0, Secure Launch disabled.
> >>>>> +
> >>>>>  Bit 5 (write): QUIET_FLAG
> >>>>>
> >>>>>   - If 0, print early messages.
> >>>>> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >>>>>
> >>>>>  This field contains maximal allowed type for setup_data and 
> >>>>> setup_indirect structs.
> >>>>>
> >>>>> +   =
> >>>>> +Field name:mle_header_offset
> >>>>> +Offset/size:   0x0010/4
> >>>>> +   =
> >>>>> +
> >>>>> +  This field contains the offset to the Secure Launch Measured Launch 
> >>>>> Environment
> >>>>> +  (MLE) header. This offset is used to locate information needed 
> >>>>> during a secure
> >>>>> +  late launch using Intel TXT. If the offset is zero, the kernel does 
> >>>>> not have
> >>>>> +  Secure Launch capabilities. The MLE entry point is called from TX

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-04 Thread Ard Biesheuvel
On Tue, 4 Jun 2024 at 19:24,  wrote:
>
> On 5/31/24 6:33 AM, Ard Biesheuvel wrote:
> > On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >>
> >> Hello Ross,
> >>
> >> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> >> wrote:
> >>>
> >>> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> >>> later AMD SKINIT) to vector to during the late launch. The symbol
> >>> sl_stub_entry is that entry point and its offset into the kernel is
> >>> conveyed to the launching code using the MLE (Measured Launch
> >>> Environment) header in the structure named mle_header. The offset of the
> >>> MLE header is set in the kernel_info. The routine sl_stub contains the
> >>> very early late launch setup code responsible for setting up the basic
> >>> environment to allow the normal kernel startup_32 code to proceed. It is
> >>> also responsible for properly waking and handling the APs on Intel
> >>> platforms. The routine sl_main which runs after entering 64b mode is
> >>> responsible for measuring configuration and module information before
> >>> it is used like the boot params, the kernel command line, the TXT heap,
> >>> an external initramfs, etc.
> >>>
> >>> Signed-off-by: Ross Philipson 
> >>> ---
> >>>   Documentation/arch/x86/boot.rst|  21 +
> >>>   arch/x86/boot/compressed/Makefile  |   3 +-
> >>>   arch/x86/boot/compressed/head_64.S |  30 +
> >>>   arch/x86/boot/compressed/kernel_info.S |  34 ++
> >>>   arch/x86/boot/compressed/sl_main.c | 577 
> >>>   arch/x86/boot/compressed/sl_stub.S | 725 +
> >>>   arch/x86/include/asm/msr-index.h   |   5 +
> >>>   arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >>>   arch/x86/kernel/asm-offsets.c  |  20 +
> >>>   9 files changed, 1415 insertions(+), 1 deletion(-)
> >>>   create mode 100644 arch/x86/boot/compressed/sl_main.c
> >>>   create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >>>
> >>> diff --git a/Documentation/arch/x86/boot.rst 
> >>> b/Documentation/arch/x86/boot.rst
> >>> index 4fd492cb4970..295cdf9bcbdb 100644
> >>> --- a/Documentation/arch/x86/boot.rst
> >>> +++ b/Documentation/arch/x86/boot.rst
> >>> @@ -482,6 +482,14 @@ Protocol:  2.00+
> >>>  - If 1, KASLR enabled.
> >>>  - If 0, KASLR disabled.
> >>>
> >>> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> >>> +
> >>> +   - Used internally by the setup kernel to communicate
> >>> + Secure Launch status to kernel proper.
> >>> +
> >>> +   - If 1, Secure Launch enabled.
> >>> +   - If 0, Secure Launch disabled.
> >>> +
> >>> Bit 5 (write): QUIET_FLAG
> >>>
> >>>  - If 0, print early messages.
> >>> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >>>
> >>> This field contains maximal allowed type for setup_data and 
> >>> setup_indirect structs.
> >>>
> >>> +   =
> >>> +Field name:mle_header_offset
> >>> +Offset/size:   0x0010/4
> >>> +   =
> >>> +
> >>> +  This field contains the offset to the Secure Launch Measured Launch 
> >>> Environment
> >>> +  (MLE) header. This offset is used to locate information needed during 
> >>> a secure
> >>> +  late launch using Intel TXT. If the offset is zero, the kernel does 
> >>> not have
> >>> +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> >>> the BSP
> >>> +  following a success measured launch. The specific state of the 
> >>> processors is
> >>> +  outlined in the TXT Software Development Guide, the latest can be 
> >>> found here:
> >>> +  
> >>> https://urldefense.com/v3/__https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf__;!!ACWV5N9M2RV99hQ!Mng0gnPhOYZ8D02t1rYwQfY6U3uWaypJyd1T2rsWz3QNHr9GhIZ9ANB_-cgPExxX0e0KmCpda-3VX8Fj$
> >>> +
> >>>
> >>
> >> Could we just repaint this field as the offset relative to the start
> >> of kernel_info rather than relative to the start o

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 16:04, Ard Biesheuvel  wrote:
>
> On Fri, 31 May 2024 at 15:33, Ard Biesheuvel  wrote:
> >
> > On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> > >
> > > Hello Ross,
> > >
> > > On Fri, 31 May 2024 at 03:32, Ross Philipson  
> > > wrote:
> > > >
> > > > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > > > later AMD SKINIT) to vector to during the late launch. The symbol
> > > > sl_stub_entry is that entry point and its offset into the kernel is
> > > > conveyed to the launching code using the MLE (Measured Launch
> > > > Environment) header in the structure named mle_header. The offset of the
> > > > MLE header is set in the kernel_info. The routine sl_stub contains the
> > > > very early late launch setup code responsible for setting up the basic
> > > > environment to allow the normal kernel startup_32 code to proceed. It is
> > > > also responsible for properly waking and handling the APs on Intel
> > > > platforms. The routine sl_main which runs after entering 64b mode is
> > > > responsible for measuring configuration and module information before
> > > > it is used like the boot params, the kernel command line, the TXT heap,
> > > > an external initramfs, etc.
> > > >
> > > > Signed-off-by: Ross Philipson 
> > > > ---
> > > >  Documentation/arch/x86/boot.rst|  21 +
> > > >  arch/x86/boot/compressed/Makefile  |   3 +-
> > > >  arch/x86/boot/compressed/head_64.S |  30 +
> > > >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> > > >  arch/x86/boot/compressed/sl_main.c | 577 
> > > >  arch/x86/boot/compressed/sl_stub.S | 725 +
> > > >  arch/x86/include/asm/msr-index.h   |   5 +
> > > >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> > > >  arch/x86/kernel/asm-offsets.c  |  20 +
> > > >  9 files changed, 1415 insertions(+), 1 deletion(-)
> > > >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> > > >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> > > >
> > > > diff --git a/Documentation/arch/x86/boot.rst 
> > > > b/Documentation/arch/x86/boot.rst
> > > > index 4fd492cb4970..295cdf9bcbdb 100644
> > > > --- a/Documentation/arch/x86/boot.rst
> > > > +++ b/Documentation/arch/x86/boot.rst
> > > > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > > > - If 1, KASLR enabled.
> > > > - If 0, KASLR disabled.
> > > >
> > > > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > > > +
> > > > +   - Used internally by the setup kernel to communicate
> > > > + Secure Launch status to kernel proper.
> > > > +
> > > > +   - If 1, Secure Launch enabled.
> > > > +   - If 0, Secure Launch disabled.
> > > > +
> > > >Bit 5 (write): QUIET_FLAG
> > > >
> > > > - If 0, print early messages.
> > > > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> > > >
> > > >This field contains maximal allowed type for setup_data and 
> > > > setup_indirect structs.
> > > >
> > > > +   =
> > > > +Field name:mle_header_offset
> > > > +Offset/size:   0x0010/4
> > > > +   =
> > > > +
> > > > +  This field contains the offset to the Secure Launch Measured Launch 
> > > > Environment
> > > > +  (MLE) header. This offset is used to locate information needed 
> > > > during a secure
> > > > +  late launch using Intel TXT. If the offset is zero, the kernel does 
> > > > not have
> > > > +  Secure Launch capabilities. The MLE entry point is called from TXT 
> > > > on the BSP
> > > > +  following a success measured launch. The specific state of the 
> > > > processors is
> > > > +  outlined in the TXT Software Development Guide, the latest can be 
> > > > found here:
> > > > +  
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > > > +
> > > >
> > >
> > > Could we just repaint this field as the offset relative to the start
> > &

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 15:33, Ard Biesheuvel  wrote:
>
> On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >
> > Hello Ross,
> >
> > On Fri, 31 May 2024 at 03:32, Ross Philipson  
> > wrote:
> > >
> > > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > > later AMD SKINIT) to vector to during the late launch. The symbol
> > > sl_stub_entry is that entry point and its offset into the kernel is
> > > conveyed to the launching code using the MLE (Measured Launch
> > > Environment) header in the structure named mle_header. The offset of the
> > > MLE header is set in the kernel_info. The routine sl_stub contains the
> > > very early late launch setup code responsible for setting up the basic
> > > environment to allow the normal kernel startup_32 code to proceed. It is
> > > also responsible for properly waking and handling the APs on Intel
> > > platforms. The routine sl_main which runs after entering 64b mode is
> > > responsible for measuring configuration and module information before
> > > it is used like the boot params, the kernel command line, the TXT heap,
> > > an external initramfs, etc.
> > >
> > > Signed-off-by: Ross Philipson 
> > > ---
> > >  Documentation/arch/x86/boot.rst|  21 +
> > >  arch/x86/boot/compressed/Makefile  |   3 +-
> > >  arch/x86/boot/compressed/head_64.S |  30 +
> > >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> > >  arch/x86/boot/compressed/sl_main.c | 577 
> > >  arch/x86/boot/compressed/sl_stub.S | 725 +
> > >  arch/x86/include/asm/msr-index.h   |   5 +
> > >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> > >  arch/x86/kernel/asm-offsets.c  |  20 +
> > >  9 files changed, 1415 insertions(+), 1 deletion(-)
> > >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> > >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> > >
> > > diff --git a/Documentation/arch/x86/boot.rst 
> > > b/Documentation/arch/x86/boot.rst
> > > index 4fd492cb4970..295cdf9bcbdb 100644
> > > --- a/Documentation/arch/x86/boot.rst
> > > +++ b/Documentation/arch/x86/boot.rst
> > > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > > - If 1, KASLR enabled.
> > > - If 0, KASLR disabled.
> > >
> > > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > > +
> > > +   - Used internally by the setup kernel to communicate
> > > + Secure Launch status to kernel proper.
> > > +
> > > +   - If 1, Secure Launch enabled.
> > > +   - If 0, Secure Launch disabled.
> > > +
> > >Bit 5 (write): QUIET_FLAG
> > >
> > > - If 0, print early messages.
> > > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> > >
> > >This field contains maximal allowed type for setup_data and 
> > > setup_indirect structs.
> > >
> > > +   =
> > > +Field name:mle_header_offset
> > > +Offset/size:   0x0010/4
> > > +   =
> > > +
> > > +  This field contains the offset to the Secure Launch Measured Launch 
> > > Environment
> > > +  (MLE) header. This offset is used to locate information needed during 
> > > a secure
> > > +  late launch using Intel TXT. If the offset is zero, the kernel does 
> > > not have
> > > +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> > > the BSP
> > > +  following a success measured launch. The specific state of the 
> > > processors is
> > > +  outlined in the TXT Software Development Guide, the latest can be 
> > > found here:
> > > +  
> > > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > > +
> > >
> >
> > Could we just repaint this field as the offset relative to the start
> > of kernel_info rather than relative to the start of the image? That
> > way, there is no need for patch #1, and given that the consumer of
> > this field accesses it via kernel_info, I wouldn't expect any issues
> > in applying this offset to obtain the actual address.
> >
> >
> > >  The Image Checksum
> > >  ==
> > > diff --git a/arch/x86/boot/compressed/Makefile 
> > > b/arch/x86/boot/compressed/Makefile
&

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
>
> Hello Ross,
>
> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> wrote:
> >
> > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > later AMD SKINIT) to vector to during the late launch. The symbol
> > sl_stub_entry is that entry point and its offset into the kernel is
> > conveyed to the launching code using the MLE (Measured Launch
> > Environment) header in the structure named mle_header. The offset of the
> > MLE header is set in the kernel_info. The routine sl_stub contains the
> > very early late launch setup code responsible for setting up the basic
> > environment to allow the normal kernel startup_32 code to proceed. It is
> > also responsible for properly waking and handling the APs on Intel
> > platforms. The routine sl_main which runs after entering 64b mode is
> > responsible for measuring configuration and module information before
> > it is used like the boot params, the kernel command line, the TXT heap,
> > an external initramfs, etc.
> >
> > Signed-off-by: Ross Philipson 
> > ---
> >  Documentation/arch/x86/boot.rst|  21 +
> >  arch/x86/boot/compressed/Makefile  |   3 +-
> >  arch/x86/boot/compressed/head_64.S |  30 +
> >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> >  arch/x86/boot/compressed/sl_main.c | 577 
> >  arch/x86/boot/compressed/sl_stub.S | 725 +
> >  arch/x86/include/asm/msr-index.h   |   5 +
> >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >  arch/x86/kernel/asm-offsets.c  |  20 +
> >  9 files changed, 1415 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >
> > diff --git a/Documentation/arch/x86/boot.rst 
> > b/Documentation/arch/x86/boot.rst
> > index 4fd492cb4970..295cdf9bcbdb 100644
> > --- a/Documentation/arch/x86/boot.rst
> > +++ b/Documentation/arch/x86/boot.rst
> > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > - If 1, KASLR enabled.
> > - If 0, KASLR disabled.
> >
> > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > +
> > +   - Used internally by the setup kernel to communicate
> > + Secure Launch status to kernel proper.
> > +
> > +   - If 1, Secure Launch enabled.
> > +   - If 0, Secure Launch disabled.
> > +
> >Bit 5 (write): QUIET_FLAG
> >
> > - If 0, print early messages.
> > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >
> >This field contains maximal allowed type for setup_data and 
> > setup_indirect structs.
> >
> > +   =
> > +Field name:mle_header_offset
> > +Offset/size:   0x0010/4
> > +   =
> > +
> > +  This field contains the offset to the Secure Launch Measured Launch 
> > Environment
> > +  (MLE) header. This offset is used to locate information needed during a 
> > secure
> > +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> > have
> > +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> > the BSP
> > +  following a success measured launch. The specific state of the 
> > processors is
> > +  outlined in the TXT Software Development Guide, the latest can be found 
> > here:
> > +  
> > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > +
> >
>
> Could we just repaint this field as the offset relative to the start
> of kernel_info rather than relative to the start of the image? That
> way, there is no need for patch #1, and given that the consumer of
> this field accesses it via kernel_info, I wouldn't expect any issues
> in applying this offset to obtain the actual address.
>
>
> >  The Image Checksum
> >  ==
> > diff --git a/arch/x86/boot/compressed/Makefile 
> > b/arch/x86/boot/compressed/Makefile
> > index 9189a0e28686..9076a248d4b4 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
> >  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> >  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> > $(objtree)/drivers/firmware/efi/libstub/lib.a
> >
> > -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> > $(obj)/early_sha256.o
> > +vmlinux-obj

Re: [PATCH v9 19/19] x86: EFI stub DRTM launch support for Secure Launch

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 03:32, Ross Philipson  wrote:
>
> This support allows the DRTM launch to be initiated after an EFI stub
> launch of the Linux kernel is done. This is accomplished by providing
> a handler to jump to when a Secure Launch is in progress. This has to be
> called after the EFI stub does Exit Boot Services.
>
> Signed-off-by: Ross Philipson 

Just some minor remarks below. The overall approach in this patch
looks fine now.


> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 98 +
>  1 file changed, 98 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index d5a8182cf2e1..a1143d006202 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,6 +9,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>  #include 
> @@ -830,6 +832,97 @@ static efi_status_t efi_decompress_kernel(unsigned long 
> *kernel_entry)
> return efi_adjust_memory_range_protection(addr, kernel_text_size);
>  }
>
> +#if (IS_ENABLED(CONFIG_SECURE_LAUNCH))

IS_ENABLED() is mostly used for C conditionals not CPP ones.

It would be nice if this #if could be dropped, and replaced with ... (see below)


> +static bool efi_secure_launch_update_boot_params(struct slr_table *slrt,
> +struct boot_params 
> *boot_params)
> +{
> +   struct slr_entry_intel_info *txt_info;
> +   struct slr_entry_policy *policy;
> +   struct txt_os_mle_data *os_mle;
> +   bool updated = false;
> +   int i;
> +
> +   txt_info = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_INTEL_INFO);
> +   if (!txt_info)
> +   return false;
> +
> +   os_mle = txt_os_mle_data_start((void *)txt_info->txt_heap);
> +   if (!os_mle)
> +   return false;
> +
> +   os_mle->boot_params_addr = (u32)(u64)boot_params;
> +

Why is this safe?

> +   policy = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_ENTRY_POLICY);
> +   if (!policy)
> +   return false;
> +
> +   for (i = 0; i < policy->nr_entries; i++) {
> +   if (policy->policy_entries[i].entity_type == 
> SLR_ET_BOOT_PARAMS) {
> +   policy->policy_entries[i].entity = (u64)boot_params;
> +   updated = true;
> +   break;
> +   }
> +   }
> +
> +   /*
> +* If this is a PE entry into EFI stub the mocked up boot params will
> +* be missing some of the setup header data needed for the second 
> stage
> +* of the Secure Launch boot.
> +*/
> +   if (image) {
> +   struct setup_header *hdr = (struct setup_header *)((u8 
> *)image->image_base + 0x1f1);

Could we use something other than a bare 0x1f1 constant here? struct
boot_params has a struct setup_header at the correct offset, so with
some casting of offsetof() use, we can make this look a lot more self
explanatory.


> +   u64 cmdline_ptr, hi_val;
> +
> +   boot_params->hdr.setup_sects = hdr->setup_sects;
> +   boot_params->hdr.syssize = hdr->syssize;
> +   boot_params->hdr.version = hdr->version;
> +   boot_params->hdr.loadflags = hdr->loadflags;
> +   boot_params->hdr.kernel_alignment = hdr->kernel_alignment;
> +   boot_params->hdr.min_alignment = hdr->min_alignment;
> +   boot_params->hdr.xloadflags = hdr->xloadflags;
> +   boot_params->hdr.init_size = hdr->init_size;
> +   boot_params->hdr.kernel_info_offset = hdr->kernel_info_offset;
> +   hi_val = boot_params->ext_cmd_line_ptr;

We have efi_set_u64_split() for this.

> +   cmdline_ptr = boot_params->hdr.cmd_line_ptr | hi_val << 32;
> +   boot_params->hdr.cmdline_size = strlen((const char 
> *)cmdline_ptr);;
> +   }
> +
> +   return updated;
> +}
> +
> +static void efi_secure_launch(struct boot_params *boot_params)
> +{
> +   struct slr_entry_dl_info *dlinfo;
> +   efi_guid_t guid = SLR_TABLE_GUID;
> +   dl_handler_func handler_callback;
> +   struct slr_table *slrt;
> +

... a C conditional here, e.g.,

if (!IS_ENABLED(CONFIG_SECURE_LAUNCH))
return;

The difference is that all the code will get compile test coverage
every time, instead of only in configs that enable
CONFIG_SECURE_LAUNCH.

This significantly reduces the risk that your stuff will get broken
inadvertently.

> +   /*
> +* The presence of this table indicated a Secure Launch
> +* is being requested.
> +*/
> +   slrt = (struct slr_table *)get_efi_config_table(guid);
> +   if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
> +   return;
> +
> +   /*
> +* Since the EFI stub library creates its own boot_params on entry, 
> the
> +* SLRT and TXT heap have to be updated with this ve

Re: [RFC PATCH 0/9] kexec x86 purgatory cleanup

2024-04-24 Thread Ard Biesheuvel
On Wed, 24 Apr 2024 at 22:04, Eric W. Biederman  wrote:
>
> Ard Biesheuvel  writes:
>
> > From: Ard Biesheuvel 
> >
> > The kexec purgatory is built like a kernel module, i.e., a partially
> > linked ELF object where each section is allocated and placed
> > individually, and all relocations need to be fixed up, even place
> > relative ones.
> >
> > This makes sense for kernel modules, which share the address space with
> > the core kernel, and contain unresolved references that need to be wired
> > up to symbols in other modules or the kernel itself.
> >
> > The purgatory, however, is a fully linked binary without any external
> > references, or any overlap with the kernel's virtual address space. So
> > it makes much more sense to create a fully linked ELF executable that
> > can just be loaded and run anywhere in memory.
>
> It does have external references that are resolved when it is loaded.
>

It doesn't today, and it hasn't for a while, at least since commit

e4160b2e4b02377c67f8ecd05786811598f39acd
x86/purgatory: Fail the build if purgatory.ro has missing symbols

which forces a build failure on unresolved external references, by
doing a full link of the purgatory.

> Further it is at least my impression that non-PIC code is more
> efficient.  PIC typically requires silly things like Global Offset
> Tables that non-PIC code does not.  At first glance this looks like a
> code passivization.
>

Given that the 64-bit purgatory can be loaded in memory that is not
32-bit addressable, the PIC code is essentially a given, since the
large code model is much worse (it uses 64-bit immediate for all
function and variable symbols, and therefore always uses indirect
calls)

Please refer to

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/build&id=cba786af84a0f9716204e09f518ce3b7ada8555e

for more details. (Getting pulled into that discussion is how I ended
up looking into the purgatory in more detail)

> Now at lot of functionality has been stripped out of purgatory so maybe
> in it's stripped down this make sense, but I want to challenge the
> notion that this is the obvious thing to do.
>

The diffstat speaks for itself - on x86, much of the allocation and
relocation logic can simply be dropped when building the purgatory in
this manner.

> > The purgatory build on x86 has already switched over to position
> > independent codegen, which only leaves a handful of absolute references,
> > which can either be dropped (patch #3) or converted into a RIP-relative
> > one (patch #4). That leaves a purgatory executable that can run at any
> > offset in memory with applying any relocations whatsoever.
>
> I missed that conversation.  Do you happen to have a pointer?  I would
> think the 32bit code is where the PIC would be most costly as the 32bit
> x86 instruction set predates PIC being a common compilation target.
>

See link above. Note that this none of this is about 32-bit code - the
purgatory as it exists today never drops out of long mode (and no
32-bit version appears to exist)

> > Some tweaks are needed to deal with the difference between partially
> > (ET_REL) and fully (ET_DYN/ET_EXEC) linked ELF objects, but with those
> > in place, a substantial amount of complicated ELF allocation, placement
> > and patching/relocation code can simply be dropped.
>
> Really?  As I recall it only needed to handle a single allocation type,
> and there were good reasons (at least when I wrote it) to patch symbols.
>
> Again maybe the fact that people have removed 90% of the functionality
> makes this make sense, but that is not obvious at first glance.
>

Again, the patches and the diffstat speak for themselves - the linker
applies all the relocations at build time, and emits all the sections
into a single ELF segment that can be copied into memory and executed
directly (modulo poking values into the global variables for the
sha256 digest and the segment list)

The last patch in the series shows which code we could drop from the
generic kexec_file_load() implementation once other architectures
adopt this scheme.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH 4/9] x86/purgatory: Avoid absolute reference to GDT

2024-04-24 Thread Ard Biesheuvel
Hi Brian,

Thanks for taking a look.

On Wed, 24 Apr 2024 at 19:39, Brian Gerst  wrote:
>
> On Wed, Apr 24, 2024 at 12:06 PM Ard Biesheuvel  wrote:
> >
> > From: Ard Biesheuvel 
> >
> > The purgatory is almost entirely position independent, without any need
> > for any relocation processing at load time except for the reference to
> > the GDT in the entry code. Generate this reference at runtime instead,
> > to remove the last R_X86_64_64 relocation from this code.
> >
> > While the GDT itself needs to be preserved in memory as long as it is
> > live, the GDT descriptor that is used to program the GDT can be
> > discarded so it can be allocated on the stack.
> >
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >  arch/x86/purgatory/entry64.S | 10 +++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
> > index 9913877b0dbe..888661d9db9c 100644
> > --- a/arch/x86/purgatory/entry64.S
> > +++ b/arch/x86/purgatory/entry64.S
> > @@ -16,7 +16,11 @@
> >
> >  SYM_CODE_START(entry64)
> > /* Setup a gdt that should be preserved */
> > -   lgdt gdt(%rip)
> > +   leaqgdt(%rip), %rax
> > +   pushq   %rax
> > +   pushw   $gdt_end - gdt - 1
> > +   lgdt(%rsp)
> > +   addq$10, %rsp
>
> This misaligns the stack, pushing 16 bytes on the stack but only
> removing 10 (decimal).
>

pushw subtracts 2 from RSP and stores a word. So the total size stored
is 10 decimal not 16.

> >
> > /* load the data segments */
> > movl$0x18, %eax /* data segment */
> > @@ -83,8 +87,8 @@ SYM_DATA_START_LOCAL(gdt)
> >  * 0x08 unused
> >  * so use them as gdt ptr
>
> obsolete comment
>
> >  */
> > -   .word gdt_end - gdt - 1
> > -   .quad gdt
> > +   .word 0
> > +   .quad 0
> > .word 0, 0, 0
>
> This can be condensed down to:
> .quad 0, 0
>

This code and the comment are removed in the next patch.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 9/9] kexec: Drop support for partially linked purgatory executables

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Remove the handling of purgatories that are allocated, loaded and
relocated as individual ELF sections, which requires a lot of
post-processing on the part of the kexec loader. This has been
superseded by the use of fully linked PIE executables, which do not
require such post-processing.

Signed-off-by: Ard Biesheuvel 
---
 kernel/kexec_file.c | 271 +---
 1 file changed, 14 insertions(+), 257 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 6379f8dfc29f..782a1247558c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -808,228 +808,31 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
 
 #ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY
 /*
- * kexec_purgatory_setup_kbuf - prepare buffer to load purgatory.
- * @pi:Purgatory to be loaded.
- * @kbuf:  Buffer to setup.
- *
- * Allocates the memory needed for the buffer. Caller is responsible to free
- * the memory after use.
- *
- * Return: 0 on success, negative errno on error.
- */
-static int kexec_purgatory_setup_kbuf(struct purgatory_info *pi,
- struct kexec_buf *kbuf)
-{
-   const Elf_Shdr *sechdrs;
-   unsigned long bss_align;
-   unsigned long bss_sz;
-   unsigned long align;
-   int i, ret;
-
-   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
-   kbuf->buf_align = bss_align = 1;
-   kbuf->bufsz = bss_sz = 0;
-
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
-   continue;
-
-   align = sechdrs[i].sh_addralign;
-   if (sechdrs[i].sh_type != SHT_NOBITS) {
-   if (kbuf->buf_align < align)
-   kbuf->buf_align = align;
-   kbuf->bufsz = ALIGN(kbuf->bufsz, align);
-   kbuf->bufsz += sechdrs[i].sh_size;
-   } else {
-   if (bss_align < align)
-   bss_align = align;
-   bss_sz = ALIGN(bss_sz, align);
-   bss_sz += sechdrs[i].sh_size;
-   }
-   }
-   kbuf->bufsz = ALIGN(kbuf->bufsz, bss_align);
-   kbuf->memsz = kbuf->bufsz + bss_sz;
-   if (kbuf->buf_align < bss_align)
-   kbuf->buf_align = bss_align;
-
-   kbuf->buffer = vzalloc(kbuf->bufsz);
-   if (!kbuf->buffer)
-   return -ENOMEM;
-   pi->purgatory_buf = kbuf->buffer;
-
-   ret = kexec_add_buffer(kbuf);
-   if (ret)
-   goto out;
-
-   return 0;
-out:
-   vfree(pi->purgatory_buf);
-   pi->purgatory_buf = NULL;
-   return ret;
-}
-
-/*
- * kexec_purgatory_setup_sechdrs - prepares the pi->sechdrs buffer.
- * @pi:Purgatory to be loaded.
- * @kbuf:  Buffer prepared to store purgatory.
- *
- * Allocates the memory needed for the buffer. Caller is responsible to free
- * the memory after use.
- *
- * Return: 0 on success, negative errno on error.
- */
-static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
-struct kexec_buf *kbuf)
-{
-   unsigned long bss_addr;
-   unsigned long offset;
-   size_t sechdrs_size;
-   Elf_Shdr *sechdrs;
-   int i;
-
-   /*
-* The section headers in kexec_purgatory are read-only. In order to
-* have them modifiable make a temporary copy.
-*/
-   sechdrs_size = array_size(sizeof(Elf_Shdr), pi->ehdr->e_shnum);
-   sechdrs = vzalloc(sechdrs_size);
-   if (!sechdrs)
-   return -ENOMEM;
-   memcpy(sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff, sechdrs_size);
-   pi->sechdrs = sechdrs;
-
-   offset = 0;
-   bss_addr = kbuf->mem + kbuf->bufsz;
-   kbuf->image->start = pi->ehdr->e_entry;
-
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   unsigned long align;
-   void *src, *dst;
-
-   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
-   continue;
-
-   align = sechdrs[i].sh_addralign;
-   if (sechdrs[i].sh_type == SHT_NOBITS) {
-   bss_addr = ALIGN(bss_addr, align);
-   sechdrs[i].sh_addr = bss_addr;
-   bss_addr += sechdrs[i].sh_size;
-   continue;
-   }
-
-   offset = ALIGN(offset, align);
-
-   /*
-* Check if the segment contains the entry point, if so,
-* calculate the value of image->start based on it.
-* If the compiler has produced more than one .text section
-* (Eg: .text.hot), they are generally after the main .text
-* section, and they shall not be us

[RFC PATCH 8/9] x86/purgatory: Simplify references to regs array

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Use a single symbol reference and offset addressing to load the contents
of the register file from memory, instead of using a symbol reference
for each, which results in larger code and more ELF overhead. While at
it, rename the individual labels with an .L prefix so they are omitted
from the ELF symbol table.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 67 ++--
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 3d09781d4f9a..56487fb7fa1d 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -37,45 +37,46 @@ SYM_CODE_START(entry64)
 new_cs_exit:
 
/* Load the registers */
-   movqrax(%rip), %rax
-   movqrbx(%rip), %rbx
-   movqrcx(%rip), %rcx
-   movqrdx(%rip), %rdx
-   movqrsi(%rip), %rsi
-   movqrdi(%rip), %rdi
-   movqrbp(%rip), %rbp
-   movqr8(%rip), %r8
-   movqr9(%rip), %r9
-   movqr10(%rip), %r10
-   movqr11(%rip), %r11
-   movqr12(%rip), %r12
-   movqr13(%rip), %r13
-   movqr14(%rip), %r14
-   movqr15(%rip), %r15
+   leaqentry64_regs(%rip), %r15
+   movq0x00(%r15), %rax
+   movq0x08(%r15), %rcx
+   movq0x10(%r15), %rdx
+   movq0x18(%r15), %rbx
+   movq0x20(%r15), %rbp
+   movq0x28(%r15), %rsi
+   movq0x30(%r15), %rdi
+   movq0x38(%r15), %r8
+   movq0x40(%r15), %r9
+   movq0x48(%r15), %r10
+   movq0x50(%r15), %r11
+   movq0x58(%r15), %r12
+   movq0x60(%r15), %r13
+   movq0x68(%r15), %r14
+   movq0x70(%r15), %r15
 
/* Jump to the new code... */
-   jmpq*rip(%rip)
+   jmpq*.Lrip(%rip)
 SYM_CODE_END(entry64)
 
.section ".rodata"
-   .balign 4
+   .balign 8
 SYM_DATA_START(entry64_regs)
-rax:   .quad 0x0
-rcx:   .quad 0x0
-rdx:   .quad 0x0
-rbx:   .quad 0x0
-rbp:   .quad 0x0
-rsi:   .quad 0x0
-rdi:   .quad 0x0
-r8:.quad 0x0
-r9:.quad 0x0
-r10:   .quad 0x0
-r11:   .quad 0x0
-r12:   .quad 0x0
-r13:   .quad 0x0
-r14:   .quad 0x0
-r15:   .quad 0x0
-rip:   .quad 0x0
+.Lrax: .quad   0x0
+.Lrcx: .quad   0x0
+.Lrdx: .quad   0x0
+.Lrbx: .quad   0x0
+.Lrbp: .quad   0x0
+.Lrsi: .quad   0x0
+.Lrdi: .quad   0x0
+.Lr8:  .quad   0x0
+.Lr9:  .quad   0x0
+.Lr10: .quad   0x0
+.Lr11: .quad   0x0
+.Lr12: .quad   0x0
+.Lr13: .quad   0x0
+.Lr14: .quad   0x0
+.Lr15: .quad   0x0
+.Lrip: .quad   0x0
 SYM_DATA_END(entry64_regs)
 
/* GDT */
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 7/9] x86/purgatory: Use fully linked PIE ELF executable

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Now that the generic support is in place, switch to a fully linked PIE
ELF executable for the purgatory, so that it can be loaded as a single,
fully relocated image. This allows a lot of ugly post-processing logic
to simply be dropped.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/include/asm/kexec.h   |   7 --
 arch/x86/kernel/machine_kexec_64.c | 127 
 arch/x86/purgatory/Makefile|  14 +--
 3 files changed, 5 insertions(+), 143 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index ee7b32565e5f..c7cacc2e9dfb 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -191,13 +191,6 @@ void arch_kexec_unprotect_crashkres(void);
 #define arch_kexec_unprotect_crashkres arch_kexec_unprotect_crashkres
 
 #ifdef CONFIG_KEXEC_FILE
-struct purgatory_info;
-int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-Elf_Shdr *section,
-const Elf_Shdr *relsec,
-const Elf_Shdr *symtab);
-#define arch_kexec_apply_relocations_add arch_kexec_apply_relocations_add
-
 int arch_kimage_file_post_load_cleanup(struct kimage *image);
 #define arch_kimage_file_post_load_cleanup arch_kimage_file_post_load_cleanup
 #endif
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index bc0a5348b4a6..ded924423e50 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -371,133 +371,6 @@ void machine_kexec(struct kimage *image)
 /* arch-dependent functionality related to kexec file-based syscall */
 
 #ifdef CONFIG_KEXEC_FILE
-/*
- * Apply purgatory relocations.
- *
- * @pi:Purgatory to be relocated.
- * @section:   Section relocations applying to.
- * @relsec:Section containing RELAs.
- * @symtabsec: Corresponding symtab.
- *
- * TODO: Some of the code belongs to generic code. Move that in kexec.c.
- */
-int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-Elf_Shdr *section, const Elf_Shdr *relsec,
-const Elf_Shdr *symtabsec)
-{
-   unsigned int i;
-   Elf64_Rela *rel;
-   Elf64_Sym *sym;
-   void *location;
-   unsigned long address, sec_base, value;
-   const char *strtab, *name, *shstrtab;
-   const Elf_Shdr *sechdrs;
-
-   /* String & section header string table */
-   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
-   strtab = (char *)pi->ehdr + sechdrs[symtabsec->sh_link].sh_offset;
-   shstrtab = (char *)pi->ehdr + sechdrs[pi->ehdr->e_shstrndx].sh_offset;
-
-   rel = (void *)pi->ehdr + relsec->sh_offset;
-
-   pr_debug("Applying relocate section %s to %u\n",
-shstrtab + relsec->sh_name, relsec->sh_info);
-
-   for (i = 0; i < relsec->sh_size / sizeof(*rel); i++) {
-
-   /*
-* rel[i].r_offset contains byte offset from beginning
-* of section to the storage unit affected.
-*
-* This is location to update. This is temporary buffer
-* where section is currently loaded. This will finally be
-* loaded to a different address later, pointed to by
-* ->sh_addr. kexec takes care of moving it
-*  (kexec_load_segment()).
-*/
-   location = pi->purgatory_buf;
-   location += section->sh_offset;
-   location += rel[i].r_offset;
-
-   /* Final address of the location */
-   address = section->sh_addr + rel[i].r_offset;
-
-   /*
-* rel[i].r_info contains information about symbol table index
-* w.r.t which relocation must be made and type of relocation
-* to apply. ELF64_R_SYM() and ELF64_R_TYPE() macros get
-* these respectively.
-*/
-   sym = (void *)pi->ehdr + symtabsec->sh_offset;
-   sym += ELF64_R_SYM(rel[i].r_info);
-
-   if (sym->st_name)
-   name = strtab + sym->st_name;
-   else
-   name = shstrtab + sechdrs[sym->st_shndx].sh_name;
-
-   pr_debug("Symbol: %s info: %02x shndx: %02x value=%llx size: 
%llx\n",
-name, sym->st_info, sym->st_shndx, sym->st_value,
-sym->st_size);
-
-   if (sym->st_shndx == SHN_UNDEF) {
-   pr_err("Undefined symbol: %s\n", name);
-   return -ENOEXEC;
-   }
-
-   if (sym->st_shndx == SHN_COMMON) {
-   pr_err("symbol '%s' in common section\n", name);
- 

[RFC PATCH 6/9] kexec: Add support for fully linked purgatory executables

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory ELF object is typically a partially linked object, which
puts the burden on the kexec loader to lay out the executable in memory,
and this involves (among other things) deciding the placement of the
sections in memory, and fixing up all relocations (relative and absolute
ones)

All of this can be greatly simplified by using a fully linked PIE ELF
executable instead, constructed in a way that removes the need for any
relocation processing or layout and allocation of individual sections.

By gathering all allocatable sections into a single PT_LOAD segment, and
relying on RIP-relative references, all relocations will be applied by
the linker, and the segment simply needs to be copied into memory.

So add a linker script and some minimal handling in generic code, which
can be used by architectures to opt into this approach. This will be
wired up for x86 in a subsequent patch.

Signed-off-by: Ard Biesheuvel 
---
 include/asm-generic/purgatory.lds | 34 ++
 kernel/kexec_file.c   | 68 +++-
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/purgatory.lds 
b/include/asm-generic/purgatory.lds
new file mode 100644
index ..260c457f7608
--- /dev/null
+++ b/include/asm-generic/purgatory.lds
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+PHDRS
+{
+   text PT_LOAD FLAGS(7) FILEHDR PHDRS;
+}
+
+SECTIONS
+{
+   . = SIZEOF_HEADERS;
+
+   .text : {
+   *(.text .rodata* .kexec-purgatory .data*)
+   } :text
+
+   .bss : {
+   *(.bss .dynbss)
+   } :text
+
+   .rela.dyn : {
+   *(.rela.*)
+   }
+
+   .symtab 0 : { *(.symtab) }
+   .strtab 0 : { *(.strtab) }
+   .shstrtab 0 : { *(.shstrtab) }
+
+   /DISCARD/ : {
+   *(.interp .modinfo .dynsym .dynstr .hash .gnu.* .dynamic 
.comment)
+   *(.got .plt .got.plt .plt.got .note.* .eh_frame .sframe)
+   }
+}
+
+ASSERT(SIZEOF(.rela.dyn) == 0, "Absolute relocations detected");
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index bef2f6f2571b..6379f8dfc29f 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -1010,6 +1010,62 @@ static int kexec_apply_relocations(struct kimage *image)
return 0;
 }
 
+/*
+ * kexec_load_purgatory_pie - Load the position independent purgatory object.
+ * @pi:Purgatory info struct.
+ * @kbuf:  Memory parameters to use.
+ *
+ * Load a purgatory PIE executable. This is a fully linked executable
+ * consisting of a single PT_LOAD segment that does not require any relocation
+ * processing.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+static int kexec_load_purgatory_pie(struct purgatory_info *pi,
+   struct kexec_buf *kbuf)
+{
+   const Elf_Phdr *phdr = (void *)pi->ehdr + pi->ehdr->e_phoff;
+   int ret;
+
+   if (pi->ehdr->e_phnum != 1)
+   return -EINVAL;
+
+   kbuf->bufsz = phdr->p_filesz;
+   kbuf->memsz = phdr->p_memsz;
+   kbuf->buf_align = phdr->p_align;
+
+   kbuf->buffer = vzalloc(kbuf->bufsz);
+   if (!kbuf->buffer)
+   return -ENOMEM;
+
+   ret = kexec_add_buffer(kbuf);
+   if (ret)
+   goto out_free_kbuf;
+
+   kbuf->image->start = kbuf->mem + pi->ehdr->e_entry;
+
+   pi->sechdrs = vcalloc(pi->ehdr->e_shnum, pi->ehdr->e_shentsize);
+   if (!pi->sechdrs)
+   goto out_free_kbuf;
+
+   pi->purgatory_buf = memcpy(kbuf->buffer,
+  (void *)pi->ehdr + phdr->p_offset,
+  kbuf->bufsz);
+
+   memcpy(pi->sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff,
+  pi->ehdr->e_shnum * pi->ehdr->e_shentsize);
+
+   for (int i = 0; i < pi->ehdr->e_shnum; i++)
+   if (pi->sechdrs[i].sh_flags & SHF_ALLOC)
+   pi->sechdrs[i].sh_addr += kbuf->mem;
+
+   return 0;
+
+out_free_kbuf:
+   vfree(kbuf->buffer);
+   return ret;
+}
+
 /*
  * kexec_load_purgatory - Load and relocate the purgatory object.
  * @image: Image to add the purgatory to.
@@ -1031,6 +1087,9 @@ int kexec_load_purgatory(struct kimage *image, struct 
kexec_buf *kbuf)
 
pi->ehdr = (const Elf_Ehdr *)kexec_purgatory;
 
+   if (pi->ehdr->e_type != ET_REL)
+   return kexec_load_purgatory_pie(pi, kbuf);
+
ret = kexec_purgatory_setup_kbuf(pi, kbuf);
if (ret)
return ret;
@@ -1087,7 +1146,8 @@ static const Elf_Sym *kexec_purgatory_find_symbol(struct 
purgatory_info *pi,
 
/* Go through symbols for a match */
for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
-   if (ELF_ST_BIND

[RFC PATCH 2/9] x86/purgatory: Simplify stack handling

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The x86 purgatory, which does little more than verify a SHA-256 hash of
the loaded segments, currently uses three different stacks:
- one in .bss that is used to call the purgatory C code
- one in .rodata that is only used to switch to an updated code segment
  descriptor in the GDT
- one in .data, which allows it to be prepopulated from the kexec loader
  in theory, but this is not actually being taken advantage of.

Simplify this, by dropping the latter two stacks, as well as the loader
logic that programs RSP.

Both the stacks in .bss and .data are 4k aligned, but 16 byte alignment
is more than sufficient.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/include/asm/kexec.h  |  1 -
 arch/x86/kernel/kexec-bzimage64.c |  8 
 arch/x86/purgatory/entry64.S  |  8 
 arch/x86/purgatory/setup-x86_64.S |  2 +-
 arch/x86/purgatory/stack.S| 18 --
 5 files changed, 1 insertion(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 91ca9a9ee3a2..ee7b32565e5f 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -163,7 +163,6 @@ struct kexec_entry64_regs {
uint64_t rcx;
uint64_t rdx;
uint64_t rbx;
-   uint64_t rsp;
uint64_t rbp;
uint64_t rsi;
uint64_t rdi;
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index cde167b0ea92..f5bf1b7d01a6 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -400,7 +400,6 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
struct bzimage64_data *ldata;
struct kexec_entry64_regs regs64;
-   void *stack;
unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
@@ -550,14 +549,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
regs64.rbx = 0; /* Bootstrap Processor */
regs64.rsi = bootparam_load_addr;
regs64.rip = kernel_load_addr + 0x200;
-   stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
-   if (IS_ERR(stack)) {
-   pr_err("Could not find address of symbol stack_end\n");
-   ret = -EINVAL;
-   goto out_free_params;
-   }
 
-   regs64.rsp = (unsigned long)stack;
ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", ®s64,
 sizeof(regs64), 0);
if (ret)
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 0b4390ce586b..9913877b0dbe 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -26,8 +26,6 @@ SYM_CODE_START(entry64)
movl%eax, %fs
movl%eax, %gs
 
-   /* Setup new stack */
-   leaqstack_init(%rip), %rsp
pushq   $0x10 /* CS */
leaqnew_cs_exit(%rip), %rax
pushq   %rax
@@ -41,7 +39,6 @@ new_cs_exit:
movqrdx(%rip), %rdx
movqrsi(%rip), %rsi
movqrdi(%rip), %rdi
-   movqrsp(%rip), %rsp
movqrbp(%rip), %rbp
movqr8(%rip), %r8
movqr9(%rip), %r9
@@ -63,7 +60,6 @@ rax:  .quad 0x0
 rcx:   .quad 0x0
 rdx:   .quad 0x0
 rbx:   .quad 0x0
-rsp:   .quad 0x0
 rbp:   .quad 0x0
 rsi:   .quad 0x0
 rdi:   .quad 0x0
@@ -97,7 +93,3 @@ SYM_DATA_START_LOCAL(gdt)
/* 0x18 4GB flat data segment */
.word 0x, 0x, 0x9200, 0x00CF
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-
-SYM_DATA_START_LOCAL(stack)
-   .quad   0, 0
-SYM_DATA_END_LABEL(stack, SYM_L_LOCAL, stack_init)
diff --git a/arch/x86/purgatory/setup-x86_64.S 
b/arch/x86/purgatory/setup-x86_64.S
index 89d9e9e53fcd..2d10ff88851d 100644
--- a/arch/x86/purgatory/setup-x86_64.S
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -53,7 +53,7 @@ SYM_DATA_START_LOCAL(gdt)
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
 
.bss
-   .balign 4096
+   .balign 16
 SYM_DATA_START_LOCAL(lstack)
.skip 4096
 SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end)
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
deleted file mode 100644
index 1ef507ca50a5..
--- a/arch/x86/purgatory/stack.S
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * purgatory:  stack
- *
- * Copyright (C) 2014 Red Hat Inc.
- */
-
-#include 
-
-   /* A stack for the loaded kernel.
-* Separate and in the data section so it can be prepopulated.
-*/
-   .data
-   .balign 4096
-
-SYM_DATA_START(stack)
-   .skip 4096
-SYM_DATA_END_LABEL(stack, SYM_L_GLOBAL, stack_end)
-- 
2.44.0.769.g3c40516874-goog


__

[RFC PATCH 1/9] x86/purgatory: Drop function entry padding from purgatory

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory is a completely separate ELF executable carried inside the
kernel as an opaque binary blob. This means that function entry padding
and the associated ELF metadata are not exposed to the branch tracking
and code patching machinery, and can there be dropped from the purgatory
binary.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index a18591f6e6d9..2df4a4b70ff5 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -23,6 +23,9 @@ KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% 
-fprofile-use=%,$(KBUILD_CF
 # by kexec. Remove -flto=* flags.
 KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS))
 
+# Drop the function entry padding, which is not needed here
+KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS),$(KBUILD_CFLAGS))
+
 # When linking purgatory.ro with -r unresolved symbols are not checked,
 # also link a purgatory.chk binary without -r to check for unresolved symbols.
 PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 5/9] x86/purgatory: Simplify GDT and drop data segment

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Data segment selectors are ignored in long mode so there is no point in
programming them. So clear them instead. This only leaves the code
segment entry in the GDT, which can be moved up a slot now that the
second slot is no longer used as the GDT descriptor.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 888661d9db9c..3d09781d4f9a 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -23,14 +23,14 @@ SYM_CODE_START(entry64)
addq$10, %rsp
 
/* load the data segments */
-   movl$0x18, %eax /* data segment */
+   xorl%eax, %eax /* data segment */
movl%eax, %ds
movl%eax, %es
movl%eax, %ss
movl%eax, %fs
movl%eax, %gs
 
-   pushq   $0x10 /* CS */
+   pushq   $0x8 /* CS */
leaqnew_cs_exit(%rip), %rax
pushq   %rax
lretq
@@ -84,16 +84,9 @@ SYM_DATA_END(entry64_regs)
 SYM_DATA_START_LOCAL(gdt)
/*
 * 0x00 unusable segment
-* 0x08 unused
-* so use them as gdt ptr
 */
-   .word 0
.quad 0
-   .word 0, 0, 0
 
-   /* 0x10 4GB flat code segment */
+   /* 0x8 4GB flat code segment */
.word 0x, 0x, 0x9A00, 0x00AF
-
-   /* 0x18 4GB flat data segment */
-   .word 0x, 0x, 0x9200, 0x00CF
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 0/9] kexec x86 purgatory cleanup

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The kexec purgatory is built like a kernel module, i.e., a partially
linked ELF object where each section is allocated and placed
individually, and all relocations need to be fixed up, even place
relative ones.

This makes sense for kernel modules, which share the address space with
the core kernel, and contain unresolved references that need to be wired
up to symbols in other modules or the kernel itself.

The purgatory, however, is a fully linked binary without any external
references, or any overlap with the kernel's virtual address space. So
it makes much more sense to create a fully linked ELF executable that
can just be loaded and run anywhere in memory.

The purgatory build on x86 has already switched over to position
independent codegen, which only leaves a handful of absolute references,
which can either be dropped (patch #3) or converted into a RIP-relative
one (patch #4). That leaves a purgatory executable that can run at any
offset in memory with applying any relocations whatsoever.

Some tweaks are needed to deal with the difference between partially
(ET_REL) and fully (ET_DYN/ET_EXEC) linked ELF objects, but with those
in place, a substantial amount of complicated ELF allocation, placement
and patching/relocation code can simply be dropped.

The last patch in the series removes this code from the generic kexec
implementation, but this can only be done once other architectures apply
the same changes proposed here for x86 (powerpc, s390 and riscv all
implement the purgatory using the shared logic)

Link: 
https://lore.kernel.org/all/CAKwvOd=3Jrzju++=Ve61=ZdeshxUM=K3-bGMNREnGOQgNw=a...@mail.gmail.com/
Link: https://lore.kernel.org/all/20240418201705.3673200-2-ardb+...@google.com/

Cc: Arnd Bergmann 
Cc: Eric Biederman 
Cc: kexec@lists.infradead.org
Cc: Nathan Chancellor 
Cc: Nick Desaulniers 
Cc: Kees Cook 
Cc: Bill Wendling 
Cc: Justin Stitt 
Cc: Masahiro Yamada 

Ard Biesheuvel (9):
  x86/purgatory: Drop function entry padding from purgatory
  x86/purgatory: Simplify stack handling
  x86/purgatory: Drop pointless GDT switch
  x86/purgatory: Avoid absolute reference to GDT
  x86/purgatory: Simplify GDT and drop data segment
  kexec: Add support for fully linked purgatory executables
  x86/purgatory: Use fully linked PIE ELF executable
  x86/purgatory: Simplify references to regs array
  kexec: Drop support for partially linked purgatory executables

 arch/x86/include/asm/kexec.h   |   8 -
 arch/x86/kernel/kexec-bzimage64.c  |   8 -
 arch/x86/kernel/machine_kexec_64.c | 127 --
 arch/x86/purgatory/Makefile|  17 +-
 arch/x86/purgatory/entry64.S   |  96 
 arch/x86/purgatory/setup-x86_64.S  |  31 +--
 arch/x86/purgatory/stack.S |  18 --
 include/asm-generic/purgatory.lds  |  34 +++
 kernel/kexec_file.c| 255 +++-
 9 files changed, 125 insertions(+), 469 deletions(-)
 delete mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 include/asm-generic/purgatory.lds

-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 4/9] x86/purgatory: Avoid absolute reference to GDT

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory is almost entirely position independent, without any need
for any relocation processing at load time except for the reference to
the GDT in the entry code. Generate this reference at runtime instead,
to remove the last R_X86_64_64 relocation from this code.

While the GDT itself needs to be preserved in memory as long as it is
live, the GDT descriptor that is used to program the GDT can be
discarded so it can be allocated on the stack.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 9913877b0dbe..888661d9db9c 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -16,7 +16,11 @@
 
 SYM_CODE_START(entry64)
/* Setup a gdt that should be preserved */
-   lgdt gdt(%rip)
+   leaqgdt(%rip), %rax
+   pushq   %rax
+   pushw   $gdt_end - gdt - 1
+   lgdt(%rsp)
+   addq$10, %rsp
 
/* load the data segments */
movl$0x18, %eax /* data segment */
@@ -83,8 +87,8 @@ SYM_DATA_START_LOCAL(gdt)
 * 0x08 unused
 * so use them as gdt ptr
 */
-   .word gdt_end - gdt - 1
-   .quad gdt
+   .word 0
+   .quad 0
.word 0, 0, 0
 
/* 0x10 4GB flat code segment */
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 3/9] x86/purgatory: Drop pointless GDT switch

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The x86 purgatory switches to a new GDT twice, and the first time, it
doesn't even bother to switch to the new code segment. Given that data
segment selectors are ignored in long mode, and the fact that the GDT is
reprogrammed again after returning from purgatory(), the first switch is
entirely pointless and can just be dropped altogether.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/setup-x86_64.S | 29 
 1 file changed, 29 deletions(-)

diff --git a/arch/x86/purgatory/setup-x86_64.S 
b/arch/x86/purgatory/setup-x86_64.S
index 2d10ff88851d..f160fc729cbe 100644
--- a/arch/x86/purgatory/setup-x86_64.S
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -15,17 +15,6 @@
.code64
 
 SYM_CODE_START(purgatory_start)
-   /* Load a gdt so I know what the segment registers are */
-   lgdtgdt(%rip)
-
-   /* load the data segments */
-   movl$0x18, %eax /* data segment */
-   movl%eax, %ds
-   movl%eax, %es
-   movl%eax, %ss
-   movl%eax, %fs
-   movl%eax, %gs
-
/* Setup a stack */
leaqlstack_end(%rip), %rsp
 
@@ -34,24 +23,6 @@ SYM_CODE_START(purgatory_start)
jmp entry64
 SYM_CODE_END(purgatory_start)
 
-   .section ".rodata"
-   .balign 16
-SYM_DATA_START_LOCAL(gdt)
-   /* 0x00 unusable segment
-* 0x08 unused
-* so use them as the gdt ptr
-*/
-   .word   gdt_end - gdt - 1
-   .quad   gdt
-   .word   0, 0, 0
-
-   /* 0x10 4GB flat code segment */
-   .word   0x, 0x, 0x9A00, 0x00AF
-
-   /* 0x18 4GB flat data segment */
-   .word   0x, 0x, 0x9200, 0x00CF
-SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-
.bss
.balign 16
 SYM_DATA_START_LOCAL(lstack)
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 14/15] x86: Secure Launch late initcall platform module

2024-02-23 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 14:58, Daniel P. Smith
 wrote:
>
> On 2/15/24 03:40, Ard Biesheuvel wrote:
> > On Wed, 14 Feb 2024 at 23:32, Ross Philipson  
> > wrote:
> >>
> >> From: "Daniel P. Smith" 
> >>
> >> The Secure Launch platform module is a late init module. During the
> >> init call, the TPM event log is read and measurements taken in the
> >> early boot stub code are located. These measurements are extended
> >> into the TPM PCRs using the mainline TPM kernel driver.
> >>
> >> The platform module also registers the securityfs nodes to allow
> >> access to TXT register fields on Intel along with the fetching of
> >> and writing events to the late launch TPM log.
> >>
> >> Signed-off-by: Daniel P. Smith 
> >> Signed-off-by: garnetgrimm 
> >> Signed-off-by: Ross Philipson 
> >
> > There is an awful amount of code that executes between the point where
> > the measurements are taken and the point where they are loaded into
> > the PCRs. All of this code could subvert the boot flow and hide this
> > fact, by replacing the actual taken measurement values with the known
> > 'blessed' ones that will unseal the keys and/or phone home to do a
> > successful remote attestation.
>
> To set context, in general the motivation to employ an RTM, Static or
> Dynamic, integrity solution is to enable external platform validation,
> aka attestation. These trust chains are constructed from the principle
> of measure and execute that rely on the presence of a RoT for Storage
> (RTS) and a RoT for Reporting (RTR). Under the TCG architecture adopted
> by x86 vendors and now recently by Arm, those roles are fulfilled by the
> TPM. With this context, lets layout the assumptive trusts being made here,
>1. The CPU GETSEC instruction functions correctly
>2. The IOMMU, and by extension the PMRs, functions correctly
>2. The ACM authentication process functions correctly
>3. The ACM functions correctly
>4. The TPM interactions function correctly
>5. The TPM functions correctly
>
> With this basis, let's explore your assertion here. The assertion breaks
> down into two scenarios. The first is that the at-rest kernel binary is
> corrupt, unintentionally (bug) or maliciously, either of which does not
> matter for the situation. For the sake of simplicity, corruption of the
> Linux kernel during loading or before the DRTM Event is considered an
> equivalent to corruption of the kernel at-rest. The second is that the
> kernel binary was corrupted in memory at some point after the DRTM event
> occurs.
>
> For both scenarios, the ACM will correctly configure the IOMMU PMRs to
> ensure the kernel can no longer be tampered with in memory. After which,
> the ACM will then accurately measure the kernel (bzImage) and safely
> store the measurement in the TPM.
>
> In the first scenario, the TPM will accurately report the kernel
> measurement in the attestation. The attestation authority will be able
> to detect if an invalid kernel was started and can take whatever
> remediation actions it may employ.
>
> In the second scenario, any attempt to corrupt the binary after the ACM
> has configured the IOMMU PMR will fail.
>
>

This protects the memory image from external masters after the
measurement has been taken.

So any external influences in the time window between taking the
measurements and loading them into the PCRs are out of scope here, I
guess?

Maybe it would help (or if I missed it - apologies) to include a
threat model here. I suppose physical tampering is out of scope?

> > At the very least, this should be documented somewhere. And if at all
> > possible, it should also be documented why this is ok, and to what
> > extent it limits the provided guarantees compared to a true D-RTM boot
> > where the early boot code measures straight into the TPMs before
> > proceeding.
>
> I can add a rendition of the above into the existing section of the
> documentation patch that already discusses separation of the measurement
> from the TPM recording code. As to the limits it incurs on the DRTM
> integrity, as explained above, I submit there are none.
>

Thanks for the elaborate explananation. And yes, please document this
with the changes.



Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-23 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 13:30, Andrew Cooper  wrote:
>
> On 22/02/2024 9:34 am, Ard Biesheuvel wrote:
> > On Thu, 22 Feb 2024 at 04:05, Andrew Cooper  
> > wrote:
> >> On 15/02/2024 8:17 am, Ard Biesheuvel wrote:
> >>> On Wed, 14 Feb 2024 at 23:31, Ross Philipson  
> >>> wrote:
> >>>> From: "Daniel P. Smith" 
> >>>>
> >>>> The SHA algorithms are necessary to measure configuration information 
> >>>> into
> >>>> the TPM as early as possible before using the values. This implementation
> >>>> uses the established approach of #including the SHA libraries directly in
> >>>> the code since the compressed kernel is not uncompressed at this point.
> >>>>
> >>>> The SHA code here has its origins in the code from the main kernel:
> >>>>
> >>>> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
> >>>>
> >>>> A modified version of this code was introduced to the lib/crypto/sha1.c
> >>>> to bring it in line with the sha256 code and allow it to be pulled into 
> >>>> the
> >>>> setup kernel in the same manner as sha256 is.
> >>>>
> >>>> Signed-off-by: Daniel P. Smith 
> >>>> Signed-off-by: Ross Philipson 
> >>> We have had some discussions about this, and you really need to
> >>> capture the justification in the commit log for introducing new code
> >>> that implements an obsolete and broken hashing algorithm.
> >>>
> >>> SHA-1 is broken and should no longer be used for anything. Introducing
> >>> new support for a highly complex boot security feature, and then
> >>> relying on SHA-1 in the implementation makes this whole effort seem
> >>> almost futile, *unless* you provide some rock solid reasons here why
> >>> this is still safe.
> >>>
> >>> If the upshot would be that some people are stuck with SHA-1 so they
> >>> won't be able to use this feature, then I'm not convinced we should
> >>> obsess over that.
> >> To be absolutely crystal clear here.
> >>
> >> The choice of hash algorithm(s) are determined by the OEM and the
> >> platform, not by Linux.
> >>
> >> Failing to (at least) cap a PCR in a bank which the OEM/platform left
> >> active is a security vulnerability.  It permits the unsealing of secrets
> >> if an attacker can replay a good set of measurements into an unused bank.
> >>
> >> The only way to get rid of the requirement for SHA-1 here is to lobby
> >> the IHVs/OEMs, or perhaps the TCG, to produce/spec a platform where the
> >> SHA-1 banks can be disabled.  There are no known such platforms in the
> >> market today, to the best of our knowledge.
> >>
> > OK, so mainline Linux does not support secure launch at all today. At
> > this point, we need to decide whether or not tomorrow's mainline Linux
> > will support secure launch with SHA1 or without, right?
>
> I'd argue that's a slightly unfair characterisation.
>

Fair enough. I'm genuinely trying to have a precise understanding of
this, not trying to be dismissive.

> We want tomorrow's mainline to support Secure Launch.  What that entails
> under the hood is largely outside of the control of the end user.
>

So the debate is really whether it makes sense at all to support
Secure Launch on systems that are stuck on an obsolete and broken hash
algorithm. This is not hyperbole: SHA-1 is broken today and once these
changes hit production 1-2 years down the line, the situation will
only have deteriorated. And another 2-3 years later, we will be the
ones chasing obscure bugs on systems that were already obsolete when
this support was added.

So what is the value proposition here? An end user today, who is
mindful enough of security to actively invest the effort to migrate
their system from ordinary measured boot to secure launch, is really
going to do so on a system that only implements SHA-1 support?

> > And the point you are making here is that we need SHA-1 not only to a)
> > support systems that are on TPM 1.2 and support nothing else, but also
> > to b) ensure that crypto agile TPM 2.0 with both SHA-1 and SHA-256
> > enabled can be supported in a safe manner, which would involve
> > measuring some terminating event into the SHA-1 PCRs to ensure they
> > are not left in a dangling state that might allow an adversary to
> > trick the TPM into unsealing a secret that it shouldn't.
>
&g

Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-22 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 04:05, Andrew Cooper  wrote:
>
> On 15/02/2024 8:17 am, Ard Biesheuvel wrote:
> > On Wed, 14 Feb 2024 at 23:31, Ross Philipson  
> > wrote:
> >> From: "Daniel P. Smith" 
> >>
> >> The SHA algorithms are necessary to measure configuration information into
> >> the TPM as early as possible before using the values. This implementation
> >> uses the established approach of #including the SHA libraries directly in
> >> the code since the compressed kernel is not uncompressed at this point.
> >>
> >> The SHA code here has its origins in the code from the main kernel:
> >>
> >> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
> >>
> >> A modified version of this code was introduced to the lib/crypto/sha1.c
> >> to bring it in line with the sha256 code and allow it to be pulled into the
> >> setup kernel in the same manner as sha256 is.
> >>
> >> Signed-off-by: Daniel P. Smith 
> >> Signed-off-by: Ross Philipson 
> > We have had some discussions about this, and you really need to
> > capture the justification in the commit log for introducing new code
> > that implements an obsolete and broken hashing algorithm.
> >
> > SHA-1 is broken and should no longer be used for anything. Introducing
> > new support for a highly complex boot security feature, and then
> > relying on SHA-1 in the implementation makes this whole effort seem
> > almost futile, *unless* you provide some rock solid reasons here why
> > this is still safe.
> >
> > If the upshot would be that some people are stuck with SHA-1 so they
> > won't be able to use this feature, then I'm not convinced we should
> > obsess over that.
>
> To be absolutely crystal clear here.
>
> The choice of hash algorithm(s) are determined by the OEM and the
> platform, not by Linux.
>
> Failing to (at least) cap a PCR in a bank which the OEM/platform left
> active is a security vulnerability.  It permits the unsealing of secrets
> if an attacker can replay a good set of measurements into an unused bank.
>
> The only way to get rid of the requirement for SHA-1 here is to lobby
> the IHVs/OEMs, or perhaps the TCG, to produce/spec a platform where the
> SHA-1 banks can be disabled.  There are no known such platforms in the
> market today, to the best of our knowledge.
>

OK, so mainline Linux does not support secure launch at all today. At
this point, we need to decide whether or not tomorrow's mainline Linux
will support secure launch with SHA1 or without, right?

And the point you are making here is that we need SHA-1 not only to a)
support systems that are on TPM 1.2 and support nothing else, but also
to b) ensure that crypto agile TPM 2.0 with both SHA-1 and SHA-256
enabled can be supported in a safe manner, which would involve
measuring some terminating event into the SHA-1 PCRs to ensure they
are not left in a dangling state that might allow an adversary to
trick the TPM into unsealing a secret that it shouldn't.

So can we support b) without a), and if so, does measuring an
arbitrary dummy event into a PCR that is only meant to keep sealed
forever really require a SHA-1 implementation, or could we just use an
arbitrary (not even random) sequence of 160 bits and use that instead?



Re: [PATCH v8 15/15] x86: EFI stub DRTM launch support for Secure Launch

2024-02-21 Thread Ard Biesheuvel
On Wed, 21 Feb 2024 at 21:37, H. Peter Anvin  wrote:
>
> On February 21, 2024 12:17:30 PM PST, ross.philip...@oracle.com wrote:
> >On 2/15/24 1:01 AM, Ard Biesheuvel wrote:
> >> On Wed, 14 Feb 2024 at 23:32, Ross Philipson  
> >> wrote:
> >>>
> >>> This support allows the DRTM launch to be initiated after an EFI stub
> >>> launch of the Linux kernel is done. This is accomplished by providing
> >>> a handler to jump to when a Secure Launch is in progress. This has to be
> >>> called after the EFI stub does Exit Boot Services.
> >>>
> >>> Signed-off-by: Ross Philipson 
> >>> ---
> >>>   drivers/firmware/efi/libstub/x86-stub.c | 55 +
> >>>   1 file changed, 55 insertions(+)
> >>>
> >>> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> >>> b/drivers/firmware/efi/libstub/x86-stub.c
> >>> index 0d510c9a06a4..4df2cf539194 100644
> >>> --- a/drivers/firmware/efi/libstub/x86-stub.c
> >>> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> >>> @@ -9,6 +9,7 @@
> >>>   #include 
> >>>   #include 
> >>>   #include 
> >>> +#include 
> >>>
> >>>   #include 
> >>>   #include 
> >>> @@ -810,6 +811,57 @@ static efi_status_t efi_decompress_kernel(unsigned 
> >>> long *kernel_entry)
> >>>  return EFI_SUCCESS;
> >>>   }
> >>>
> >>> +static void efi_secure_launch(struct boot_params *boot_params)
> >>> +{
> >>> +   struct slr_entry_uefi_config *uefi_config;
> >>> +   struct slr_uefi_cfg_entry *uefi_entry;
> >>> +   struct slr_entry_dl_info *dlinfo;
> >>> +   efi_guid_t guid = SLR_TABLE_GUID;
> >>> +   struct slr_table *slrt;
> >>> +   u64 memmap_hi;
> >>> +   void *table;
> >>> +   u8 buf[64] = {0};
> >>> +
> >>
> >> If you add a flex array to slr_entry_uefi_config as I suggested in
> >> response to the other patch, we could simplify this substantially
> >
> >I feel like there is some reason why we did not use flex arrays. We were 
> >talking and we seem to remember we used to use them and someone asked us to 
> >remove them. We are still looking into it. But if we can go back to them, I 
> >will take all the changes you recommended here.
> >
>
> Linux kernel code doesn't use VLAs because of the limited stack size, and 
> VLAs or alloca() makes stack size tracking impossible. Although this 
> technically speaking runs in a different environment, it is easier to enforce 
> the constraint globally.

Flex array != VLA

VLAs were phased out because of this reason (and VLAISs [VLAs in
structs] were phased out before that because they are a GNU extension
and not supported by Clang)

Today, VLAs are not supported anywhere in the kernel.

Flex arrays are widely used in the kernel. A flex array is a trailing
array of unspecified size in a struct that makes the entire *type*
have a variable size. But that does not make them VLAs (or VLAISs) - a
VLA is a stack allocated *variable* whose size is based on a function
parameter.

Instances of types containing flex arrays can be allocated statically,
or dynamically on the heap. This is common practice in the kernel, and
even supported by instrumentation to help the compiler track the
runtime size and flag overruns. We are even in the process of adding
compiler support to annotate struct members as carrying the number of
elements in an associated flex arrays, to improve the coverage of the
instrumentation.

I am not asking for a VLA here, only a flex array.



Re: [PATCH v8 15/15] x86: EFI stub DRTM launch support for Secure Launch

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> This support allows the DRTM launch to be initiated after an EFI stub
> launch of the Linux kernel is done. This is accomplished by providing
> a handler to jump to when a Secure Launch is in progress. This has to be
> called after the EFI stub does Exit Boot Services.
>
> Signed-off-by: Ross Philipson 
> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 55 +
>  1 file changed, 55 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index 0d510c9a06a4..4df2cf539194 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -810,6 +811,57 @@ static efi_status_t efi_decompress_kernel(unsigned long 
> *kernel_entry)
> return EFI_SUCCESS;
>  }
>
> +static void efi_secure_launch(struct boot_params *boot_params)
> +{
> +   struct slr_entry_uefi_config *uefi_config;
> +   struct slr_uefi_cfg_entry *uefi_entry;
> +   struct slr_entry_dl_info *dlinfo;
> +   efi_guid_t guid = SLR_TABLE_GUID;
> +   struct slr_table *slrt;
> +   u64 memmap_hi;
> +   void *table;
> +   u8 buf[64] = {0};
> +

If you add a flex array to slr_entry_uefi_config as I suggested in
response to the other patch, we could simplify this substantially

static struct slr_entry_uefi_config cfg = {
.hdr.tag= SLR_ENTRY_UEFI_CONFIG,
.hdr.size   = sizeof(cfg),
.revision   = SLR_UEFI_CONFIG_REVISION,
.nr_entries = 1,
.entries[0] = {
.pcr= 18,
.evt_info = "Measured UEFI memory map",
},
};

cfg.entries[0].cfg  = boot_params->efi_info.efi_memmap |
  (u64)boot_params->efi_info.efi_memmap_hi << 32;
cfg.entries[0].size = boot_params->efi_info.efi_memmap_size;



> +   table = get_efi_config_table(guid);
> +
> +   /*
> +* The presence of this table indicated a Secure Launch
> +* is being requested.
> +*/
> +   if (!table)
> +   return;
> +
> +   slrt = (struct slr_table *)table;
> +
> +   if (slrt->magic != SLR_TABLE_MAGIC)
> +   return;
> +

slrt = (struct slr_table *)get_efi_config_table(guid);
if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
return;

> +   /* Add config information to measure the UEFI memory map */
> +   uefi_config = (struct slr_entry_uefi_config *)buf;
> +   uefi_config->hdr.tag = SLR_ENTRY_UEFI_CONFIG;
> +   uefi_config->hdr.size = sizeof(*uefi_config) + sizeof(*uefi_entry);
> +   uefi_config->revision = SLR_UEFI_CONFIG_REVISION;
> +   uefi_config->nr_entries = 1;
> +   uefi_entry = (struct slr_uefi_cfg_entry *)(buf + 
> sizeof(*uefi_config));
> +   uefi_entry->pcr = 18;
> +   uefi_entry->cfg = boot_params->efi_info.efi_memmap;
> +   memmap_hi = boot_params->efi_info.efi_memmap_hi;
> +   uefi_entry->cfg |= memmap_hi << 32;
> +   uefi_entry->size = boot_params->efi_info.efi_memmap_size;
> +   memcpy(&uefi_entry->evt_info[0], "Measured UEFI memory map",
> +   strlen("Measured UEFI memory map"));
> +

Drop all of this

> +   if (slr_add_entry(slrt, (struct slr_entry_hdr *)uefi_config))

if (slr_add_entry(slrt, &uefi_config.hdr))


> +   return;
> +
> +   /* Jump through DL stub to initiate Secure Launch */
> +   dlinfo = (struct slr_entry_dl_info *)
> +   slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_DL_INFO);
> +
> +   asm volatile ("jmp *%%rax"
> + : : "a" (dlinfo->dl_handler), "D" 
> (&dlinfo->bl_context));

Fix the prototype and just do

dlinfo->dl_handler(&dlinfo->bl_context);
unreachable();


So in summary, this becomes

static void efi_secure_launch(struct boot_params *boot_params)
{
static struct slr_entry_uefi_config cfg = {
.hdr.tag= SLR_ENTRY_UEFI_CONFIG,
.hdr.size   = sizeof(cfg),
.revision   = SLR_UEFI_CONFIG_REVISION,
.nr_entries = 1,
.entries[0] = {
.pcr= 18,
.evt_info = "Measured UEFI memory map",
},
};
struct slr_entry_dl_info *dlinfo;
efi_guid_t guid = SLR_TABLE_GUID;
struct slr_table *slrt;

/*
 * The presence of this table indicated a Secure Launch
 * is being requested.
 */
slrt = (struct slr_table *)get_efi_config_table(guid);
if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
return;

cfg.entries[0].cfg  = boot_params->efi_info.efi_memmap |
  (u64)boot_params->efi_info.efi_memmap_hi << 32;
cfg.entries[0].size = boot_params->efi_info.efi_memmap_size;

if (slr_add_

Re: [PATCH v8 14/15] x86: Secure Launch late initcall platform module

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> From: "Daniel P. Smith" 
>
> The Secure Launch platform module is a late init module. During the
> init call, the TPM event log is read and measurements taken in the
> early boot stub code are located. These measurements are extended
> into the TPM PCRs using the mainline TPM kernel driver.
>
> The platform module also registers the securityfs nodes to allow
> access to TXT register fields on Intel along with the fetching of
> and writing events to the late launch TPM log.
>
> Signed-off-by: Daniel P. Smith 
> Signed-off-by: garnetgrimm 
> Signed-off-by: Ross Philipson 

There is an awful amount of code that executes between the point where
the measurements are taken and the point where they are loaded into
the PCRs. All of this code could subvert the boot flow and hide this
fact, by replacing the actual taken measurement values with the known
'blessed' ones that will unseal the keys and/or phone home to do a
successful remote attestation.

At the very least, this should be documented somewhere. And if at all
possible, it should also be documented why this is ok, and to what
extent it limits the provided guarantees compared to a true D-RTM boot
where the early boot code measures straight into the TPMs before
proceeding.


> ---
>  arch/x86/kernel/Makefile   |   1 +
>  arch/x86/kernel/slmodule.c | 511 +
>  2 files changed, 512 insertions(+)
>  create mode 100644 arch/x86/kernel/slmodule.c
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 5848ea310175..948346ff4595 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -75,6 +75,7 @@ obj-$(CONFIG_IA32_EMULATION)  += tls.o
>  obj-y  += step.o
>  obj-$(CONFIG_INTEL_TXT)+= tboot.o
>  obj-$(CONFIG_SECURE_LAUNCH)+= slaunch.o
> +obj-$(CONFIG_SECURE_LAUNCH)+= slmodule.o
>  obj-$(CONFIG_ISA_DMA_API)  += i8237.o
>  obj-y  += stacktrace.o
>  obj-y  += cpu/
> diff --git a/arch/x86/kernel/slmodule.c b/arch/x86/kernel/slmodule.c
> new file mode 100644
> index ..52269f24902e
> --- /dev/null
> +++ b/arch/x86/kernel/slmodule.c
> @@ -0,0 +1,511 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Secure Launch late validation/setup, securityfs exposure and finalization.
> + *
> + * Copyright (c) 2022 Apertus Solutions, LLC
> + * Copyright (c) 2021 Assured Information Security, Inc.
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + *
> + * Co-developed-by: Garnet T. Grimm 
> + * Signed-off-by: Garnet T. Grimm 
> + * Signed-off-by: Daniel P. Smith 
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * The macro DECLARE_TXT_PUB_READ_U is used to read values from the TXT
> + * public registers as unsigned values.
> + */
> +#define DECLARE_TXT_PUB_READ_U(size, fmt, msg_size)\
> +static ssize_t txt_pub_read_u##size(unsigned int offset,   \
> +   loff_t *read_offset,\
> +   size_t read_len,\
> +   char __user *buf)   \
> +{  \
> +   char msg_buffer[msg_size];  \
> +   u##size reg_value = 0;  \
> +   void __iomem *txt;  \
> +   \
> +   txt = ioremap(TXT_PUB_CONFIG_REGS_BASE, \
> +   TXT_NR_CONFIG_PAGES * PAGE_SIZE);   \
> +   if (!txt)   \
> +   return -EFAULT; \
> +   memcpy_fromio(®_value, txt + offset, sizeof(u##size));   \
> +   iounmap(txt);   \
> +   snprintf(msg_buffer, msg_size, fmt, reg_value); \
> +   return simple_read_from_buffer(buf, read_len, read_offset,  \
> +   &msg_buffer, msg_size); \
> +}
> +
> +DECLARE_TXT_PUB_READ_U(8, "%#04x\n", 6);
> +DECLARE_TXT_PUB_READ_U(32, "%#010x\n", 12);
> +DECLARE_TXT_PUB_READ_U(64, "%#018llx\n", 20);
> +
> +#define DECLARE_TXT_FOPS(reg_name, reg_offset, reg_size)   \
> +static ssize_t txt_##reg_name##_read(struct file *flip,  
>   \
> +   char __user *buf, size_t read_len, loff_t *read_offset) \
> +{  \
> +   return txt_pub_read_u##reg_

Re: [PATCH v8 07/15] x86: Secure Launch kernel early boot stub

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> later AMD SKINIT) to vector to during the late launch. The symbol
> sl_stub_entry is that entry point and its offset into the kernel is
> conveyed to the launching code using the MLE (Measured Launch
> Environment) header in the structure named mle_header. The offset of the
> MLE header is set in the kernel_info. The routine sl_stub contains the
> very early late launch setup code responsible for setting up the basic
> environment to allow the normal kernel startup_32 code to proceed. It is
> also responsible for properly waking and handling the APs on Intel
> platforms. The routine sl_main which runs after entering 64b mode is
> responsible for measuring configuration and module information before
> it is used like the boot params, the kernel command line, the TXT heap,
> an external initramfs, etc.
>
> Signed-off-by: Ross Philipson 
> ---
>  Documentation/arch/x86/boot.rst|  21 +
>  arch/x86/boot/compressed/Makefile  |   3 +-
>  arch/x86/boot/compressed/head_64.S |  34 ++
>  arch/x86/boot/compressed/kernel_info.S |  34 ++
>  arch/x86/boot/compressed/sl_main.c | 582 
>  arch/x86/boot/compressed/sl_stub.S | 705 +
>  arch/x86/include/asm/msr-index.h   |   5 +
>  arch/x86/include/uapi/asm/bootparam.h  |   1 +
>  arch/x86/kernel/asm-offsets.c  |  20 +
>  9 files changed, 1404 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/sl_main.c
>  create mode 100644 arch/x86/boot/compressed/sl_stub.S
>
> diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
> index c513855a54bb..ce6a51c6d4e7 100644
> --- a/Documentation/arch/x86/boot.rst
> +++ b/Documentation/arch/x86/boot.rst
> @@ -482,6 +482,14 @@ Protocol:  2.00+
> - If 1, KASLR enabled.
> - If 0, KASLR disabled.
>
> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> +
> +   - Used internally by the compressed kernel to communicate

decompressor

> + Secure Launch status to kernel proper.
> +
> +   - If 1, Secure Launch enabled.
> +   - If 0, Secure Launch disabled.
> +
>Bit 5 (write): QUIET_FLAG
>
> - If 0, print early messages.
> @@ -1027,6 +1035,19 @@ Offset/size: 0x000c/4
>
>This field contains maximal allowed type for setup_data and setup_indirect 
> structs.
>
> +   =
> +Field name:mle_header_offset
> +Offset/size:   0x0010/4
> +   =
> +
> +  This field contains the offset to the Secure Launch Measured Launch 
> Environment
> +  (MLE) header. This offset is used to locate information needed during a 
> secure
> +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> have
> +  Secure Launch capabilities. The MLE entry point is called from TXT on the 
> BSP
> +  following a success measured launch. The specific state of the processors 
> is
> +  outlined in the TXT Software Development Guide, the latest can be found 
> here:
> +  
> https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> +
>
>  The Image Checksum
>  ==
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index a1b018eb9801..012f7ca780c3 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o \
> +   $(obj)/sl_main.o $(obj)/sl_stub.o
>
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index bf4a10a5794f..6fa5bb87195b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -415,6 +415,17 @@ SYM_CODE_START(startup_64)
> pushq   $0
> popfq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   pushq   %rsi
> +

This push and the associated pop are no longer needed.

> +   /* Ensure the relocation region coverd by a PMR */

'is covered'

> +   movq%rbx, %rdi
> +   movl$(_bss - startup_32), %esi
> +   callq   sl_check_region
> +
> +   popq%rsi
> +#endif
> +
>  /*
>   * Copy the compressed kernel to the end of our buffer
>   * where decompression in place becomes safe.
> @@ -457,6 +468,29 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
> shrq$3, %rcx
> rep stosq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   /*
> +* Have to do the final early sl stub work in 64b

Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> From: "Daniel P. Smith" 
>
> The SHA algorithms are necessary to measure configuration information into
> the TPM as early as possible before using the values. This implementation
> uses the established approach of #including the SHA libraries directly in
> the code since the compressed kernel is not uncompressed at this point.
>
> The SHA code here has its origins in the code from the main kernel:
>
> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
>
> A modified version of this code was introduced to the lib/crypto/sha1.c
> to bring it in line with the sha256 code and allow it to be pulled into the
> setup kernel in the same manner as sha256 is.
>
> Signed-off-by: Daniel P. Smith 
> Signed-off-by: Ross Philipson 

We have had some discussions about this, and you really need to
capture the justification in the commit log for introducing new code
that implements an obsolete and broken hashing algorithm.

SHA-1 is broken and should no longer be used for anything. Introducing
new support for a highly complex boot security feature, and then
relying on SHA-1 in the implementation makes this whole effort seem
almost futile, *unless* you provide some rock solid reasons here why
this is still safe.

If the upshot would be that some people are stuck with SHA-1 so they
won't be able to use this feature, then I'm not convinced we should
obsess over that.

> ---
>  arch/x86/boot/compressed/Makefile   |  2 +
>  arch/x86/boot/compressed/early_sha1.c   | 12 
>  arch/x86/boot/compressed/early_sha256.c |  6 ++



>  include/crypto/sha1.h   |  1 +
>  lib/crypto/sha1.c   | 81 +

This needs to be a separate patch in any case.


>  5 files changed, 102 insertions(+)
>  create mode 100644 arch/x86/boot/compressed/early_sha1.c
>  create mode 100644 arch/x86/boot/compressed/early_sha256.c
>
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index f19c038409aa..a1b018eb9801 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,6 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
>
> diff --git a/arch/x86/boot/compressed/early_sha1.c 
> b/arch/x86/boot/compressed/early_sha1.c
> new file mode 100644
> index ..0c7cf6f8157a
> --- /dev/null
> +++ b/arch/x86/boot/compressed/early_sha1.c
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Apertus Solutions, LLC.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "../../../../lib/crypto/sha1.c"
> diff --git a/arch/x86/boot/compressed/early_sha256.c 
> b/arch/x86/boot/compressed/early_sha256.c
> new file mode 100644
> index ..54930166ffee
> --- /dev/null
> +++ b/arch/x86/boot/compressed/early_sha256.c
> @@ -0,0 +1,6 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Apertus Solutions, LLC
> + */
> +
> +#include "../../../../lib/crypto/sha256.c"
> diff --git a/include/crypto/sha1.h b/include/crypto/sha1.h
> index 044ecea60ac8..d715dd5332e1 100644
> --- a/include/crypto/sha1.h
> +++ b/include/crypto/sha1.h
> @@ -42,5 +42,6 @@ extern int crypto_sha1_finup(struct shash_desc *desc, const 
> u8 *data,
>  #define SHA1_WORKSPACE_WORDS   16
>  void sha1_init(__u32 *buf);
>  void sha1_transform(__u32 *digest, const char *data, __u32 *W);
> +void sha1(const u8 *data, unsigned int len, u8 *out);
>
>  #endif /* _CRYPTO_SHA1_H */
> diff --git a/lib/crypto/sha1.c b/lib/crypto/sha1.c
> index 1aebe7be9401..10152125b338 100644
> --- a/lib/crypto/sha1.c
> +++ b/lib/crypto/sha1.c
> @@ -137,4 +137,85 @@ void sha1_init(__u32 *buf)
>  }
>  EXPORT_SYMBOL(sha1_init);
>
> +static void __sha1_transform(u32 *digest, const char *data)
> +{
> +   u32 ws[SHA1_WORKSPACE_WORDS];
> +
> +   sha1_transform(digest, data, ws);
> +
> +   memzero_explicit(ws, sizeof(ws));
> +}
> +
> +static void sha1_update(struct sha1_state *sctx, const u8 *data, unsigned 
> int len)
> +{
> +   unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
> +
> +   sctx->count += len;
> +
> +   if (likely((partial + len) >= SHA1_BLOCK_SIZE)) {
> +   int blocks;
> +
> +   if (partial) {
> +   int p = SHA1_BLOCK_SIZE - partial;
> +
> +   memcpy(sctx->buffer + partial, data, p);
> +   data += p;
> +   len -= p;
> +
> +   __sha1_transform(sctx->state, sctx->buffer);
> +   }
> +
> +   blocks = len / SHA1_BLOCK_SIZE;
> +

Re: [PATCH v8 04/15] x86: Secure Launch Resource Table header file

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> Introduce the Secure Launch Resource Table which forms the formal
> interface between the pre and post launch code.
>
> Signed-off-by: Ross Philipson 
> ---
>  include/linux/slr_table.h | 270 ++
>  1 file changed, 270 insertions(+)
>  create mode 100644 include/linux/slr_table.h
>
> diff --git a/include/linux/slr_table.h b/include/linux/slr_table.h
> new file mode 100644
> index ..42020988233a
> --- /dev/null
> +++ b/include/linux/slr_table.h
> @@ -0,0 +1,270 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Secure Launch Resource Table
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef _LINUX_SLR_TABLE_H
> +#define _LINUX_SLR_TABLE_H
> +
> +/* Put this in efi.h if it becomes a standard */
> +#define SLR_TABLE_GUID EFI_GUID(0x877a9b2a, 0x0385, 
> 0x45d1, 0xa0, 0x34, 0x9d, 0xac, 0x9c, 0x9e, 0x56, 0x5f)
> +
> +/* SLR table header values */
> +#define SLR_TABLE_MAGIC0x4452544d
> +#define SLR_TABLE_REVISION 1
> +
> +/* Current revisions for the policy and UEFI config */
> +#define SLR_POLICY_REVISION1
> +#define SLR_UEFI_CONFIG_REVISION   1
> +
> +/* SLR defined architectures */
> +#define SLR_INTEL_TXT  1
> +#define SLR_AMD_SKINIT 2
> +
> +/* SLR defined bootloaders */
> +#define SLR_BOOTLOADER_INVALID 0
> +#define SLR_BOOTLOADER_GRUB1
> +
> +/* Log formats */
> +#define SLR_DRTM_TPM12_LOG 1
> +#define SLR_DRTM_TPM20_LOG 2
> +
> +/* DRTM Policy Entry Flags */
> +#define SLR_POLICY_FLAG_MEASURED   0x1
> +#define SLR_POLICY_IMPLICIT_SIZE   0x2
> +
> +/* Array Lengths */
> +#define TPM_EVENT_INFO_LENGTH  32
> +#define TXT_VARIABLE_MTRRS_LENGTH  32
> +
> +/* Tags */
> +#define SLR_ENTRY_INVALID  0x
> +#define SLR_ENTRY_DL_INFO  0x0001
> +#define SLR_ENTRY_LOG_INFO 0x0002
> +#define SLR_ENTRY_ENTRY_POLICY 0x0003
> +#define SLR_ENTRY_INTEL_INFO   0x0004
> +#define SLR_ENTRY_AMD_INFO 0x0005
> +#define SLR_ENTRY_ARM_INFO 0x0006
> +#define SLR_ENTRY_UEFI_INFO0x0007
> +#define SLR_ENTRY_UEFI_CONFIG  0x0008
> +#define SLR_ENTRY_END  0x
> +
> +/* Entity Types */
> +#define SLR_ET_UNSPECIFIED 0x
> +#define SLR_ET_SLRT0x0001
> +#define SLR_ET_BOOT_PARAMS 0x0002
> +#define SLR_ET_SETUP_DATA  0x0003
> +#define SLR_ET_CMDLINE 0x0004
> +#define SLR_ET_UEFI_MEMMAP 0x0005
> +#define SLR_ET_RAMDISK 0x0006
> +#define SLR_ET_TXT_OS2MLE  0x0010
> +#define SLR_ET_UNUSED  0x
> +
> +#ifndef __ASSEMBLY__
> +
> +/*
> + * Primary SLR Table Header
> + */
> +struct slr_table {
> +   u32 magic;
> +   u16 revision;
> +   u16 architecture;
> +   u32 size;
> +   u32 max_size;
> +   /* entries[] */
> +} __packed;

Packing this struct has no effect on the layout so better drop the
__packed here. If this table is part of a structure that can appear
misaligned in memory, better to pack the outer struct or deal with it
there in another way.

> +
> +/*
> + * Common SLRT Table Header
> + */
> +struct slr_entry_hdr {
> +   u16 tag;
> +   u16 size;
> +} __packed;

Same here

> +
> +/*
> + * Boot loader context
> + */
> +struct slr_bl_context {
> +   u16 bootloader;
> +   u16 reserved;
> +   u64 context;
> +} __packed;
> +
> +/*
> + * DRTM Dynamic Launch Configuration
> + */
> +struct slr_entry_dl_info {
> +   struct slr_entry_hdr hdr;
> +   struct slr_bl_context bl_context;
> +   u64 dl_handler;

I noticed in the EFI patch that this is actually

void (*dl_handler)(struct slr_bl_context *bl_context);

so better declare it as such.

> +   u64 dce_base;
> +   u32 dce_size;
> +   u64 dlme_entry;
> +} __packed;
> +
> +/*
> + * TPM Log Information
> + */
> +struct slr_entry_log_info {
> +   struct slr_entry_hdr hdr;
> +   u16 format;
> +   u16 reserved;
> +   u64 addr;
> +   u32 size;
> +} __packed;
> +
> +/*
> + * DRTM Measurement Policy
> + */
> +struct slr_entry_policy {
> +   struct slr_entry_hdr hdr;
> +   u16 revision;
> +   u16 nr_entries;
> +   /* policy_entries[] */

Please use a flex array here:

  struct slr_policy_entry policy_entries[];

> +} __packed;
> +
> +/*
> + * DRTM Measurement Entry
> + */
> +struct slr_policy_entry {
> +   u16 pcr;
> +   u16 entity_type;
> +   u16 flags;
> +   u16 reserved;
> +   u64 entity;
> +   u64 size;
> +   char evt_info[TPM_EVENT_INFO_LENGTH];
> +} __packed;
> +
> +/*
> + * Secure Launch defined MTRR saving structures
> + */
> +struct slr_txt_mtrr_pair {
> +   u64 mtrr_physbase;
> +   u64 mtrr_physmask;
> +} __packed;
> +
> +struct slr_txt_mtrr_state {
> +   u64 default_mem_type;
> +   u64 mtrr_vcnt;
> +   struct slr_txt_mtrr_pair mtrr_pair[TXT_VARIABLE_MTRRS_LENGTH];
> +} __packed;
> +
> +/*
> + * Intel TXT Info t

Re: [PATCH v8 03/15] x86: Secure Launch Kconfig

2024-02-14 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> Initial bits to bring in Secure Launch functionality. Add Kconfig
> options for compiling in/out the Secure Launch code.
>
> Signed-off-by: Ross Philipson 
> ---
>  arch/x86/Kconfig | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..d96d75f6f1a9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2071,6 +2071,18 @@ config EFI_RUNTIME_MAP
>
>   See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
>
> +config SECURE_LAUNCH
> +   bool "Secure Launch support"
> +   default n

'n' is already the default, so you can drop this line.

> +   depends on X86_64 && X86_X2APIC

This depends on CONFIG_TCG_TPM as well (I got build failures without it)

> +   help
> +  The Secure Launch feature allows a kernel to be loaded
> +  directly through an Intel TXT measured launch. Intel TXT
> +  establishes a Dynamic Root of Trust for Measurement (DRTM)
> +  where the CPU measures the kernel image. This feature then
> +  continues the measurement chain over kernel configuration
> +  information and init images.
> +
>  source "kernel/Kconfig.hz"
>
>  config ARCH_SUPPORTS_KEXEC
> --
> 2.39.3
>



Re: [PATCH v8 01/15] x86/boot: Place kernel_info at a fixed offset

2024-02-14 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> From: Arvind Sankar 
>
> There are use cases for storing the offset of a symbol in kernel_info.
> For example, the trenchboot series [0] needs to store the offset of the
> Measured Launch Environment header in kernel_info.
>

Why? Is this information consumed by the bootloader?

I'd like to get away from x86 specific hacks for boot code and boot
images, so I would like to explore if we can avoid kernel_info, or at
least expose it in a generic way. We might just add a 32-bit offset
somewhere in the first 64 bytes of the bootable image: this could
co-exist with EFI bootable images, and can be implemented on arm64,
RISC-V and LoongArch as well.

> Since commit (note: commit ID from tip/master)
>
> commit 527afc212231 ("x86/boot: Check that there are no run-time relocations")
>
> run-time relocations are not allowed in the compressed kernel, so simply
> using the symbol in kernel_info, as
>
> .long   symbol
>
> will cause a linker error because this is not position-independent.
>
> With kernel_info being a separate object file and in a different section
> from startup_32, there is no way to calculate the offset of a symbol
> from the start of the image in a position-independent way.
>
> To enable such use cases, put kernel_info into its own section which is
> placed at a predetermined offset (KERNEL_INFO_OFFSET) via the linker
> script. This will allow calculating the symbol offset in a
> position-independent way, by adding the offset from the start of
> kernel_info to KERNEL_INFO_OFFSET.
>
> Ensure that kernel_info is aligned, and use the SYM_DATA.* macros
> instead of bare labels. This stores the size of the kernel_info
> structure in the ELF symbol table.
>
> Signed-off-by: Arvind Sankar 
> Cc: Ross Philipson 
> Signed-off-by: Ross Philipson 
> ---
>  arch/x86/boot/compressed/kernel_info.S | 19 +++
>  arch/x86/boot/compressed/kernel_info.h | 12 
>  arch/x86/boot/compressed/vmlinux.lds.S |  6 ++
>  3 files changed, 33 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/kernel_info.h
>
> diff --git a/arch/x86/boot/compressed/kernel_info.S 
> b/arch/x86/boot/compressed/kernel_info.S
> index f818ee8fba38..c18f07181dd5 100644
> --- a/arch/x86/boot/compressed/kernel_info.S
> +++ b/arch/x86/boot/compressed/kernel_info.S
> @@ -1,12 +1,23 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
> +#include 
>  #include 
> +#include "kernel_info.h"
>
> -   .section ".rodata.kernel_info", "a"
> +/*
> + * If a field needs to hold the offset of a symbol from the start
> + * of the image, use the macro below, eg
> + * .long   rva(symbol)
> + * This will avoid creating run-time relocations, which are not
> + * allowed in the compressed kernel.
> + */
> +
> +#define rva(X) (((X) - kernel_info) + KERNEL_INFO_OFFSET)
>
> -   .global kernel_info
> +   .section ".rodata.kernel_info", "a"
>
> -kernel_info:
> +   .balign 16
> +SYM_DATA_START(kernel_info)
> /* Header, Linux top (structure). */
> .ascii  "LToP"
> /* Size. */
> @@ -19,4 +30,4 @@ kernel_info:
>
>  kernel_info_var_len_data:
> /* Empty for time being... */
> -kernel_info_end:
> +SYM_DATA_END_LABEL(kernel_info, SYM_L_LOCAL, kernel_info_end)
> diff --git a/arch/x86/boot/compressed/kernel_info.h 
> b/arch/x86/boot/compressed/kernel_info.h
> new file mode 100644
> index ..c127f84aec63
> --- /dev/null
> +++ b/arch/x86/boot/compressed/kernel_info.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef BOOT_COMPRESSED_KERNEL_INFO_H
> +#define BOOT_COMPRESSED_KERNEL_INFO_H
> +
> +#ifdef CONFIG_X86_64
> +#define KERNEL_INFO_OFFSET 0x500
> +#else /* 32-bit */
> +#define KERNEL_INFO_OFFSET 0x100
> +#endif
> +
> +#endif /* BOOT_COMPRESSED_KERNEL_INFO_H */
> diff --git a/arch/x86/boot/compressed/vmlinux.lds.S 
> b/arch/x86/boot/compressed/vmlinux.lds.S
> index 083ec6d7722a..718c52f3f1e6 100644
> --- a/arch/x86/boot/compressed/vmlinux.lds.S
> +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> @@ -7,6 +7,7 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT)
>
>  #include 
>  #include 
> +#include "kernel_info.h"
>
>  #ifdef CONFIG_X86_64
>  OUTPUT_ARCH(i386:x86-64)
> @@ -27,6 +28,11 @@ SECTIONS
> HEAD_TEXT
> _ehead = . ;
> }
> +   .rodata.kernel_info KERNEL_INFO_OFFSET : {
> +   *(.rodata.kernel_info)
> +   }
> +   ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, "kernel_info at 
> bad address!")
> +
> .rodata..compressed : {
> *(.rodata..compressed)
> }
> --
> 2.39.3
>



Re: [PATCH 0/2] Sign the Image which is zboot's payload

2023-09-25 Thread Ard Biesheuvel
On Mon, 25 Sept 2023 at 03:01, Pingfan Liu  wrote:
>
> On Fri, Sep 22, 2023 at 1:19 PM Jan Hendrik Farr  wrote:
> >
...
> > I missed some of the earlier discussion about this zboot kexec support.
> > So just let me know if I'm missing something here. You were exploring
> > these two options in getting this supported:
> >
> > 1. Making kexec_file_load do all the work.
> >
> > This option makes the signature verification easy. kexec_file_load
> > checks the signature on the pe file and then extracts it and does the
> > kexec.
> >
> > This is similar to how I'm approaching UKI support in [1].
> >
>
> Yes, that is my original try.
>
> > 2. Extract in userspace and pass decompressed kernel to kexec_file_load
> >
> > This option requires the decompressed kernel to have a valid signature on
> > it. That's why this patch adds the ability to add that signature to the
> > kernel contained inside the zboot image.
> >
>
> You got it.
>
> > This option would not make sense for UKI support as it would not
> > validate the signature with respect to the initrd and cmdline that it
> > contains. Am I correct in thinking that there is no similar issue with
> > zboot images? They don't contain any more information besides the kernel
> > that is intended to be securely signed, right? Do you have a reference
>
> If using my second method, it means to unpack the UKI image in user
> space, and pass the kernel image, initrd and cmdline through
> kexec_file_load interface. If the UKI can have signature on the initrd
> and cmdline, we extend the capability of that interface to check those
> verification.
>
> > for the zboot image layout somewhere?
> >
>
> Sorry that maybe there is no document. I understand them through the code.
> The zboot image, aka, vmlinuz.efi looks like:
> PE header, which is formed manually in arch/arm64/kernel/head.S
> EFI decompressor, which consists of
> drivers/firmware/efi/libstub/zboot.c and libstub
> Image.gz, which is formed by compressing Image as instructed in Makefile.zboot
>
>

Indeed, this is currently only documented in code. zboot is a PE
executable that decompresses the kernel and boots it, but it also
carries the base and size of the compressed payload in its header,
along with the compression type so non-EFI loaders can run it as well
(QEMU implements this for gzip on arm64)

> > > I hesitate to post this series,
> >
> > I appreciate you sending it, it's helping the discussion along.
> >

Absolutely. RFCs are important because nobody knows how exactly the
code will look until someone takes the time to implement it. So your
work on this is much appreciated, even if we may decide to take
another approach down the road.

> > > [...] since Ard has recommended using an
> > > emulated UEFI boot service to resolve the UKI kexec load problem [1].
> > > since on aarch64, vmlinuz.efi has faced the similar issue at present.
> > > But anyway, I have a crude outline of it and am sending it out for
> > > discussion.
> >
> > The more I'm thinking about it, the more I like Ard's idea. There's now
> > already two different formats trying to be added to kexec that are
> > pretty different from each other, yet they both have the UEFI interface
> > in common. I think if the kernel supported kexec'ing EFI applications
> > that would be a more flexible and forward-looking approach. It's a
>
> Yes, I agree. That method is attractive, originally I had a try when
> Ard suggested it but there was no clear boundary on which boot service
> should be implemented for zboot, so I did not move on along that
> direction.
>
> Now, UKI poses another challenge to kexec_file_load, and seems to
> require more than zboot. And it appears that Ard's approach is a
> silver bullet for that issue.
>

Yes, it looks appealing but it will take some time to iterate on ideas
and converge on an implementation.

> > standard that both zboot and UKI as well as all future formats for UEFI
> > platforms will support anyways. So while it's more work right now to
> > implement, I think it'll likely pay off.
> >
> > It is significantly more work than the other options though. So I think
> > before work is started on it, it would be nice to get some type of
> > consensus on these things (not an exhaustive list, please feel free to
> > add to it):
> >
>
> I try to answer part of the questions.
>
> > 1. Is it the right approach? It adds a significant amount of userspace
> > API.
>
> My crude assumption: this new stub will replace the purgatory, and I
> am not sure whether kexec-tools source tree will accommodate it. It
> can be signed and checked during the kexec_file_load.
>
> > 2. What subset of the UEFI spec needs/should to be supported?
> > 3. Can we let runtime services still be handled by the firmware after
> > exiting boot services?
>
> I think the runtime services survive through the kexec process. It is
> derived from the real firmware, not related with this stub
>

Yes, this should be possible.

> > 4. How can we debug the stubs that are bei

Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Ard Biesheuvel
On Wed, 20 Sept 2023 at 08:40, Dave Young  wrote:
>
> On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
> >
> > > > In the end the only benefit this series brings is to extend the
> > > > signature checking on the whole UKI except of just the kernel image.
> > > > Everything else can also be done in user space. Compared to the
> > > > problems described above this is a very small gain for me.
> > >
> > > Correct. That is the benefit of pulling the UKI apart in the
> > > kernel. However having to sign the kernel inside the UKI defeats
> > > the whole point.
> >
> >
> > Pingfan added the zboot load support in kexec-tools, I know that he is
> > trying to sign the zboot image and the inside kernel twice. So
> > probably there are some common areas which can be discussed.
> > Added Ard and Pingfan in cc.
> > http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
> >
>
> Here is another thread of the initial try in kernel with a few more
> options eg. some fake efi service helpers.
> https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
>

Currently, UKI's external interface is defined in terms of EFI
services, i.e., it is an executable PE/COFF binary that encapsulates
all the logic that performs the unpacking of the individual sections,
and loads the kernel as a PE/COFF binary as well (i.e., via
LoadImage/StartImage)

As soon as we add support to Linux to unpack a UKI and boot the
encapsulated kernel using a boot protocol other than EFI, we are
painting ourselves into a corner, severely limiting the freedom of the
UKI effort to make changes to the interfaces that were implementation
details up to this point.

It also means that UKI handling in kexec will need to be taught about
every individual architecture again, which is something we are trying
to avoid with EFI support in general. Breaking the abstraction like
this lets the cat out of the bag, and will add yet another variation
of kexec that we will need to support and maintain forever.

So the only way to do this properly and portably is to implement the
minimal set of EFI boot services [0] that Linux actually needs to run
its EFI stub (which is mostly identical to the set that UKI relies on
afaict), and expose them to the kexec image as it is being loaded.
This is not as bad as it sounds - I have some Rust code that could be
used as an inspiration [1] and which could be reused and shared
between architectures.

This would also reduce/remove the need for a purgatory: loading a EFI
binary in this way would run it up to the point were it calls
ExitBootServices(), and the actual kexec would invoke the image as if
it was returning from ExitBootServices().

The only fundamental problem here is the need to allocate large chunks
of physical memory, which would need some kind of CMA support, I
imagine?

Maybe we should do a BoF at LPC to discuss this further?

[0] this is not as bad as it sounds: beyond a protocol database, a
heap allocator and a memory map, there is actually very little needed
to boot Linux via the EFI stub (although UKI needs
LoadImage/StartImage as well)

[1] https://github.com/ardbiesheuvel/efilite

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-06 Thread Ard Biesheuvel
On Sat, 5 Aug 2023 at 11:18, Borislav Petkov  wrote:
>
> On Thu, Aug 03, 2023 at 01:11:54PM +0200, Ard Biesheuvel wrote:
> > Sadly, not only 'old' grubs - GRUB mainline only recently added
> > support for booting Linux/x86 via the EFI stub (because I wrote the
> > code for them),
>
> haha.
>
> > but it will still fall back to the previous mode for kernels that are
> > built without EFI stub support, or which are older than ~v5.8 (because
> > their EFI stub does not implement the generic EFI initrd loading
> > mechanism)
>
> The thing is, those SNP kernels pretty much use the EFI boot mechanism.
> I mean, don't take my word for it as I run SNP guests only from time to
> time but that's what everyone uses AFAIK.
>
> > Yeah. what seems to be saving our ass here is that startup_32 maps the
> > first 1G of physical address space 4 times, and x86_64 EFI usually
> > puts firmware tables below 4G. This means the cc blob check doesn't
> > fault, but it may dereference bogus memory traversing the config table
> > array looking for the cc blob GUID. However, the system table field
> > holding the size of the array may also appear as bogus so this may
> > still break in weird ways.
>
> Oh fun.
>

This is not actually true, I misread the code.

The initial mapping is 1:1 for the lower 4G of system memory, so
anything that lives there is accessible before the demand paging stuff
is up and running.

IOW, your change should be sufficient to fix this even when entering
via the 32-bit entry point.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-03 Thread Ard Biesheuvel
On Thu, 3 Aug 2023 at 13:11, Ard Biesheuvel  wrote:
>
> On Wed, 2 Aug 2023 at 17:52, Borislav Petkov  wrote:
> >
> > On Wed, Aug 02, 2023 at 04:55:27PM +0200, Ard Biesheuvel wrote:
> > > ... because now, entering via startup_32 is broken, given that it only
> > > maps the kernel image itself and relies on the #PF handling for
> > > everything else it accesses, including firmware tables.
> > >
> > > AFAICT this also means that entering via startup_32 is broken entirely
> > > for any configuration that enables the cc blob config table check,
> > > regardless of the platform.
> >
> > Lemme brain-dump what Tom and I just talked on IRC.
> >
> > That startup_32 entry path for SNP guests was used with old grubs which
> > used to enter through there and not anymore, reportedly. Which means,
> > that must've worked at some point but Joerg would know. CCed.
> >
>
> Sadly, not only 'old' grubs - GRUB mainline only recently added
> support for booting Linux/x86 via the EFI stub (because I wrote the
> code for them), but it will still fall back to the previous mode for
> kernels that are built without EFI stub support, or which are older
> than ~v5.8 (because their EFI stub does not implement the generic EFI
> initrd loading mechanism)
>
> This fallback still appears to enter via startup_32, even when GRUB
> itself runs in long mode in the context of EFI.
>
> > Newer grubs enter through the 64-bit entry point and thus are fine
> > - otherwise we would be seeing explosions left and right.
> >
>
> Yeah. what seems to be saving our ass here is that startup_32 maps the
> first 1G of physical address space 4 times, and x86_64 EFI usually
> puts firmware tables below 4G. This means the cc blob check doesn't
> fault, but it may dereference bogus memory traversing the config table
> array looking for the cc blob GUID. However, the system table field
> holding the size of the array may also appear as bogus so this may
> still break in weird ways.
>
> > So dependent on what we wanna do, if we kill the 32-bit path, we can
> > kill the 32-bit C-bit verif code. But that's for later and an item on my
> > TODO list.
> >
>
> I don't think we can kill it yet, but it would be nice if we could
> avoid the need to support SNP boot when entering that way.

https://lists.gnu.org/archive/html/grub-devel/2023-08/msg5.html

Coming to your distro any decade now!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-03 Thread Ard Biesheuvel
On Wed, 2 Aug 2023 at 17:52, Borislav Petkov  wrote:
>
> On Wed, Aug 02, 2023 at 04:55:27PM +0200, Ard Biesheuvel wrote:
> > ... because now, entering via startup_32 is broken, given that it only
> > maps the kernel image itself and relies on the #PF handling for
> > everything else it accesses, including firmware tables.
> >
> > AFAICT this also means that entering via startup_32 is broken entirely
> > for any configuration that enables the cc blob config table check,
> > regardless of the platform.
>
> Lemme brain-dump what Tom and I just talked on IRC.
>
> That startup_32 entry path for SNP guests was used with old grubs which
> used to enter through there and not anymore, reportedly. Which means,
> that must've worked at some point but Joerg would know. CCed.
>

Sadly, not only 'old' grubs - GRUB mainline only recently added
support for booting Linux/x86 via the EFI stub (because I wrote the
code for them), but it will still fall back to the previous mode for
kernels that are built without EFI stub support, or which are older
than ~v5.8 (because their EFI stub does not implement the generic EFI
initrd loading mechanism)

This fallback still appears to enter via startup_32, even when GRUB
itself runs in long mode in the context of EFI.

> Newer grubs enter through the 64-bit entry point and thus are fine
> - otherwise we would be seeing explosions left and right.
>

Yeah. what seems to be saving our ass here is that startup_32 maps the
first 1G of physical address space 4 times, and x86_64 EFI usually
puts firmware tables below 4G. This means the cc blob check doesn't
fault, but it may dereference bogus memory traversing the config table
array looking for the cc blob GUID. However, the system table field
holding the size of the array may also appear as bogus so this may
still break in weird ways.

> So dependent on what we wanna do, if we kill the 32-bit path, we can
> kill the 32-bit C-bit verif code. But that's for later and an item on my
> TODO list.
>

I don't think we can kill it yet, but it would be nice if we could
avoid the need to support SNP boot when entering that way.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-02 Thread Ard Biesheuvel
On Wed, 2 Aug 2023 at 15:59, Borislav Petkov  wrote:
>
> On Wed, Aug 02, 2023 at 08:40:36AM -0500, Tom Lendacky wrote:
> > Short of figuring out how to map page accesses earlier through the
> > boot_page_fault IDT routine
>
> And you want to do that because?
>

... because now, entering via startup_32 is broken, given that it only
maps the kernel image itself and relies on the #PF handling for
everything else it accesses, including firmware tables.

AFAICT this also means that entering via startup_32 is broken entirely
for any configuration that enables the cc blob config table check,
regardless of the platform.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-17 Thread Ard Biesheuvel
On Mon, 17 Jul 2023 at 15:53, Tao Liu  wrote:
>
> Hi Borislav,
>
> On Thu, Jul 13, 2023 at 6:05 PM Borislav Petkov  wrote:
> >
> > On Thu, Jun 01, 2023 at 03:20:44PM +0800, Tao Liu wrote:
> > >  arch/x86/kernel/machine_kexec_64.c | 35 ++
> > >  1 file changed, 31 insertions(+), 4 deletions(-)
> >
> > Ok, pls try this totally untested thing.
> >
> > Thx.
> >
> > ---
> > diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> > index 09dc8c187b3c..fefe27b2af85 100644
> > --- a/arch/x86/boot/compressed/sev.c
> > +++ b/arch/x86/boot/compressed/sev.c
> > @@ -404,13 +404,20 @@ void sev_enable(struct boot_params *bp)
> > if (bp)
> > bp->cc_blob_address = 0;
> >
> > +   /* Check for the SME/SEV support leaf */
> > +   eax = 0x8000;
> > +   ecx = 0;
> > +   native_cpuid(&eax, &ebx, &ecx, &edx);
> > +   if (eax < 0x801f)
> > +   return;
> > +
> > /*
> >  * Setup/preliminary detection of SNP. This will be sanity-checked
> >  * against CPUID/MSR values later.
> >  */
> > snp = snp_init(bp);
> >
> > -   /* Check for the SME/SEV support leaf */
> > +   /* Recheck the SME/SEV support leaf */
> > eax = 0x8000;
> > ecx = 0;
> > native_cpuid(&eax, &ebx, &ecx, &edx);
> >
> Thanks a lot for the patch above! Sorry for the late response. I have
> compiled and tested it locally against 6.5.0-rc1, though it can pass
> the early stage of kexec kernel bootup,

OK, so that proves that the cc_blob table access is the culprit here.
That still means that kexec on SEV is likely to explode in the exact
same way should anyone attempt that.


> however the kernel will panic
> occasionally later. The test machine is the one with Intel Atom
> x6425RE cpu which encountered the page fault issue of missing efi
> config table.
>

Agree with Boris that this seems entirely unrelated.

> ...snip...
> [   21.360763]  nvme0n1: p1 p2 p3
> [   21.364207] igc :03:00.0: PTM enabled, 4ns granularity
> [   21.421097] pps pps1: new PPS source ptp1
> [   21.425396] igc :03:00.0 (unnamed net_device) (uninitialized): PHC 
> added
> [   21.457005] igc :03:00.0: 4.000 Gb/s available PCIe bandwidth
> (5.0 GT/s PCIe x1 link)
> [   21.465210] igc :03:00.0 eth1: MAC: ...snip...
> [   21.473424] igc :03:00.0 enp3s0: renamed from eth1
> [   21.479446] BUG: kernel NULL pointer dereference, address: 0008
> [   21.486405] #PF: supervisor read access in kernel mode
> [   21.491519] mmc1: Failed to initialize a non-removable card
> [   21.491538] #PF: error_code(0x) - not-present page
> [   21.502229] PGD 0 P4D 0
> [   21.504773] Oops:  [#1] PREEMPT SMP NOPTI
> [   21.509133] CPU: 3 PID: 402 Comm: systemd-udevd Not tainted 6.5.0-rc1+ #1
> [   21.515905] Hardware name: ...snip...


Why are you snipping the hardware name?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-13 Thread Ard Biesheuvel
On Fri, 7 Jul 2023 at 19:12, Borislav Petkov  wrote:
>
> On Fri, Jul 07, 2023 at 10:25:15AM -0500, Michael Roth wrote:
> > ...
> > It would be unfortunate if we finally abandoned this path because of the
> > issue being hit here though. I think the patch posted here is the proper
> > resolution to the issue being hit, and I'm hoping at this point we've
> > identified all the similar cases where EFI/setup_data-related structures
> > were missing explicit mappings. But if we still think it's too much of a
> > liability to access the EFI config table outside of SEV-enabled guests,
> > then I can work on re-implementing things based on the above logic.
>
> Replying here to Tom's note too...
>
> So, I like the idea of rechecking CPUID. Yes, let's do the sev_status
> check. As a result, we either fail the guest - no problem - or we boot
> and we recheck. Thus, we don't run AMD code on !AMD machines, if the HV
> is not a lying bastard.
>
> Now, if we've gotten a valid setup_data SETUP_EFI entry with a valid
> pointer to an EFI config table, then that should happen in the generic
> path - initialize_identity_maps(), for example - like you've done in
> b57feed2cc26 - not in the kexec code because kexec *happens* to need it.
>
> We want to access the EFI config table? Sure, by all means, but make
> that generic for all code.
>

OK, so in summary, what seems to be happening here is that the SEV
init code in the decompressor looks for the cc blob table before the
on-demand mapping code is up, which normally ensures that any RAM
address is accessible even if it hasn't been mapped explicitly.

This is why the fix happens to work: the code only maps the array of
(guid, phys_addr) tuples that describes the list of configuration
tables that have been provided by the firmware. The actual
configuration tables themselves could be anywhere in physical memory,
and without prior knowledge of a particular GUID value, there is no
way to know the size of the table, and so they cannot be mapped
upfront like this. However, the cc blob table does not exist on this
machine, and so whether the EFI config tables themselves are mapped or
not is irrelevant.

But it does mean the fix is incomplete, and certainly does not belong
in generic kexec code. If anything, we should be fixing the
decompressor code to defer the cc blob table check until after the
demand mapping code is up.

If this is problematic, we might instead disable SEV for kexec, and
rely on the fact that SEV firmware enters with a complete 1:1 map (as
we seem to be doing currently). If kexec for SEV is needed at some
point, we can re-enable it by having it provide a mapping for the
config table array and the cc blob table explicitly.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Ard Biesheuvel
On Fri, 12 May 2023 at 13:28, Matthew Garrett  wrote:
>
> On Fri, May 12, 2023 at 01:18:45PM +0200, Ard Biesheuvel wrote:
> > On Fri, 12 May 2023 at 13:04, Matthew Garrett  wrote:
> > >
> > > On Tue, May 09, 2023 at 06:21:44PM -0700, Eric Biggers wrote:
> > >
> > > > SHA-1 is insecure.  Why are you still using SHA-1?  Don't TPMs support 
> > > > SHA-2
> > > > now?
> > >
> > > TXT is supported on some TPM 1.2 systems as well. TPM 2 systems are also
> > > at the whim of the firmware in terms of whether the SHA-2 banks are
> > > enabled. But even if the SHA-2 banks are enabled, if you suddenly stop
> > > extending the SHA-1 banks, a malicious actor can later turn up and
> > > extend whatever they want into them and present a SHA-1-only
> > > attestation. Ideally whatever is handling that attestation should know
> > > whether or not to expect an attestation with SHA-2, but the easiest way
> > > to maintain security is to always extend all banks.
> > >
> >
> > Wouldn't it make more sense to measure some terminating event into the
> > SHA-1 banks instead?
>
> Unless we assert that SHA-1 events are unsupported, it seems a bit odd
> to force a policy on people who have both banks enabled. People with
> mixed fleets are potentially going to be dealing with SHA-1 measurements
> for a while yet, and while there's obviously a security benefit in using
> SHA-2 instead it'd be irritating to have to maintain two attestation
> policies.

I understand why that matters from an operational perspective.

However, we are dealing with brand new code being proposed for Linux
mainline, and so this is our only chance to push back on this, as
otherwise, we will have to maintain it for a very long time.

IOW, D-RTM does not exist today in Linux, and it is up to us to define
what it will look like. From that perspective, it is downright
preposterous to even consider supporting SHA-1, given that SHA-1 by
itself gives none of the guarantees that D-RTM aims to provide. If
reducing your TCB is important enough to warrant switching to this
implementation of D-RTM, surely you can upgrade your attestation
policies as well.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Ard Biesheuvel
On Fri, 12 May 2023 at 13:04, Matthew Garrett  wrote:
>
> On Tue, May 09, 2023 at 06:21:44PM -0700, Eric Biggers wrote:
>
> > SHA-1 is insecure.  Why are you still using SHA-1?  Don't TPMs support SHA-2
> > now?
>
> TXT is supported on some TPM 1.2 systems as well. TPM 2 systems are also
> at the whim of the firmware in terms of whether the SHA-2 banks are
> enabled. But even if the SHA-2 banks are enabled, if you suddenly stop
> extending the SHA-1 banks, a malicious actor can later turn up and
> extend whatever they want into them and present a SHA-1-only
> attestation. Ideally whatever is handling that attestation should know
> whether or not to expect an attestation with SHA-2, but the easiest way
> to maintain security is to always extend all banks.
>

Wouldn't it make more sense to measure some terminating event into the
SHA-1 banks instead?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] Support kexec'ing PEs containing compressed kernels

2023-05-04 Thread Ard Biesheuvel
On Thu, 4 May 2023 at 18:41, Jeremy Linton  wrote:
>
> The linux ZBOOT option creates PEs that contain compressed kernel images
> which are self decompressed on execution by UEFI.
>
> This set adds support for this image format to kexec by decompressing the
> contained kernel image to a temp file, then handing the resulting image
> off to the existing "Image" load routine to pass to the kexec syscall.
>
> There is also an additional patch which cleans up some errors noticed
> in the existing zImage support as well.
>
> Jeremy Linton (4):
>   arm64: Cleanup _probe() return values
>   arm64: Add ZBOOT PE containing compressed image support
>   arm64: Hook up the ZBOOT support as vmlinuz
>   arm64: Fix some issues with zImage _probe()
>

Thanks a lot for taking care of this!

This all looks good to me. The only comment I have is that EFI zboot
itself is generic, even though arm64 is the only arch that distros are
building it for at the moment. So it is not unlikely that some of this
code will end up needing to be shared.

Acked-by: Ard Biesheuvel 


>  kexec/arch/arm64/Makefile  |   3 +-
>  kexec/arch/arm64/image-header.h|  11 ++
>  kexec/arch/arm64/kexec-arm64.c |   7 +
>  kexec/arch/arm64/kexec-arm64.h |   3 +
>  kexec/arch/arm64/kexec-elf-arm64.c |   1 +
>  kexec/arch/arm64/kexec-vmlinuz-arm64.c | 172 +
>  kexec/arch/arm64/kexec-zImage-arm64.c  |  13 +-
>  kexec/kexec.c  |  11 +-
>  8 files changed, 201 insertions(+), 20 deletions(-)
>  create mode 100644 kexec/arch/arm64/kexec-vmlinuz-arm64.c
>
> --
> 2.40.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/6] arm64: make kexec_file able to load zboot image

2023-03-06 Thread Ard Biesheuvel
(cc Mark)

Hello Pingfan,

Thanks for working on this.

On Mon, 6 Mar 2023 at 04:03, Pingfan Liu  wrote:
>
> After introducing zboot image, kexec_file can not load and jump to the
> new style image. Hence it demands a method to load the new kernel.
>
> The crux of the problem lies in when and how to decompress the Image.gz.
> There are three possible courses to take: -1. in user space, but hard to
> achieve due to the signature verification inside the kernel.

That depends. The EFI zboot image encapsulates another PE/COFF image,
which could be signed as well.

So there are at least three other options here:
- sign the encapsulated image with the same key as the zboot image
- sign the encapsulated image with a key that is only valid for kexec boot
- sign the encapsulated image with an ephemeral key that is only valid
for a kexec'ing an image that was produced by the same kernel build

>  -2. at the
> boot time, let the efi_zboot_entry() handles it, which means a simulated
> EFI service should be provided to that entry, especially about how to be
> aware of the memory layout.

This is actually an idea I intend to explore: with the EFI runtime
services regions mapped 1:1, it wouldn't be too hard to implement a
minimal environment that can run the zboot image under the previous
kernel up to the point where it call ExitBootServices(), after which
kexec() would take over.

>  -3. in kernel space, during the file load
> of the zboot image. At that point, the kernel masters the whole memory
> information, and easily allocates a suitable memory for the decompressed
> kernel image. (I think this is similar to what grub does today).
>

GRUB just calls LoadImage(), and the decompression code runs in the EFI context.

> The core of this series is [5/6].  [3,6/6] handles the config option.
> The assumption of [3/6] is kexec_file_load is independent of zboot,
> especially it can load kernel images compressed with different
> compression method.  [6/6] is if EFI_ZBOOT, the corresponding
> decompression method should be included.
>
>
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Andrew Morton 
> Cc: Ard Biesheuvel 
> Cc: kexec@lists.infradead.org
> To: linux-arm-ker...@lists.infradead.org
> To: linux-ker...@vger.kernel.org
>
> Pingfan Liu (6):
>   arm64: kexec: Rename kexec_image.c to kexec_raw_image.c
>   lib/decompress: Introduce decompress_method_by_name()
>   arm64: Kconfig: Pick decompressing method for kexec file load
>   lib/decompress: Keep decompress routines based on selection
>   arm64: kexec: Introduce zboot image loader
>   init/Kconfig: Select decompressing method if compressing kernel
>
>  arch/arm64/Kconfig|  59 ++
>  arch/arm64/include/asm/kexec.h|   4 +-
>  arch/arm64/kernel/Makefile|   2 +-
>  .../{kexec_image.c => kexec_raw_image.c}  |   2 +-
>  arch/arm64/kernel/kexec_zboot_image.c | 186 ++
>  arch/arm64/kernel/machine_kexec.c |   1 +
>  arch/arm64/kernel/machine_kexec_file.c|   3 +-
>  include/linux/decompress/generic.h|   2 +
>  include/linux/decompress/mm.h |   9 +-
>  include/linux/zboot.h |  26 +++
>  init/Kconfig  |   7 +
>  lib/Kconfig   |   3 +
>  lib/decompress.c  |  17 +-
>  13 files changed, 314 insertions(+), 7 deletions(-)
>  rename arch/arm64/kernel/{kexec_image.c => kexec_raw_image.c} (98%)
>  create mode 100644 arch/arm64/kernel/kexec_zboot_image.c
>  create mode 100644 include/linux/zboot.h
>
> --
> 2.31.1
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Ard Biesheuvel
On Mon, 7 Nov 2022 at 08:40, Dave Young  wrote:
>
> On Mon, 7 Nov 2022 at 15:36, Dave Young  wrote:
> >
> > Hi Ard,
> >
> > On Mon, 7 Nov 2022 at 15:30, Ard Biesheuvel  wrote:
> > >
> > > On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Sat, 5 Nov 2022 at 22:16,  wrote:
> > > > >
> > > > > On 2022-11-05 05:49, Dave Young wrote:
> > > > > > Baoquan, thanks for cc me.
> > > > > >
> > > > > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > > > > >>
> > > > > >> Add Dave to CC
> > > > > >>
> > > > > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > > > > >> > Greetings,
> > > > > >> >
> > > > > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where 
> > > > > >> > kexecing will
> > > > > >> > cause EFI mode (if that's the right term for it) to be 
> > > > > >> > unconditionally
> > > > > >> > disabled, even when not using the --noefi option to kexec.
> > > > > >> >
> > > > > >> > What I mean by "EFI mode" being disabled, more than just EFI 
> > > > > >> > runtime
> > > > > >> > services, is that basically nothing about the system's EFI is 
> > > > > >> > visible
> > > > > >> > post-kexec. Normally you have a message like this in dmesg when 
> > > > > >> > the
> > > > > >> > system is booted in EFI mode:
> > > > > >> >
> > > > > >> > [0.00] efi: EFI v2.70 by EDK II
> > > > > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > > > > >> > 2.0=0x7fb7e014
> > > > > >> > MEMATTR=0x7ec63018
> > > > > >> > (obviously not the real firmware of the machine I'm talking 
> > > > > >> > about, but I
> > > > > >> > can also send that if it would be of any help)
> > > > > >> >
> > > > > >> > No such message pops up in my dmesg as a result of this bug, & 
> > > > > >> > this
> > > > > >> > causes some fallout like being unable to find the system's DMI
> > > > > >> > information:
> > > > > >> >
> > > > > >> > <6>[0.00] DMI not present or invalid.
> > > > > >> >
> > > > > >> > The efivarfs module also fails to load with -ENODEV.
> > > > > >> >
> > > > > >> > I've tried also booting with efi=runtime explicitly but it 
> > > > > >> > doesn't
> > > > > >> > change anything. The kernel still does not print the name of the 
> > > > > >> > EFI
> > > > > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > > > > >> >
> > > > > >> > I've been using the kexec_load syscall for all these tests, if 
> > > > > >> > it's
> > > > > >> > important.
> > > > > >> >
> > > > > >> > Also, to make it very clear, all this only ever happens 
> > > > > >> > post-kexec. When
> > > > > >> > booting straight from UEFI (with the EFI stub), all the 
> > > > > >> > aforementioned
> > > > > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > > > > >> > printed,
> > > > > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > > > > >> >
> > > > > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been 
> > > > > >> > trying to
> > > > > >> > bisect it, but it seems like it goes pretty far back. I've got 
> > > > > >> > vanilla
> > > > > >> > mainline kernel builds dating back to 5.17 that have the exact 
> > > > > >> > same
> > > > > >> > issue. It might be worth noting that during this

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Ard Biesheuvel
On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
>
> Hi,
>
> On Sat, 5 Nov 2022 at 22:16,  wrote:
> >
> > On 2022-11-05 05:49, Dave Young wrote:
> > > Baoquan, thanks for cc me.
> > >
> > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > >>
> > >> Add Dave to CC
> > >>
> > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > >> > Greetings,
> > >> >
> > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing will
> > >> > cause EFI mode (if that's the right term for it) to be unconditionally
> > >> > disabled, even when not using the --noefi option to kexec.
> > >> >
> > >> > What I mean by "EFI mode" being disabled, more than just EFI runtime
> > >> > services, is that basically nothing about the system's EFI is visible
> > >> > post-kexec. Normally you have a message like this in dmesg when the
> > >> > system is booted in EFI mode:
> > >> >
> > >> > [0.00] efi: EFI v2.70 by EDK II
> > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > >> > 2.0=0x7fb7e014
> > >> > MEMATTR=0x7ec63018
> > >> > (obviously not the real firmware of the machine I'm talking about, but 
> > >> > I
> > >> > can also send that if it would be of any help)
> > >> >
> > >> > No such message pops up in my dmesg as a result of this bug, & this
> > >> > causes some fallout like being unable to find the system's DMI
> > >> > information:
> > >> >
> > >> > <6>[0.00] DMI not present or invalid.
> > >> >
> > >> > The efivarfs module also fails to load with -ENODEV.
> > >> >
> > >> > I've tried also booting with efi=runtime explicitly but it doesn't
> > >> > change anything. The kernel still does not print the name of the EFI
> > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > >> >
> > >> > I've been using the kexec_load syscall for all these tests, if it's
> > >> > important.
> > >> >
> > >> > Also, to make it very clear, all this only ever happens post-kexec. 
> > >> > When
> > >> > booting straight from UEFI (with the EFI stub), all the aforementioned
> > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > >> > printed,
> > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > >> >
> > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been trying to
> > >> > bisect it, but it seems like it goes pretty far back. I've got vanilla
> > >> > mainline kernel builds dating back to 5.17 that have the exact same
> > >> > issue. It might be worth noting that during this testing, I made sure
> > >> > the version of the kernel being kexeced & the kernel kexecing were the
> > >> > same version. It may not have been a problem in older kernels, but that
> > >> > would be difficult to test for me (a pretty important driver for this
> > >> > machine was only merged during v5.17-rc4). So it may not have been a
> > >> > regression & just a hidden problem since time immemorial.
> > >> >
> > >> > I am willing to test any patches I may get to further debug or fix
> > >> > this issue, preferably based on the current state of 
> > >> > torvalds/linux.git.
> > >> > I can build & test kernels quite a few times per day.
> > >> >
> > >> > I can also send any important materials (kernel config, dmesg, firmware
> > >> > information, so on & so forth) on request. I'll also just mention I'm
> > >> > using kexec-tools 2.0.24 upfront, if it matters.
> > >
> > > Can you check the efi runtime in sysfs:
> > > ls /sys/firmware/efi/runtime-map/
> > >
> > > If nothing then maybe you did not enable CONFIG_EFI_RUNTIME_MAP=y, it
> > > is needed for kexec UEFI boot on x86_64.
> >
> > Oh my, it really is that simple.
> >
> > Indeed, enabling this in the pre-kexec kernel fixes it all up. I had
> > blindly disabled it in my quest to downsize the pre-kexec kernel to
> > reduce boot time (it only runs a bootloader). In hindsight, the firmware
> > drivers section is not really a good section to tweak on a whim.
> >
> > I'm terribly sorry to have taken your time to "fix" this "bug". But I
> > must ask, is there any reason why this is a visible config option, or at
> > least not gated behind CONFIG_EXPERT? drivers/firmware/efi/runtime-map.c
> > is pretty tiny, & considering it depends on CONFIG_KEXEC_CORE, one
> > probably wants to have kexec work properly if they can even enable it.
>
> Glad to know it works with the .config tweaking. I can not recall any
> reason for that though.
>
> Since it sits in the efi code path, let's see how Ard thinks about
> your proposal.
>

I don't understand why EFI_RUNTIME_MAP should depend on KEXEC_CORE at
all: it is documented as a feature that can be enabled for debugging
as well, and kexec does not work as expected without it.

Should we just change it like this perhaps?

--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -28,8 +28,8 @@ config EFI_VARS_PSTORE_DEFAULT_DISABLE

 config EFI_RUNTIME_MAP
bool "Export efi runtime maps to sysfs"
-   depends on X86 && EFI && KEXEC_CORE
-   default y

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-09-06 Thread Ard Biesheuvel
On Mon, 5 Sept 2022 at 14:08, Baoquan He  wrote:
>
> On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> > On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > >
> > > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > > it like:
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > > phys_addr_t low_mem_mask = U32_MAX;
> > > phys_addr_t phys_start = memblock_start_of_DRAM();
> > >
> > > if ((!IS_ENABLED(CONFIG_ZONE_DMA) && 
> > > !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > >  (phys_start > U32_MAX))
> > > low_mem_mask = PHYS_ADDR_MAX;
> > >
> > > return low_mem_mast + 1;
> > > }
> > >
> > > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > > as you suggested. Which one do you like better?
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > > if (!IS_ENABLED(CONFIG_ZONE_DMA) && 
> > > !IS_ENABLED(CONFIG_ZONE_DMA32))
> > > return PHYS_ADDR_MAX + 1;
> > >
> > > return max_zone_phys(32);
> > > }
> >
> > I like the second variant better.
>
> Sure, will change to use the 2nd one . Thanks.
>

While I appreciate the effort that has gone into solving this problem,
I don't think there is any consensus that an elaborate fix is required
to ensure that the crash kernel can be unmapped from the linear map at
all cost. In fact, I personally think we shouldn't bother, and IIRC,
Will made a remark along the same lines back when the Huawei engineers
were still driving this effort.

So perhaps we could align on that before doing yet another version of this?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] memblock: define functions to set the usable memory range

2022-01-29 Thread Ard Biesheuvel
On Mon, 24 Jan 2022 at 22:05, Frank van der Linden  wrote:
>
> Meanwhile, it seems that this issue was already addressed in:
>
> https://lore.kernel.org/all/20211215021348.8766-1-kernelf...@gmail.com/
>
> ..which has now been pulled in, and sent to stable@ for 5.15. I
> somehow missed that message, and sent my change in a few weeks
> later.
>
> The fix to just reserve the ranges does seem a bit cleaner overall,
> but this will do fine.
>

Works for me.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-10-17 Thread Ard Biesheuvel
On Thu, 7 Oct 2021 at 09:23, Andy Shevchenko  wrote:
>
> On Thu, Oct 7, 2021 at 10:20 AM Ard Biesheuvel  wrote:
> > On Wed, 6 Oct 2021 at 18:28, Andy Shevchenko  
> > wrote:
> > > On Mon, Jun 14, 2021 at 08:27:36PM +0300, Andy Shevchenko wrote:
> > > > On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
>
> ...
>
> > > > Double checked, confirmed that it's NOT working.
> > >
> > > Any news here?
> > >
> > > Shall I resend my series?
> >
> > As I said before:
> >
> > """
> > I would still prefer to get to the bottom of this before papering over
> > it with command line options. If the memory gets corrupted by the
> > first kernel, maybe we are not preserving it correctly in the first
> > kernel.
> > """
>
> And I can't agree more, but above I asked about news, implying if
> there is anything to test?
> The issue is still there and it becomes a bit annoying to see my hack
> patches in every tree I have been using.
>

If nobody can be bothered to properly diagnose this, how important is
it, really?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-10-07 Thread Ard Biesheuvel
On Wed, 6 Oct 2021 at 18:28, Andy Shevchenko  wrote:
>
> On Mon, Jun 14, 2021 at 08:27:36PM +0300, Andy Shevchenko wrote:
> > On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
> > > On Mon, Jun 14, 2021 at 06:38:30PM +0300, Andy Shevchenko wrote:
> > > > On Sat, Jun 12, 2021 at 12:40:57PM +0800, Dave Young wrote:
> > > > > > Probably it is doable to have kexec on 32bit efi working
> > > > > > without runtime service support, that means no need the trick of 
> > > > > > fixed
> > > > > > mapping.
> > > > > >
> > > > > > If I can restore my vm to boot 32bit efi on this weekend then I may 
> > > > > > provide some draft
> > > > > > patches for test.
> > > > >
> > > > > Unfortunately I failed to setup a 32bit efi guest,  here are some
> > > > > untested draft patches, please have a try.
> > > >
> > > > Thanks for the patches.
> > > >
> > > > As previously, I have reverted my hacks and applied your patches (also I
> > > > dropped patches from previous mail against kernel and kexec-tools) for 
> > > > both
> > > > kernel and user space on first and second environments.
> > > >
> > > > It does NOT solve the issue.
> > > >
> > > > If there is no idea pops up soon, I'm going to resend my series that
> > > > workarounds the issue.
> > >
> > > Hold on, I may have made a mistake during testing. Let me retest this.
> >
> > Double checked, confirmed that it's NOT working.
>
> Any news here?
>
> Shall I resend my series?
>

As I said before:

"""
I would still prefer to get to the bottom of this before papering over
it with command line options. If the memory gets corrupted by the
first kernel, maybe we are not preserving it correctly in the first
kernel.
"""

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6] ARM: uncompress: Parse "linux, usable-memory-range" DT property

2021-09-22 Thread Ard Biesheuvel
On Wed, 15 Sept 2021 at 15:20, Geert Uytterhoeven
 wrote:
>
> Add support for parsing the "linux,usable-memory-range" DT property.
> This property is used to describe the usable memory reserved for the
> crash dump kernel, and thus makes the memory reservation explicit.
> If present, Linux no longer needs to mask the program counter, and rely
> on the "mem=" kernel parameter to obtain the start and size of usable
> memory.
>
> For backwards compatibility, the traditional method to derive the start
> of memory is still used if "linux,usable-memory-range" is absent.
>
> Signed-off-by: Geert Uytterhoeven 

Acked-by: Ard Biesheuvel 

> ---
> KernelVersion: v5.15-rc1
> ---
> The corresponding patch for kexec-tools is "[PATCH] arm: kdump: Add DT
> properties to crash dump kernel's DTB", which is still valid:
> https://lore.kernel.org/r/20200902154129.6358-1-geert+rene...@glider.be/
>
> v6:
>   - All dependencies are in v5.15-rc1,
>
> v5:
>   - Remove the addition of "linux,elfcorehdr" and
> "linux,usable-memory-range" handling to arch/arm/mm/init.c,
>
> v4:
>   - Remove references to architectures in chosen.txt, to avoid having to
> change this again when more architectures copy kdump support,
>   - Remove the architecture-specific code for parsing
> "linux,usable-memory-range" and "linux,elfcorehdr", as the FDT core
> code now takes care of this,
>   - Move chosen.txt change to patch changing the FDT core,
>   - Use IS_ENABLED(CONFIG_CRASH_DUMP) instead of #ifdef,
>
> v3:
>   - Rebase on top of accepted solution for DTB memory information
> handling, which is part of v5.12-rc1,
>
> v2:
>   - Rebase on top of reworked DTB memory information handling.
> ---
>  .../arm/boot/compressed/fdt_check_mem_start.c | 48 ---
>  1 file changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/boot/compressed/fdt_check_mem_start.c 
> b/arch/arm/boot/compressed/fdt_check_mem_start.c
> index 62450d824c3ca180..9291a2661bdfe57f 100644
> --- a/arch/arm/boot/compressed/fdt_check_mem_start.c
> +++ b/arch/arm/boot/compressed/fdt_check_mem_start.c
> @@ -55,16 +55,17 @@ static uint64_t get_val(const fdt32_t *cells, uint32_t 
> ncells)
>   * DTB, and, if out-of-range, replace it by the real start address.
>   * To preserve backwards compatibility (systems reserving a block of memory
>   * at the start of physical memory, kdump, ...), the traditional method is
> - * always used if it yields a valid address.
> + * used if it yields a valid address, unless the "linux,usable-memory-range"
> + * property is present.
>   *
>   * Return value: start address of physical memory to use
>   */
>  uint32_t fdt_check_mem_start(uint32_t mem_start, const void *fdt)
>  {
> -   uint32_t addr_cells, size_cells, base;
> +   uint32_t addr_cells, size_cells, usable_base, base;
> uint32_t fdt_mem_start = 0x;
> -   const fdt32_t *reg, *endp;
> -   uint64_t size, end;
> +   const fdt32_t *usable, *reg, *endp;
> +   uint64_t size, usable_end, end;
> const char *type;
> int offset, len;
>
> @@ -80,6 +81,27 @@ uint32_t fdt_check_mem_start(uint32_t mem_start, const 
> void *fdt)
> if (addr_cells > 2 || size_cells > 2)
> return mem_start;
>
> +   /*
> +* Usable memory in case of a crash dump kernel
> +* This property describes a limitation: memory within this range is
> +* only valid when also described through another mechanism
> +*/
> +   usable = get_prop(fdt, "/chosen", "linux,usable-memory-range",
> + (addr_cells + size_cells) * sizeof(fdt32_t));
> +   if (usable) {
> +   size = get_val(usable + addr_cells, size_cells);
> +   if (!size)
> +   return mem_start;
> +
> +   if (addr_cells > 1 && fdt32_ld(usable)) {
> +   /* Outside 32-bit address space */
> +   return mem_start;
> +   }
> +
> +   usable_base = fdt32_ld(usable + addr_cells - 1);
> +   usable_end = usable_base + size;
> +   }
> +
> /* Walk all memory nodes and regions */
> for (offset = fdt_next_node(fdt, -1, NULL); offset >= 0;
>  offset = fdt_next_node(fdt, offset, NULL)) {
> @@ -107,7 +129,20 @@ uint32_t fdt_check_mem_start(uint32_t mem_start, const 
> void *fdt)
>
> base = fdt32_ld(reg + addr_cells - 1);
> end

Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-07-19 Thread Ard Biesheuvel
On Mon, 14 Jun 2021 at 19:27, Andy Shevchenko  wrote:
>
> On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
> > On Mon, Jun 14, 2021 at 06:38:30PM +0300, Andy Shevchenko wrote:
> > > On Sat, Jun 12, 2021 at 12:40:57PM +0800, Dave Young wrote:
> > > > > Probably it is doable to have kexec on 32bit efi working
> > > > > without runtime service support, that means no need the trick of fixed
> > > > > mapping.
> > > > >
> > > > > If I can restore my vm to boot 32bit efi on this weekend then I may 
> > > > > provide some draft
> > > > > patches for test.
> > > >
> > > > Unfortunately I failed to setup a 32bit efi guest,  here are some
> > > > untested draft patches, please have a try.
> > >
> > > Thanks for the patches.
> > >
> > > As previously, I have reverted my hacks and applied your patches (also I
> > > dropped patches from previous mail against kernel and kexec-tools) for 
> > > both
> > > kernel and user space on first and second environments.
> > >
> > > It does NOT solve the issue.
> > >
> > > If there is no idea pops up soon, I'm going to resend my series that
> > > workarounds the issue.
> >
> > Hold on, I may have made a mistake during testing. Let me retest this.
>
> Double checked, confirmed that it's NOT working.
>

Apologies for chiming in so late - in my defence, I was on vacation :-)

So if I understand the thread correctly, the Surface 3 provides a
SMBIOS entry point (not SMBIOS3), and it does not get picked up by the
second kernel, right?

I would still prefer to get to the bottom of this before papering over
it with command line options. If the memory gets corrupted by the
first kernel, maybe we are not preserving it correctly in the first
kernel.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/5] arm64: kexec_file: Forbid non-crash kernels

2021-05-31 Thread Ard Biesheuvel
On Mon, 31 May 2021 at 11:57, Marc Zyngier  wrote:
>
> It has been reported that kexec_file doesn't really work on arm64.
> It completely ignores any of the existing reservations, which results
> in the secondary kernel being loaded where the GICv3 LPI tables live,
> or even corrupting the ACPI tables.
>
> Since only crash kernels are imune to this as they use a reserved
> memory region, disable the non-crash kernel use case. Further
> patches will try and restore the functionality.
>
> Reported-by: Moritz Fischer 
> Signed-off-by: Marc Zyngier 
> Cc: sta...@vger.kernel.org # 5.10

Acked-by: Ard Biesheuvel 

... but do we really only need this in 5.10 and not earlier?

> ---
>  arch/arm64/kernel/kexec_image.c | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
> index 9ec34690e255..acf9cd251307 100644
> --- a/arch/arm64/kernel/kexec_image.c
> +++ b/arch/arm64/kernel/kexec_image.c
> @@ -145,3 +145,23 @@ const struct kexec_file_ops kexec_image_ops = {
> .verify_sig = image_verify_sig,
>  #endif
>  };
> +
> +/**
> + * arch_kexec_locate_mem_hole - Find free memory to place the segments.
> + * @kbuf:   Parameters for the memory search.
> + *
> + * On success, kbuf->mem will have the start address of the memory region 
> found.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
> +{
> +   /*
> +* For the time being, kexec_file_load isn't reliable except
> +* for crash kernel. Say sorry to the user.
> +*/
> +   if (kbuf->image->type != KEXEC_TYPE_CRASH)
> +   return -EADDRNOTAVAIL;
> +
> +   return kexec_locate_mem_hole(kbuf);
> +}
> --
> 2.30.2
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] arm64: Make kexec_file_load honor iomem reservations

2021-05-31 Thread Ard Biesheuvel
On Mon, 31 May 2021 at 11:57, Marc Zyngier  wrote:
>
> This series is a complete departure from the approach I initially sent
> almost a month ago[0]. Instead of trying to teach EFI, ACPI and other
> subsystem to use memblock, I've decided to stick with the iomem
> resource tree and use that exclusively for arm64.
>
> This means that my current approach is (despite what I initially
> replied to both Dave and Catalin) to provide an arm64-specific
> implementation of arch_kexec_locate_mem_hole() which walks the
> resource tree and excludes ranges of RAM that have been registered for
> any odd purpose. This is exactly what the userspace implementation
> does, and I don't really see a good reason to diverge from it.
>
> Again, this allows my Synquacer board to reliably use kexec_file_load
> with as little as 256M, something that would always fail before as it
> would overwrite most of the reserved tables.
>
> Although this series still targets 5.14, the initial patch is a
> -stable candidate, and disables non-kdump uses of kexec_file_load. I
> have limited it to 5.10, as earlier kernels will require a different,
> probably more invasive approach.
>
> Catalin, Ard: although this series has changed a bit compared to v1,
> I've kept your AB/RB tags. Should anything seem odd, please let me
> know and I'll drop them.
>

Fine with me.

> Thanks,
>
> M.
>
> * From v1 [1]:
>   - Move the overlap exclusion into find_next_iomem_res()
>   - Handle child resource not overlapping with parent
>   - Provide walk_system_ram_excluding_child_res() as a top level
> walker
>   - Simplify arch-specific code
>   - Add initial patch disabling non-crash kernels
>
> [0] https://lore.kernel.org/r/20210429133533.1750721-1-...@kernel.org
> [1] https://lore.kernel.org/r/20210526190531.62751-1-...@kernel.org
>
> Marc Zyngier (5):
>   arm64: kexec_file: Forbid non-crash kernels
>   kexec_file: Make locate_mem_hole_callback global
>   kernel/resource: Allow find_next_iomem_res() to exclude overlapping
> child resources
>   kernel/resource: Introduce walk_system_ram_excluding_child_res()
>   arm64: kexec_image: Restore full kexec functionnality
>
>  arch/arm64/kernel/kexec_image.c | 39 
>  include/linux/ioport.h  |  3 ++
>  include/linux/kexec.h   |  1 +
>  kernel/kexec_file.c |  6 +--
>  kernel/resource.c   | 82 +
>  5 files changed, 119 insertions(+), 12 deletions(-)
>
> --
> 2.30.2
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] arm64: Make kexec_file_load honor iomem reservations

2021-05-30 Thread Ard Biesheuvel
On Thu, 27 May 2021 at 19:39, Catalin Marinas  wrote:
>
> On Wed, May 26, 2021 at 08:05:27PM +0100, Marc Zyngier wrote:
> > This series is a complete departure from the approach I initially sent
> > almost a month ago[1]. Instead of trying to teach EFI, ACPI and other
> > subsystem to use memblock, I've decided to stick with the iomem
> > resource tree and use that exclusively for arm64.
> >
> > This means that my current approach is (despite what I initially
> > replied to both Dave and Catalin) to provide an arm64-specific
> > implementation of arch_kexec_locate_mem_hole() which walks the
> > resource tree and excludes ranges of RAM that have been registered for
> > any odd purpose. This is exactly what the userspace implementation
> > does, and I don't really see a good reason to diverge from it.
> >
> > Again, this allows my Synquacer board to reliably use kexec_file_load
> > with as little as 256M, something that would always fail before as it
> > would overwrite most of the reserved tables.
> >
> > Obviously, this is now at least 5.14 material. Given how broken
> > kexec_file_load is for non-crash kernels on arm64 at the moment,
> > should we at least disable it in 5.13 and all previous stable kernels?
>
> I think it makes sense to disable it in the current and earlier kernels.
>

Ack to that

> For this series:
>
> Acked-by: Catalin Marinas 

and likewise for the series

Reviewed-by: Ard Biesheuvel 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi/x86: Revert struct layout change to fix kexec boot regression

2020-04-10 Thread Ard Biesheuvel
On Fri, 10 Apr 2020 at 16:34, Borislav Petkov  wrote:
>
> On Fri, Apr 10, 2020 at 04:22:49PM +0200, Ard Biesheuvel wrote:
> > > BTW, a fixes tag is good to have..
> >
> > I usually omit those for patches that fix bugs that were introduced in
> > the current cycle.
>
> A valid use case for having the Fixes: tag anyway are the backporting
> kernels gangs which might pick up the first patch for whatever reason
> and would probably be thankful if they find the second one, i.e., the
> fix for the first one, through grepping or other, automated means.
>

Fair point.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi/x86: Revert struct layout change to fix kexec boot regression

2020-04-10 Thread Ard Biesheuvel
On Fri, 10 Apr 2020 at 16:02, Dave Young  wrote:
>
> On 04/10/20 at 09:56pm, Dave Young wrote:
> > On 04/10/20 at 09:43am, Ard Biesheuvel wrote:
> > > Commit
> > >
> > >   0a67361dcdaa29 ("efi/x86: Remove runtime table address from kexec EFI 
> > > setup data")
> > >
> > > removed the code that retrieves the non-remapped UEFI runtime services
> > > pointer from the data structure provided by kexec, as it was never really
> > > needed on the kexec boot path: mapping the runtime services table at its
> > > non-remapped address is only needed when calling SetVirtualAddressMap(),
> > > which never happens during a kexec boot in the first place.
> > >
> > > However, dropping the 'runtime' member from struct efi_setup_data was a
> > > mistake. That struct is shared ABI between the kernel and the kexec 
> > > tooling
> > > for x86, and so we cannot simply change its layout. So let's put back the
> > > removed field, but call it 'unused' to reflect the fact that we never look
> > > at its contents. While at it, add a comment to remind our future selves
> > > that the layout is external ABI.
> > >
> > > Reported-by: Theodore Ts'o 
> > > Tested-by: Theodore Ts'o 
> > > Signed-off-by: Ard Biesheuvel 
> > > ---
> > >
> > > Ingo, Thomas, Boris: I sent out my efi-urgent pull request just yesterday,
> > > so please take this directly into tip:efi/urgent - no need to wait for the
> > > next batch.
> > >
> > >  arch/x86/include/asm/efi.h | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> > > index 781170d36f50..96044c8d8600 100644
> > > --- a/arch/x86/include/asm/efi.h
> > > +++ b/arch/x86/include/asm/efi.h
> > > @@ -178,8 +178,10 @@ extern void efi_free_boot_services(void);
> > >  extern pgd_t * __init efi_uv1_memmap_phys_prolog(void);
> > >  extern void __init efi_uv1_memmap_phys_epilog(pgd_t *save_pgd);
> > >
> > > +/* kexec external ABI */
> > >  struct efi_setup_data {
> > > u64 fw_vendor;
> > > +   u64 unused;
> > > u64 tables;
> > > u64 smbios;
> > > u64 reserved[8];
> > > --
> > > 2.17.1
> > >
> >
> > Ah, replied too quick in another mail.  I just cced kexec list again.
> >
> > Thanks for the fix:
> >
> > Reviewed-by: Dave Young 
> >
>

Thanks Dave

> BTW, a fixes tag is good to have..
>

I usually omit those for patches that fix bugs that were introduced in
the current cycle.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/2] firmware: dmi_scan: Pass dmi_entry_point to kexec'ed kernel

2020-01-20 Thread Ard Biesheuvel
On Mon, 20 Jan 2020 at 23:31, Andy Shevchenko  wrote:
>
> On Mon, Jan 20, 2020 at 9:28 PM Eric W. Biederman  
> wrote:
> > Andy Shevchenko  writes:
> > > On Sat, Dec 17, 2016 at 06:57:21PM +0800, Dave Young wrote:
> > >> Ccing efi people.
> > >>
> > >> On 12/16/16 at 02:33pm, Jean Delvare wrote:
> > >> > On Fri, 16 Dec 2016 14:18:58 +0200, Andy Shevchenko wrote:
> > >> > > On Fri, 2016-12-16 at 10:32 +0800, Dave Young wrote:
> > >> > > > On 12/15/16 at 12:28pm, Jean Delvare wrote:
> > >> > > > > I am no kexec expert but this confuses me. Shouldn't the second
> > >> > > > > kernel have access to the EFI systab as the first kernel does? It
> > >> > > > > includes many more pointers than just ACPI and DMI tables, and it
> > >> > > > > would seem inconvenient to have to pass all these addresses
> > >> > > > > individually explicitly.
> > >> > > >
> > >> > > > Yes, in modern linux kernel, kexec has the support for EFI, I 
> > >> > > > think it
> > >> > > > should work naturally at least in x86_64.
> > >> > >
> > >> > > Thanks for this good news!
> > >> > >
> > >> > > Unfortunately Intel Galileo is 32-bit platform.
> > >> >
> > >> > If it was done for X86_64 then maybe it can be generalized to X86?
> > >>
> > >> For X86_64, we have a new way for efi runtime memmory mapping, in i386
> > >> code it still use old ioremap way. It is impossible to use same way as
> > >> the X86_64 since the virtual address space is limited.
> > >>
> > >> But maybe for 32bit, kexec kernel can run in physical mode, but I'm not
> > >> sure, I would suggest Andy to do a test first with efi=noruntime for
> > >> kexec 2nd kernel.
> > >
> > > Guys, it was quite a long no hear from you. As I told you the proposed 
> > > work
> > > around didn't help. Today I found that Microsoft Surface 3 also affected
> > > by this.
> > >
> > > Can we apply these patches for now until you will find better
> > > solution?
> >
> > Not a chance.  The patches don't apply to any kernel in the git history.
> >
> > Which may be part of your problem.  You are or at least were running
> > with code that has not been merged upstream.
>
> It's done against linux-next.
> Applied clearly. (Not the version in this more than yearly old series
> of course, that's why I told I can resend)
>
> > > P.S. I may resend them rebased on recent vanilla.
> >
> > Second.  I looked at your test results and they don't directly make
> > sense.  dmidecode bypasses the kernel completely or it did last time
> > I looked so I don't know why you would be using that to test if
> > something in the kernel is working.
> >
> > However dmidecode failing suggests that the actual problem is something
> > in the first kernel is stomping the dmi tables.
>
> See below.
>
> > Adding a command line option won't fix stomped tables.
>
> It provides a mechanism, which seems to be absent, to the second
> kernel to know where to look for SMBIOS tables.
>
> > So what I would suggest is:
> > a) Verify that dmidecode works before kexec.
>
> Yes, it does.
>
> > b) Test to see if dmidecode works after kexec.
>
> No, it doesn't.
>
> > c) Once (a) shows that dmidecode works and (b) shows that dmidecode
> >fails figure out what is stomping your dmi tables during or before
> >kexec and that is what should get fixed.
>
> The problem here as I can see it that EFI and kexec protocols are not
> friendly to each other.
> I'm not an expert in either. That's why I'm asking for possible
> solutions. And this needs to be done in kernel to allow drivers to
> work.
>
> Does the
>
> commit 4996c02306a25def1d352ec8e8f48895bbc7dea9
> Author: Takao Indoh 
> Date:   Thu Jul 14 18:05:21 2011 -0400
>
> ACPI: introduce "acpi_rsdp=" parameter for kdump
>
> description shed a light on this?
>
> > Now using a non-efi method of dmi detection relies on the
> > tables being between 0xF and 0x1. AKA the last 64K
> > of the first 1MiB of memory.  You might check to see if your
> > dmi tables are in that address range.
>
> # dmidecode --no-sysfs
> # dmidecode 3.2
> Scanning /dev/mem for entry point.
> # No SMBIOS nor DMI entry point found, sorry.
>
> === with patch applied ===
> # dmidecode
> ...
> Release Date: 03/10/2015
> ...
>
> >
> > Otherwise I suspect the good solution is to give efi it's own page
> > tables in the kernel and switch to it whenever efi functions are called.
> >
>
> > But on 32bit the Linux kernel has historically been just fine directly
> > accessing the hardware, and ignoring efi and all of the other BIOS's.
>
> It seems not only for 32-bit Linux kernel anymore. MS Surface 3 runs
> 64-bit code.
>
> > So if that doesn't work on Intel Galileo that is probably a firmware
> > problem.
>
> It's not only about Galileo anymore.
>

Looking at the x86 kexec EFI code, it seems that it has special
handling for the legacy SMBIOS table address, but not for the SMBIOS3
table address, which was introduced to accommodate SMBIOS tables
living in memory that is not 32-bit addressable.

Could anyone check whethe

Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-09 Thread Ard Biesheuvel
On Wed, 8 Jan 2020 at 22:53, Dan Williams  wrote:
>
> On Tue, Jan 7, 2020 at 9:52 AM Ard Biesheuvel  
> wrote:
> >
> > On Tue, 7 Jan 2020 at 06:19, Dave Young  wrote:
> > >
> > > On 01/06/20 at 08:16pm, Dan Williams wrote:
> > > > On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
> > > > >
> > > > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > > > Dave noticed that when specifying multiple efi_fake_mem= entries 
> > > > > > only
> > > > > > the last entry was successfully being reflected in the efi memory 
> > > > > > map.
> > > > > > This is due to the fact that the efi_memmap_insert() is being called
> > > > > > multiple times, but on successive invocations the insertion should 
> > > > > > be
> > > > > > applied to the last new memmap rather than the original map at
> > > > > > efi_fake_memmap() entry.
> > > > > >
> > > > > > Rework efi_fake_memmap() to install the new memory map after each
> > > > > > efi_fake_mem= entry is parsed.
> > > > > >
> > > > > > This also fixes an issue in efi_fake_memmap() that caused it to 
> > > > > > litter
> > > > > > emtpy entries into the end of the efi memory map. An empty entry 
> > > > > > causes
> > > > > > efi_memmap_insert() to attempt more memmap splits / copies than
> > > > > > efi_memmap_split_count() accounted for when sizing the new map. When
> > > > > > that happens efi_memmap_insert() may overrun its allocation, and if 
> > > > > > you
> > > > > > are lucky will spill over to an unmapped page leading to crash
> > > > > > signature like the following rather than silent corruption:
> > > > > >
> > > > > > BUG: unable to handle page fault for address: ff281000
> > > > > > [..]
> > > > > > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > > > > > [..]
> > > > > > Call Trace:
> > > > > >  ? bgrt_init+0xbe/0xbe
> > > > > >  ? efi_arch_mem_reserve+0x1cb/0x228
> > > > > >  ? acpi_parse_bgrt+0xa/0xd
> > > > > >  ? acpi_table_parse+0x86/0xb8
> > > > > >  ? acpi_boot_init+0x494/0x4e3
> > > > > >  ? acpi_parse_x2apic+0x87/0x87
> > > > > >  ? setup_acpi_sci+0xa2/0xa2
> > > > > >  ? setup_arch+0x8db/0x9e1
> > > > > >  ? start_kernel+0x6a/0x547
> > > > > >  ? secondary_startup_64+0xb6/0xc0
> > > > > >
> > > > > > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > > > > > services data to fix kexec breakage" is listed in Fixes: since it
> > > > > > introduces more occurrences where efi_memmap_insert() is invoked 
> > > > > > after
> > > > > > an efi_fake_mem= configuration has been parsed. Previously the side
> > > > > > effects of vestigial empty entries were benign, but with commit
> > > > > > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > > > > > efi_memmap_insert() overruns.
> > > > > >
> > > > > > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > > > > > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > > > > > services...")
> > > > >
> > > > > A nitpick for the Fixes flags, as I replied in the thread below:
> > > > > https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
> > > > >
> > > > > I reproduced two other panics without the patches applied, so this 
> > > > > issue
> > > > > is not caused by either of the commits, maybe just drop the Fixes.
> > > >
> > > > Just the "Fixes: af1648984828", right? No objection from me. I'll let
> > > > Ingo say if he needs a resend for that.
> > > >
> > > > The "Fixes: 0f96a99dab36" is valid because the original implementation
> > > > failed to handle the multiple argument case from the beginning.
> > >
> > > Agreed, thanks!
> > >

Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-07 Thread Ard Biesheuvel
On Tue, 7 Jan 2020 at 06:19, Dave Young  wrote:
>
> On 01/06/20 at 08:16pm, Dan Williams wrote:
> > On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
> > >
> > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > Dave noticed that when specifying multiple efi_fake_mem= entries only
> > > > the last entry was successfully being reflected in the efi memory map.
> > > > This is due to the fact that the efi_memmap_insert() is being called
> > > > multiple times, but on successive invocations the insertion should be
> > > > applied to the last new memmap rather than the original map at
> > > > efi_fake_memmap() entry.
> > > >
> > > > Rework efi_fake_memmap() to install the new memory map after each
> > > > efi_fake_mem= entry is parsed.
> > > >
> > > > This also fixes an issue in efi_fake_memmap() that caused it to litter
> > > > emtpy entries into the end of the efi memory map. An empty entry causes
> > > > efi_memmap_insert() to attempt more memmap splits / copies than
> > > > efi_memmap_split_count() accounted for when sizing the new map. When
> > > > that happens efi_memmap_insert() may overrun its allocation, and if you
> > > > are lucky will spill over to an unmapped page leading to crash
> > > > signature like the following rather than silent corruption:
> > > >
> > > > BUG: unable to handle page fault for address: ff281000
> > > > [..]
> > > > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > > > [..]
> > > > Call Trace:
> > > >  ? bgrt_init+0xbe/0xbe
> > > >  ? efi_arch_mem_reserve+0x1cb/0x228
> > > >  ? acpi_parse_bgrt+0xa/0xd
> > > >  ? acpi_table_parse+0x86/0xb8
> > > >  ? acpi_boot_init+0x494/0x4e3
> > > >  ? acpi_parse_x2apic+0x87/0x87
> > > >  ? setup_acpi_sci+0xa2/0xa2
> > > >  ? setup_arch+0x8db/0x9e1
> > > >  ? start_kernel+0x6a/0x547
> > > >  ? secondary_startup_64+0xb6/0xc0
> > > >
> > > > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > > > services data to fix kexec breakage" is listed in Fixes: since it
> > > > introduces more occurrences where efi_memmap_insert() is invoked after
> > > > an efi_fake_mem= configuration has been parsed. Previously the side
> > > > effects of vestigial empty entries were benign, but with commit
> > > > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > > > efi_memmap_insert() overruns.
> > > >
> > > > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > > > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > > > services...")
> > >
> > > A nitpick for the Fixes flags, as I replied in the thread below:
> > > https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
> > >
> > > I reproduced two other panics without the patches applied, so this issue
> > > is not caused by either of the commits, maybe just drop the Fixes.
> >
> > Just the "Fixes: af1648984828", right? No objection from me. I'll let
> > Ingo say if he needs a resend for that.
> >
> > The "Fixes: 0f96a99dab36" is valid because the original implementation
> > failed to handle the multiple argument case from the beginning.
>
> Agreed, thanks!
>

I'll queue this but without the fixes tags. The -stable maintainers
are far too trigger happy IMHO, and this really needs careful review
before being backported. efi_fake_mem is a debug feature anyway, so I
don't see an urgent need to get this fixed retroactively in older
kernels.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 3/4] efi: Fix efi_memmap_alloc() leaks

2020-01-07 Thread Ard Biesheuvel
On Tue, 7 Jan 2020 at 06:18, Dave Young  wrote:
>
> On 01/06/20 at 08:24pm, Dan Williams wrote:
> > On Mon, Jan 6, 2020 at 7:58 PM Dave Young  wrote:
> > >
> > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > > > updated and replaced multiple times. When that happens a previous
> > > > dynamically allocated efi memory map can be garbage collected. Use the
> > > > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > > > allocated memory map is being replaced.
> > > >
> > > > Debug statements in efi_memmap_free() reveal:
> > > >
> > > >  efi: __efi_memmap_free:37: phys: 0x23ffdd580 size: 2688 flags: 0x2
> > > >  efi: __efi_memmap_free:37: phys: 0x9db00 size: 2640 flags: 0x2
> > > >  efi: __efi_memmap_free:37: phys: 0x9e580 size: 2640 flags: 0x2
> > > >
> > > > ...a savings of 7968 bytes on a qemu boot with 2 entries specified to
> > > > efi_fake_mem=.
> > > >
> > > > Cc: Taku Izumi 
> > > > Cc: Ard Biesheuvel 
> > > > Signed-off-by: Dan Williams 
> > > > ---
> > > >  drivers/firmware/efi/memmap.c |   24 
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/drivers/firmware/efi/memmap.c 
> > > > b/drivers/firmware/efi/memmap.c
> > > > index 04dfa56b994b..bffa320d2f9a 100644
> > > > --- a/drivers/firmware/efi/memmap.c
> > > > +++ b/drivers/firmware/efi/memmap.c
> > > > @@ -29,6 +29,28 @@ static phys_addr_t __init 
> > > > __efi_memmap_alloc_late(unsigned long size)
> > > >   return PFN_PHYS(page_to_pfn(p));
> > > >  }
> > > >
> > > > +static void __init __efi_memmap_free(u64 phys, unsigned long size, 
> > > > unsigned long flags)
> > > > +{
> > > > + if (flags & EFI_MEMMAP_MEMBLOCK) {
> > > > + if (slab_is_available())
> > > > + memblock_free_late(phys, size);
> > > > + else
> > > > + memblock_free(phys, size);
> > > > + } else if (flags & EFI_MEMMAP_SLAB) {
> > > > + struct page *p = pfn_to_page(PHYS_PFN(phys));
> > > > + unsigned int order = get_order(size);
> > > > +
> > > > + free_pages((unsigned long) page_address(p), order);
> > > > + }
> > > > +}
> > > > +
> > > > +static void __init efi_memmap_free(void)
> > > > +{
> > > > + __efi_memmap_free(efi.memmap.phys_map,
> > > > + efi.memmap.desc_size * efi.memmap.nr_map,
> > > > + efi.memmap.flags);
> > > > +}
> > > > +
> > > >  /**
> > > >   * efi_memmap_alloc - Allocate memory for the EFI memory map
> > > >   * @num_entries: Number of entries in the allocated map.
> > > > @@ -100,6 +122,8 @@ static int __init __efi_memmap_init(struct 
> > > > efi_memory_map_data *data)
> > > >   return -ENOMEM;
> > > >   }
> > > >
> > > > + efi_memmap_free();
> > > > +
> > >
> > > This seems still not safe,  see below function:
> > > arch/x86/platform/efi/efi.c:
> > > static void __init efi_clean_memmap(void)
> > > It use same memmap for both old and new, and filter out those invalid
> > > ranges in place, if the memory is freed then ..
> >
> > In the efi_clean_memmap() case flags are 0, so efi_memmap_free() is a nop.
> >
> > Would you feel better with an explicit?
> >
> > WARN_ON(efi.memmap.phys_map == data->phys_map && (data->flags &
> > (EFI_MEMMAP_SLAB | EFI_MEMMAP_MEMBLOCK))
> >
> > ...not sure it's worth it.
>
> Ah, yes, sorry I did not see the flags, although it is not very obvious.
> Maybe add some code comment for efi_mem_alloc and efi_mem_init.
>
> Let's defer the suggestion to Ard.
>

A one line comment to remind our future selves of this discussion
would probably be helpful, but beyond that, I don't think we need to
do much here.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 2/4] efi: Add tracking for dynamically allocated memmaps

2020-01-02 Thread Ard Biesheuvel
Hi Dan,

Thanks for taking the time to really fix this properly.

Comments/questions below.

On Thu, 2 Jan 2020 at 05:29, Dan Williams  wrote:
>
> In preparation for fixing efi_memmap_alloc() leaks, add support for
> recording whether the memmap was dynamically allocated from slab,
> memblock, or is the original physical memmap provided by the platform.
>
> Cc: Taku Izumi 
> Cc: Ard Biesheuvel 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/platform/efi/efi.c |2 +-
>  arch/x86/platform/efi/quirks.c  |   11 ++-
>  drivers/firmware/efi/fake_mem.c |5 +++--
>  drivers/firmware/efi/memmap.c   |   16 ++--
>  include/linux/efi.h |8 ++--
>  5 files changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 38d44f36d5ed..7086afbb84fd 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -333,7 +333,7 @@ static void __init efi_clean_memmap(void)
> u64 size = efi.memmap.nr_map - n_removal;
>
> pr_warn("Removing %d invalid memory map entries.\n", 
> n_removal);
> -   efi_memmap_install(efi.memmap.phys_map, size);
> +   efi_memmap_install(efi.memmap.phys_map, size, 0);
> }
>  }
>
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index f8f0220b6a66..4a71c790f9c3 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -244,6 +244,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
>  void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
>  {
> phys_addr_t new_phys, new_size;
> +   unsigned long flags = 0;
> struct efi_mem_range mr;
> efi_memory_desc_t md;
> int num_entries;
> @@ -272,8 +273,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> num_entries += efi.memmap.nr_map;
>
> new_size = efi.memmap.desc_size * num_entries;
> -
> -   new_phys = efi_memmap_alloc(num_entries);
> +   new_phys = efi_memmap_alloc(num_entries, &flags);
> if (!new_phys) {
> pr_err("Could not allocate boot services memmap\n");
> return;
> @@ -288,7 +288,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> efi_memmap_insert(&efi.memmap, new, &mr);
> early_memunmap(new, new_size);
>
> -   efi_memmap_install(new_phys, num_entries);
> +   efi_memmap_install(new_phys, num_entries, flags);
> e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> e820__update_table(e820_table);
>  }
> @@ -408,6 +408,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
>  void __init efi_free_boot_services(void)
>  {
> phys_addr_t new_phys, new_size;
> +   unsigned long flags = 0;
> efi_memory_desc_t *md;
> int num_entries = 0;
> void *new, *new_md;
> @@ -463,7 +464,7 @@ void __init efi_free_boot_services(void)
> return;
>
> new_size = efi.memmap.desc_size * num_entries;
> -   new_phys = efi_memmap_alloc(num_entries);
> +   new_phys = efi_memmap_alloc(num_entries, &flags);
> if (!new_phys) {
> pr_err("Failed to allocate new EFI memmap\n");
> return;
> @@ -493,7 +494,7 @@ void __init efi_free_boot_services(void)
>
> memunmap(new);
>
> -   if (efi_memmap_install(new_phys, num_entries)) {
> +   if (efi_memmap_install(new_phys, num_entries, flags)) {
> pr_err("Could not install new EFI memmap\n");
> return;
> }
> diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
> index bb9fc70d0cfa..7e53e5520548 100644
> --- a/drivers/firmware/efi/fake_mem.c
> +++ b/drivers/firmware/efi/fake_mem.c
> @@ -39,6 +39,7 @@ void __init efi_fake_memmap(void)
> int new_nr_map = efi.memmap.nr_map;
> efi_memory_desc_t *md;
> phys_addr_t new_memmap_phy;
> +   unsigned long flags = 0;
> void *new_memmap;
> int i;
>
> @@ -55,7 +56,7 @@ void __init efi_fake_memmap(void)
> }
>
> /* allocate memory for new EFI memmap */
> -   new_memmap_phy = efi_memmap_alloc(new_nr_map);
> +   new_memmap_phy = efi_memmap_alloc(new_nr_map, &flags);
> if (!new_memmap_phy)
> return;
>
> @@ -73,7 +74,7 @@ void __init efi_fake_memmap(void)
> /* swap into new EFI memmap */
> early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
>
> -   efi_memm

Re: [PATCH] efi/memreserve: register reservations as 'reserved' in /proc/iomem

2019-12-05 Thread Ard Biesheuvel
On Wed, 4 Dec 2019 at 20:13, Bhupesh SHARMA  wrote:
>
> Hello Masa,
>
> (+Cc Simon)
>
> On Thu, Dec 5, 2019 at 12:27 AM Masayoshi Mizuma  
> wrote:
> >
> > On Wed, Dec 04, 2019 at 06:17:59PM +, James Morse wrote:
> > > Hi Masa,
> > >
> > > On 04/12/2019 17:17, Masayoshi Mizuma wrote:
> > > > Thank you for sending the patch, but unfortunately it doesn't work for 
> > > > the issue...
> > > >
> > > > After applied your patch, the LPI tables are marked as reserved in
> > > > /proc/iomem like as:
> > > >
> > > > 8030-a1fd : System RAM
> > > >   8048-8134 : Kernel code
> > > >   8135-817b : reserved
> > > >   817c-82ac : Kernel data
> > > >   830f-830f : reserved # Property table
> > > >   8348-83480fff : reserved # Pending table
> > > >   8349-8349 : reserved # Pending table
> > > >
> > > > However, kexec tries to allocate memory from System RAM, it doesn't care
> > > > the reserved in System RAM.
> > >
> > > > I'm not sure why kexec doesn't care the reserved in System RAM, however,
> > >
> > > Hmm, we added these to fix a problem with the UEFI memory map, and more 
> > > recently ACPI
> > > tables being overwritten by kexec.
> > >
> > > Which version of kexec-tools are you using? Could you try:
> > > https://git.linaro.org/people/takahiro.akashi/kexec-tools.git/commit/?h=arm64/resv_mem
> >
> > Thanks a lot! It worked and the issue is gone with Ard's patch and
> > the linaro kexec (arm64/resv_mem branch).
> >
> > Ard, please feel free to add:
> >
> > Tested-by: Masayoshi Mizuma 
>
> Same results at my side, so:
> Tested-and-Reviewed-by: Bhipesh Sharma 
>

Thank you all. I'll get this queued as a fix with cc:stable for v5.4


> > >
> > > > if the kexec behaivor is right, the LPI tables should not belong to
> > > > System RAM.
> > >
> > > > Like as:
> > > >
> > > > 8030-830e : System RAM
> > > >   8048-8134 : Kernel code
> > > >   8135-817b : reserved
> > > >   817c-82ac : Kernel data
> > > > 830f-830f : reserved # Property table
> > > > 8348-83480fff : reserved # Pending table
> > > > 8349-8349 : reserved # Pending table
> > > > 834a-a1fd : System RAM
> > > >
> > > > I don't have ideas to separete LPI tables from System RAM... so I tried
> > > > to add a new file to inform the LPI tables to userspace.
> > >
> > > This is how 'nomap' memory appears, we carve it out of System RAM. A side 
> > > effect of this
> > > is kdump can't touch it, as you've told it this isn't memory.
> > >
> > > As these tables are memory, mapped by the linear map, I think Ard's patch 
> > > is the right
> > > thing to do ... I suspect your kexec-tools doesn't have those patches 
> > > from Akashi to make
> > > it honour all second level entries.
> >
> > I used the kexec on the top of master branch:
> > git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> >
> > Should we use the linaro kexec for aarch64 machine?
> > Or will the arm64/resv_mem branch be merged to the kexec on
> > git.kernel.org...?
>
> Glad that Ard's patch fixes the issue for you.
> Regarding Akashi's patch, I think it was sent to upstream kexec-tools
> some time ago (see [0}) but  seems not integrated in upstream
> kexec-tools (now I noticed my Tested-by email for the same got bounced
> off due to some gmail msmtp setting issues at my end - sorry for
> that). I have added Simon in Cc list.
>
> Hi Simon,
>
> Can you please help pick [0] in upstream kexec-tools with Tested-by
> from Masa and myself? Thanks a lot for your help.
>
> [0]. http://lists.infradead.org/pipermail/kexec/2019-January/022201.html
>
> Thanks,
> Bhupesh

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] efi/memreserve: register reservations as 'reserved' in /proc/iomem

2019-12-04 Thread Ard Biesheuvel
Memory regions that are reserved using efi_mem_reserve_persistent()
are recorded in a special EFI config table which survives kexec,
allowing the incoming kernel to honour them as well. However,
such reservations are not visible in /proc/iomem, and so the kexec
tools that load the incoming kernel and its initrd into memory may
overwrite these reserved regions before the incoming kernel has a
chance to reserve them from further use.

So add these reservations to /proc/iomem as they are created. Note
that reservations that are inherited from a previous kernel are
memblock_reserve()'d early on, so they are already visible in
/proc/iomem.

Cc: Masayoshi Mizuma 
Cc: d.hatay...@fujitsu.com
Cc: kexec@lists.infradead.org
Signed-off-by: Ard Biesheuvel 
---
 drivers/firmware/efi/efi.c | 29 ++--
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d101f072c8f8..fcd82dde23c8 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -979,6 +979,24 @@ static int __init efi_memreserve_map_root(void)
return 0;
 }
 
+static int efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
+{
+   struct resource *res, *parent;
+
+   res = kzalloc(sizeof(struct resource), GFP_ATOMIC);
+   if (!res)
+   return -ENOMEM;
+
+   res->name   = "reserved";
+   res->flags  = IORESOURCE_MEM;
+   res->start  = addr;
+   res->end= addr + size - 1;
+
+   /* we expect a conflict with a 'System RAM' region */
+   parent = request_resource_conflict(&iomem_resource, res);
+   return parent ? request_resource(parent, res) : 0;
+}
+
 int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
 {
struct linux_efi_memreserve *rsv;
@@ -1001,9 +1019,8 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
if (index < rsv->size) {
rsv->entry[index].base = addr;
rsv->entry[index].size = size;
-
memunmap(rsv);
-   return 0;
+   return efi_mem_reserve_iomem(addr, size);
}
memunmap(rsv);
}
@@ -1013,6 +1030,12 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
if (!rsv)
return -ENOMEM;
 
+   rc = efi_mem_reserve_iomem(__pa(rsv), SZ_4K);
+   if (rc) {
+   free_page(rsv);
+   return rc;
+   }
+
/*
 * The memremap() call above assumes that a linux_efi_memreserve entry
 * never crosses a page boundary, so let's ensure that this remains true
@@ -1029,7 +1052,7 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
efi_memreserve_root->next = __pa(rsv);
spin_unlock(&efi_mem_reserve_persistent_lock);
 
-   return 0;
+   return efi_mem_reserve_iomem(addr, size);
 }
 
 static int __init efi_memreserve_root_init(void)
-- 
2.17.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-04 Thread Ard Biesheuvel
On Wed, 4 Dec 2019 at 10:14, Ingo Molnar  wrote:
>
>
> * Dave Young  wrote:
>
> > On 12/04/19 at 03:52pm, Dave Young wrote:
> > > Michael Weiser reported he got below error during a kexec rebooting:
> > > esrt: Unsupported ESRT version 2904149718861218184.
> > >
> > > The ESRT memory stays in EFI boot services data, and it was reserved
> > > in kernel via efi_mem_reserve().  The initial purpose of the reservation
> > > is to reuse the EFI boot services data across kexec reboot. For example
> > > the BGRT image data and some ESRT memory like Michael reported.
> > >
> > > But although the memory is reserved it is not updated in X86 e820 table.
> > > And kexec_file_load iterate system ram in io resource list to find places
> > > for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> > > initramfs overwritten the ESRT memory and then the failure happened.
> >
> > s/overwritten/overwrote :)  If need a repost please let me know..
> >
> > >
> > > Since kexec_file_load depends on the e820 to be updated, just fix this
> > > by updating the reserved EFI boot services memory as reserved type in 
> > > e820.
> > >
> > > Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> > > bypassed in the reservation code path because they are assumed as 
> > > reserved.
> > > But the reservation is still needed for multiple kexec reboot.
> > > And it is the only possible case we come here thus just drop the code
> > > chunk then everything works without side effects.
> > >
> > > On my machine the ESRT memory sits in an EFI runtime data range, it does
> > > not trigger the problem, but I successfully tested with BGRT instead.
> > > both kexec_load and kexec_file_load work and kdump works as well.
> > >
> > > Signed-off-by: Dave Young 
>
>
> So I edited this to:
>
>  From: Dave Young 
>
>  Michael Weiser reported he got this error during a kexec rebooting:
>
>esrt: Unsupported ESRT version 2904149718861218184.
>
>  The ESRT memory stays in EFI boot services data, and it was reserved
>  in kernel via efi_mem_reserve().  The initial purpose of the reservation
>  is to reuse the EFI boot services data across kexec reboot. For example
>  the BGRT image data and some ESRT memory like Michael reported.
>
>  But although the memory is reserved it is not updated in the X86 E820 table,
>  and kexec_file_load() iterates system RAM in the IO resource list to find 
> places
>  for kernel, initramfs and other stuff. In Michael's case the kexec loaded
>  initramfs overwrote the ESRT memory and then the failure happened.
>
>  Since kexec_file_load() depends on the E820 table being updated, just fix 
> this
>  by updating the reserved EFI boot services memory as reserved type in E820.
>
>  Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
>  bypassed in the reservation code path because they are assumed as reserved.
>
>  But the reservation is still needed for multiple kexec reboots,
>  and it is the only possible case we come here thus just drop the code
>  chunk, then everything works without side effects.
>
>  On my machine the ESRT memory sits in an EFI runtime data range, it does
>  not trigger the problem, but I successfully tested with BGRT instead.
>  both kexec_load() and kexec_file_load() work and kdump works as well.
>

Acked-by: Ard Biesheuvel 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages

2019-12-04 Thread Ard Biesheuvel
On Tue, 3 Dec 2019 at 20:14, Masayoshi Mizuma  wrote:
>
> From: Masayoshi Mizuma 
>
> kexec reboot sometime fails in early boot sequence on aarch64 machine.
> That is because kexec overwrites the LPI property tables and pending
> tables with the initrd.
>
> To avoid the overwrite, introduce /proc/efi/memreserve to tell the
> tables region to kexec so that kexec can avoid the memory region to
> locate initrd.
>
> kexec also needs a patch to handle /proc/efi/memreserve. I'm preparing
> the patch for kexec.
>
> Changelog
> v2: - Change memreserve file location from sysfs to procfs.
>   memreserve may exceed the PAGE_SIZE in case efi_memreserve_root
>   has a lot of entries. So we cannot use sysfs_kf_seq_show().
>   Use seq_printf() in procfs instead.
>
> Masayoshi Mizuma (2):
>   efi: add /proc/efi directory
>   efi: arm64: Introduce /proc/efi/memreserve to tell the persistent
> pages
>

Apologies for the tardy response.

Adding /proc/efi is really out of the question. *If* we add any
special files to expose this information, it should be under sysfs.

However, this is still only a partial solution, since it only solves
the problem for userspace based kexec, and we need something for
kexec_file_load() as well.

The fundamental issue here is that /proc/iomem apparently lacks the
entries that describe these regions as 'reserved', so we should try to
address that instead.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Ard Biesheuvel
On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
>
> Add more cc
> On 12/02/19 at 04:58pm, Dave Young wrote:
> > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > Hello Dave,
> > >
> > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > >
> > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > (either
> > > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > > able
> > > > > > to ask the question which memory ares are reserved.
> > > [...]
> > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > /proc/iomem?
> > > > >
> > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > kept
> > > > > around accross the life cycle of *one* running kernel and
> > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > or Matt would know.
> > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > BOOT* then in X86 you can use something like below if it is not covered:
> > >
> > > > void __init efi_esrt_init(void)
> > > > {
> > > > ...
> > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > ...
> > > > }
> > >
> > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > the esrt module reports at boot:
> > >
> > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > 0x74dd2fd0.
> > >
> > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > code you quote reserve it using memblock_reserve() shown by
> > > memblock=debug:
> > >
> > > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > > efi_mem_reserve+0x1d/0x2b
> > >
> > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > as shown by efi=debug:
> > >
> > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > >
> > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > not report this area as being free'd while the surrounding ones are:
> > >
> > > [0.178369] __memblock_free_late: 
> > > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > > [0.178658] __memblock_free_late: 
> > > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > >
> > > The esrt area does not show up in /proc/iomem though:
> > >
> > > 0010-763f5fff : System RAM
> > >   6200-62a00d80 : Kernel code
> > >   62c0-62f15fff : Kernel rodata
> > >   6300-630ea8bf : Kernel data
> > >   63fed000-641f : Kernel bss
> > >   6500-6aff : Crash kernel
> > >
> > > And thus kexec loads the new kernel right over that area as shown when
> > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > and 0x7300+0x24be000 = 0x754be000):
> > >
> > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > >
> > > ... because it looks for any memory hole large enough in iomem resources
> > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > excluded from on my system.
> > >
> > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > the area with efi.memmap and installs it using efi_memmap_install().
> > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > *not* reserve that memory in any way.
> > >
> > > > Unfortunately I noticed there are different requirements/ways for
> > > > different types of "reserved" memory.  But that is another topic..
> > >
> > > I tried to reserve the area with something like this:
> > >
> > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 4de244683a7e..b86a5df027a2 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -24

Re: [PATCH] do not clean dummy variable in kexec path

2019-09-25 Thread Ard Biesheuvel
On Tue, 17 Sep 2019 at 19:52, Matthew Garrett  wrote:
>
> On Fri, Sep 13, 2019 at 2:18 AM Ard Biesheuvel
>  wrote:
>
> > > > - Remove the cleanup from the kexec path -- the cleanup logic from [4],
> > > >   even if justified for the cold boot path, should have never modified
> > > >   the kexec path.
> > >
> > > I agree that there's no benefit in it being called in the kexec path.
> >
> > Can I take that as an ack?
>
> An ack of this hunk.

Given that the patch in question has only one hunk, I'll take this as
an ack of the entire patch, and queue it as a fix.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] do not clean dummy variable in kexec path

2019-09-13 Thread Ard Biesheuvel
On Tue, 13 Aug 2019 at 22:14, Matthew Garrett  wrote:
>
> On Tue, Aug 13, 2019 at 4:28 AM Laszlo Ersek  wrote:
> > (I verified yesterday, using the edk2 source code, that there is no
> > varstore reclaim after ExitBootServices(), indeed.)
>
> Some implementations do reclaim at runtime, in which case the
> create/delete dance will permit variable creation.
>
> > (a) Attempting to delete the dummy variable in efi_enter_virtual_mode().
>
> To be clear, the dummy variable should never actually come into
> existence - we explicitly attempt to create a variable that's bigger
> than the available space, so the expectation is that it will always
> fail. However, should it somehow end up being created, there's a race
> between the creation and the deletion and so there's a (small) risk
> that the variable actually ends up there. The cleanup in
> enter_virtual_mode() is just there to ensure that anything that did
> end up being created on a previous boot is deleted - the expectation
> is that it'll be a noop.
>
> > (b) The following part, in efi_query_variable_store():
> >
> > +   /*
> > +* The runtime code may now have triggered a garbage 
> > collection
> > +* run, so check the variable info again
> > +*/
> >
> > Let me start with (b). That code is essentially dead, I would say, based
> > on the information that had already been captured in the commit message
> > of [1]. Reclaim would never happen after ExitBootServices(). (I assume
> > efi_query_variable_store() is only invoked after ExitBootServices(),
> > i.e., from kernel space proper -- sorry if that's a wrong assumption.)
>
> It's dead code on Tiano, but not on at least one vendor implementation.
>
> > Considering (a): what justified the attempt to delete the dummy variable
> > in efi_enter_virtual_mode(), in commit [4]? Was that meant as a
> > fail-safe just so we don't leave a dummy variable lying around?
>
> Yes.
>
> > So even if we consider the "clean DUMMY object" hunk from [4] a
> > justified fail-safe for the normal boot path, it doesn't apply to the
> > kexec path -- the cold-booted primary kernel will have gone through
> > those motions already, will it not?
> >
> > Therefore, we should do two things:
> >
> > - Remove the cleanup from the kexec path -- the cleanup logic from [4],
> >   even if justified for the cold boot path, should have never modified
> >   the kexec path.
>
> I agree that there's no benefit in it being called in the kexec path.

Can I take that as an ack?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] do not clean dummy variable in kexec path

2019-08-05 Thread Ard Biesheuvel
On Mon, 5 Aug 2019 at 11:36, Dave Young  wrote:
>
> kexec reboot fails randomly in UEFI based kvm guest.  The firmware
> just reset while calling efi_delete_dummy_variable();  Unfortunately
> I don't know how to debug the firmware, it is also possible a potential
> problem on real hardware as well although nobody reproduced it.
>
> The intention of efi_delete_dummy_variable is to trigger garbage collection
> when entering virtual mode.  But SetVirtualAddressMap can only run once
> for each physical reboot, thus kexec_enter_virtual_mode is not necessarily
> a good place to clean dummy object.
>

I would argue that this means it is not a good place to *create* the
dummy variable, and if we don't create it, we don't have to delete it
either.

> Drop efi_delete_dummy_variable so that kexec reboot can work.
>

Creating it and not deleting it is bad, so please try and see if we
can omit the creation on this code path instead.


> Signed-off-by: Dave Young 
> ---
>  arch/x86/platform/efi/efi.c |3 ---
>  1 file changed, 3 deletions(-)
>
> --- linux-x86.orig/arch/x86/platform/efi/efi.c
> +++ linux-x86/arch/x86/platform/efi/efi.c
> @@ -894,9 +894,6 @@ static void __init kexec_enter_virtual_m
>
> if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX))
> runtime_code_page_mkexec();
> -
> -   /* clean DUMMY object */
> -   efi_delete_dummy_variable();
>  #endif
>  }
>


Re: [PATCH v2] arm64: invalidate TLB just before turning MMU on

2018-12-13 Thread Ard Biesheuvel
On Fri, 14 Dec 2018 at 05:08, Qian Cai  wrote:
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS.

This doesn't make any sense to me. If the system gets into a weird
state out of cold reboot, how could this code be the culprit? Please
check your firmware, and try to reproduce the issue on a system that
doesn't have such defects.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [Bug Report] kdump crashes after latest EFI memblock changes on arm64 machines with large number of CPUs

2018-11-06 Thread Ard Biesheuvel
On 6 November 2018 at 02:30, Will Deacon  wrote:
> On Fri, Nov 02, 2018 at 02:44:10AM +0530, Bhupesh Sharma wrote:
>> With the latest EFI changes for memblock reservation across kdump
>> kernel from Ard (Commit 71e0940d52e107748b270213a01d3b1546657d74
>> ["efi: honour memory reservations passed via a linux specific config
>> table"]), we hit a panic while trying to boot the kdump kernel on
>> machines which have large number of CPUs.
>>
>> I have a arm64 board which has 224 CPUS:
>> # lscpu
>> <..snip..>
>> CPU(s):  224
>> On-line CPU(s) list: 0-223
>> <..snip..>
>>
>> Here are the crash logs in the kdump kernel on this machine:
>>
>> [0.00] Unable to handle kernel paging request at virtual
>> address 80003ffe
>> val)nt EL), IL ata abort info:
>> [0.or: Oops: 96inted 4.18.0+ #3
>> [0.00] pstate: 20400089 (nzCv daIf +PAN -UAO)
>> [0.00] pc : __memcpy+0x110/0x180
>> [0.00] lr : memblock_double_array+0x240/0x348
>> [0.00] sp : 092efc80 x28: bffe
>> [0.00] x27: 1800 x26: 09d59000
>> [0.00] x25: 80003ffe x24: 
>> [0.00] x23: 0001 x22: 09d594e8
>> [0.00] x21: 09d594f4 x20: 093c7268
>> [0.00] x19: 0c00 x18: 0010
>> [0.00] x17:  x16: 
>> [0.00] x15: 3: 000fc18d x12: 0008
>> [0.00] x11: 0018 x10: ddab9e18
>> [0.00] x9 : 0008 x8 : 02c1
>> [0.00] x7 : 91b9 x6 : 80003ffe
>> [0.00] x5 : 0001 x4 : 
>> [0.00] x3 :  x2 : 0b80
>> [0.00] x1 : 09d59540 x0 : 80003ffe
>> [0.00] Process swapper)
>> [0.00] Call trace:
>> [0.00]  __memcpy+0x110/0x180
>> [0.00]  memblock_add_range+0x134/0x2e8
>> [0.00]  memblock_reserve+0x70/0xb8
>> [0.00]  memblock_alloc_base_nid+0x6c/0x88
>> [0.00]  __memblock_alloc_base+0x3c/0x4c
>> [0.00]  memblock_alloc_base+0x28/0x4c
>> [0.00]  memblock_alloc+0x2c/0x38
>> [0.00]  early_pgtable_alloc+0x20/0xb0
>
> Hmm, so this seems to be the crux of the issue: early_pgtable_alloc() relies
> on memblock to allocate page-table memory, but this can be called before the
> linear mapping is up and running (or even as part of creating the linear
> mapping itself!) so the use of __va in memblock_double_array() actually
> returns an unmapped address.
>

OK, so this means we are calling memblock_allow_resize() too early in any case

> So I guess we either need to implement early_pgtable_alloc() some other way
> (how?) or get memblock_double_array() to use a fixmap if it's called too
> early (yuck). Alternatively, would it be possible to postpone processing of
> the EFI mem_reserve entries until after we've created the linear mapping?
>

We could move this until after paging_init(), I suppose. I'll cook something up.

Bhupesh: any comments?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [Bug Report] kdump crashes after latest EFI memblock changes on arm64 machines with large number of CPUs

2018-11-05 Thread Ard Biesheuvel
(+ Marc)

On 1 November 2018 at 22:14, Bhupesh Sharma  wrote:
> Hi,
>
> With the latest EFI changes for memblock reservation across kdump
> kernel from Ard (Commit 71e0940d52e107748b270213a01d3b1546657d74
> ["efi: honour memory reservations passed via a linux specific config
> table"]), we hit a panic while trying to boot the kdump kernel on
> machines which have large number of CPUs.
>

Just for my understanding: why do you boot all 224 CPus when running
the crash kernel?

I'm not saying we shouldn't fix the underlying issue, I'm just curious.

> I have a arm64 board which has 224 CPUS:
> # lscpu
> <..snip..>
> CPU(s):  224
> On-line CPU(s) list: 0-223
> <..snip..>
>
> Here are the crash logs in the kdump kernel on this machine:
>
> [0.00] Unable to handle kernel paging request at virtual
> address 80003ffe
> val)nt EL), IL ata abort info:
> [0.or: Oops: 96inted 4.18.0+ #3
> [0.00] pstate: 20400089 (nzCv daIf +PAN -UAO)
> [0.00] pc : __memcpy+0x110/0x180
> [0.00] lr : memblock_double_array+0x240/0x348
> [0.00] sp : 092efc80 x28: bffe
> [0.00] x27: 1800 x26: 09d59000
> [0.00] x25: 80003ffe x24: 
> [0.00] x23: 0001 x22: 09d594e8
> [0.00] x21: 09d594f4 x20: 093c7268
> [0.00] x19: 0c00 x18: 0010
> [0.00] x17:  x16: 
> [0.00] x15: 3: 000fc18d x12: 0008
> [0.00] x11: 0018 x10: ddab9e18
> [0.00] x9 : 0008 x8 : 02c1
> [0.00] x7 : 91b9 x6 : 80003ffe
> [0.00] x5 : 0001 x4 : 
> [0.00] x3 :  x2 : 0b80
> [0.00] x1 : 09d59540 x0 : 80003ffe
> [0.00] Process swapper)
> [0.00] Call trace:
> [0.00]  __memcpy+0x110/0x180
> [0.00]  memblock_add_range+0x134/0x2e8
> [0.00]  memblock_reserve+0x70/0xb8
> [0.00]  memblock_alloc_base_nid+0x6c/0x88
> [0.00]  __memblock_alloc_base+0x3c/0x4c
> [0.00]  memblock_alloc_base+0x28/0x4c
> [0.00]  memblock_alloc+0x2c/0x38
> [0.00]  early_pgtable_alloc+0x20/0xb0
> [0.00]  paging_init+0x28/0x7f8
> [   0.00]  start_kernel+0x78/0x4cc
> [0.00] Code: a8c12027 a8c12829 a8c1302b a8c1382d (a88120c7)
> [0.00] random: get_random_bytes called from
> print_oops_end_marker+0x30/0x58 with crng_init=0
> [0.00] ---[ end trace  ]---
> [0.00] Kernel panic - not syncing: Fatal exception
> [0.00] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> Adding more debug logs via 'memblock=debug' being passed to the kdump
> kernel, (and adding a few more prints to 'mm/memblock.c'), I can see
> that the panic happens while trying to resize array inside
> 'memblock_double_array' (which doubles the size of the memblock
> regions array):
>
> [0.00] Reserving 13KB of memory at 0xbfff for elfcorehdr
> [0.00] memblock_reserve:
> [0xbfff-0xbfff]
> memblock_alloc_base_nid+0x6c/0x88
> [0.00] memblock: use_slab is 0, new_area_start=bfff,
> new_area_size=1
> [0.00] memblock: use_slab is 0, addr=0, new_area_size=1
> [0.00] memblock: addr=bffe, __va(addr)=80003ffe
> [0.0 [0xbffe-0xbffe17ff]
> [0.00] Unable to handle kernel paging request at virtual
> address 80003ffe
>
> which indicates that after Ard's patch the memblocks being reserved
> across kdump swell up on systems which have large number of CPUs and
> hence 'memblock_double_array' is called up in early kdump boot code to
> double the size of the memblock regions array.
>
> To confirm the above, I reduced the number of SMP CPUs available to
> the kernel on this system, by specifying 'nr_cpus=46' in the kernel
> bootargs for the primary kernel. As expected this makes the kdump
> kernel boot successfully and also save the crash dump properly.
>
> I saw another arm64 kdump user report this issue to me privately, so I
> am sending this to a wider audience, so that kdump users are aware
> that this is a known issue.
>
> I am working on a RFC patch which seems to fix the issue on my board
> and will try to send it out for wider review in coming days after some
> more checks at my end.
>
> Any advices on the same are also welcome :)
>
> Thanks,
> Bhupesh

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-21 Thread Ard Biesheuvel
On 9 August 2018 at 11:13, Dave Young  wrote:
> On 08/09/18 at 09:33am, Mike Galbraith wrote:
>> On Thu, 2018-08-09 at 12:21 +0800, Dave Young wrote:
>> > Hi Mike,
>> >
>> > Thanks for the patch!
>> > On 08/08/18 at 04:03pm, Mike Galbraith wrote:
>> > > When booting with efi=noruntime, we call efi_runtime_map_copy() while
>> > > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
>> > > that and a useless allocation when the only mapping we can use (1:1)
>> > > is not available.
>> >
>> > At first glance, efi_get_runtime_map_size should return 0 in case
>> > noruntime.
>>
>> What efi does internally at unmap time is to leave everything except
>> efi.mmap.map untouched, setting it to NULL and turning off EFI_MEMMAP,
>> rendering efi.mmap.map accessors useless/unsafe without first checking
>> EFI_MEMMAP.
>
> Probably the x86 efi_init should reset nr_map to zero in case runtime is
> disabled.  But let's see how Ard thinks about this and cc linux-efi.
>
> As for efi_get_runtime_map_size, it was introduced for x86 kexec use.
> for copying runtime maps,  so I think it is reasonable this function
> return zero in case no runtime.
>

I don't see the patch in the context so I cannot comment in great detail.

In any case, it is better to decouple EFI_MEMMAP from EFI_RUNTIME
dependencies. On x86, one may imply the other, but this is not
generally the case.

That means that efi_get_runtime_map_size() should probably check the
EFI_RUNTIME flag, and return 0 if it is cleared. Perhaps there are
other places where EFI_MEMMAP flag checks are missing, but I consider
that a separate issue.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v12 16/16] arm64: kexec_file: add kaslr support

2018-07-27 Thread Ard Biesheuvel
On 27 July 2018 at 11:22, James Morse  wrote:
> Hi Akashi,
>
>
> On 07/27/2018 09:31 AM, AKASHI Takahiro wrote:
>
> On Thu, Jul 26, 2018 at 02:40:49PM +0100, James Morse wrote:
>
> On 24/07/18 07:57, AKASHI Takahiro wrote:
>
> Adding "kaslr-seed" to dtb enables triggering kaslr, or kernel virtual
> address randomization, at secondary kernel boot.
>
> Hmm, there are three things that get moved by CONFIG_RANDOMIZE_BASE. The
> kernel
> physical placement when booted via the EFIstub, the kernel-text VAs and the
> location of memory in the linear-map region. Adding the kaslr-seed only does
> the
> last two.
>
> Yes, but I think that I and Mark has agreed that "kaslr" meant
> "virtual" randomisation, not including "physical" randomisation.
>
> Okay, I'll update my terminology!
>
>
> This means the physical placement of the new kernel is predictable from
> /proc/iomem ... but this also tells you the physical placement of the
> current
> kernel, so I don't think this is a problem.
>
>
> We always do this as it will have no harm on kaslr-incapable kernel.
>
> We don't have any "switch" to turn off this feature directly, but still
> can suppress it by passing "nokaslr" as a kernel boot argument.
>
> diff --git a/arch/arm64/kernel/machine_kexec_file.c
> b/arch/arm64/kernel/machine_kexec_file.c
> index 7356da5a53d5..47a4fbd0dc34 100644
> --- a/arch/arm64/kernel/machine_kexec_file.c
> +++ b/arch/arm64/kernel/machine_kexec_file.c
> @@ -158,6 +160,12 @@ static int setup_dtb(struct kimage *image,
>
> Don't you need to reserve some space in the area you vmalloc()d for the DT?
>
> No, I don't think so.
> All the data to be loaded are temporarily saved in kexec buffers,
> which will eventually be copied to target locations in machine_kexec
> (arm64_relocate_new_kernel, which, unlike its name, will handle
> not only kernel but also other data as well).
>
>
> I think we're speaking at cross purposes. Don't you need:
>
> | buf_size += fdt_prop_len("kaslr―seed", sizeof(u64));
>
>
> You can't assume the existing DTB had a kaslr-seed property, and the
> difference may take us over a PAGE_SIZE boundary.
>
>
>
>
> + /* add kaslr-seed */
> + get_random_bytes(&value, sizeof(value));
>
> What happens if the crng isn't ready?
>
> It looks like this will print a warning that these random-bytes aren't
> really up
> to standard, but the new kernel doesn't know this happened.
>
> crng_ready() isn't exposed, all we could do now is
> wait_for_random_bytes(), but that may wait forever because we do this
> unconditionally.
>
> I'd prefer to leave this feature until we can check crng_ready(), and skip
> adding a dodgy-seed if its not-ready. This avoids polluting the
> next-kernel's
> entropy pool.
>
> OK. I would try to follow the same way as Bhupesh's userspace patch
> does for kaslr-seed:
> http://lists.infradead.org/pipermail/kexec/2018-April/020564.html
>
>
> (I really don't understand this 'copying code from user-space' that happens
> with kexec_file_load)
>
>
>   if (not found kaslr-seed in 1st kernel's dtb)
>  don't care; go ahead
>
>
> Don' t bother. As you say in the commit-message its harmless if the new
> kernel doesn't support it.
> Always having this would let you use kexec_file_load as a bootloader that
> can get the crng to
> provide decent entropy even if the platform bootloader can't.
>
>
>   else
>  if (current kaslr-seed != 0)
> error
>
>
> Don't bother. If this happens its a bug in another part of the kernel that
> doesn't affect this one. We aren't second-guessing the file-system when we
> read the kernel-fd, lets keep this simple.
>
>  if (crng_ready()) ; FIXME, it's a local macro
> get_random_bytes(non-blocking)
> set new kaslr-seed
>  else
> error
>
> error? Something like pr_warn_once().
>
> I thought the kaslr-seed was added to the entropy pool, but now I look again
> I see its a separate EFI table. So the new kernel will add the same entropy
> ... that doesn't sound clever. (I can't see where its zero'd or
> re-initialised)
>

We do have a hook for that: grep for update_efi_random_seed()

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 0/4] arm64: kexec,kdump: fix boot failures on acpi-only system

2018-07-12 Thread Ard Biesheuvel
On 13 July 2018 at 02:34, AKASHI Takahiro  wrote:
> On Thu, Jul 12, 2018 at 05:49:19PM +0100, Will Deacon wrote:
>> Hi Akashi,
>>
>> On Tue, Jul 10, 2018 at 08:42:25AM +0900, AKASHI Takahiro wrote:
>> > This patch series is a set of bug fixes to address kexec/kdump
>> > failures which are sometimes observed on ACPI-only system and reported
>> > in LAK-ML before.
>>
>> I tried picking this up, along with Ard's fixup, but I'm seeing a build
>> failure for allmodconfig:
>>
>> arch/arm64/kernel/acpi.o: In function `__acpi_get_mem_attribute':
>> acpi.c:(.text+0x60): undefined reference to `efi_mem_attributes'
>>
>> I didn't investigate further. Please can you fix this?
>
> Because CONFIG_ACPI is on and CONFIG_EFI is off.
>
> This can happen in allmodconfig as CONFIG_EFI depends on
> !CONFIG_CPU_BIG_ENDIAN, which is actually on in this case.
>

Allowing both CONFIG_ACPI and CONFIG_CPU_BIG_ENDIAN to be configured
makes no sense at all. Things will surely break if you start using BE
memory accesses while parsing ACPI tables.

Allowing CONFIG_ACPI without CONFIG_EFI makes no sense either, since
on arm64, the only way to find the ACPI tables is through a UEFI
configuration table.

> Looking at __acpi_get_mem_attributes(), since there is no information
> available on memory attributes, what we can do at best is
>   * return PAGE_KERNEL (= cacheable) for mapped memory,
>   * return DEVICE_nGnRnE (= non-cacheable) otherwise
> (See a hunk to be applied on top of my patch#4.)
>
> I think that, after applying, acpi_os_ioremap() would work almost
> in the same way as the original before my patchset given that
> MAP memblock attribute is used only under CONFIG_EFI for now.
>
> Make sense?
>

Let's keep your code as is but fix the Kconfig dependencies instead.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 2/4] efi/arm: preserve early mapping of UEFI memory map longer for BGRT

2018-07-12 Thread Ard Biesheuvel
On 12 July 2018 at 15:32, Will Deacon  wrote:
> On Tue, Jul 10, 2018 at 08:39:16PM +0200, Ard Biesheuvel wrote:
>> On 10 July 2018 at 19:57, James Morse  wrote:
>> > Hi Ard,
>> >
>> > On 10/07/18 00:42, AKASHI Takahiro wrote:
>> >> From: Ard Biesheuvel 
>> >>
>> >> The BGRT code validates the contents of the table against the UEFI
>> >> memory map, and so it expects it to be mapped when the code runs.
>> >>
>> >> On ARM, this is currently not the case, since we tear down the early
>> >> mapping after efi_init() completes, and only create the permanent
>> >> mapping in arm_enable_runtime_services(), which executes as an early
>> >> initcall, but still leaves a window where the UEFI memory map is not
>> >> mapped.
>> >>
>> >> So move the call to efi_memmap_unmap() from efi_init() to
>> >> arm_enable_runtime_services().
>> >
>> > I don't have a machine that generates a BGRT, but I can see that 
>> > efi_mem_type()
>> > call in efi_bgrt_init() would cause the same problems we have with kexec 
>> > and acpi.
>> >
>>
>> I'm not sure I follow. The BGRT table only contains natively aligned
>> fields, so the alignment faults should not occur when accessing this
>> table after kexec. The issue addressed by this patch is that
>> efi_mem_type() bails when called while EFI_MEMMAP is cleared.
>>
>> >
>> >> diff --git a/drivers/firmware/efi/arm-init.c 
>> >> b/drivers/firmware/efi/arm-init.c
>> >> index b5214c143fee..388a929baf95 100644
>> >> --- a/drivers/firmware/efi/arm-init.c
>> >> +++ b/drivers/firmware/efi/arm-init.c
>> >> @@ -259,7 +259,6 @@ void __init efi_init(void)
>> >>
>> >>   reserve_regions();
>> >>   efi_esrt_init();
>> >> - efi_memmap_unmap();
>> >>
>> >>   memblock_reserve(params.mmap & PAGE_MASK,
>> >>PAGE_ALIGN(params.mmap_size +
>> >> diff --git a/drivers/firmware/efi/arm-runtime.c 
>> >> b/drivers/firmware/efi/arm-runtime.c
>> >> index 5889cbea60b8..59a8c0ec94d5 100644
>> >> --- a/drivers/firmware/efi/arm-runtime.c
>> >> +++ b/drivers/firmware/efi/arm-runtime.c
>> >> @@ -115,6 +115,8 @@ static int __init arm_enable_runtime_services(void)
>> >>   return 0;
>> >>   }
>> >>
>> >> + efi_memmap_unmap();
>> >
>> > This can get called twice if uefi_init() fails after setting the EFI_BOOT 
>> > flag,
>> > but this can only happen if the system table signature is wrong, (or we're 
>> > out
>> > of memory really early).
>> >
>>
>> I guess we should check the EFI_MEMMAP attribute here as well then.
>
> Do you plan to spin a new version of this patch?
>

Either that or fold in the hunk below.


--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -110,7 +110,7 @@ static int __init arm_enable_runtime_services(void)
 {
u64 mapsize;

-   if (!efi_enabled(EFI_BOOT)) {
+   if (!efi_enabled(EFI_BOOT) || !efi_enabled(EFI_MEMMAP)) {
pr_info("EFI services will not be available.\n");
return 0;
}

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-10 Thread Ard Biesheuvel
On 10 July 2018 at 21:19, Bjorn Andersson  wrote:
> On Mon 09 Jul 23:56 PDT 2018, Ard Biesheuvel wrote:
>
>> On 10 July 2018 at 08:51, Ard Biesheuvel  wrote:
>> > On 9 July 2018 at 21:41, Mimi Zohar  wrote:
>> >> On Mon, 2018-07-02 at 17:30 +0200, Ard Biesheuvel wrote:
>> >>> On 2 July 2018 at 16:38, Mimi Zohar  wrote:
> [..]
>> > So to summarize again: in my opinion, using a single buffer is not a
>> > problem as long as the validation completes before the DMA map is
>> > performed. This will provide the expected guarantees on systems with
>> > IOMMUs, and will not complicate matters on systems where there is no
>> > point in obsessing about this anyway given that devices can access all
>> > of memory whenever they want to.
>> >
>> > As for the Qualcomm case: dma_alloc_coherent() is not needed here but
>> > simply ends up being used because it was already wired up in the
>> > qualcomm specific secure world API, which amounts to doing syscalls
>> > into a higher privilege level than the one the kernel itself runs at.
>
> As I said before, the dma_alloc_coherent() referred to in this
> discussion holds parameters for the Trustzone call, i.e. it will hold
> the address to the buffer that the firmware was loaded into - it won't
> hold any data that comes from the actual firmware.
>

Ah yes, I forgot that detail. Thanks for reminding me.

>> > So again, reasoning about whether the secure world will look at your
>> > data before you checked the sig is rather pointless, and adding
>> > special cases to the IMA api to cater for this use case seems like a
>> > waste of engineering and review effort to me.
>
> Forgive me if I'm missing something in the implementation here, but
> aren't the IMA checks done before request_firmware*() returns?
>

The issue under discussion is whether calling request_firmware() to
load firmware into a buffer that may be readable by the device while
the IMA checks are in progress constitutes a security hazard.

>> > If we have to do
>> > something to tie up this loose end, let's try switching it to the
>> > streaming DMA api instead.
>> >
>>
>> Forgot to mention: the Qualcomm case is about passing data to the CPU
>> running at another privilege level, so IOMMU vs !IOMMU is not a factor
>> here.
>
> Further more, all scenarios we've look at so far is completely
> sequential, so if the firmware request fails we won't invoke the
> Trustzone operation that would consume the memory or we won't turn on
> the power to the CPU that would execute the firmware.
>
>
> Tbh the only case I can think of where there would be a "race condition"
> here is if we have a device that is polling the last byte of a
> predefined firmware memory area for the firmware loader to read some
> specific data into it. In cases where the firmware request is followed
> by some explicit signalling to the device (or a power on sequence) I'm
> unable to see the issue discussed here.
>

I agree. But the latter part is platform specific, and so it requires
some degree of trust in the driver author on the part of the IMA
routines that request_firmware() is called at an appropriate time.

The point I am trying to make in this thread is that there are cases
where it makes no sense for the kernel to reason about these things,
given that higher privilege levels such as the TrustZone secure world
own the kernel's execution context entirely already, and given that
masters that are not behind an IOMMU can read and write all of memory
all of the time anyway.

The bottom line is that reality does not respect the layering that IMA
assumes, and so the only meaningful way to treat some of the use cases
is simply to ignore them entirely. So we should still perform all the
checks, but we will have to live with the limited utility of doing so
in some scenarios (and not print nasty warnings to the kernel log for
such cases)

-- 
Ard.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 2/4] efi/arm: preserve early mapping of UEFI memory map longer for BGRT

2018-07-10 Thread Ard Biesheuvel
On 10 July 2018 at 19:57, James Morse  wrote:
> Hi Ard,
>
> On 10/07/18 00:42, AKASHI Takahiro wrote:
>> From: Ard Biesheuvel 
>>
>> The BGRT code validates the contents of the table against the UEFI
>> memory map, and so it expects it to be mapped when the code runs.
>>
>> On ARM, this is currently not the case, since we tear down the early
>> mapping after efi_init() completes, and only create the permanent
>> mapping in arm_enable_runtime_services(), which executes as an early
>> initcall, but still leaves a window where the UEFI memory map is not
>> mapped.
>>
>> So move the call to efi_memmap_unmap() from efi_init() to
>> arm_enable_runtime_services().
>
> I don't have a machine that generates a BGRT, but I can see that 
> efi_mem_type()
> call in efi_bgrt_init() would cause the same problems we have with kexec and 
> acpi.
>

I'm not sure I follow. The BGRT table only contains natively aligned
fields, so the alignment faults should not occur when accessing this
table after kexec. The issue addressed by this patch is that
efi_mem_type() bails when called while EFI_MEMMAP is cleared.

>
>> diff --git a/drivers/firmware/efi/arm-init.c 
>> b/drivers/firmware/efi/arm-init.c
>> index b5214c143fee..388a929baf95 100644
>> --- a/drivers/firmware/efi/arm-init.c
>> +++ b/drivers/firmware/efi/arm-init.c
>> @@ -259,7 +259,6 @@ void __init efi_init(void)
>>
>>   reserve_regions();
>>   efi_esrt_init();
>> - efi_memmap_unmap();
>>
>>   memblock_reserve(params.mmap & PAGE_MASK,
>>PAGE_ALIGN(params.mmap_size +
>> diff --git a/drivers/firmware/efi/arm-runtime.c 
>> b/drivers/firmware/efi/arm-runtime.c
>> index 5889cbea60b8..59a8c0ec94d5 100644
>> --- a/drivers/firmware/efi/arm-runtime.c
>> +++ b/drivers/firmware/efi/arm-runtime.c
>> @@ -115,6 +115,8 @@ static int __init arm_enable_runtime_services(void)
>>   return 0;
>>   }
>>
>> + efi_memmap_unmap();
>
> This can get called twice if uefi_init() fails after setting the EFI_BOOT 
> flag,
> but this can only happen if the system table signature is wrong, (or we're out
> of memory really early).
>

I guess we should check the EFI_MEMMAP attribute here as well then.

> I think this is harmless as we end up passing NULL to early_memunmap() which
> WARN()s and returns as its outside the fixmap range. Its just more noise on
> systems with a corrupt efi system table.
>
> Acked-by: James Morse 
>

Thanks James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-09 Thread Ard Biesheuvel
On 10 July 2018 at 08:51, Ard Biesheuvel  wrote:
> On 9 July 2018 at 21:41, Mimi Zohar  wrote:
>> On Mon, 2018-07-02 at 17:30 +0200, Ard Biesheuvel wrote:
>>> On 2 July 2018 at 16:38, Mimi Zohar  wrote:
>>> > Some systems are memory constrained but they need to load very large
>>> > firmwares.  The firmware subsystem allows drivers to request this
>>> > firmware be loaded from the filesystem, but this requires that the
>>> > entire firmware be loaded into kernel memory first before it's provided
>>> > to the driver.  This can lead to a situation where we map the firmware
>>> > twice, once to load the firmware into kernel memory and once to copy the
>>> > firmware into the final resting place.
>>> >
>>> > To resolve this problem, commit a098ecd2fa7d ("firmware: support loading
>>> > into a pre-allocated buffer") introduced request_firmware_into_buf() API
>>> > that allows drivers to request firmware be loaded directly into a
>>> > pre-allocated buffer. (Based on the mailing list discussions, calling
>>> > dma_alloc_coherent() is unnecessary and confusing.)
>>> >
>>> > (Very broken/buggy) devices using pre-allocated memory run the risk of
>>> > the firmware being accessible to the device prior to the completion of
>>> > IMA's signature verification.  For the time being, this patch emits a
>>> > warning, but does not prevent the loading of the firmware.
>>> >
>>>
>>> As I attempted to explain in the exchange with Luis, this has nothing
>>> to do with broken or buggy devices, but is simply the reality we have
>>> to deal with on platforms that lack IOMMUs.
>>
>>> Even if you load into one buffer, carry out the signature verification
>>> and *only then* copy it to another buffer, a bus master could
>>> potentially read it from the first buffer as well. Mapping for DMA
>>> does *not* mean 'making the memory readable by the device' unless
>>> IOMMUs are being used. Otherwise, a bus master can read it from the
>>> first buffer, or even patch the code that performs the security check
>>> in the first place. For such platforms, copying the data around to
>>> prevent the device from reading it is simply pointless, as well as any
>>> other mitigation in software to protect yourself from misbehaving bus
>>> masters.
>>
>> Thank you for taking the time to explain this again.
>>
>>> So issuing a warning in this particular case is rather arbitrary. On
>>> these platforms, all bus masters can read (and modify) all of your
>>> memory all of the time, and as long as the firmware loader code takes
>>> care not to provide the DMA address to the device until after the
>>> verification is complete, it really has done all it reasonably can in
>>> the environment that it is expected to operate in.
>>
>> So for the non-IOMMU system case, differentiating between pre-
>> allocated buffers vs. using two buffers doesn't make sense.
>>
>>>
>>> (The use of dma_alloc_coherent() is a bit of a red herring here, as it
>>> incorporates the DMA map operation. However, DMA map is a no-op on
>>> systems with cache coherent 1:1 DMA [iow, all PCs and most arm64
>>> platforms unless they have IOMMUs], and so there is not much
>>> difference between memory allocated with kmalloc() or with
>>> dma_alloc_coherent() in terms of whether the device can access it
>>> freely)
>>
>> What about systems with an IOMMU?
>>
>
> On systems with an IOMMU, performing the DMA map will create an entry
> in the IOMMU page tables for the physical region associated with the
> buffer, making the region accessible to the device. For platforms in
> this category, using dma_alloc_coherent() for allocating a buffer to
> pass firmware to the device does open a window where the device could
> theoretically access this data while the validation is still in
> progress.
>
> Note that the device still needs to be informed about the address of
> the buffer: just calling dma_alloc_coherent() will not allow the
> device to find the firmware image in its memory space, and arbitrary
> DMA accesses performed by the device will trigger faults that are
> reported to the OS. So the window between DMA map (or
> dma_alloc_coherent()) and the device specific command to pass the DMA
> buffer address to the device is not inherently unsafe IMO, but I do
> understand the need to cover this scenario.
>
> As I pointed out before, using coherent DMA buf

  1   2   >