Re: [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case

2023-10-20 Thread kirill.shute...@linux.intel.com
On Fri, Oct 20, 2023 at 03:29:24AM +, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> > ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
> > kexec: target kernel won't be able to use more than one CPU.
> > 
> > Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> > that the mailbox is not usable.
> > 
> > This is Linux-specific protocol and not reflected in ACPI spec.
> > 
> > Booting the target kernel with signle CPU is enough to cover the most
> > common case for kexec -- kdump.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  arch/x86/kernel/acpi/madt_wakeup.c | 17 +
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
> > b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 15bdf10b1393..4e92d1d4a5fa 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox 
> > *acpi_mp_wake_mailbox;
> >  
> >  static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> >  {
> > +   if (!acpi_mp_wake_mailbox_paddr) {
> > +   pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. 
> > Booting with kexec?\n");
> > +   return -EOPNOTSUPP;
> > +   }
> > +
> > /*
> >  * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> >  *
> > @@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union 
> > acpi_subtable_headers *header,
> > /* Disable CPU onlining/offlining */
> > cpu_hotplug_not_supported();
> >  
> > +   /*
> > +* ACPI MADT doesn't allow to offline CPU after it got woke up.
> > +* It limits kexec: target kernel won't be able to use more than
> > +* one CPU.
> > +*
> > +* Zero out mailbox address in the ACPI MADT wakeup structure to
> > +* indicate that the mailbox is not usable.
> 
> Nit:
> 
> It is better to explicitly say that this will only impact the second kernel
> because the current kernel has already detected the  mailbox address?
> 
>   Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
>   The acpi_wakeup_cpu() will use it to bring up secondary cpus.
> 
>   Zero out mailbox address in the ACPI MADT wakeup structure to
>   indicate that the mailbox is not usable.  This prevents the
>   kexec()-ed kernel from reading a vaild mailbox, which in turn
>   makes the kexec()-ed kernel only be able to use the boot CPU. 

Okay. Looks good.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

2023-10-20 Thread Kirill A. Shutemov
On Fri, Oct 20, 2023 at 12:21:11PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> > 
> > On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > > +{
> > > > > > > + unsigned long addr, end;
> > > > > > > + long found = 0, shared;
> > > > > > > +
> > > > > > > + /*
> > > > > > > +  * Walk direct mapping and convert all shared memory back to 
> > > > > > > private,
> > > > > > > +  */
> > > > > > > +
> > > > > > > + addr = PAGE_OFFSET;
> > > > > > > + end  = PAGE_OFFSET + get_max_mapped();
> > > > > > > +
> > > > > > > + while (addr < end) {
> > > > > > > + unsigned long size;
> > > > > > > + unsigned int level;
> > > > > > > + pte_t *pte;
> > > > > > > +
> > > > > > > + pte = lookup_address(addr, );
> > > > > > 
> > > > > > IIRC, you were earlier walking the direct mapping using
> > > > > > walk_page_range_novma(), any particular reason to use 
> > > > > > lookup_address()
> > > > > > instead ?
> > > > > 
> > > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky 
> > > > > as
> > > > > we run here from atomic context in case of crash.
> > > > > 
> > > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > > 
> > > > > > 
> > > > > > > + size = page_level_size(level);
> > > > > > > +
> > > > > > > + if (pte && pte_decrypted(*pte)) {
> > > > > > 
> > > > > > Additionally need to add check for pte_none() here to handle 
> > > > > > physical memory
> > > > > > holes in direct mapping.
> > > > > 
> > > > > lookup_address() returns NULL for none entries.
> > > > > 
> > > > 
> > > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > > pte_offset_kernel() and there does not seem to be a check for returning 
> > > > NULL
> > > > if pte_none() ?
> > > 
> > > Hm. You are right.
> > > 
> > > I think it yet another quirk in how lookup_address() implemented. We need
> > > to make it straight too.
> > > 
> > > There's two options: either make lookup_address() return pointer for entry
> > > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > > and return NULL if it is true.
> > > 
> > > I like the first option more as it allows caller to populate the entry if
> > > it wants.
> > 
> > Yes, i like the first option.
> 
> I tried to this, but lookup_address() has to many callers. It gets beyond
> the scope of the patchset. I will add pte_none() check on unshare side for
> now.

Ah. pte_none() is not need for TDX implementation, as pte_decrypted()
check will fail for it. SEV implementation would need an additional check.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-20 Thread kirill.shute...@linux.intel.com
On Fri, Oct 20, 2023 at 09:49:59AM +, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> >  struct acpi_madt_multiproc_wakeup {
> >     struct acpi_subtable_header header;
> > -   u16 mailbox_version;
> > +   u16 version;
> >     u32 reserved;   /* reserved - must be zero */
> > -   u64 base_address;
> > +   u64 mailbox_address;
> > +   u64 reset_vector;
> >  };
> 
> I don't quite understand the connection between the renaming and what this 
> patch
> wants to achieve?  What's the reason to rename?

Names are bad: the version field guides version of the structure, not
version of the mailbox. And it is not clear what base base_address
specifies.

> If needed, perhaps put into a separate patch with proper justification?

Hm. Okay...

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

2023-10-20 Thread Kirill A. Shutemov
On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> 
> On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > +{
> > > > > > +   unsigned long addr, end;
> > > > > > +   long found = 0, shared;
> > > > > > +
> > > > > > +   /*
> > > > > > +* Walk direct mapping and convert all shared memory back to 
> > > > > > private,
> > > > > > +*/
> > > > > > +
> > > > > > +   addr = PAGE_OFFSET;
> > > > > > +   end  = PAGE_OFFSET + get_max_mapped();
> > > > > > +
> > > > > > +   while (addr < end) {
> > > > > > +   unsigned long size;
> > > > > > +   unsigned int level;
> > > > > > +   pte_t *pte;
> > > > > > +
> > > > > > +   pte = lookup_address(addr, );
> > > > > 
> > > > > IIRC, you were earlier walking the direct mapping using
> > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > instead ?
> > > > 
> > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > we run here from atomic context in case of crash.
> > > > 
> > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > 
> > > > > 
> > > > > > +   size = page_level_size(level);
> > > > > > +
> > > > > > +   if (pte && pte_decrypted(*pte)) {
> > > > > 
> > > > > Additionally need to add check for pte_none() here to handle physical 
> > > > > memory
> > > > > holes in direct mapping.
> > > > 
> > > > lookup_address() returns NULL for none entries.
> > > > 
> > > 
> > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > pte_offset_kernel() and there does not seem to be a check for returning 
> > > NULL
> > > if pte_none() ?
> > 
> > Hm. You are right.
> > 
> > I think it yet another quirk in how lookup_address() implemented. We need
> > to make it straight too.
> > 
> > There's two options: either make lookup_address() return pointer for entry
> > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > and return NULL if it is true.
> > 
> > I like the first option more as it allows caller to populate the entry if
> > it wants.
> 
> Yes, i like the first option.

I tried to this, but lookup_address() has to many callers. It gets beyond
the scope of the patchset. I will add pte_none() check on unshare side for
now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-20 Thread Huang, Kai
On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
>  struct acpi_madt_multiproc_wakeup {
>   struct acpi_subtable_header header;
> - u16 mailbox_version;
> + u16 version;
>   u32 reserved;   /* reserved - must be zero */
> - u64 base_address;
> + u64 mailbox_address;
> + u64 reset_vector;
>  };

I don't quite understand the connection between the renaming and what this patch
wants to achieve?  What's the reason to rename?  If needed, perhaps put into a
separate patch with proper justification?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-20 Thread Huang, Kai

> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt.S
> @@ -0,0 +1,28 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> + .text
> + .align PAGE_SIZE
> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> + /* Load address of reset vector into RCX to jump when kernel is ready */
> + movqacpi_mp_reset_vector_paddr(%rip), %rcx
> +
> + /* zero out flags, and disable interrupts */
> + pushq   $0
> + popfq
> +
> + /* Turn off global entries. Following CR3 write will flush them. */
> + movq%cr4, %rdx
> + andq$~(X86_CR4_PGE), %rdx
> + movq%rdx, %cr4
> +
> + /* Switch to identity mapping */
> + movqacpi_mp_pgd(%rip), %rax
> + movq%rax, %cr3
> +
> + /* Jump to reset vector */
> + ANNOTATE_RETPOLINE_SAFE
> + jmp *%rcx
> +SYM_FUNC_END(asm_acpi_mp_play_dead)
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
> b/arch/x86/kernel/acpi/madt_wakeup.c
> index 4e92d1d4a5fa..2cc8590ec7a5 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,12 +1,162 @@
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
>  #include 
> +#include 
>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
>  static u64 acpi_mp_wake_mailbox_paddr;
>  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
> +unsigned long acpi_mp_pgd;
> +u64 acpi_mp_reset_vector_paddr;

Nit: 

They are both physical address.  It's a little bit odd for one to use 'unsigned
long' and the other to use 'u64'.

> +
> +void asm_acpi_mp_play_dead(void);
> +
> +static void __init *alloc_pgt_page(void *context)
> +{
> + return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}

If I am reading correclty, @context is never used.  It's not used inside this
function, and all the callers call this with @context = NULL.


[...]

> +
> +static void acpi_mp_play_dead(void)
> +{
> + idle_task_exit();
> + cpuhp_ap_report_dead();
> + asm_acpi_mp_play_dead();
> +}
> +

Looks you can use play_dead_common() here, if you take IRQ disable part out from
the assembly, because play_dead_common() does:

void play_dead_common(void)
{   
idle_task_exit();

cpuhp_ap_report_dead();

local_irq_disable();   
}
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup

2023-10-20 Thread Huang, Kai
On Tue, 2023-10-10 at 10:24 +, Huang, Kai wrote:
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers 
> > *header,
> >  
> > 
> >     acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> >  
> > 
> > +   /* Disable CPU onlining/offlining */
> > +   cpu_hotplug_not_supported();
> > +
> 
> Both onlining/offlining are prevented, or just offlining?
> 
> The previous patch says:
> 
>   It does not prevent the initial bring up of the CPU, but it stops 
>   subsequent offlining.
> 
> And ...
> 
> [...]
> 
> 
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, 
> > enum cpuhp_state target)
> >  * If the platform does not support hotplug, report it explicitly to
> >  * differentiate it from a transient offlining failure.
> >  */
> > -   if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > +   if (!cpu_hotplug_supported)
> >     return -EOPNOTSUPP;
> >     if (cpu_hotplug_disabled)
> >     return -EBUSY;
> 
> ... here cpu_down_maps_locked() only prevents offlining if I am reading
> correctly.
> 
> Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match 
> the
> behaviour better?
> 
> Anyway, isn't it a little bit odd to have:
> 
>   if (!cpu_hotplug_supported)
>   return -EOPNOTSUPP;
>   if (cpu_hotplug_disabled)
>   return -EBUSY;
> 
> ?

I probably have missed something important, but I don't quite understand what's
the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
replace it with cpu_hotplug_not_supported()?

From the changelog:

Currently hotplug prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method.

CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by non-
TDX guest too.

Anyway, if the purpose is just to prevent CPU from going offline, can this be
done by registering a cpuhp callback?

static int madt_wakeup_offline_cpu(unsigned int cpu)
{
return -EOPNOTSUPP;
}

...

err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
NULL, madt_wakeup_offline_cpu);
if (err) {
pr_err("Register CPU hotplug callback failed.\n");
/* BUG() ??? */
}

This doesn't pollute the common CPU hotplug code, thus to me looks more clear?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-10-20 Thread Hari Bathini



On 18/10/23 1:51 pm, Pingfan Liu wrote:

On Tue, Oct 17, 2023 at 6:39 PM Hari Bathini  wrote:




On 17/10/23 7:58 am, Pingfan Liu wrote:

*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org


Thanks for working on this, Pingfan.
Looks good to me.

Acked-by: Hari Bathini 



Thank you for kindly reviewing. I hope that after all these years, we
have accomplished the objective.



I hope so too.
Thanks!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-20 Thread kirill.shute...@linux.intel.com
On Fri, Oct 20, 2023 at 11:21:34AM +, Huang, Kai wrote:
> 
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt.S
> > @@ -0,0 +1,28 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +   .text
> > +   .align PAGE_SIZE
> > +SYM_FUNC_START(asm_acpi_mp_play_dead)
> > +   /* Load address of reset vector into RCX to jump when kernel is ready */
> > +   movqacpi_mp_reset_vector_paddr(%rip), %rcx
> > +
> > +   /* zero out flags, and disable interrupts */
> > +   pushq   $0
> > +   popfq
> > +
> > +   /* Turn off global entries. Following CR3 write will flush them. */
> > +   movq%cr4, %rdx
> > +   andq$~(X86_CR4_PGE), %rdx
> > +   movq%rdx, %cr4
> > +
> > +   /* Switch to identity mapping */
> > +   movqacpi_mp_pgd(%rip), %rax
> > +   movq%rax, %cr3
> > +
> > +   /* Jump to reset vector */
> > +   ANNOTATE_RETPOLINE_SAFE
> > +   jmp *%rcx
> > +SYM_FUNC_END(asm_acpi_mp_play_dead)
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
> > b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 4e92d1d4a5fa..2cc8590ec7a5 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -1,12 +1,162 @@
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> >  #include 
> > +#include 
> >  
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> >  static u64 acpi_mp_wake_mailbox_paddr;
> >  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> >  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >  
> > +unsigned long acpi_mp_pgd;
> > +u64 acpi_mp_reset_vector_paddr;
> 
> Nit: 
> 
> They are both physical address.  It's a little bit odd for one to use 
> 'unsigned
> long' and the other to use 'u64'.

Okay, I will make them both u64.

> > +
> > +void asm_acpi_mp_play_dead(void);
> > +
> > +static void __init *alloc_pgt_page(void *context)
> > +{
> > +   return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > +}
> 
> If I am reading correclty, @context is never used.  It's not used inside this
> function, and all the callers call this with @context = NULL.

Yes. We need the argument to conform to x86_mapping_info::alloc_pgt_page
interface.

> > +
> > +static void acpi_mp_play_dead(void)
> > +{
> > +   idle_task_exit();
> > +   cpuhp_ap_report_dead();
> > +   asm_acpi_mp_play_dead();
> > +}
> > +
> 
> Looks you can use play_dead_common() here, if you take IRQ disable part out 
> from
> the assembly, because play_dead_common() does:
> 
> void play_dead_common(void)
> {   
> idle_task_exit();
> 
> cpuhp_ap_report_dead();   
>  
> 
> local_irq_disable();  
>  
> }

Okay, fair enough.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 12/13] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure

2023-10-20 Thread Kirill A. Shutemov
To prepare for the addition of support for MADT wakeup structure version
1, it is necessary to provide more appropriate names for the fields in
the structure.

The field 'mailbox_version' renamed as 'version'. This field signifies
the version of the structure and the related protocols, rather than the
version of the mailbox. This field has not been utilized in the code
thus far.

The field 'base_address' renamed as 'mailbox_address' to clarify the
kind of address it represents. In version 1, the structure includes the
reset vector address. Clear and distinct naming helps to prevent any
confusion.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/acpi/madt_wakeup.c | 4 ++--
 include/acpi/actbl2.h  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
b/arch/x86/kernel/acpi/madt_wakeup.c
index 9bbe829737e7..ad170def2367 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -79,7 +79,7 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers 
*header,
 
acpi_table_print_madt_entry(>common);
 
-   acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+   acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
 
cpu_hotplug_disable_offlining();
 
@@ -98,7 +98,7 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers 
*header,
 *
 * This is Linux-specific protocol and not reflected in ACPI spec.
 */
-   mp_wake->base_address = 0;
+   mp_wake->mailbox_address = 0;
 
apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index 3751ae69432f..23b4cfb640fc 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1109,9 +1109,9 @@ struct acpi_madt_generic_translator {
 
 struct acpi_madt_multiproc_wakeup {
struct acpi_subtable_header header;
-   u16 mailbox_version;
+   u16 version;
u32 reserved;   /* reserved - must be zero */
-   u64 base_address;
+   u64 mailbox_address;
 };
 
 #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE2032
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-20 Thread Kirill A. Shutemov
MADT Multiprocessor Wakeup structure version 1 brings support of CPU
offlining: BIOS provides a reset vector where the CPU has to jump to
offline itself. The new TEST mailbox command can be used to test the CPU
offlined successfully and BIOS has control over it.

Add CPU offling support for ACPI MADT wakeup method by implementing
custom cpu_die, play_dead and stop_other_cpus SMP operations.

CPU offlining makes possible to hand over secondary CPUs over kexec, not
limiting the second kernel with single CPU.

The change conforms to the approved ACPI spec change proposal. See the
Link.

Signed-off-by: Kirill A. Shutemov 
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
---
 arch/x86/kernel/acpi/Makefile  |   2 +-
 arch/x86/kernel/acpi/boot.c|   2 +
 arch/x86/kernel/acpi/madt.S|  24 
 arch/x86/kernel/acpi/madt_wakeup.c | 197 ++---
 include/acpi/actbl2.h  |  15 ++-
 5 files changed, 220 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt.S

diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index 8c7329c88a75..ccb8198dd8d1 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI)  += boot.o
 obj-$(CONFIG_ACPI_SLEEP)   += sleep.o wakeup_$(BITS).o
 obj-$(CONFIG_ACPI_APEI)+= apei.o
 obj-$(CONFIG_ACPI_CPPC_LIB)+= cppc.o
-obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o madt.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
 obj-y  += cstate.o
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 111bd226ad99..d537dbffa697 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -33,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "sleep.h" /* To include x86_acpi_suspend_lowlevel */
 static int __initdata acpi_force = 0;
diff --git a/arch/x86/kernel/acpi/madt.S b/arch/x86/kernel/acpi/madt.S
new file mode 100644
index ..a60435cf4a77
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt.S
@@ -0,0 +1,24 @@
+#include 
+#include 
+#include 
+#include 
+
+   .text
+   .align PAGE_SIZE
+SYM_FUNC_START(asm_acpi_mp_play_dead)
+   /* Load address of reset vector into RCX to jump when kernel is ready */
+   movqacpi_mp_reset_vector_paddr(%rip), %rcx
+
+   /* Turn off global entries. Following CR3 write will flush them. */
+   movq%cr4, %rdx
+   andq$~(X86_CR4_PGE), %rdx
+   movq%rdx, %cr4
+
+   /* Switch to identity mapping */
+   movqacpi_mp_pgd(%rip), %rax
+   movq%rax, %cr3
+
+   /* Jump to reset vector */
+   ANNOTATE_RETPOLINE_SAFE
+   jmp *%rcx
+SYM_FUNC_END(asm_acpi_mp_play_dead)
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
b/arch/x86/kernel/acpi/madt_wakeup.c
index ad170def2367..f9ff14ee2892 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,8 +1,13 @@
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
@@ -11,6 +16,150 @@ static u64 acpi_mp_wake_mailbox_paddr;
 /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
 static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
 
+u64 acpi_mp_pgd;
+u64 acpi_mp_reset_vector_paddr;
+
+void asm_acpi_mp_play_dead(void);
+
+static void __init *alloc_pgt_page(void *context)
+{
+   return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+/*
+ * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
+ * the same place as in the kernel page tables. The function switches to
+ * the identity mapping and has be present at the same spot in before and
+ * after transition.
+ */
+static int __init init_transition_pgtable(pgd_t *pgd)
+{
+   pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+   unsigned long vaddr, paddr;
+   int result = -ENOMEM;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   vaddr = (unsigned long)asm_acpi_mp_play_dead;
+   pgd += pgd_index(vaddr);
+   if (!pgd_present(*pgd)) {
+   p4d = (p4d_t *)alloc_pgt_page(NULL);
+   if (!p4d)
+   goto err;
+   set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+   }
+   p4d = p4d_offset(pgd, vaddr);
+   if (!p4d_present(*p4d)) {
+   pud = (pud_t *)alloc_pgt_page(NULL);
+   if (!pud)
+   goto err;
+   set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+   }
+   pud = pud_offset(p4d, vaddr);
+   if (!pud_present(*pud)) {
+   pmd = (pmd_t 

[PATCHv2 07/13] x86/mm: Return correct level from lookup_address() if pte is none

2023-10-20 Thread Kirill A. Shutemov
lookup_address() only returns correct page table level for the entry if
the entry is not none.

Make the helper to always return correct 'level'. It allows to implement
iterator over kernel page tables using lookup_address().

Add one more entry into enum pg_level to indicate size of VA covered by
one PGD entry in 5-level paging mode.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Rick Edgecombe 
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/pat/set_memory.c | 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..3f648ffdfbe5 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -548,6 +548,7 @@ enum pg_level {
PG_LEVEL_2M,
PG_LEVEL_1G,
PG_LEVEL_512G,
+   PG_LEVEL_256T,
PG_LEVEL_NUM
 };
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6fbf22d5fa56..01f827eb8e80 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long 
address,
pud_t *pud;
pmd_t *pmd;
 
-   *level = PG_LEVEL_NONE;
+   *level = PG_LEVEL_256T;
 
if (pgd_none(*pgd))
return NULL;
 
+   *level = PG_LEVEL_512G;
p4d = p4d_offset(pgd, address);
if (p4d_none(*p4d))
return NULL;
 
-   *level = PG_LEVEL_512G;
if (p4d_large(*p4d) || !p4d_present(*p4d))
return (pte_t *)p4d;
 
+   *level = PG_LEVEL_1G;
pud = pud_offset(p4d, address);
if (pud_none(*pud))
return NULL;
 
-   *level = PG_LEVEL_1G;
if (pud_large(*pud) || !pud_present(*pud))
return (pte_t *)pud;
 
+   *level = PG_LEVEL_2M;
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return NULL;
 
-   *level = PG_LEVEL_2M;
if (pmd_large(*pmd) || !pmd_present(*pmd))
return (pte_t *)pmd;
 
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 10/13] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges

2023-10-20 Thread Kirill A. Shutemov
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other
things, guides where direct mapping ends. Any memory above max_pfn is
not going to be present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the ram based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI
tables and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory
above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash
memory as E820_TYPE_RAM on boot. If the preallocated range is small, it
can fit under the last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX
guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass
it between the kernels on kexec.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/e820.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..99c80680dc9e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -827,7 +827,7 @@ u64 __init e820__memblock_alloc_reserved(u64 size, u64 
align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum 
e820_type type)
+static unsigned long __init e820_end_ram_pfn(unsigned long limit_pfn)
 {
int i;
unsigned long last_pfn = 0;
@@ -838,7 +838,8 @@ static unsigned long __init e820_end_pfn(unsigned long 
limit_pfn, enum e820_type
unsigned long start_pfn;
unsigned long end_pfn;
 
-   if (entry->type != type)
+   if (entry->type != E820_TYPE_RAM &&
+   entry->type != E820_TYPE_ACPI)
continue;
 
start_pfn = entry->addr >> PAGE_SHIFT;
@@ -864,12 +865,12 @@ static unsigned long __init e820_end_pfn(unsigned long 
limit_pfn, enum e820_type
 
 unsigned long __init e820__end_of_ram_pfn(void)
 {
-   return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
+   return e820_end_ram_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820__end_of_low_ram_pfn(void)
 {
-   return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM);
+   return e820_end_ram_pfn(1UL << (32 - PAGE_SHIFT));
 }
 
 static void __init early_panic(char *msg)
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 09/13] x86/tdx: Convert shared memory back to private on kexec

2023-10-20 Thread Kirill A. Shutemov
TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/coco/tdx/kexec.c   |   0
 arch/x86/coco/tdx/tdx.c | 120 +++-
 arch/x86/include/asm/x86_init.h |   1 +
 arch/x86/kernel/crash.c |   4 ++
 arch/x86/kernel/reboot.c|   5 ++
 5 files changed, 128 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c

diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
new file mode 100644
index ..e69de29bb2d1
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 587bdeb88afa..2be23fe8cb3d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -6,14 +6,17 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 /* MMIO direction */
 #define EPT_READ   0
@@ -40,6 +43,9 @@
 
 static atomic_long_t nr_shared;
 
+static atomic_t conversions_in_progress;
+static bool conversion_allowed = true;
+
 static inline bool pte_decrypted(pte_t pte)
 {
return cc_mkdec(pte_val(pte)) == pte_val(pte);
@@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
 
 static bool tdx_cache_flush_required(void)
 {
+   /*
+* Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
+* stopped. Otherwise it can race with unshare_all_memory() and trigger
+* implicit conversion to shared.
+*/
+   if (!conversion_allowed)
+   return false;
+
/*
 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
 * TDX doesn't have such capability.
@@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, 
int numpages, bool enc)
 static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
 bool enc)
 {
+   atomic_inc(_in_progress);
+
+   /*
+* Check after bumping conversions_in_progress to serialize
+* against tdx_shutdown().
+*/
+   if (!conversion_allowed) {
+   atomic_dec(_in_progress);
+   return -EBUSY;
+   }
+
/*
 * Only handle shared->private conversion here.
 * See the comment in tdx_early_init().
 */
-   if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+   if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+   atomic_dec(_in_progress);
return -EIO;
+   }
 
return 0;
 }
@@ -804,17 +831,104 @@ static int tdx_enc_status_change_finish(unsigned long 
vaddr, int numpages,
 * Only handle private->shared conversion here.
 * See the comment in tdx_early_init().
 */
-   if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+   if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+   atomic_dec(_in_progress);
return -EIO;
+   }
 
if (enc)
atomic_long_sub(numpages, _shared);
else
atomic_long_add(numpages, _shared);
 
+   atomic_dec(_in_progress);
+
return 0;
 }
 
+void tdx_kexec_unshare_mem(bool crash)
+{
+   unsigned long addr, end;
+   long found = 0, shared;
+
+   /* Stop new private<->shared conversions */
+   conversion_allowed = false;
+
+   /*
+* Crash kernel reaches here with interrupts disabled: can't wait for
+* conversions to finish.
+*
+* If race happened, just report and proceed.
+*/
+   if (!crash) {
+   unsigned long timeout;
+
+   /*
+* Wait for in-flight conversions to complete.
+*
+* Do not wait more than 30 seconds.
+*/
+   timeout = 30 * USEC_PER_SEC;
+   while (atomic_read(_in_progress) && timeout--)
+   udelay(1);
+   }
+
+   if (atomic_read(_in_progress))
+   pr_warn("Failed to finish shared<->private conversions\n");
+
+   /*
+* Walk direct mapping and convert all shared memory back to private,
+*/
+
+   addr = PAGE_OFFSET;
+   end  = PAGE_OFFSET + get_max_mapped();
+
+   while (addr < end) {
+   unsigned long size;
+   unsigned int level;
+   pte_t *pte;
+
+   pte = lookup_address(addr, );
+   size = page_level_size(level);
+
+  

[PATCHv2 02/13] kernel/cpu: Add support for declaring CPU offlining not supported

2023-10-20 Thread Kirill A. Shutemov
ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently offlining hotplug prevented based on the confidential
computing attribute which is set for Intel TDX. But TDX is not
the only possible user of the wake up method.

Introduce cpu_hotplug_not_supported() that can be called to indicate
that CPU offlining should be disabled.

This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI
MADT.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/cpu.h |  2 ++
 kernel/cpu.c| 13 -
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f19f56501809..97832ced939d 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
 extern void cpus_read_unlock(void);
 extern int  cpus_read_trylock(void);
 extern void lockdep_assert_cpus_held(void);
+extern void cpu_hotplug_disable_offlining(void);
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 void clear_tasks_mm_cpumask(int cpu);
@@ -147,6 +148,7 @@ static inline void cpus_read_lock(void) { }
 static inline void cpus_read_unlock(void) { }
 static inline int  cpus_read_trylock(void) { return true; }
 static inline void lockdep_assert_cpus_held(void) { }
+static inline void cpu_hotplug_disable_offlining(void) { }
 static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
 static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6de7c6bb74ee..faebee0050a2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -484,6 +484,8 @@ static int cpu_hotplug_disabled;
 
 DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
 
+static bool cpu_hotplug_offline_disabled;
+
 void cpus_read_lock(void)
 {
percpu_down_read(_hotplug_lock);
@@ -543,6 +545,14 @@ static void lockdep_release_cpus_lock(void)
rwsem_release(_hotplug_lock.dep_map, _THIS_IP_);
 }
 
+/* Declare CPU offlining not supported */
+void cpu_hotplug_disable_offlining(void)
+{
+   cpu_maps_update_begin();
+   cpu_hotplug_offline_disabled = true;
+   cpu_maps_update_done();
+}
+
 /*
  * Wait for currently running CPU hotplug operations to complete (if any) and
  * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
@@ -1507,7 +1517,8 @@ static int cpu_down_maps_locked(unsigned int cpu, enum 
cpuhp_state target)
 * If the platform does not support hotplug, report it explicitly to
 * differentiate it from a transient offlining failure.
 */
-   if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+   if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) ||
+   cpu_hotplug_offline_disabled)
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

2023-10-20 Thread Kirill A. Shutemov
In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.

Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig   |  7 +++
 arch/x86/include/asm/acpi.h|  5 ++
 arch/x86/kernel/acpi/Makefile  | 11 ++--
 arch/x86/kernel/acpi/boot.c| 86 +-
 arch/x86/kernel/acpi/madt_wakeup.c | 81 
 5 files changed, 100 insertions(+), 90 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 799102f4d909..9957a73bb386 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || 
PCI_MSI
select IRQ_DOMAIN_HIERARCHY
 
+config X86_ACPI_MADT_WAKEUP
+   def_bool y
+   depends on X86_64
+   depends on ACPI
+   depends on SMP
+   depends on X86_LOCAL_APIC
+
 config X86_IO_APIC
def_bool y
depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index c8a7fc23f63c..b536b5a6a57b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
 
 #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
 
+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end);
+
 /*
  * Check if the CPU can handle C2 and deeper
  */
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..8c7329c88a75 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -1,11 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_ACPI) += boot.o
-obj-$(CONFIG_ACPI_SLEEP)   += sleep.o wakeup_$(BITS).o
-obj-$(CONFIG_ACPI_APEI)+= apei.o
-obj-$(CONFIG_ACPI_CPPC_LIB)+= cppc.o
+obj-$(CONFIG_ACPI) += boot.o
+obj-$(CONFIG_ACPI_SLEEP)   += sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_APEI)+= apei.o
+obj-$(CONFIG_ACPI_CPPC_LIB)+= cppc.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
-obj-y  += cstate.o
+obj-y  += cstate.o
 endif
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2a0ea38955df..111bd226ad99 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -66,13 +66,6 @@ static u64 acpi_lapic_addr __initdata = 
APIC_DEFAULT_PHYS_BASE;
 static bool acpi_support_online_capable;
 #endif
 
-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -357,60 +350,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, 
const unsigned long e
 
return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
-{
-   /*
-* Remap mailbox memory only for the first call to acpi_wakeup_cpu().
-*
-* Wakeup of secondary CPUs is fully serialized in the core code.
-* No need to protect acpi_mp_wake_mailbox from concurrent accesses.
-*/
-   if (!acpi_mp_wake_mailbox) {
-   acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
-   sizeof(*acpi_mp_wake_mailbox),
-   MEMREMAP_WB);
-   }
-
-   /*
-* Mailbox memory is shared between the firmware and OS. Firmware will
-* listen on mailbox command address, and once it receives the wakeup
-* command, the CPU associated with the given apicid will be booted.
-*
-* The value of 'apic_id' and 'wakeup_vector' must be visible to the
-* firmware before the wakeup command is visible.  smp_store_release()
-* ensures ordering and visibility.
-*/
-   acpi_mp_wake_mailbox->apic_id   = apicid;
-   acpi_mp_wake_mailbox->wakeup_vector = start_ip;
-   smp_store_release(_mp_wake_mailbox->command,
- ACPI_MP_WAKE_COMMAND_WAKEUP);
-
-   /*
-* Wait for the CPU to wake up.
-*
-* The CPU being woken up is essentially in a spin loop waiting to be
-* woken up. It should not take long for it wake up and acknowledge by
-* zeroing out ->command.
- 

[PATCHv2 11/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case

2023-10-20 Thread Kirill A. Shutemov
ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
kexec: the second kernel won't be able to use more than one CPU.

Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
The acpi_wakeup_cpu() will use it to bring up secondary cpus.

Zero out mailbox address in the ACPI MADT wakeup structure to indicate
that the mailbox is not usable.  This prevents the kexec()-ed kernel
from reading a vaild mailbox, which in turn makes the kexec()-ed kernel
only be able to use the boot CPU.

This is Linux-specific protocol and not reflected in ACPI spec.

Booting the second kernel with signle CPU is enough to cover the most
common case for kexec -- kdump.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/acpi/madt_wakeup.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
b/arch/x86/kernel/acpi/madt_wakeup.c
index 4bc1d5106afd..9bbe829737e7 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -13,6 +13,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox 
*acpi_mp_wake_mailbox;
 
 static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
 {
+   if (!acpi_mp_wake_mailbox_paddr) {
+   pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. 
Booting with kexec?\n");
+   return -EOPNOTSUPP;
+   }
+
/*
 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
 *
@@ -78,6 +83,23 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers 
*header,
 
cpu_hotplug_disable_offlining();
 
+   /*
+* ACPI MADT doesn't allow to offline CPU after it got woke up.
+* It limits kexec: the second kernel won't be able to use more than
+* one CPU.
+*
+* Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
+* The acpi_wakeup_cpu() will use it to bring up secondary cpus.
+*
+* Zero out mailbox address in the ACPI MADT wakeup structure to
+* indicate that the mailbox is not usable.  This prevents the
+* kexec()-ed kernel from reading a vaild mailbox, which in turn
+* makes the kexec()-ed kernel only be able to use the boot CPU.
+*
+* This is Linux-specific protocol and not reflected in ACPI spec.
+*/
+   mp_wake->base_address = 0;
+
apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
return 0;
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

2023-10-20 Thread Kirill A. Shutemov
TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Use alternatives to keep the flag during kexec for TDX guests.

The change doesn't affect non-TDX-guest environments.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Kai Huang 
---
 arch/x86/kernel/relocate_kernel_64.S | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S 
b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..bea89814b48e 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 * Set cr4 to a known state:
 *  - physical address extension enabled
 *  - 5-level paging, if it was enabled before
+*  - Machine check exception on TDX guest. Clearing MCE is not allowed
+*in TDX guests.
 */
movl$X86_CR4_PAE, %eax
testq   $X86_CR4_LA57, %r13
jz  1f
orl $X86_CR4_LA57, %eax
+1:
+   ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
+   orl $X86_CR4_MCE, %eax
 1:
movq%rax, %cr4
 
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 06/13] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno

2023-10-20 Thread Kirill A. Shutemov
TDX is going to have more than one reason to fail
enc_status_change_prepare().

Change the callback to return errno instead of assuming -EIO;
enc_status_change_finish() changed too to keep the interface symmetric.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/coco/tdx/tdx.c | 20 +++-
 arch/x86/hyperv/ivm.c   |  9 +++--
 arch/x86/include/asm/x86_init.h |  4 ++--
 arch/x86/kernel/x86_init.c  |  4 ++--
 arch/x86/mm/mem_encrypt_amd.c   |  8 
 arch/x86/mm/pat/set_memory.c|  9 +
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index d11206ceff3b..8897d3cc8a15 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -776,28 +776,30 @@ static bool tdx_enc_status_changed(unsigned long vaddr, 
int numpages, bool enc)
return true;
 }
 
-static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
- bool enc)
+static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+bool enc)
 {
/*
 * Only handle shared->private conversion here.
 * See the comment in tdx_early_init().
 */
-   if (enc)
-   return tdx_enc_status_changed(vaddr, numpages, enc);
-   return true;
+   if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+   return -EIO;
+
+   return 0;
 }
 
-static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 bool enc)
 {
/*
 * Only handle private->shared conversion here.
 * See the comment in tdx_early_init().
 */
-   if (!enc)
-   return tdx_enc_status_changed(vaddr, numpages, enc);
-   return true;
+   if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+   return -EIO;
+
+   return 0;
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c1088d3661d5..7d2241059d49 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -510,13 +510,12 @@ static int hv_mark_gpa_visibility(u16 count, const u64 
pfn[],
  * with host. This function works as wrap of hv_mark_gpa_visibility()
  * with memory base and size.
  */
-static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, 
bool enc)
+static int hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, 
bool enc)
 {
enum hv_mem_host_visibility visibility = enc ?
VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
u64 *pfn_array;
int ret = 0;
-   bool result = true;
int i, pfn;
 
pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
@@ -530,17 +529,15 @@ static bool hv_vtom_set_host_visibility(unsigned long 
kbuffer, int pagecount, bo
if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
ret = hv_mark_gpa_visibility(pfn, pfn_array,
 visibility);
-   if (ret) {
-   result = false;
+   if (ret)
goto err_free_pfn_array;
-   }
pfn = 0;
}
}
 
  err_free_pfn_array:
kfree(pfn_array);
-   return result;
+   return ret;
 }
 
 static bool hv_vtom_tlb_flush_required(bool private)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5240d88db52a..5031cbc6e211 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,8 +150,8 @@ struct x86_init_acpi {
  * @enc_cache_flush_required   Returns true if a cache flush is needed before 
changing page encryption status
  */
 struct x86_guest {
-   bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool 
enc);
-   bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool 
enc);
+   int (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool 
enc);
+   int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool 
enc);
bool (*enc_tlb_flush_required)(bool enc);
bool (*enc_cache_flush_required)(void);
 };
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3b4773..f0f54e109eb9 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -131,8 +131,8 @@ struct x86_cpuinit_ops x86_cpuinit = {
 
 static void default_nmi_init(void) { };
 
-static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, 
bool enc) { return true; }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, 
bool enc) { return true; }
+static int 

[PATCHv2 00/13] x86/tdx: Add kexec support

2023-10-20 Thread Kirill A. Shutemov
The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel. It requires BIOS-side enabling. If it missing we fallback to booting
2nd kernel with single CPU.

Please review. I would be glad for any feedback.

v2:
  - Rework how unsharing hook ups into kexec codepath;
  - Rework kvmclock_disable() fix based on Sean's;
  - s/cpu_hotplug_not_supported()/cpu_hotplug_disable_offlining()/;
  - use play_dead_common() to implement acpi_mp_play_dead();
  - cond_resched() in tdx_shared_memory_show();
  - s/target kernel/second kernel/;
  - Update commit messages and comments;

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

Kirill A. Shutemov (13):
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  kernel/cpu: Add support for declaring CPU offlining not supported
  cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  x86/mm: Return correct level from lookup_address() if pte is none
  x86/tdx: Account shared memory
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

 arch/x86/Kconfig |   7 +
 arch/x86/coco/core.c |   1 -
 arch/x86/coco/tdx/kexec.c|   0
 arch/x86/coco/tdx/tdx.c  | 205 +++-
 arch/x86/hyperv/ivm.c|   9 +-
 arch/x86/include/asm/acpi.h  |   5 +
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/include/asm/x86_init.h  |   5 +-
 arch/x86/kernel/acpi/Makefile|  11 +-
 arch/x86/kernel/acpi/boot.c  |  88 +
 arch/x86/kernel/acpi/madt.S  |  24 +++
 arch/x86/kernel/acpi/madt_wakeup.c   | 267 +++
 arch/x86/kernel/crash.c  |   4 +
 arch/x86/kernel/e820.c   |   9 +-
 arch/x86/kernel/kvmclock.c   |  12 +-
 arch/x86/kernel/reboot.c |   5 +
 arch/x86/kernel/relocate_kernel_64.S |   5 +
 arch/x86/kernel/x86_init.c   |   4 +-
 arch/x86/mm/mem_encrypt_amd.c|   8 +-
 arch/x86/mm/pat/set_memory.c |  17 +-
 include/acpi/actbl2.h|  19 +-
 include/linux/cc_platform.h  |  10 -
 include/linux/cpu.h  |   2 +
 kernel/cpu.c |  12 +-
 24 files changed, 586 insertions(+), 144 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c
 create mode 100644 arch/x86/kernel/acpi/madt.S
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 08/13] x86/tdx: Account shared memory

2023-10-20 Thread Kirill A. Shutemov
The kernel will convert all shared memory back to private during kexec.
The direct mapping page tables will provide information on which memory
is shared.

It is extremely important to convert all shared memory. If a page is
missed, it will cause the second kernel to crash when it accesses it.

Keep track of the number of shared pages. This will allow for
cross-checking against the shared information in the direct mapping and
reporting if the shared bit is lost.

Include a debugfs interface that allows for the check to be performed at
any point.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/coco/tdx/tdx.c | 69 +
 1 file changed, 69 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 8897d3cc8a15..587bdeb88afa 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt) "tdx: " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,6 +38,13 @@
 
 #define TDREPORT_SUBTYPE_0 0
 
+static atomic_long_t nr_shared;
+
+static inline bool pte_decrypted(pte_t pte)
+{
+   return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __noreturn __tdx_hypercall_failed(void)
 {
@@ -799,6 +807,11 @@ static int tdx_enc_status_change_finish(unsigned long 
vaddr, int numpages,
if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
return -EIO;
 
+   if (enc)
+   atomic_long_sub(numpages, _shared);
+   else
+   atomic_long_add(numpages, _shared);
+
return 0;
 }
 
@@ -874,3 +887,59 @@ void __init tdx_early_init(void)
 
pr_info("Guest detected\n");
 }
+
+#ifdef CONFIG_DEBUG_FS
+static int tdx_shared_memory_show(struct seq_file *m, void *p)
+{
+   unsigned long addr, end;
+   unsigned long found = 0;
+
+   addr = PAGE_OFFSET;
+   end  = PAGE_OFFSET + get_max_mapped();
+
+   while (addr < end) {
+   unsigned long size;
+   unsigned int level;
+   pte_t *pte;
+
+   pte = lookup_address(addr, );
+   size = page_level_size(level);
+
+   if (pte && pte_decrypted(*pte))
+   found += size / PAGE_SIZE;
+
+   addr += size;
+
+   cond_resched();
+   }
+
+   seq_printf(m, "Number of unshared pages in kernel page tables:  
%16lu\n",
+  found);
+   seq_printf(m, "Number of pages accounted as unshared:   
%16ld\n",
+  atomic_long_read(_shared));
+   return 0;
+}
+
+static int tdx_shared_memory_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, tdx_shared_memory_show, NULL);
+}
+
+static const struct file_operations tdx_shared_memory_fops = {
+   .open   = tdx_shared_memory_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static __init int debug_tdx_shared_memory(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+   return 0;
+
+   debugfs_create_file("tdx_shared_memory", S_IRUSR, arch_debugfs_dir,
+   NULL, _shared_memory_fops);
+   return 0;
+}
+fs_initcall(debug_tdx_shared_memory);
+#endif
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv2 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled

2023-10-20 Thread Sean Christopherson
On Fri, Oct 20, 2023, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
>   unchecked MSR access error: WRMSR to 0x12 (tried to write 
> 0x)
>   at rIP: 0x8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enabled.
> 
> Signed-off-by: Kirill A. Shutemov 
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Cc: Paolo Bonzini 
> Cc: Wanpeng Li 
> Cc: Vitaly Kuznetsov 
> Cc: Sean Christopherson 
> ---

Reviewed-by: Sean Christopherson 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 03/13] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup

2023-10-20 Thread Kirill A. Shutemov
ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently hotplug prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method.

Disable CPU offlining on ACPI MADT wakeup enumeration.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/coco/core.c   |  1 -
 arch/x86/kernel/acpi/madt_wakeup.c |  3 +++
 include/linux/cc_platform.h| 10 --
 kernel/cpu.c   |  3 +--
 4 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..f07c3bb7deab 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
-   case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_MEM_ENCRYPT:
case CC_ATTR_MEM_ENCRYPT:
return true;
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c 
b/arch/x86/kernel/acpi/madt_wakeup.c
index 58cdfc0b6c0a..4bc1d5106afd 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,4 +1,5 @@
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -75,6 +76,8 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers 
*header,
 
acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
 
+   cpu_hotplug_disable_offlining();
+
apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
return 0;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..d08dd65b5c43 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -80,16 +80,6 @@ enum cc_attr {
 * using AMD SEV-SNP features.
 */
CC_ATTR_GUEST_SEV_SNP,
-
-   /**
-* @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
-*
-* The platform/OS is running as a guest/virtual machine does not
-* support CPU hotplug feature.
-*
-* Examples include TDX Guest.
-*/
-   CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index faebee0050a2..51c258f289ac 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1517,8 +1517,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum 
cpuhp_state target)
 * If the platform does not support hotplug, report it explicitly to
 * differentiate it from a transient offlining failure.
 */
-   if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) ||
-   cpu_hotplug_offline_disabled)
+   if (cpu_hotplug_offline_disabled)
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv2 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled

2023-10-20 Thread Kirill A. Shutemov
kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:

unchecked MSR access error: WRMSR to 0x12 (tried to write 
0x)
at rIP: 0x8110687c (kvmclock_disable+0x1c/0x30)

kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.

Do not disable kvmclock if it was not enabled.

Signed-off-by: Kirill A. Shutemov 
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Cc: Paolo Bonzini 
Cc: Wanpeng Li 
Cc: Vitaly Kuznetsov 
Cc: Sean Christopherson 
---
 arch/x86/kernel/kvmclock.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..f2fff625576d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -24,8 +24,8 @@
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
-static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static int msr_kvm_system_time __ro_after_init;
+static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
@@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-   native_write_msr(msr_kvm_system_time, 0, 0);
+   if (msr_kvm_system_time)
+   native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
@@ -294,7 +295,10 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-   } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+   } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+   } else {
return;
}
 
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup

2023-10-20 Thread kirill.shute...@linux.intel.com
On Fri, Oct 20, 2023 at 11:58:58AM +, Huang, Kai wrote:
> On Tue, 2023-10-10 at 10:24 +, Huang, Kai wrote:
> > >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union 
> > > acpi_subtable_headers *header,
> > >  
> > > 
> > >   acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > >  
> > > 
> > > + /* Disable CPU onlining/offlining */
> > > + cpu_hotplug_not_supported();
> > > +
> > 
> > Both onlining/offlining are prevented, or just offlining?
> > 
> > The previous patch says:
> > 
> > It does not prevent the initial bring up of the CPU, but it stops 
> > subsequent offlining.
> > 
> > And ...
> > 
> > [...]
> > 
> > 
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, 
> > > enum cpuhp_state target)
> > >    * If the platform does not support hotplug, report it explicitly to
> > >    * differentiate it from a transient offlining failure.
> > >    */
> > > - if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > > + if (!cpu_hotplug_supported)
> > >   return -EOPNOTSUPP;
> > >   if (cpu_hotplug_disabled)
> > >   return -EBUSY;
> > 
> > ... here cpu_down_maps_locked() only prevents offlining if I am reading
> > correctly.
> > 
> > Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match 
> > the
> > behaviour better?
> > 
> > Anyway, isn't it a little bit odd to have:
> > 
> > if (!cpu_hotplug_supported)
> >     return -EOPNOTSUPP;
> >     if (cpu_hotplug_disabled)
> >     return -EBUSY;
> > 
> > ?
> 
> I probably have missed something important, but I don't quite understand 
> what's
> the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
> replace it with cpu_hotplug_not_supported()?

CC_ATTR_HOTPLUG_DISABLED was a mistake. And now obvious when we only need
to disable offlining dynamically, based on supported MADT MP WP version.

> From the changelog:
> 
>   Currently hotplug prevented based on the confidential computing
>   attribute which is set for Intel TDX. But TDX is not the only possible
>   user of the wake up method.
> 
> CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by 
> non-
> TDX guest too.
> 
> Anyway, if the purpose is just to prevent CPU from going offline, can this be
> done by registering a cpuhp callback?
> 
>   static int madt_wakeup_offline_cpu(unsigned int cpu)
>   {
>   return -EOPNOTSUPP;
>   }
> 
>   ...
> 
>   err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
>   NULL, madt_wakeup_offline_cpu);
>   if (err) {
>   pr_err("Register CPU hotplug callback failed.\n");
>   /* BUG() ??? */
>   }
> 
> This doesn't pollute the common CPU hotplug code, thus to me looks more clear?

Thomas seems fine with cpu_hotplug_disable_offlining().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv2 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled

2023-10-20 Thread Vitaly Kuznetsov
"Kirill A. Shutemov"  writes:

> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
>
>   unchecked MSR access error: WRMSR to 0x12 (tried to write 
> 0x)
>   at rIP: 0x8110687c (kvmclock_disable+0x1c/0x30)
>
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
>
> Do not disable kvmclock if it was not enabled.
>
> Signed-off-by: Kirill A. Shutemov 
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Cc: Paolo Bonzini 
> Cc: Wanpeng Li 
> Cc: Vitaly Kuznetsov 
> Cc: Sean Christopherson 
> ---
>  arch/x86/kernel/kvmclock.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..f2fff625576d 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -24,8 +24,8 @@
>  
>  static int kvmclock __initdata = 1;
>  static int kvmclock_vsyscall __initdata = 1;
> -static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
> -static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> +static int msr_kvm_system_time __ro_after_init;
> +static int msr_kvm_wall_clock __ro_after_init;
>  static u64 kvm_sched_clock_offset __ro_after_init;
>  
>  static int __init parse_no_kvmclock(char *arg)
> @@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
>  
>  void kvmclock_disable(void)
>  {
> - native_write_msr(msr_kvm_system_time, 0, 0);
> + if (msr_kvm_system_time)
> + native_write_msr(msr_kvm_system_time, 0, 0);
>  }
>  
>  static void __init kvmclock_init_mem(void)
> @@ -294,7 +295,10 @@ void __init kvmclock_init(void)
>   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
>   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
>   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
> - } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
> + } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
> + msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
> + msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
> + } else {
>   return;
>   }

This should work, so

Reviewed-by: Vitaly Kuznetsov 

but my personal preference would be to change kvm_guest_cpu_offline()
to check KVM features explicitly instead of checking MSRs against '0'
at least becase it already does so for other features. Completely
untested:

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ab9ee5896c..1ee49c98e70a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -454,7 +454,9 @@ static void kvm_guest_cpu_offline(bool shutdown)
kvm_pv_disable_apf();
if (!shutdown)
apf_task_wake_all();
-   kvmclock_disable();
+   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2) ||
+   kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))
+   kvmclock_disable();
 }

-- 
Vitaly


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv2 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

2023-10-20 Thread Kuppuswamy Sathyanarayanan



On 10/20/2023 8:12 AM, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, move the relevant code into a separate file.
> 
> Introduce a new configuration option to clearly indicate dependencies
> without the use of ifdefs.
> 
> There have been no functional changes.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan 



>  arch/x86/Kconfig   |  7 +++
>  arch/x86/include/asm/acpi.h|  5 ++
>  arch/x86/kernel/acpi/Makefile  | 11 ++--
>  arch/x86/kernel/acpi/boot.c| 86 +-
>  arch/x86/kernel/acpi/madt_wakeup.c | 81 
>  5 files changed, 100 insertions(+), 90 deletions(-)
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 799102f4d909..9957a73bb386 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
>   depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || 
> PCI_MSI
>   select IRQ_DOMAIN_HIERARCHY
>  
> +config X86_ACPI_MADT_WAKEUP
> + def_bool y
> + depends on X86_64
> + depends on ACPI
> + depends on SMP
> + depends on X86_LOCAL_APIC
> +
>  config X86_IO_APIC
>   def_bool y
>   depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>  
>  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>  
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +   const unsigned long end);
> +
>  /*
>   * Check if the CPU can handle C2 and deeper
>   */
> diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> index fc17b3f136fe..8c7329c88a75 100644
> --- a/arch/x86/kernel/acpi/Makefile
> +++ b/arch/x86/kernel/acpi/Makefile
> @@ -1,11 +1,12 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
> -obj-$(CONFIG_ACPI)   += boot.o
> -obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> -obj-$(CONFIG_ACPI_APEI)  += apei.o
> -obj-$(CONFIG_ACPI_CPPC_LIB)  += cppc.o
> +obj-$(CONFIG_ACPI)   += boot.o
> +obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> +obj-$(CONFIG_ACPI_APEI)  += apei.o
> +obj-$(CONFIG_ACPI_CPPC_LIB)  += cppc.o
> +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)   += madt_wakeup.o
>  
>  ifneq ($(CONFIG_ACPI_PROCESSOR),)
> -obj-y+= cstate.o
> +obj-y+= cstate.o
>  endif
>  
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 2a0ea38955df..111bd226ad99 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -66,13 +66,6 @@ static u64 acpi_lapic_addr __initdata = 
> APIC_DEFAULT_PHYS_BASE;
>  static bool acpi_support_online_capable;
>  #endif
>  
> -#ifdef CONFIG_X86_64
> -/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> -static u64 acpi_mp_wake_mailbox_paddr;
> -/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> -static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> -#endif
> -
>  #ifdef CONFIG_X86_IO_APIC
>  /*
>   * Locks related to IOAPIC hotplug
> @@ -357,60 +350,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * 
> header, const unsigned long e
>  
>   return 0;
>  }
> -
> -#ifdef CONFIG_X86_64
> -static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> -{
> - /*
> -  * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> -  *
> -  * Wakeup of secondary CPUs is fully serialized in the core code.
> -  * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> -  */
> - if (!acpi_mp_wake_mailbox) {
> - acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> - sizeof(*acpi_mp_wake_mailbox),
> - MEMREMAP_WB);
> - }
> -
> - /*
> -  * Mailbox memory is shared between the firmware and OS. Firmware will
> -  * listen on mailbox command address, and once it receives the wakeup
> -  * command, the CPU associated with the given apicid will be booted.
> -  *
> -  * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> -  * firmware before the wakeup command is visible.  smp_store_release()
> -  * ensures ordering and visibility.
> -  */
> - acpi_mp_wake_mailbox->apic_id   = apicid;
> - acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> - smp_store_release(_mp_wake_mailbox->command,
> -  

Re: [PATCHv2 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled

2023-10-20 Thread Sean Christopherson
On Fri, Oct 20, 2023, Vitaly Kuznetsov wrote:
> > ---
> >  arch/x86/kernel/kvmclock.c | 12 
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index fb8f52149be9..f2fff625576d 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -24,8 +24,8 @@
> >  
> >  static int kvmclock __initdata = 1;
> >  static int kvmclock_vsyscall __initdata = 1;
> > -static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
> > -static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> > +static int msr_kvm_system_time __ro_after_init;
> > +static int msr_kvm_wall_clock __ro_after_init;
> >  static u64 kvm_sched_clock_offset __ro_after_init;
> >  
> >  static int __init parse_no_kvmclock(char *arg)
> > @@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
> >  
> >  void kvmclock_disable(void)
> >  {
> > -   native_write_msr(msr_kvm_system_time, 0, 0);
> > +   if (msr_kvm_system_time)
> > +   native_write_msr(msr_kvm_system_time, 0, 0);
> >  }
> >  
> >  static void __init kvmclock_init_mem(void)
> > @@ -294,7 +295,10 @@ void __init kvmclock_init(void)
> > if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
> > msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
> > msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
> > -   } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
> > +   } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
> > +   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
> > +   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
> > +   } else {
> > return;
> > }
> 
> This should work, so
> 
> Reviewed-by: Vitaly Kuznetsov 
> 
> but my personal preference would be to change kvm_guest_cpu_offline()
> to check KVM features explicitly instead of checking MSRs against '0'
> at least becase it already does so for other features. Completely
> untested:
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index b8ab9ee5896c..1ee49c98e70a 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -454,7 +454,9 @@ static void kvm_guest_cpu_offline(bool shutdown)
> kvm_pv_disable_apf();
> if (!shutdown)
> apf_task_wake_all();
> -   kvmclock_disable();
> +   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2) ||
> +   kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))
> +   kvmclock_disable();
>  }

That would result in an unnecessray WRMSR in the case where kvmclock is disabled
on the command line.  It _should_ be benign given how the code is written, but
it's not impossible to imagine a scenario where someone disabled kvmclock in the
guest because of a hypervisor bug.  And the WRMSR would become a bogus write to
MSR 0x0 if someone made a "cleanup" to set msr_kvm_system_time if and only if
kvmclock is actually used, e.g. if someone made Kirill's change sans the check 
in
kvmclock_disable().

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 7/7] ima: record log size at kexec load and execute

2023-10-20 Thread Tushar Sugandhi




On 10/12/23 17:27, Stefan Berger wrote:


On 10/5/23 14:26, Tushar Sugandhi wrote:

The window between kexec 'load' and 'execute' could be arbitrarily long.
Even with the large chunk of memory allocated at kexec 'load', it may
run out which would result in missing events in IMA log after the system
soft reboots to the new Kernel.  This would result in IMA measurements
getting out of sync with the TPM PCR quotes which would result in remote
attestation failing for that system.  There is currently no way for the
new Kernel to know if the IMA log TPM PCR quote out of sync problem is
because of the missing measurements during kexec.

Define two new IMA events, 'kexec_load' and 'kexec_execute', to be
measured at kexec 'load' and 'execute' respectively.

IMA measures 'boot_aggregate' as the first event when the system boots -
either cold boot or kexec soft boot.  In case when the system goes
through multiple soft reboots, the number of 'boot_aggregate' events in
IMA log corresponds to total number of boots (cold boot plus multiple
kexec soft reboots).  With this change, there would be additional
'kexec_load' and 'kexec_execute' events in between the two
'boot_aggregate' events.  In rare cases, when the system runs out of
memory during kexec soft reboot, 'kexec_execute' won't be copied since
its one of the very last event measured just before kexec soft reboot.
The absence of the event 'kexec_execute' in between the two
boot_aggregate' events would signal the attestation service that the IMA
log on the system is out of sync with TPM PCR quotes and the system needs
to be cold booted for the remote attestation to succeed again.


@@ -198,6 +208,7 @@ void ima_add_kexec_buffer(struct kimage *image)
  static int ima_update_kexec_buffer(struct notifier_block *self,
 unsigned long action, void *data)
  {
+    char ima_kexec_event[IMA_KEXEC_EVENT_LEN];
  void *buf = NULL;
  size_t buf_size;
  bool resume = false;
@@ -213,9 +224,31 @@ static int ima_update_kexec_buffer(struct 
notifier_block *self,

  return NOTIFY_OK;
  }
+    buf_size = ima_get_binary_runtime_size();
+    scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
+  "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;",
+  kexec_segment_size, buf_size);
+
+    /*
+ * This is one of the very last events measured by IMA before kexec
+ * soft rebooting into the new Kernal.
+ * This event can be used as a marker after the system soft reboots
+ * to the new Kernel to check if there was sufficient memory 
allocated

+ * at kexec 'load' to capture the events measured between the window
+ * of kexec 'load' and 'execute'.
+ * This event needs to be present in the IMA log, in between the two
+ * 'boot_aggregate' events that are logged for the previous boot and
+ * the current soft reboot. If it is not present after the system 
soft

+ * reboots into the new Kernel, it would mean the IMA log is not
+ * consistent with the TPM PCR quotes, and the system needs to be
+ * cold-booted for the attestation to succeed again.
+ */
+    ima_measure_critical_data("ima_kexec", "kexec_execute",
+  ima_kexec_event, strlen(ima_kexec_event),
+  false, NULL, 0);
+
  ima_measurements_suspend();
-    buf_size = ima_get_binary_runtime_size();


This should be removed earlier, in 2/7.




Agreed. Will do.

~Tushar


  ret = ima_dump_measurement_list(_size, ,
  kexec_segment_size);


Re: [PATCH v2 6/7] ima: make the memory for events between kexec load and exec configurable

2023-10-20 Thread Stefan Berger

On 10/20/23 16:39, Tushar Sugandhi wrote:




On 10/12/23 17:27, Stefan Berger wrote:


On 10/5/23 14:26, Tushar Sugandhi wrote:
IMA currently allocates half a PAGE_SIZE for the extra events that 
would

be measured between kexec 'load' and 'execute'.  Depending on the IMA
policy and the system state, that memory may not be sufficient to hold
the extra IMA events measured after kexec 'load'.  The memory
requirements vary from system to system and they should be 
configurable.


Define a Kconfig option, IMA_KEXEC_EXTRA_PAGES, to configure the number
of extra pages to be allocated for IMA measurements added in the window
from kexec 'load' to kexec 'execute'.

Update ima_add_kexec_buffer() function to allocate memory based on the
Kconfig option value, rather than the currently hardcoded one.

Signed-off-by: Tushar Sugandhi
---
  security/integrity/ima/Kconfig |  9 +
  security/integrity/ima/ima_kexec.c | 13 -
  2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/security/integrity/ima/Kconfig 
b/security/integrity/ima/Kconfig

index 60a511c6b583..1b55cd2bcb36 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -338,3 +338,12 @@ config IMA_DISABLE_HTABLE
  default n
  help
 This option disables htable to allow measurement of 
duplicate records.

+
+config IMA_KEXEC_EXTRA_PAGES
+    int
+    depends on IMA && IMA_KEXEC
+    default 16
+    help
+  IMA_KEXEC_EXTRA_PAGES determines the number of extra
+  pages to be allocated for IMA measurements added in the
+  window from kexec 'load' to kexec 'execute'.



On ppc64 a page is 64kb. I would ask for additional kb here.



Not sure I understand.  Do you mean I should make the default value of
the config IMA_KEXEC_EXTRA_PAGES 64 or something?



No, what I mean is you should ask the user for how many extra kilobytes 
(kb) to allocate - not ask for pages.



   Stefan



In code, I multiply the config value with PAGE_SIZE.  So more memory
would be allocated on ppc64 for given default config value. Could you
please clarify what change you are suggesting here?


+    binary_runtime_size = ima_get_binary_runtime_size() +
+   sizeof(struct ima_kexec_hdr) +
+   (CONFIG_IMA_KEXEC_EXTRA_PAGES *
PAGE_SIZE);

~Tushar

diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c

index 13fbbb90319b..6cd5f46a7208 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -150,15 +150,18 @@ void ima_add_kexec_buffer(struct kimage *image)
  int ret;
  /*
- * Reserve an extra half page of memory for additional 
measurements

- * added during the kexec load.
+ * Reserve extra memory for measurements added in the window from
+ * kexec 'load' to kexec 'execute'.
   */
-    binary_runtime_size = ima_get_binary_runtime_size();
+    binary_runtime_size = ima_get_binary_runtime_size() +
+  sizeof(struct ima_kexec_hdr) +
+  (CONFIG_IMA_KEXEC_EXTRA_PAGES * PAGE_SIZE);
+
  if (binary_runtime_size >= ULONG_MAX - PAGE_SIZE)
  kexec_segment_size = ULONG_MAX;
  else
-    kexec_segment_size = ALIGN(ima_get_binary_runtime_size() +
-   PAGE_SIZE / 2, PAGE_SIZE);
+    kexec_segment_size = ALIGN(binary_runtime_size, PAGE_SIZE);
+
  if ((kexec_segment_size == ULONG_MAX) ||
  ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 
2)) {

  pr_err("Binary measurement list too large.\n");


Re: [PATCH v2 1/7] ima: refactor ima_dump_measurement_list to move memory allocation to a separate function

2023-10-20 Thread Tushar Sugandhi

Thanks a lot Stefan for reviewing this series.
Really appreciate it.

On 10/12/23 17:28, Stefan Berger wrote:


On 10/5/23 14:25, Tushar Sugandhi wrote:

IMA allocates memory and dumps the measurement during kexec soft reboot
as a single function call ima_dump_measurement_list().  It gets called
during kexec 'load' operation.  It results in the IMA measurements
between the window of kexec 'load' and 'execute' getting dropped when the
system boots into the new Kernel.  One of the kexec requirements is the
segment size cannot change between the 'load' and the 'execute'.
Therefore, to address this problem, ima_dump_measurement_list() needs
to be refactored to allocate the memory at kexec 'load', and dump the
measurements at kexec 'execute'.  The function that allocates the memory
should handle the scenario where the kexec load is called multiple times.

Refactor ima_dump_measurement_list() to move the memory allocation part
to a separate function ima_alloc_kexec_buf() to allocate buffer of size
'kexec_segment_size' at kexec 'load'.  Make the local variables in
function ima_dump_measurement_list() global, so that they can be accessed
from ima_alloc_kexec_buf().  Make necessary changes to the function
ima_add_kexec_buffer() to call the above two functions.

Signed-off-by: Tushar Sugandhi
---
  security/integrity/ima/ima_kexec.c | 126 +
  1 file changed, 93 insertions(+), 33 deletions(-)

diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c

index 419dc405c831..307e07991865 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -15,61 +15,114 @@
  #include "ima.h"
  #ifdef CONFIG_IMA_KEXEC
+struct seq_file ima_kexec_file;
+struct ima_kexec_hdr ima_khdr;


Since you are only populating the buffer at kexec 'execute' time, you 
should be able to move this header into the function where it is needed.



Yup, ima_khdr doesn't need to be static. Will fix.



+
+void ima_clear_kexec_file(void)
+{
+    vfree(ima_kexec_file.buf);
I would pass the ima_kexec_file onto this function here as a parameter 
rather than accessing the file-static variable. I find this better to 
read once you look at ima_clear_kexec_file() further below and wonder 
why it doesn't take a parameter.

Agreed. This will make the code more readable.


+    ima_kexec_file.buf = NULL;
+    ima_kexec_file.size = 0;
+    ima_kexec_file.read_pos = 0;
+    ima_kexec_file.count = 0;
+}
+
+static int ima_alloc_kexec_buf(size_t kexec_segment_size)


Call it segment size to avoid the later kexec_segment_size static 
variable in this file.




Will do.

+{
+    if ((kexec_segment_size == 0) ||
+    (kexec_segment_size == ULONG_MAX) ||
+    ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 2)) {


These tests are already done before ima_alloca_kexec_buf() is called. 
Also, kexec_segment_size cannot be 0.




I was being extra cautious here. If ima_alloc_kexec_buf() gets called
from some other place, then this check would be handy. Being said that,
maybe this function is a better place to have this check. I will see
what I can do here to simplify things. Thanks for the feedback.


+    pr_err("%s: Invalid segment size for kexec: %zu\n",
+    __func__, kexec_segment_size);
+    return -EINVAL;
+    }
+
+    /*
+ * If kexec load was called before, clear the existing buffer
+ * before allocating a new one
+ */
+    if (ima_kexec_file.buf)
+    ima_clear_kexec_file();
+


ima_clear_file(_kexec_file);



Agreed. Will do.


+    /* segment size can't change between kexec load and execute */
+    ima_kexec_file.buf = vmalloc(kexec_segment_size);
+    if (!ima_kexec_file.buf) {
+    pr_err("%s: No memory for ima kexec measurement buffer\n",
+    __func__);
+    return -ENOMEM;
+    }
+
+    ima_kexec_file.size = kexec_segment_size;
+    ima_kexec_file.read_pos = 0;
+    ima_kexec_file.count = sizeof(ima_khdr);    /* reserved space */
+
+    memset(_khdr, 0, sizeof(ima_khdr));
+    ima_khdr.version = 1;


Move this into ima_dump_measurement_list() since it's only used there 
once and getting rid of this file-static variable is a plus.




Yes. This will get rid of an extra static variable.


+
+    return 0;
+}
+
  static int ima_dump_measurement_list(unsigned long *buffer_size, 
void **buffer,

   unsigned long segment_size)
  {
  struct ima_queue_entry *qe;
-    struct seq_file file;
-    struct ima_kexec_hdr khdr;

Don't remove it from here.

Agreed. khdr doesn't need to be static.


  int ret = 0;
-    /* segment size can't change between kexec load and execute */
-    file.buf = vmalloc(segment_size);
-    if (!file.buf) {
-    ret = -ENOMEM;
-    goto out;
+    if (!ima_kexec_file.buf) {
+    pr_err("%s: Kexec file buf not allocated\n",
+    __func__);
+    return -EINVAL;
  }
-    file.size = segment_size;
-    file.read_pos = 0;
-    file.count = 

Re: [PATCH v2 6/7] ima: make the memory for events between kexec load and exec configurable

2023-10-20 Thread Tushar Sugandhi




On 10/20/23 14:16, Stefan Berger wrote:
No, what I mean is you should ask the user for how many extra kilobytes 
(kb) to allocate - not ask for pages.



    Stefan

Ok. Will do.
I will align the input config value to the PAGE_SIZE as well.


Re: [PATCH v2 2/7] ima: move ima_dump_measurement_list call from kexec load to execute

2023-10-20 Thread Tushar Sugandhi




On 10/12/23 17:28, Stefan Berger wrote:


On 10/5/23 14:25, Tushar Sugandhi wrote:

In the current IMA implementation, ima_dump_measurement_list() is called
during the kexec 'load' operation.  This can result in loss of IMA
measurements taken between the 'load' and 'execute' phases when the
system goes through Kexec soft reboot to a new Kernel.  The call to the
function ima_dump_measurement_list() needs to be moved out of the
function ima_add_kexec_buffer() and needs to be called during the kexec
'execute' operation.

Implement a function ima_update_kexec_buffer() that is called during
kexec 'execute', allowing IMA to update the measurement list with the
events between kexec 'load' and 'execute'.  Move the
ima_dump_measurement_list() call from ima_add_kexec_buffer() to
ima_update_kexec_buffer().  Make ima_kexec_buffer and kexec_segment_size
variables global, so that they can be accessed during both kexec 'load'
and 'execute'.  Add functions ima_measurements_suspend() and
ima_measurements_resume() to set and reset the 'suspend_ima_measurements'
variable respectively, to suspend/resume IMA measurements.  Use
the existing 'ima_extend_list_mutex' to ensure that the operations are
thread-safe.  These function calls will help maintaining the integrity
of the IMA log while it is being copied to the new Kernel's buffer.
Add a reboot notifier_block 'update_buffer_nb' to ensure
the function ima_update_kexec_buffer() gets called during kexec
soft-reboot.

Signed-off-by: Tushar Sugandhi
---
  security/integrity/ima/ima.h   |  2 ++
  security/integrity/ima/ima_kexec.c | 58 +-
  security/integrity/ima/ima_queue.c | 18 ++
  3 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index c29db699c996..49a6047dd8eb 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -161,6 +161,8 @@ bool ima_template_has_modsig(const struct 
ima_template_desc *ima_template);

  int ima_restore_measurement_entry(struct ima_template_entry *entry);
  int ima_restore_measurement_list(loff_t bufsize, void *buf);
  int ima_measurements_show(struct seq_file *m, void *v);
+void ima_measurements_suspend(void);
+void ima_measurements_resume(void);
  unsigned long ima_get_binary_runtime_size(void);
  int ima_init_template(void);
  void ima_init_template_list(void);
diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c

index 307e07991865..2c11bbe6efef 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -17,6 +17,8 @@
  #ifdef CONFIG_IMA_KEXEC
  struct seq_file ima_kexec_file;
  struct ima_kexec_hdr ima_khdr;
+static void *ima_kexec_buffer;
+static size_t kexec_segment_size;
  void ima_clear_kexec_file(void)
  {
@@ -142,7 +144,6 @@ void ima_add_kexec_buffer(struct kimage *image)
  /* use more understandable variable names than defined in kbuf */
  void *kexec_buffer = NULL;
  size_t kexec_buffer_size;
-    size_t kexec_segment_size;
  int ret;
  /*
@@ -167,14 +168,6 @@ void ima_add_kexec_buffer(struct kimage *image)
  return;
  }
-    ret = ima_dump_measurement_list(_buffer_size, _buffer,
-    kexec_segment_size);
-    if (ret < 0) {
-    pr_err("%s: Failed to dump IMA measurements. Error:%d.\n",
-   __func__, ret);
-    return;
-    }
-
  kbuf.buffer = kexec_buffer;
  kbuf.bufsz = kexec_buffer_size;
  kbuf.memsz = kexec_segment_size;
@@ -192,6 +185,53 @@ void ima_add_kexec_buffer(struct kimage *image)
  pr_debug("kexec measurement buffer for the loaded kernel at 
0x%lx.\n",

   kbuf.mem);
  }
+
+/*
+ * Called during kexec execute so that IMA can update the measurement 
list.

+ */
+static int ima_update_kexec_buffer(struct notifier_block *self,
+   unsigned long action, void *data)
+{
+    void *buf = NULL;
+    size_t buf_size;
+    bool resume = false;
+    int ret;
+
+    if (!kexec_in_progress) {
+    pr_info("%s: No kexec in progress.\n", __func__);
+    return NOTIFY_OK;
+    }
+
+    if (!ima_kexec_buffer) {
+    pr_err("%s: Kexec buffer not set.\n", __func__);
+    return NOTIFY_OK;
+    }
+
+    ima_measurements_suspend();
+
+    buf_size = ima_get_binary_runtime_size();
There doesn't seem to be a need to call this function and pass in the 
binary runtime size into the dump function. You should be able to remove 
it.

Oh. This was an oversight.
This line is removed in patch 7/7.  It should have been removed here
itself.
Thanks for catching it. Will fix.

+    ret = ima_dump_measurement_list(_size, ,
+    kexec_segment_size);
+
+    if (!buf || ret < 0) {
I don't think this function can (or should) ever return ret >= 0 with 
buf == NULL.

I will simplify this check. Thanks.

~Tushar

+    pr_err("%s: Dump measurements failed. Error:%d\n",
+   __func__, ret);
+    resume = true;

Re: [PATCH v2 1/7] ima: refactor ima_dump_measurement_list to move memory allocation to a separate function

2023-10-20 Thread Stefan Berger



On 10/20/23 16:33, Tushar Sugandhi wrote:

Thanks a lot Stefan for reviewing this series.
Really appreciate it.



You are welcome.

What may be a bit problematic is the fact that between the time the 
buffer for the flattened IMA log is allocated (kexec 'load') and the 
time it is filled (kexec 'exec') that the log may grow quite a bit. I 
would say that the size of the growths may depend a lot on how people 
are using their system that the additional kilobytes  may or may not be 
enough. So a distro would probably have to specify additional bytes to 
allocate for like the worst case. But how many kilobytes is the worst case?


   Stefan



Re: [PATCH v2 3/7] ima: kexec: map source pages containing IMA buffer to image post kexec load

2023-10-20 Thread Tushar Sugandhi




On 10/12/23 17:29, Stefan Berger wrote:


On 10/5/23 14:25, Tushar Sugandhi wrote:

Currently, the mechanism to map and unmap segments to the kimage
structure is not available to the subsystems outside of kexec.  This
functionality is needed when IMA is allocating the memory segments
during kexec 'load' operation.

Implement kimage_map_segment() which takes a kimage pointer, an address,
and a size.  Ensure that the entire segment is being mapped by comparing
the given address and size to each segment in the kimage's segment array.
Collect the source pages that correspond to the given address range,
allocate an array of pointers to these pages, and map them to a
contiguous range of virtual addresses.  If the mapping operation is
successful, the function returns the start of this range.  Otherwise, it
frees the page pointer array and returns NULL.

Implement kimage_unmap_segment() that takes a pointer to a segment buffer
and unmaps it using vunmap().

Implement function ima_kexec_post_load(), to be called by IMA after kexec
loads the new Kernel image.  ima_kexec_post_load() would map the IMA
buffer allocated during kexec 'load' to a segment in the loaded image.

Finally, move for_each_kimage_entry() macro from kexec_core.c to kexec.h.

Signed-off-by: Tushar Sugandhi
---
  include/linux/ima.h    |  3 ++
  include/linux/kexec.h  | 13 ++
  kernel/kexec_core.c    | 73 --
  security/integrity/ima/ima_kexec.c | 32 +
  4 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/include/linux/ima.h b/include/linux/ima.h
index 86b57757c7b1..006db20f852d 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -49,6 +49,9 @@ static inline void ima_appraise_parse_cmdline(void) {}
  #ifdef CONFIG_IMA_KEXEC
  extern void ima_add_kexec_buffer(struct kimage *image);
+extern void ima_kexec_post_load(struct kimage *image);
+#else
+static inline void ima_kexec_post_load(struct kimage *image) {}
  #endif
  #else
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 22b5cd24f581..e00b8101b53b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -490,6 +490,15 @@ static inline int 
arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g
  static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned 
int pages) { }

  #endif
+#define for_each_kimage_entry(image, ptr, entry) \
+    for (ptr = >head; (entry = *ptr) && !(entry & IND_DONE); \
+    ptr = (entry & IND_INDIRECTION) ? \
+    boot_phys_to_virt((entry & PAGE_MASK)) : ptr + 1)
+
+extern void *kimage_map_segment(struct kimage *image,
+    unsigned long addr, unsigned long size);
+extern void kimage_unmap_segment(void *buffer);
+
  #else /* !CONFIG_KEXEC_CORE */
  struct pt_regs;
  struct task_struct;
@@ -497,6 +506,10 @@ static inline void __crash_kexec(struct pt_regs 
*regs) { }

  static inline void crash_kexec(struct pt_regs *regs) { }
  static inline int kexec_should_crash(struct task_struct *p) { return 
0; }

  static inline int kexec_crash_loaded(void) { return 0; }
+static inline void *kimage_map_segment(struct kimage *image,
+   unsigned long addr, unsigned long size)
+{ return NULL; }
+static inline void kimage_unmap_segment(void *buffer) { }
  #define kexec_in_progress false
  #endif /* CONFIG_KEXEC_CORE */
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 3d578c6fefee..e01156f3c404 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -594,11 +594,6 @@ void kimage_terminate(struct kimage *image)
  *image->entry = IND_DONE;
  }
-#define for_each_kimage_entry(image, ptr, entry) \
-    for (ptr = >head; (entry = *ptr) && !(entry & IND_DONE); \
-    ptr = (entry & IND_INDIRECTION) ? \
-    boot_phys_to_virt((entry & PAGE_MASK)) : ptr + 1)
-
  static void kimage_free_entry(kimage_entry_t entry)
  {
  struct page *page;
@@ -921,6 +916,74 @@ int kimage_load_segment(struct kimage *image,
  return result;
  }
+void *kimage_map_segment(struct kimage *image,
+ unsigned long addr, unsigned long size)
+{
+    unsigned long eaddr = addr + size;
+    unsigned long src_page_addr, dest_page_addr;
+    struct page **src_pages;
+    int i, npages;
+    kimage_entry_t *ptr, entry;
+    void *vaddr = NULL;
+
+    /*
+ * Make sure that we are mapping a whole segment.
+ */
+    for (i = 0; i < image->nr_segments; i++) {
+    if (addr == image->segment[i].mem &&
+    size == image->segment[i].memsz) {
+    break;
+    }
+    }


I wonder whether this distrust in the segments or in the passed 
pointer are justifyable since you call this function like this:


     ima_kexec_buffer = kimage_map_segment(image,
image->ima_buffer_addr,
image->ima_buffer_size);

and previously kexec_add_buffer() was also called:

     kbuf.buffer = kexec_buffer;
     kbuf.bufsz = kexec_buffer_size;
     kbuf.memsz = kexec_segment_size;
 

Re: [PATCH v2 6/7] ima: make the memory for events between kexec load and exec configurable

2023-10-20 Thread Tushar Sugandhi




On 10/12/23 17:27, Stefan Berger wrote:


On 10/5/23 14:26, Tushar Sugandhi wrote:

IMA currently allocates half a PAGE_SIZE for the extra events that would
be measured between kexec 'load' and 'execute'.  Depending on the IMA
policy and the system state, that memory may not be sufficient to hold
the extra IMA events measured after kexec 'load'.  The memory
requirements vary from system to system and they should be configurable.

Define a Kconfig option, IMA_KEXEC_EXTRA_PAGES, to configure the number
of extra pages to be allocated for IMA measurements added in the window
from kexec 'load' to kexec 'execute'.

Update ima_add_kexec_buffer() function to allocate memory based on the
Kconfig option value, rather than the currently hardcoded one.

Signed-off-by: Tushar Sugandhi
---
  security/integrity/ima/Kconfig |  9 +
  security/integrity/ima/ima_kexec.c | 13 -
  2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/security/integrity/ima/Kconfig 
b/security/integrity/ima/Kconfig

index 60a511c6b583..1b55cd2bcb36 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -338,3 +338,12 @@ config IMA_DISABLE_HTABLE
  default n
  help
 This option disables htable to allow measurement of duplicate 
records.

+
+config IMA_KEXEC_EXTRA_PAGES
+    int
+    depends on IMA && IMA_KEXEC
+    default 16
+    help
+  IMA_KEXEC_EXTRA_PAGES determines the number of extra
+  pages to be allocated for IMA measurements added in the
+  window from kexec 'load' to kexec 'execute'.



On ppc64 a page is 64kb. I would ask for additional kb here.



Not sure I understand.  Do you mean I should make the default value of
the config IMA_KEXEC_EXTRA_PAGES 64 or something?

In code, I multiply the config value with PAGE_SIZE.  So more memory
would be allocated on ppc64 for given default config value. Could you
please clarify what change you are suggesting here?


+binary_runtime_size = ima_get_binary_runtime_size() +
+   sizeof(struct ima_kexec_hdr) +
+   (CONFIG_IMA_KEXEC_EXTRA_PAGES *
PAGE_SIZE);

~Tushar

diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c

index 13fbbb90319b..6cd5f46a7208 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -150,15 +150,18 @@ void ima_add_kexec_buffer(struct kimage *image)
  int ret;
  /*
- * Reserve an extra half page of memory for additional measurements
- * added during the kexec load.
+ * Reserve extra memory for measurements added in the window from
+ * kexec 'load' to kexec 'execute'.
   */
-    binary_runtime_size = ima_get_binary_runtime_size();
+    binary_runtime_size = ima_get_binary_runtime_size() +
+  sizeof(struct ima_kexec_hdr) +
+  (CONFIG_IMA_KEXEC_EXTRA_PAGES * PAGE_SIZE);
+
  if (binary_runtime_size >= ULONG_MAX - PAGE_SIZE)
  kexec_segment_size = ULONG_MAX;
  else
-    kexec_segment_size = ALIGN(ima_get_binary_runtime_size() +
-   PAGE_SIZE / 2, PAGE_SIZE);
+    kexec_segment_size = ALIGN(binary_runtime_size, PAGE_SIZE);
+
  if ((kexec_segment_size == ULONG_MAX) ||
  ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 2)) {
  pr_err("Binary measurement list too large.\n");


Re: [PATCH v2 1/7] ima: refactor ima_dump_measurement_list to move memory allocation to a separate function

2023-10-20 Thread Tushar Sugandhi




On 10/20/23 14:21, Stefan Berger wrote:


On 10/20/23 16:33, Tushar Sugandhi wrote:

Thanks a lot Stefan for reviewing this series.
Really appreciate it.



You are welcome.

What may be a bit problematic is the fact that between the time the 
buffer for the flattened IMA log is allocated (kexec 'load') and the 
time it is filled (kexec 'exec') that the log may grow quite a bit. I 
would say that the size of the growths may depend a lot on how people 
are using their system that the additional kilobytes  may or may not be 
enough. So a distro would probably have to specify additional bytes to 
allocate for like the worst case. But how many kilobytes is the worst case?


    Stefan

Yes.  Its a genuine concern.

The window between kexec 'load' and 'execute' could be arbitrarily long.
(hours, days, months).  And the log can grow quite a bit.  And there is
always a possibility that it will run out of the extra allocated memory-
no matter how much we allocate at load.

We can never know with certainty - how many kilobytes is the worst case?
So I used another approach to address this issue.

I addressed this issue in patch 7/7 of this series[1] by measuring
a marker event ("kexec_execute") just before kexec 'execute'.
Also pasting the code from 7/7 below[1] for reference.

If IMA runs out of the extra allocated memory while copying the events,
this marker event ("kexec_execute") will not be present in the IMA log
when the system boots into the new Kernel.

So the event sequence in IMA log would be as follows:

IMA log

'boot_aggregate' # clean boot
...
... # events before new kexec 'load'
...
'kexec_load'
...
...# arbitrary many more events
...
...
...
'kexec_execute'
#if this 'kexec_execute' event is missing after the
#system kexec soft boots into the new Kernel,
#i.e. between the two boot_aggregate events,
#it can be safely concluded that the IMA log
#ran out of memory in during kexec reboot,
#and now it is out of sync with PCR quotes
#and thus the system needs to be hard rebooted.

'boot_aggregate' # clean boot
...
... # events after kexec soft boot
...

This logic can effectively conclude if IMA log is out of
sync with the PCR quotes.  If it is, then the remote
attestation service/client can take appropriate action
on the system (clean boot) to recover.


Hope this approach makes sense.

~Tushar


[1] [v2,7/7] ima: record log size at kexec load and execute
https://patchwork.kernel.org/project/linux-integrity/patch/20231005182602.634615-8-tusha...@linux.microsoft.com/ 



diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c

index 6cd5f46a7208..0f9c424fe808 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -17,6 +17,8 @@
 #include "ima.h"

 #ifdef CONFIG_IMA_KEXEC
+#define IMA_KEXEC_EVENT_LEN 128
+
 struct seq_file ima_kexec_file;
 struct ima_kexec_hdr ima_khdr;
 static void *ima_kexec_buffer;
@@ -34,6 +36,8 @@  void ima_clear_kexec_file(void)

 static int ima_alloc_kexec_buf(size_t kexec_segment_size)
 {
+   char ima_kexec_event[IMA_KEXEC_EVENT_LEN];
+
if ((kexec_segment_size == 0) ||
(kexec_segment_size == ULONG_MAX) ||
((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 2)) {
@@ -64,6 +68,12 @@  static int ima_alloc_kexec_buf(size_t 
kexec_segment_size)

memset(_khdr, 0, sizeof(ima_khdr));
ima_khdr.version = 1;

+   scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
+ "kexec_segment_size=%lu;", kexec_segment_size);
+
+   ima_measure_critical_data("ima_kexec", "kexec_load", ima_kexec_event,
+ strlen(ima_kexec_event), false, NULL, 0);
+
return 0;
 }

@@ -198,6 +208,7 @@  void ima_add_kexec_buffer(struct kimage *image)
 static int ima_update_kexec_buffer(struct notifier_block *self,
   unsigned long action, void *data)
 {
+   char ima_kexec_event[IMA_KEXEC_EVENT_LEN];
void *buf = NULL;
size_t buf_size;
bool resume = false;
@@ -213,9 +224,31 @@  static int ima_update_kexec_buffer(struct 
notifier_block *self,

return NOTIFY_OK;
}

+   buf_size = ima_get_binary_runtime_size();
+   scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
+ "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;",
+ kexec_segment_size, buf_size);
+
+   /*
+* This is one of the very last events measured by IMA before kexec
+* soft rebooting into the new Kernal.
+* This event can be used as a marker after the system soft reboots
+* to the new Kernel to check if there was sufficient memory allocated
+* at kexec 'load' to capture the events measured between the window
+* of kexec 'load' and 'execute'.
+* This event needs to be present in the IMA log, in between the two
+* 'boot_aggregate' events that are logged for the previous boot and
+* the