from:"Thomas Gleixner"

Re: [PATCH v9 06/19] x86: Add early SHA-1 support for Secure Launch early measurements

2024-08-15 Thread Thomas Gleixner

On Thu, Aug 15 2024 at 13:38, Daniel P. Smith wrote:
> On 5/31/24 09:54, Eric W. Biederman wrote:
>> Eric Biggers  writes:
>>> That paragraph is also phrased as a hypothetical, "Even if we'd prefer to 
>>> use
>>> SHA-256-only".  That implies that you do not, in fact, prefer SHA-256 only. 
>>>  Is
>>> that the case?  Sure, maybe there are situations where you *have* to use 
>>> SHA-1,
>>> but why would you not at least *prefer* SHA-256?
>> 
>> Yes.  Please prefer to use SHA-256.
>> 
>> Have you considered implementing I think it is SHA1-DC (as git has) that
>> is compatible with SHA1 but blocks the known class of attacks where
>> sha1 is actively broken at this point?
>
> We are using the kernel's implementation, addressing what the kernel 
> provides is beyond our efforts. Perhaps someone who is interested in 
> improving the kernel's SHA1 could submit a patch implementing/replacing 
> it with SHA1-DC, as I am sure the maintainers would welcome the help.

Well, someone who is interested to get his "secure" code merged should
have a vested interested to have a non-broken SHA1 implementation if
there is a sensible requirement to use SHA1 in that new "secure" code,
no?

Just for the record. The related maintainers can rightfully decide to
reject known broken "secure" code on a purely technical argument.

Thanks,

tglx


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 02/16] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:
> acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox initialized once
> during ACPI MADT init and never changed.
>
> Signed-off-by: Kirill A. Shutemov 
> Acked-by: Kai Huang 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 01/16] x86/acpi: Extract ACPI MADT wakeup code into a separate file

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, move the relevant code into a separate file.
>
> Introduce a new configuration option to clearly indicate dependencies
> without the use of ifdefs.
>
> There have been no functional changes.
>
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Kuppuswamy Sathyanarayanan 
> 
> Acked-by: Kai Huang 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 16/16] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:
> MADT Multiprocessor Wakeup structure version 1 brings support of CPU
> offlining: BIOS provides a reset vector where the CPU has to jump to
> for offlining itself. The new TEST mailbox command can be used to test
> whether the CPU offlined itself which means the BIOS has control over
> the CPU and can online it again via the ACPI MADT wakeup method.
>
> Add CPU offling support for the ACPI MADT wakeup method by implementing
> custom cpu_die(), play_dead() and stop_this_cpu() SMP operations.
>
> CPU offlining makes is possible to hand over secondary CPUs over kexec,
> not limiting the second kernel to a single CPU.
>
> The change conforms to the approved ACPI spec change proposal. See the
> Link.
>
> Signed-off-by: Kirill A. Shutemov 
> Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
> Acked-by: Kai Huang 
> Reviewed-by: Kuppuswamy Sathyanarayanan 
> 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 13/16] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline a CPU after it was onlined. This
> limits kexec: the second kernel won't be able to use more than one CPU.
>
> To prevent a kexec kernel from onlining secondary CPUs invalidate the
> mailbox address in the ACPI MADT wakeup structure which prevents a
> kexec kernel to use it.
>
> This is safe as the booting kernel has the mailbox address cached
> already and acpi_wakeup_cpu() uses the cached value to bring up the
> secondary CPUs.
>
> Note: This is a Linux specific convention and not covered by the
>   ACPI specification.
>
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Kai Huang 
> Reviewed-by: Kuppuswamy Sathyanarayanan 
> 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 14/16] x86/smp: Add smp_ops.stop_this_cpu() callback

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:

> If the helper is defined, it is called instead of halt() to stop the CPU
> at the end of stop_this_cpu() and on crash CPU shutdown.
>
> ACPI MADT will use it to hand over the CPU to BIOS in order to be able
> to wake it up again after kexec.
>
> Signed-off-by: Kirill A. Shutemov 
> Acked-by: Kai Huang 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 12/16] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure

2024-02-23 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:
> To prepare for the addition of support for MADT wakeup structure version
> 1, it is necessary to provide more appropriate names for the fields in
> the structure.
>
> The field 'mailbox_version' renamed as 'version'. This field signifies
> the version of the structure and the related protocols, rather than the
> version of the mailbox. This field has not been utilized in the code
> thus far.
>
> The field 'base_address' renamed as 'mailbox_address' to clarify the
> kind of address it represents. In version 1, the structure includes the
> reset vector address. Clear and distinct naming helps to prevent any
> confusion.
>
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Kai Huang 
> Reviewed-by: Kuppuswamy Sathyanarayanan 
> 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 05/16] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

2024-02-22 Thread Thomas Gleixner

On Mon, Feb 12 2024 at 12:44, Kirill A. Shutemov wrote:

> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
>
> Use alternatives to keep the flag during kexec for TDX guests.
>
> The change doesn't affect non-TDX-guest environments.
>
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Kai Huang 

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-12-15 Thread Thomas Gleixner

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:

> MADT Multiprocessor Wakeup structure version 1 brings support of CPU
> offlining: BIOS provides a reset vector where the CPU has to jump to
> offline itself.

CPU has to jump to for offlining itself.

> The new TEST mailbox command can be used to test the CPU offlined
> successfully and BIOS has control over it.

test whether the CPU offlined itself which means the BIOS has control
over the CPU and can online it again via the ACPI MADT wakeup method.

> Add CPU offling support for ACPI MADT wakeup method by implementing

for the ACPI

> custom cpu_die, play_dead and stop_other_cpus SMP operations.

cpu_die(), play_dead() ...

> CPU offlining makes is possible to hand over secondary CPUs over kexec,
> not limiting the second kernel to single CPU.

to a single CPU.

> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index 4fab2ed454f3..3c8efba86d5c 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -38,6 +38,7 @@ struct smp_ops {
>   int (*cpu_disable)(void);
>   void (*cpu_die)(unsigned int cpu);
>   void (*play_dead)(void);
> + void (*crash_play_dead)(void);

This new callback and the callsite change wants to be introduced in a
preparatory patch. This one is doing too many things at once, really.
  
> diff --git a/arch/x86/kernel/acpi/madt_playdead.S 
> b/arch/x86/kernel/acpi/madt_playdead.S
> new file mode 100644
> index ..68f83865a1e3
> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt_playdead.S
> @@ -0,0 +1,21 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> + .text
> + .align PAGE_SIZE

Newline please

Please document what the register arguments to this function are.

> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> + /* Turn off global entries. Following CR3 write will flush them. */
> + movq%cr4, %rdx
> + andq$~(X86_CR4_PGE), %rdx
> + movq%rdx, %cr4
> +
> + /* Switch to identity mapping */
> + movq%rsi, %rax
> + movq%rax, %cr3
> +
> + /* Jump to reset vector */
> + ANNOTATE_RETPOLINE_SAFE
> + jmp *%rdi
> +SYM_FUNC_END(asm_acpi_mp_play_dead)

> +static u64 acpi_mp_pgd __ro_after_init;
> +static u64 acpi_mp_reset_vector_paddr __ro_after_init;
> +
> +void asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa);

Declarations want to be in a header file.

> +static void crash_acpi_mp_play_dead(void)
> +{
> + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
> +   acpi_mp_pgd);

Pointless line break.

> +}
> +
> +static void acpi_mp_play_dead(void)
> +{
> + play_dead_common();
> + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
> +   acpi_mp_pgd);

Ditto.

> +}
> +
> +static void acpi_mp_cpu_die(unsigned int cpu)
> +{
> + u32 apicid = per_cpu(x86_cpu_to_apicid, cpu);
> + unsigned long timeout;
> +
> + /*
> +  * Use TEST mailbox command to prove that BIOS got control over
> +  * the CPU before declaring it dead.
> +  *
> +  * BIOS has to clear 'command' field of the mailbox.
> +  */
> + acpi_mp_wake_mailbox->apic_id = apicid;
> + smp_store_release(&acpi_mp_wake_mailbox->command,
> +   ACPI_MP_WAKE_COMMAND_TEST);
> +
> + /* Don't wait longer than a second. */
> + timeout = USEC_PER_SEC;
> + while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
> + udelay(1);

So this waits and then does nothing if the wait fails. What's the point?

...


Do we really need this specific hackery or is there some similar
identity mapping muck which can be generalized?

> + smp_ops.play_dead = acpi_mp_play_dead;
> + smp_ops.crash_play_dead = crash_acpi_mp_play_dead;
> + smp_ops.cpu_die = acpi_mp_cpu_die;
> + smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
> +
> + acpi_mp_reset_vector_paddr = reset_vector;
> + acpi_mp_pgd = __pa(pgd);
> +
> + return 0;
> +}
> +
>  static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
>  {
>   if (!acpi_mp_wake_mailbox_paddr) {
> @@ -68,37 +299,63 @@ static int acpi_wakeup_cpu(u32 apicid, unsigned long 
> start_ip)
>   return 0;
>  }
>  
> +static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup 
> *mp_wake)
> +{
> + cpu_hotplug_disable_offlining();
> +
> + /*
> +  * Zero out mailbox address in the ACPI MADT wakeup structure
> +  * to indicate that the mailbox is not usable.  This prevents
> +  * the kexec()-ed kernel from reading a vaild mailbox, which in
> +  * turn makes the kexec()-ed kernel only be able to use the boot
> +  * CPU.
> +  *
> +  * This is Linux-specific protocol and not reflected in ACPI spec.
> +  *
> +  * acpi_mp_wake_mailbox_paddr already has the mailbox address.
> +  * The acpi_wakeup_cpu() will use it to bring up secondary cpus for
> +  * the current kernel.
> +  */
> + mp_wake->mailbox_addr

Re: [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case

2023-12-15 Thread Thomas Gleixner

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up. It limits

to offline a CPU after it was onlined. This limits kexec: ...

> kexec: the second kernel won't be able to use more than one CPU.

... one CPU, which is enough to cover the kdump case.


> Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> The acpi_wakeup_cpu() will use it to bring up secondary cpus.
>
> Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> that the mailbox is not usable.  This prevents the kexec()-ed kernel
> from reading a vaild mailbox, which in turn makes the kexec()-ed kernel
> only be able to use the boot CPU.
>
> This is Linux-specific protocol and not reflected in ACPI spec.
>
> Booting the second kernel with signle CPU is enough to cover the most
> common case for kexec -- kdump.

This is confusing at best and I doubt that kdump is the most common case
for every one.

  To prevent a kexec kernel from onlining secondary CPUs invalidate the
  mailbox address in the ACPI MADT wakeup structure which prevents a
  kexec kernel to use it.

  This is safe as the booting kernel has the mailbox address cached
  already and acpi_wakeup_cpu() uses the cached value to bring up the
  secondary CPUs.

  Note: This is a Linux specific convention and not covered by the
ACPI specification.

Hmm?

> + /*
> +  * ACPI MADT doesn't allow to offline CPU after it got woke up.

to offline a CPU after it was onlined.

> +  * It limits kexec: the second kernel won't be able to use more than

   This limits kexec: ...

> +  * one CPU.
> +  *
> +  * Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> +  * The acpi_wakeup_cpu() will use it to bring up secondary cpus.
> +  *
> +  * Zero out mailbox address in the ACPI MADT wakeup structure to
> +  * indicate that the mailbox is not usable.  This prevents the
> +  * kexec()-ed kernel from reading a vaild mailbox, which in turn
> +  * makes the kexec()-ed kernel only be able to use the boot CPU.
> +  *
> +  * This is Linux-specific protocol and not reflected in ACPI spec.

See changelog comment...

> +  */
> + mp_wake->mailbox_address = 0;
> +
>   apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>   return 0;

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup

2023-12-15 Thread Thomas Gleixner

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:

> ACPI MADT doesn't allow to offline CPU after it got woke up.
>
> Currently hotplug prevented based on the confidential computing

Currently CPU hotplug is prevented...

Other than that:

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported

2023-12-15 Thread Thomas Gleixner

On Tue, Dec 05 2023 at 03:44, Kirill A. Shutemov wrote:
>  
> +static bool cpu_hotplug_offline_disabled;

__ro_after_init?

Other than that:

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv2 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

2023-10-29 Thread Thomas Gleixner

On Fri, Oct 20 2023 at 18:12, Kirill A. Shutemov wrote:

> MADT Multiprocessor Wakeup structure version 1 brings support of CPU
> offlining: BIOS provides a reset vector where the CPU has to jump to
> offline itself. The new TEST mailbox command can be used to test the CPU
> offlined successfully and BIOS has control over it.
>
> Add CPU offling support for ACPI MADT wakeup method by implementing
> custom cpu_die, play_dead and stop_other_cpus SMP operations.
>
> CPU offlining makes possible to hand over secondary CPUs over kexec, not

makes it possible

> limiting the second kernel with single CPU.

s/with/to/

> The change conforms to the approved ACPI spec change proposal. See the
> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> + /* Load address of reset vector into RCX to jump when kernel is ready */
> + movqacpi_mp_reset_vector_paddr(%rip), %rcx
> +
> + /* Turn off global entries. Following CR3 write will flush them. */
> + movq%cr4, %rdx
> + andq$~(X86_CR4_PGE), %rdx
> + movq%rdx, %cr4
> +
> + /* Switch to identity mapping */
> + movqacpi_mp_pgd(%rip), %rax
> + movq%rax, %cr3

You can just make this function:

asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa);

then you have the reset vector in RDI and the pgd in RSI and spare the
global variables.

>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> @@ -11,6 +16,150 @@ static u64 acpi_mp_wake_mailbox_paddr;
>  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
> +u64 acpi_mp_pgd;
> +u64 acpi_mp_reset_vector_paddr;

See above (static) and __ro_after_init please

> +
> +void asm_acpi_mp_play_dead(void);
> +
> +static void __init *alloc_pgt_page(void *context)
> +{

What's the purpose of the context argument?

> + return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +/*
> + * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
> + * the same place as in the kernel page tables. The function switches to
> + * the identity mapping and has be present at the same spot in before and
> + * after transition.

Why does it need to be there after the CPU jumped to the reset vector?

> + */
> +static int __init init_transition_pgtable(pgd_t *pgd)
> +{
> + pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
> + unsigned long vaddr, paddr;
> + int result = -ENOMEM;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + vaddr = (unsigned long)asm_acpi_mp_play_dead;
> + pgd += pgd_index(vaddr);
> + if (!pgd_present(*pgd)) {
> + p4d = (p4d_t *)alloc_pgt_page(NULL);
> + if (!p4d)
> + goto err;

return -ENOMEM?

the error labels is pretty silly without an actual cleanup, right?

> + set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
> + }
> + p4d = p4d_offset(pgd, vaddr);
> + if (!p4d_present(*p4d)) {
> + pud = (pud_t *)alloc_pgt_page(NULL);
> + if (!pud)
> + goto err;

Ditto. But what mops up the already allocated page above?
> + set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
> +
> + return 0;
> +err:
> + return result;
> +}

Can you please move that function to the place where it is used?

> +
> +static void acpi_mp_play_dead(void)
> +{
> + play_dead_common();
> + asm_acpi_mp_play_dead();
> +}
> +
> +static void acpi_mp_cpu_die(unsigned int cpu)
> +{
> + int apicid = per_cpu(x86_cpu_to_apicid, cpu);

u32 apicid

> + unsigned long timeout;
> +
> + /*
> +  * Use TEST mailbox command to prove that BIOS got control over
> +  * the CPU before declaring it dead.
> +  *
> +  * BIOS has to clear 'command' field of the mailbox.
> +  */
> + acpi_mp_wake_mailbox->apic_id = apicid;
> + smp_store_release(&acpi_mp_wake_mailbox->command,
> +   ACPI_MP_WAKE_COMMAND_TEST);
> +
> + /* Don't wait longer than a second. */
> + timeout = USEC_PER_SEC;
> + while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
> + udelay(1);
> +}
> +
> +static void acpi_mp_stop_other_cpus(int wait)
> +{
> + smp_shutdown_nonboot_cpus(smp_processor_id());

This clearly was never tested with lockdep. At the point where
stop_other_cpus() is invoked the invoking CPU has interrupts disabled...

> +}
> +
> +static void acpi_mp_crash_stop_other_cpus(void)
> +{
> + smp_shutdown_nonboot_cpus(smp_processor_id());

Yuck. Crash can happen at arbitrary places. So you really cannot invoke
the whole CPU hotplug state machine from here.

There is a reason why the other implementation just kick CPUs into some
"safe" state.

> + /* The kernel is broken so disable interrupts */
> + local_irq_disable();
> +}
> +
> +static int __init acpi_mp_setup_reset(u64 reset_vector)
> +{
> + pgd_t *pgd;
> + struct x86_mapping_info info = {

Re: [PATCHv2 02/13] kernel/cpu: Add support for declaring CPU offlining not supported

2023-10-28 Thread Thomas Gleixner

On Mon, Oct 23 2023 at 22:07, Kai Huang wrote:
> On Mon, 2023-10-23 at 18:31 +0300, kirill.shute...@linux.intel.com wrote:
>> On Mon, Oct 23, 2023 at 09:30:59AM +, Huang, Kai wrote:
>> > IMHO it's a little bit odd to have two mechanisms in place, even in this 
>> > middle
>> > state patch.  Is it better to completely replace CC_ATTR_HOTPLUG_DISABLED 
>> > with
>> > the new cpu_hotplug_offline_disabled in this patch? You can explicitly call
>> > cpu_hotplug_disable_offlining() in tdx_early_init() so no functional 
>> > change is
>> > done.
>> 
>> I can. But I don't see how it makes a difference.
>
> Personally I think this is better because it is odd to have two mechanisms in
> place even temporarily especially when we can avoid it.  But no hard opinion. 
> Up to you.

It's not at all better because having this clear split makes it simpler
to review and ponder the individual changes.

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv2 02/13] kernel/cpu: Add support for declaring CPU offlining not supported

2023-10-28 Thread Thomas Gleixner

On Fri, Oct 20 2023 at 18:12, Kirill A. Shutemov wrote:

Subject: cpu/hotplug: ...

> ACPI MADT doesn't allow to offline CPU after it got woke up.

ACPI MADT is a table ...

This is about the MADT mailbox wakeup method, right?

> Currently offlining hotplug prevented based on the confidential

is prevented

> computing attribute which is set for Intel TDX. But TDX is not
> the only possible user of the wake up method.
>
> Introduce cpu_hotplug_not_supported() that can be called to indicate
> that CPU offlining should be disabled.

Otherwise the changes look good!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup

2023-10-11 Thread Thomas Gleixner

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
>  
> + /* Disable CPU onlining/offlining */

That's not what the function does.

> + cpu_hotplug_not_supported();

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported

2023-10-11 Thread Thomas Gleixner

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
> The function cpu_hotplug_not_supported() can be called to indicate that
> CPU hotplug should be disabled. It does not prevent the initial bring up
> of the CPU, but it stops subsequent offlining.

This tells me what the patch is doing, but not the why.

> This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
>  extern void cpus_read_unlock(void);
>  extern int  cpus_read_trylock(void);
>  extern void lockdep_assert_cpus_held(void);
> +extern void cpu_hotplug_not_supported(void);

This function name sucks.

The point is as you explained to prevent offlining, but not onlining. So
can we please make this very clear? Something like:

cpu_hotplug_disable_offlining()

> +/* Cleared if platform declares CPU hotplug not supported */
> +static bool cpu_hotplug_supported = true;

Again. This is not about disabling hotplug all together. Something like:

static bool cpu_hotplug_offline_disabled;

Which expresses clearly what this is about and does not require this
awkward negation.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait

2023-07-07 Thread Thomas Gleixner

On Thu, Jul 06 2023 at 14:44, Baokun Li wrote:
> On 2023/7/5 16:59, Thomas Gleixner wrote:
>> +/*
>> + * If this is a crash stop which does not execute on the boot CPU,
>> + * then this cannot use the INIT mechanism because INIT to the boot
>> + * CPU will reset the machine.
>> + */
>> +if (this_cpu)
>> +return false;

> This patch does fix the problem of rebooting at panic, but the
> exported stack stays at stop_this_cpu() like below, instead of showing
> what the corresponding process is doing as before.
>
> PID: 681  TASK: 9ac2429d3080  CPU: 2    COMMAND: "fsstress"
>   #0 [b00200184fd0] stop_this_cpu at 89a4ffd8
>   #1 [b00200184fe8] __sysvec_reboot at 89a94213
>   #2 [b00200184ff0] sysvec_reboot at 8aee7491
> ---  ---
>      RIP: 0010  RSP: 0018  RFLAGS: b00200f8bd08
>      RAX: 9ac256fda9d8  RBX: 09973a85  RCX: 9ac256fda078
>      RDX: 9ac24416e300  RSI: 9ac256fda9e0  RDI: 
>      RBP: 9ac2443a5f88   R8:    R9: 9ac2422eeea0
>      R10: 9ac256fda9d8  R11: 00549921  R12: 9ac2422eeea0
>      R13: 9ac251cd23c8  R14: 9ac24269a800  R15: 9ac251cd2150
>      ORIG_RAX: 8a1719e4  CS: 0206  SS: 8a1719c8
> bt: WARNING: possibly bogus exception frame
>
> Do you know how this happened? I would be grateful if you could fix it.

No, I don't. But there is clearly a hint:

> bt: WARNING: possibly bogus exception frame

So the exception frame seems to be corrupted. I have no idea why.

The question is, whether this goes away when you revert that commit or not.
I can't oracle that out from your report.

Can you please revert 45e34c8af58f on top of Linus tree and verify that
it makes the issue go away?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait

2023-07-05 Thread Thomas Gleixner

On Mon, Jul 03 2023 at 11:44, Baokun Li wrote:

> When I manually trigger panic in a qume x86 VM with
>
>     `echo c > /proc/sysrq-trigger`,
>
>   I find that the VM will probably reboot directly, but the 
> PANIC_TIMEOUT is 0.
> This prevents us from exporting the vmcore via panic, and even if we succeed
> in panic exporting the vmcore, the processes in the vmcore are mostly
> stop_this_cpu(). By dichotomizing we found the patch that introduced the
> behavior change
>
>     45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible"),

Bah, I missed that this is used by crash too. So if this happens to be
invoked on an AP, i.e. not on CPU 0, then the INIT will reset the
machine. Fix below.

Thanks,

tglx
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ed2d51960a7d..e1aa2cd7734b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1348,6 +1348,14 @@ bool smp_park_other_cpus_in_init(void)
if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
return false;
 
+   /*
+* If this is a crash stop which does not execute on the boot CPU,
+* then this cannot use the INIT mechanism because INIT to the boot
+* CPU will reset the machine.
+*/
+   if (this_cpu)
+   return false;
+
for_each_present_cpu(cpu) {
if (cpu == this_cpu)
continue;

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v23 4/8] crash: memory and CPU hotplug sysfs attributes

2023-06-13 Thread Thomas Gleixner

On Tue, Jun 13 2023 at 14:58, Eric DeVolder wrote:
> On 6/13/23 10:24, Eric DeVolder wrote:
>> On 6/13/23 03:03, Greg KH wrote:
>>> All of these #ifdefs should all be removed and instead use the
>>> is_visible() callback to determine if the attribute is shown or not,
>>> using the IS_ENABLED() test in the function.
>> 
>> ok, I'll correct this.
>
> I've been examining drivers/base/cacheinfo.c as a template for how to remove 
> the
> #ifdefs and use the is_visible() callback for the drivers/base/cpu|memory.c 
> files.
>
> I'm attempting to apply this technique to drivers/base/cpu.c. In this file, 
> there
> are features that are compiled in/out based on the CONFIG settings, for 
> example 
> CONFIG_ARCH_CPU_PROBE_RELEASE. My attempts at applying the technique thus far 
> have
> resulted in link-time errors for missing symbols, ie. arch_cpu_probe() and
> arch_cpu_release().
>
> As I understand it, to use IS_ENABLED(XYZ) to guard-band conditional code, 
> the contents
> of that code still needs to be compile-able (eg. no references to struct 
> members with
> surrounding #ifdef CONFIG_XYZ) and link-able (eg. any called functions must 
> also be
> compiled).

You can't obviously reference anything which is #ifdeffed out in a data
structure. But functions is a different story. All you needs is a
declaration.

void foo(void);

 if (IS_ENABLED(FOO))
foo();

Builds correctly if FOO=n and foo() is not built in. The wonders of dead
code elimination.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Thomas Gleixner

On Fri, May 12 2023 at 17:13, Matthew Garrett wrote:
> On Fri, May 12, 2023 at 03:24:04PM +0200, Thomas Gleixner wrote:
>> On Fri, May 12 2023 at 12:28, Matthew Garrett wrote:
>> > Unless we assert that SHA-1 events are unsupported, it seems a bit odd 
>> > to force a policy on people who have both banks enabled. People with 
>> > mixed fleets are potentially going to be dealing with SHA-1 measurements 
>> > for a while yet, and while there's obviously a security benefit in using 
>> > SHA-2 instead it'd be irritating to have to maintain two attestation 
>> > policies.
>> 
>> Why?
>> 
>> If you have a mixed fleet then it's not too much asked to provide two
>> data sets. On a TPM2 system you can enforce SHA-2 and only fallback to
>> SHA-1 on TPM 1.2 hardware. No?
>
> No, beause having TPM2 hardware doesn't guarantee that your firmware 
> enables SHA-2 (which also means this is something that could change with 
> firmware updates, which means that refusing to support SHA-1 if the 
> SHA-2 banks are enabled could result in an entirely different policy 
> being required (and plausibly one that isn't implemented in their 
> existing tooling)

It's not rocket science to have both variants supported in tooling,
really.

What a mess.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v6 09/14] x86: Secure Launch SMP bringup support

2023-05-12 Thread Thomas Gleixner

On Thu, May 04 2023 at 14:50, Ross Philipson wrote:
>  
> +#ifdef CONFIG_SECURE_LAUNCH
> +
> +static atomic_t first_ap_only = {1};

ATOMIC_INIT(1) if at all.

> +
> +/*
> + * Called to fix the long jump address for the waiting APs to vector to
> + * the correct startup location in the Secure Launch stub in the rmpiggy.
> + */
> +static int
> +slaunch_fixup_jump_vector(void)

One line please.

> +{
> + struct sl_ap_wake_info *ap_wake_info;
> + u32 *ap_jmp_ptr = NULL;
> +
> + if (!atomic_dec_and_test(&first_ap_only))
> + return 0;

Why does this need an atomic? CPU bringup is fully serialized and even
with the upcoming parallel bootup work, there is no concurrency on this
function.

Aside of that. Why isn't this initialized during boot in a __init function?

> + ap_wake_info = slaunch_get_ap_wake_info();
> +
> + ap_jmp_ptr = (u32 *)__va(ap_wake_info->ap_wake_block +
> +  ap_wake_info->ap_jmp_offset);
> +
> + *ap_jmp_ptr = real_mode_header->sl_trampoline_start32;
> +
> + pr_debug("TXT AP long jump address updated\n");
> +
> + return 0;

Why does this need a return code of all return paths return 0?

> +}
> +
> +/*
> + * TXT AP startup is quite different than normal. The APs cannot have #INIT
> + * asserted on them or receive SIPIs. The early Secure Launch code has parked
> + * the APs in a pause loop waiting to receive an NMI. This will wake the APs
> + * and have them jump to the protected mode code in the rmpiggy where the 
> rest
> + * of the SMP boot of the AP will proceed normally.
> + */
> +static int
> +slaunch_wakeup_cpu_from_txt(int cpu, int apicid)
> +{
> + unsigned long send_status = 0, accept_status = 0;
> +
> + /* Only done once */

Yes. But not here.

> + if (slaunch_fixup_jump_vector())
> + return -1;
> +
> + /* Send NMI IPI to idling AP and wake it up */
> + apic_icr_write(APIC_DM_NMI, apicid);
> +
> + if (init_udelay == 0)
> + udelay(10);
> + else
> + udelay(300);

The wonders of copy & pasta. This condition is pointless because this
code only runs on systems which force init_udelay to 0.

> + send_status = safe_apic_wait_icr_idle();

Moar copy & pasta. As this is guaranteed to be X2APIC mode, this
function is a nop and returns 0 unconditionally.

> + if (init_udelay == 0)
> + udelay(10);
> + else
> + udelay(300);
> +
> + accept_status = (apic_read(APIC_ESR) & 0xEF);

The point of this is? Bit 0-3 are Pentium and P6 only.

Bit 4 Tried to send low prio IPI but not supported
Bit 5 Illegal Vector sent
Bit 6 Illegal Vector received
Bit 7 X2APIC illegal register access

IOW, there is no accept error here. That would be bit 2 which is never set
on anything modern

But aside of that the read is moot anyway because the CPU has the APIC
error vector enabled so if this would happen the APIC error interrupt
would have swallowed and cleared the error condition.

IOW. Everything except the apic_icr_write() here is completely useless.

> +#else
> +
> +#define slaunch_wakeup_cpu_from_txt(cpu, apicid) 0

inline stub please. 

> +
> +#endif  /* !CONFIG_SECURE_LAUNCH */
> +
>  /*
>   * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
>   * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
> @@ -1132,6 +1210,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
> task_struct *idle,
>   cpumask_clear_cpu(cpu, cpu_initialized_mask);
>   smp_mb();
>  
> + /* With Intel TXT, the AP startup is totally different */
> + if ((slaunch_get_flags() & (SL_FLAG_ACTIVE|SL_FLAG_ARCH_TXT)) ==
> +(SL_FLAG_ACTIVE|SL_FLAG_ARCH_TXT)) {

Stick this condition into a helper function please

> + boot_error = slaunch_wakeup_cpu_from_txt(cpu, apicid);
> + goto txt_wake;
> + }
> +
>   /*
>* Wake up a CPU in difference cases:
>* - Use a method from the APIC driver if one defined, with wakeup
> @@ -1147,6 +1232,7 @@ static int do_boot_cpu(int apicid, int cpu, struct 
> task_struct *idle,
>   boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
>cpu0_nmi_registered);
>  
> +txt_wake:

Sorry, but what has this to do with TXT ? And why can't the above just
be yet another if clause in the existing if/else if maze?

Now that brings me to another question. How is this supposed to work
with CPU hotplug post boot?

It will simply not work at all because once a CPU is offlined it is
going to sit in an endless loop and wait for INIT/SIPI/SIPI. So it will
get that NMI and go back to wait.

So you need a TXT specific cpu_play_dead() implementation, which should
preferrably use monitor/mwait where each "offline" CPU sits and waits
until a condition becomes true. Then you don't need a NMI for wakeup at
all. Just writing the condition into that per CPU cache line should be
enough.

Thanks,

tg

Re: [PATCH v6 07/14] x86: Secure Launch kernel early boot stub

2023-05-12 Thread Thomas Gleixner



On Thu, May 04 2023 at 14:50, Ross Philipson wrote:
> +
> +/* CPUID: leaf 1, ECX, SMX feature bit */
> +#define X86_FEATURE_BIT_SMX  (1 << 6)
> +
> +/* Can't include apiddef.h in asm */

Why not? All it needs is a #ifndef __ASSEMBLY__ guard around the C parts.

> +#define XAPIC_ENABLE (1 << 11)
> +#define X2APIC_ENABLE(1 << 10)
> +
> +/* Can't include traps.h in asm */

NMI_VECTOR is defined in irq_vectors.h which just has a include
 for no real good reason.

> +#define X86_TRAP_NMI 2



> +/*
> + * See the comment in head_64.S for detailed informatoin on what this macro
> + * is used for.
> + */
> +#define rva(X) ((X) - sl_stub_entry)

I'm having a hard time to find that comment in head_64.S. At least it's
not in this patch.

> +.Lsl_ap_cs:
> + /* Load the relocated AP IDT */
[ 11 more citation lines. Click/Enter to show. ]
> + lidt(sl_ap_idt_desc - sl_txt_ap_wake_begin)(%ecx)
> +
> + /* Fixup MTRRs and misc enable MSR on APs too */
> + callsl_txt_load_regs
> +
> + /* Enable SMI with GETSEC[SMCTRL] */
> + GETSEC $(SMX_X86_GETSEC_SMCTRL)
> +
> + /* IRET-to-self can be used to enable NMIs which SENTER disabled */
> + lealrva(.Lnmi_enabled_ap)(%ebx), %eax
> + pushfl
> + pushl   $(__SL32_CS)
> + pushl   %eax
> + iret

So from here on any NMI which hits the AP before it can reach the wait
loop will corrupt EDX...

> +/* This is the beginning of the relocated AP wake code block */
> + .global sl_txt_ap_wake_begin
[ 10 more citation lines. Click/Enter to show. ]
> +sl_txt_ap_wake_begin:
> +
> + /*
> +  * Wait for NMI IPI in the relocated AP wake block which was provided
> +  * and protected in the memory map by the prelaunch code. Leave all
> +  * other interrupts masked since we do not expect anything but an NMI.
> +  */
> + xorl%edx, %edx
> +
> +1:
> + hlt
> + testl   %edx, %edx
> + jz  1b

This really makes me nervous. A stray NMI and the AP starts going.

Can't this NMI just bring the AP out of HLT w/o changing any state and
the AP evaluates a memory location which indicates whether it should
start up or not.

> + /*
> +  * This is the long absolute jump to the 32b Secure Launch protected
> +  * mode stub code in the rmpiggy. The jump address will be fixed in

Providing an actual name for the stub might spare to rummage through
code to figure out where this is supposed to jump to.

> +  * the SMP boot code when the first AP is brought up. This whole area
> +  * is provided and protected in the memory map by the prelaunch code.
[ 2 more citation lines. Click/Enter to show. ]
> +  */
> + .byte   0xea
> +sl_ap_jmp_offset:
> + .long   0x
> + .word   __SL32_CS

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v6 08/14] x86: Secure Launch kernel late boot stub

2023-05-12 Thread Thomas Gleixner

On Thu, May 04 2023 at 14:50, Ross Philipson wrote:
> The routine slaunch_setup is called out of the x86 specific setup_arch

Can you please make functions visible in changelogs by appending (),
i.e. setup_arch() ?

See https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
for further hints.

> +static u32 sl_flags;
> +static struct sl_ap_wake_info ap_wake_info;
> +static u64 evtlog_addr;
> +static u32 evtlog_size;
> +static u64 vtd_pmr_lo_size;

Is any of this modifyable after boot? If not then this wants to be
annotated with __ro_after_init.

> +/* This should be plenty of room */
> +static u8 txt_dmar[PAGE_SIZE] __aligned(16);
> +
> +u32 slaunch_get_flags(void)
> +{
> + return sl_flags;
> +}
> +EXPORT_SYMBOL(slaunch_get_flags);

What needs this export? If there is a reason then please EXPORT_SYMBOL_GPL()

> +struct sl_ap_wake_info *slaunch_get_ap_wake_info(void)
> +{
> + return &ap_wake_info;
> +}

If you return a pointer, then there is not much of point for encapsulating.

> +struct acpi_table_header *slaunch_get_dmar_table(struct acpi_table_header 
> *dmar)

Some explanation on public visible functions would be really useful.

> +{
> + /* The DMAR is only stashed and provided via TXT on Intel systems */

-ENOPARSE.

> + if (memcmp(txt_dmar, "DMAR", 4))
> + return dmar;
> +
> + return (struct acpi_table_header *)(&txt_dmar[0]);

s/&txt_dmar[0]/txt_dmar/ No?

> +}

> +void __noreturn slaunch_txt_reset(void __iomem *txt,
> +   const char *msg, u64 error)

Please avoid these line breaks. We lifted the 80 character limit quite
some time ago.

> +
> + /* Iterate over heap tables looking for table of "type" */
> + for (i = 0; i < type; i++) {
> + base += offset;
> + heap = early_memremap(base, sizeof(u64));
> + if (!heap)
> + slaunch_txt_reset(txt,
> + "Error early_memremap of heap for heap walk\n",
> + SL_ERROR_HEAP_MAP);

This is horrible to read.

if (!heap) {
slaunch_txt_reset(txt, "Error early_memremap of heap 
for heap walk\n",
  SL_ERROR_HEAP_MAP);
}

See documentation about bracket rules.

> +
> +/*
> + * TXT stashes a safe copy of the DMAR ACPI table to prevent tampering.
> + * It is stored in the TXT heap. Fetch it from there and make it available
> + * to the IOMMU driver.
> + */
> +static void __init slaunch_copy_dmar_table(void __iomem *txt)
> +{
> + struct txt_sinit_mle_data *sinit_mle_data;
> + u32 field_offset, dmar_size, dmar_offset;
> + void *dmar;
> +
> + memset(&txt_dmar, 0, PAGE_SIZE);

txt_dmar is statically allocated so it's already zero, no?

> +/*
> + * Intel TXT specific late stub setup and validation.
> + */
> +void __init slaunch_setup_txt(void)
> +{
> + u64 one = TXT_REGVALUE_ONE, val;
> + void __iomem *txt;
> +
> + if (!boot_cpu_has(X86_FEATURE_SMX))
> + return;
> +
> + /*
> +  * If booted through secure launch entry point, the loadflags
> +  * option will be set.
> +  */
> + if (!(boot_params.hdr.loadflags & SLAUNCH_FLAG))
> + return;
> +
> + /*
> +  * See if SENTER was done by reading the status register in the
> +  * public space. If the public register space cannot be read, TXT may
> +  * be disabled.
> +  */
> + txt = early_ioremap(TXT_PUB_CONFIG_REGS_BASE,
> + TXT_NR_CONFIG_PAGES * PAGE_SIZE);
> + if (!txt)
> + return;

Wait. You have established above that SMX is available and the boot has
set the SLAUNCH flag.

So if that ioremap() fails then there is an issue with the fixmaps.

How is returning here sensible? The system will just die later on in the
worst case with some undecodable issue.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Thomas Gleixner

On Fri, May 12 2023 at 12:28, Matthew Garrett wrote:
> On Fri, May 12, 2023 at 01:18:45PM +0200, Ard Biesheuvel wrote:
>> On Fri, 12 May 2023 at 13:04, Matthew Garrett  wrote:
>> >
>> > On Tue, May 09, 2023 at 06:21:44PM -0700, Eric Biggers wrote:
>> >
>> > > SHA-1 is insecure.  Why are you still using SHA-1?  Don't TPMs support 
>> > > SHA-2
>> > > now?
>> >
>> > TXT is supported on some TPM 1.2 systems as well. TPM 2 systems are also
>> > at the whim of the firmware in terms of whether the SHA-2 banks are
>> > enabled. But even if the SHA-2 banks are enabled, if you suddenly stop
>> > extending the SHA-1 banks, a malicious actor can later turn up and
>> > extend whatever they want into them and present a SHA-1-only
>> > attestation. Ideally whatever is handling that attestation should know
>> > whether or not to expect an attestation with SHA-2, but the easiest way
>> > to maintain security is to always extend all banks.
>> >
>> 
>> Wouldn't it make more sense to measure some terminating event into the
>> SHA-1 banks instead?
>
> Unless we assert that SHA-1 events are unsupported, it seems a bit odd 
> to force a policy on people who have both banks enabled. People with 
> mixed fleets are potentially going to be dealing with SHA-1 measurements 
> for a while yet, and while there's obviously a security benefit in using 
> SHA-2 instead it'd be irritating to have to maintain two attestation 
> policies.

Why?

If you have a mixed fleet then it's not too much asked to provide two
data sets. On a TPM2 system you can enforce SHA-2 and only fallback to
SHA-1 on TPM 1.2 hardware. No?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v6 02/14] Documentation/x86: Secure Launch kernel documentation

2023-05-12 Thread Thomas Gleixner

On Thu, May 04 2023 at 14:50, Ross Philipson wrote:
> +KASLR Configuration
> +---
> +
> +Secure Launch does not interoperate with KASLR. If possible, the MLE should 
> be
> +built with KASLR disabled::

Why?

> +  "Processor type and features" -->
> +  "Build a relocatable kernel" -->
> +  "Randomize the address of the kernel image (KASLR) [ ]"
> +
> +This unsets the Kconfig value CONFIG_RANDOMIZE_BASE.
> +
> +If not possible, KASLR must be disabled on the kernel command line when doing
> +a Secure Launch as follows::
> +
> +  nokaslr

So what happens if KASLR is enabled in Kconfig and not disabled on the
command line?

> +IOMMU Configuration
> +---
> +
> +When doing a Secure Launch, the IOMMU should always be enabled and the 
> drivers
> +loaded. However, IOMMU passthrough mode should never be used. This leaves the
> +MLE completely exposed to DMA after the PMR's [2]_ are disabled. The current 
> default
> +mode is to use IOMMU in lazy translated mode but strict translated mode is 
> the preferred
> +IOMMU mode and this should be selected in the build configuration::
> +
> +  "Device Drivers" -->
> +  "IOMMU Hardware Support" -->
> +  "IOMMU default domain type" -->
> +  "(X) Translated - Strict"
> +
> +In addition, the Intel IOMMU should be on by default. The following sets 
> this as the
> +default in the build configuration::
> +
> +  "Device Drivers" -->
> +  "IOMMU Hardware Support" -->
> +  "Support for Intel IOMMU using DMA Remapping Devices [*]"
> +
> +and::
> +
> +  "Device Drivers" -->
> +  "IOMMU Hardware Support" -->
> +  "Support for Intel IOMMU using DMA Remapping Devices [*]" -->
> +  "Enable Intel DMA Remapping Devices by default  [*]"
> +
> +It is recommended that no other command line options should be set to 
> override
> +the defaults above.

Is any of this validated and are proper warnings emitted or is it just
recommended and left to the user to do the right thing?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v22 5/8] x86/crash: add x86 crash hotplug support

2023-05-09 Thread Thomas Gleixner

On Wed, May 03 2023 at 18:41, Eric DeVolder wrote:
> In the patch 'kexec: exclude elfcorehdr from the segment digest'

See reply to 8/8
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..80538524c494 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2119,6 +2119,19 @@ config CRASH_DUMP
> (CONFIG_RELOCATABLE=y).
> For more details see Documentation/admin-guide/kdump/kdump.rst
>  
> +config CRASH_HOTPLUG
> + bool "Update the crash elfcorehdr on system configuration changes"
> + default y
> + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG)
> + help
> +   Enable direct update to the crash elfcorehdr (which contains
> +   the list of CPUs and memory regions to be dumped upon a crash)
> +   in response to hot plug/unplug or online/offline of CPUs or
> +   memory. This is a much more advanced approach than userspace
> +   attempting that.
> +
> +   If unsure, say Y.

Why is this config an X86 specific thing?

Neither CRASH_DUMP nor HOTPLUG_CPU nor MEMORY_HOTPLUG are in any way X86
specific at all. So why can't you stick that into a place where it can
be reused by other architectures?

It's not rocket science to do 

+   depends on WANTS_CRASH_HOTPLUG && CRASH_DUMP && (HOTPLUG_CPU || 
MEMORY_HOTPLUG)

or something like that. It's so tiring to have x86 Kconfig be the dump
ground for the initial implementation, then having the sh*t copied to
every other architecture and the cleanup is left to the maintainers.

It's not rocket science to differentiate between a real architecture
specific option and a generally useful option in the first place, right?


> +#ifdef CONFIG_CRASH_HOTPLUG
> + /*
> +  * Ensure the elfcorehdr segment large enough for hotplug changes.
> +  * Account for VMCOREINFO and kernel_map and maximum CPUs.

Neither the first line nor the second one qualifies as parseable sentences.

> +/**
> + * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
> + * @image: the active struct kimage

What is an active struct kimage?

> + *
> + * The new elfcorehdr is prepared in a kernel buffer, and then it is
> + * written on top of the existing/old elfcorehdr.

-ENOPARSE

> + */
> +void arch_crash_handle_hotplug_event(struct kimage *image)
> +{
> + void *elfbuf = NULL, *old_elfcorehdr;
> + unsigned long nr_mem_ranges;
> + unsigned long mem, memsz;
> + unsigned long elfsz = 0;
> +
> + /*
> +  * Create the new elfcorehdr reflecting the changes to CPU and/or
> +  * memory resources.
> +  */
> + if (prepare_elf_headers(image, &elfbuf, &elfsz, &nr_mem_ranges)) {
> + pr_err("unable to prepare elfcore headers");
> + goto out;

So this can fail. Why is there just a pr_err() and no return value which
tells the caller that this failed?

> + /*
> +  * Copy new elfcorehdr over the old elfcorehdr at destination.
> +  */
> + old_elfcorehdr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT));
> + if (!old_elfcorehdr) {
> + pr_err("updating elfcorehdr failed\n");

How hard is it to write an error message which is clearly describing the
problem?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v22 8/8] x86/crash: optimize CPU changes

2023-05-09 Thread Thomas Gleixner

On Wed, May 03 2023 at 18:41, Eric DeVolder wrote:
> This patch is dependent upon the patch 'crash: change

Seriously? You send a patch series which is ordered in itself and then
tell in the changelog of patch 8/8 that it depends on patch 7/8?

This information is complete garbage once the patches are applied and
ends up in the git logs and even for the submission it's useless
information.

Patch series are usually ordered by dependecy, no?

Aside of that please do:

# git grep 'This patch' Documentation/process/

> crash_prepare_elf64_headers() to for_each_possible_cpu()'. With that
> patch, crash_prepare_elf64_headers() writes out an ELF CPU PT_NOTE
> for all possible CPUs, thus further CPU changes to the elfcorehdr
> are not needed.

I'm having a hard time to decode this word salad.

  crash_prepare_elf64_headers() is writing out an ELF CPU PT_NOTE for
  all possible CPUs, thus further changes to the ELF core header are
  not required.

Makes some sense to me.

> This change works for kexec_file_load() and kexec_load() syscalls.
> For kexec_file_load(), crash_prepare_elf64_headers() is utilized
> directly and thus all ELF CPU PT_NOTEs are in the elfcorehdr already.
> This is the kimage->file_mode term.
> For kexec_load() syscall, one CPU or memory change will cause the
> elfcorehdr to be updated via crash_prepare_elf64_headers() and at
> that point all ELF CPU PT_NOTEs are in the elfcorehdr. This is the
> kimage->elfcorehdr_updated term.

Sorry. I tried hard, but this is completely incomprehensible.

> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 8064e65de6c0..3157e6068747 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -483,6 +483,16 @@ void arch_crash_handle_hotplug_event(struct kimage 
> *image)
>   unsigned long mem, memsz;
>   unsigned long elfsz = 0;
>  
> + /* As crash_prepare_elf64_headers() has already described all

This is not a proper multiline comment. Please read and follow the tip
tree documentation along with all other things which are documented
there:

  https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

This documentation is not there for entertainment value or exists just
because we are bored to death.

> +  * possible CPUs, there is no need to update the elfcorehdr
> +  * for additional CPU changes. This works for both kexec_load()
> +  * and kexec_file_load() syscalls.

And it does not work for what?

You cannot expect that anyone who reads this code is an kexec/crash*
wizard who might be able to deduce the meaning of this.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

2023-02-13 Thread Thomas Gleixner

On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote:
> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
> isn't that equivalent to possible CPUs?

glibc tries to evaluate that in the following order:

  1) /sys/devices/system/cpu/cpu*

 That's present CPUs not possible CPUs

  2) /proc/stat

 That's online CPUs

  3) sched_getaffinity()

 That's online CPUs at best. In the worst case it's an affinity mask
 which is set on a process group

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

2023-02-08 Thread Thomas Gleixner

Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
> On 2/1/23 05:33, Thomas Gleixner wrote:
>
> So my latest solution is introduce two new CPUHP states, 
> CPUHP_AP_ELFCOREHDR_ONLINE
> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to 
> better names.
>
> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
> attempts at locating this state failed when inside the STARTING section, so I 
> located
> this just inside the ONLINE sectoin. The crash hotplug handler is registered 
> on
> this state as the callback for the .startup method.
>
> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, 
> and I
> placed it at the end of the PREPARE section. This crash hotplug handler is 
> also
> registered on this state as the callback for the .teardown method.

TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash time. That spares all
those callback heuristics.

> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
> makedumpfile and (the consumer of it all) the userspace crash utility,
> in order to understand the impact of moving from for_each_present_cpu()
> to for_each_online_cpu().

Is the packing actually worth the trouble? What's the actual win?

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

2023-02-06 Thread Thomas Gleixner

On Mon, Feb 06 2023 at 13:42, Sourabh Jain wrote:
> On 01/02/23 17:03, Thomas Gleixner wrote:
>> Also in case of loading the crash kernel in the situation where not all
>> present CPUs are online (think boot time SMT disable) then your
>> resulting crash image will contain all present CPUs and none of the
>> offline CPUs are excluded.
>>
>> How does that make any sense at all?
>>
>> This image->hp_action and image->offlinecpu dance is engineering
>> voodoo. You just can do:
>>
>>  for_each_present_cpu(cpu) {
>>  if (!cpu_online(cpu))
>>  continue;
>>  do_stuff(cpu);
>>
>> which does the right thing in all situations and can be further
>> simplified to:
>>
>>  for_each_online_cpu(cpu) {
>>  do_stuff(cpu);
>
> What will be the implication on x86 if we pack PT_NOTE for possible
> CPUs?

I don't know.

> IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc
> and when the system is on crash path the crash notes for online CPUs is
> populated with the required data and rest crash notes are untouched.

Which should be fine. That's a problem of postprocessing and it's
unclear to me from the changelogs what the actual problem is which is
trying to be solved here.

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

2023-02-01 Thread Thomas Gleixner

Eric!

On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, 
> struct crash_mem *mem,
>  
>   /* Prepare one phdr of type PT_NOTE for each present CPU */
>   for_each_present_cpu(cpu) {
> +#ifdef CONFIG_CRASH_HOTPLUG
> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
> + /* Skip the soon-to-be offlined cpu */
> + if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
> + (cpu == image->offlinecpu))
> + continue;
> + }
> +#endif

I'm failing to see how the above is correct in any way. Look at the
following sequence of events:

 1) Offline CPU$N

-> Prepare elf headers with CPU$N excluded

 2) Another hotplug operation != 'Online CPU$N'

-> Prepare elf headers with CPU$N included

Also in case of loading the crash kernel in the situation where not all
present CPUs are online (think boot time SMT disable) then your
resulting crash image will contain all present CPUs and none of the
offline CPUs are excluded.

How does that make any sense at all?

This image->hp_action and image->offlinecpu dance is engineering
voodoo. You just can do:

for_each_present_cpu(cpu) {
if (!cpu_online(cpu))
continue;
do_stuff(cpu);

which does the right thing in all situations and can be further
simplified to:

for_each_online_cpu(cpu) {
do_stuff(cpu);

without the need for ifdefs or whatever.

No?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v17 3/6] crash: add generic infrastructure for crash hotplug support

2023-01-19 Thread Thomas Gleixner

Eric!

On Wed, Jan 18 2023 at 16:35, Eric DeVolder wrote:
> CPU and memory change notifications are received in order to
> regenerate the elfcorehdr.
>
> To support cpu hotplug, a callback is registered to capture the
> CPUHP_AP_ONLINE_DYN online and offline events via
> cpuhp_setup_state_nocalls().

This sentence does not make sense. The callback is not registered to
capture CPUHP_AP_ONLINE_DYN events.

What this does is: It installs a dynamic CPU hotplug state with
callbacks for online and offline. These callbacks store information
about a CPU coming up and going down. Right?

But why are they required and what's the value?

This changelog tells WHAT it does and not WHY. I can see the WHAT from
the patch itself. 

Don't tell me the WHY is in the cover letter. The cover letter is not
part of the commits and changelogs have to be self contained.

Now let me cite from your cover letter:

> When the kdump service is loaded, if a CPU or memory is hot
> un/plugged, the crash elfcorehdr, which describes the CPUs
> and memory in the system, must also be updated, else the resulting
> vmcore is inaccurate (eg. missing either CPU context or memory
> regions).

The CPU hotplug state you are using for this is patently inaccurate
too. With your approach the CPU is tracked as online very late in the
hotplug process and tracked as offline very early on unplug.

So if the kernel crashes before/after the plug/unplug tracking event
then your recorded state is bogus and given the amount of callbacks
between the real online/offline and the recording point there is a
pretty large window.

You can argue that this is better than the current state and considered
good enough for whatever reason, but such information wants to be in the
changelog, no?

Thanks,

tglx

Hint: The requirements for changelogs are well documented in 
Documentation/process/

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] Revert "x86/apic/x2apic: Implement IPI shorthands support"

2023-01-09 Thread Thomas Gleixner

On Tue, Dec 20 2022 at 13:34, Baoquan He wrote:
> This reverts commit 43931d350f30c6cd8c2f498d54ef7d65750abc92.
>
> On kvm guest with 4 cpus deployed, when adding 'nr_cpus=2' to normal
> kernel's cmdline, and triggering crash to jump to kdump kernel, kdump
> kernel will stably hang. Reverting commit 43931d350f30 ("x86/apic/x2apic:
> Implement IPI shorthands support") can fix it.

Is there any output on the early console or hangs it silently?

If the latter, can you attach GDB to the guest and figure out where it
is stuck?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: kdump kernel randomly hang with tick_periodic call trace on bare metal system

2023-01-09 Thread Thomas Gleixner

On Tue, Dec 20 2022 at 13:41, Baoquan He wrote:
> On one intel bare metal system, I can randomly reproduce the kdump hang
> as below with tick_periodic call trace. Attach the kernel config for
> reference.

This has absolutely nothing to do with x2apic IPI shorthands

> [0.045980] Spurious LAPIC timer interrupt on cpu 0

So here the CPU receives a spurious Local APIC timer interrupt, but
that's a red herring.

> [1.152690] BUG: kernel NULL pointer dereference, address: 0088
> [1.159634] #PF: supervisor read access in kernel mode
> [1.164757] #PF: error_code(0x) - not-present page
> [1.169882] PGD 0 P4D 0 
> [1.172407] Oops:  [#1] PREEMPT SMP PTI
> [1.176578] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 6.0.11-300.fc37.x86_64 #1
> [1.183870] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 
> 07/20/2012
> [1.191420] RIP: 0010:tick_periodic+0x23/0x80

I'm willing to bet that this is caused by the following line in tick_periodic():

update_process_times(user_mode(get_irq_regs()));

   user_mode() is invoked with a NULL pointer. user_mode() accesses
   regs->cs. CS is at offset 0x88

The reason for this is here:

> [1.280648]  tick_handle_periodic+0x1f/0x70
> [1.284821]  timer_interrupt+0x14/0x20
> [1.288561]  __handle_irq_event_percpu+0x46/0x190
> [1.293253]  handle_irq_event+0x34/0x70
> [1.297080]  handle_level_irq+0xa8/0x180
> [1.300993]  resend_irqs+0x5d/0x70
> [1.304386]  tasklet_action_common.constprop.0+0xab/0xe0
> [1.309686]  __do_softirq+0xfb/0x319
> [1.313254]  __irq_exit_rcu+0xd7/0x140
> [1.316993]  common_interrupt+0xb9/0xd0

For some reason the timer interrupt is resent in software. I assume it is
the HPET interrupt because that's what just got initialized.

> [1.143537] clocksource: hpet: mask: 0x max_cycles: 0x, 
> max_idle_ns: 133484882848 ns

and the callchain below just confirms that:

> [1.393315]  _raw_spin_unlock_irqrestore+0x19/0x40
> [1.398093]  __setup_irq+0x443/0x6d0
> [1.401659]  request_threaded_irq+0x109/0x170
> [1.406005]  hpet_time_init+0x2d/0x4b
> [1.409661]  x86_late_time_init+0x17/0x34
> [1.413658]  start_kernel+0x8cf/0x97f

The software resend code does not go through the regular interrupt entry
path which explains why get_irq_regs() returns NULL.

That software resend is bogus especially since the timer interrupt is
a level interrupt. As dmesg does not say anything about the APIC
delivery mode I assume this goes through i8259, which fails to set the
IRQ_LEVEL flag on all interrupt lines.

The below should fix this.

Thanks,

tglx
---
--- a/arch/x86/kernel/i8259.c
+++ b/arch/x86/kernel/i8259.c
@@ -114,6 +114,7 @@ static void make_8259A_irq(unsigned int
disable_irq_nosync(irq);
io_apic_irqs &= ~(1

Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-10 Thread Thomas Gleixner

On Tue, May 10 2022 at 11:38, Pingfan Liu wrote:
> On Mon, May 09, 2022 at 12:55:21PM +0200, Thomas Gleixner wrote:
>> On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
>> > The following code chunk repeats in both
>> > migrate_to_reboot_cpu() and smp_shutdown_nonboot_cpus():
>> > This is due to a breakage like the following:
>> 
>> I don't see what's broken here.
>> 
>
> No, no broken. Could it be better to replace 'breakage' with
> 'breakin'?

There is no break-in. There is a phase where CPU hotplug is reenabled,
which might be avoided.

>> > +/* primary_cpu keeps unchanged after migrate_to_reboot_cpu() */
>> 
>> This comment makes no sense.
>> 
>
> Since migrate_to_reboot_cpu() disables cpu hotplug, so the selected
> valid online cpu -- primary_cpu keeps unchange.

So what is that parameter for then? If migrate_to_reboot_cpu() ensured
that the current task is on the reboot CPU then this parameter is
useless, no?

>> >  void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
>> >  {
>> >unsigned int cpu;
>> >int error;
>> >  
>> > +  /*
>> > +   * Block other cpu hotplug event, so primary_cpu is always online if
>> > +   * it is not touched by us
>> > +   */
>> >cpu_maps_update_begin();
>> > -
>> >/*
>> > -   * Make certain the cpu I'm about to reboot on is online.
>> > -   *
>> > -   * This is inline to what migrate_to_reboot_cpu() already do.
>> > +   * migrate_to_reboot_cpu() disables CPU hotplug assuming that
>> > +   * no further code needs to use CPU hotplug (which is true in
>> > +   * the reboot case). However, the kexec path depends on using
>> > +   * CPU hotplug again; so re-enable it here.
>> 
>> You want to reduce confusion, but in reality this is even more confusing
>> than before.
>> 
>
> This __cpu_hotplug_enable() can be considered to defer from kernel_kexec() to
> arch-dependent code chunk (here), which is a more proper point.
>
> Could it make things better by rephrasing the words as the following?
>migrate_to_reboot_cpu() disables CPU hotplug to prevent the selected
>reboot cpu from disappearing. But arches need cpu_down to hot remove
>cpus except rebooting-cpu, so re-enabling cpu hotplug again.

Can you please use proper words. arches is not a word and it's closer to
the plural of arch, than to the word architecture. This is not twitter.

And no, the architectures do not need cpu_down() at all. This very
function smp_shutdown_nonboot_cpus() invokes cpu_down_maps_locked() to
shut down the non boot CPUs. That fails when cpu_hotplug_disabled != 0.

>> > */
>> > -  if (!cpu_online(primary_cpu))
>> > -  primary_cpu = cpumask_first(cpu_online_mask);
>> > +  __cpu_hotplug_enable();
>> 
>> How is this decrement solving anything? At the end of this function, the
>> counter is incremented again. So what's the point of this exercise?
>> 
> This decrement enables the cpu hot-removing.  Since
> smp_shutdown_nonboot_cpus()->cpu_down_maps_locked(), if
> cpu_hotplug_disabled, it returns -EBUSY.

Correct, so why can't you spell that out in concise words in the first
place right at that comment which reenables hotplug?

>> What does that for arch/powerpc/kernel/kexec_machine64.c now?
>> 
>> Nothing, as far as I can tell. Which means you basically reverted
>> 011e4b02f1da ("powerpc, kexec: Fix "Processor X is stuck" issue during
>> kexec from ST mode") unless I'm completely confused.
>> 
>
> Oops. Forget about powerpc. Considering the cpu hotplug is an
> arch-dependent feature in machine_shutdown(), as x86 does not need it.

It's not a feature, it's a architecture specific requirement. x86 is
irrelevant here because this is a powerpc requirement.

>> This is tinkering at best. Can we please sit down and rethink this whole
>> machinery instead of applying random duct tape to it?
>> 
> I try to make code look consistent.

Emphasis on try. So far the attempt failed and resulted in a regression.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-09 Thread Thomas Gleixner

On Mon, May 09 2022 at 12:55, Thomas Gleixner wrote:
> On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
>> -cpu_hotplug_enable();
>
> This is tinkering at best. Can we please sit down and rethink this whole
> machinery instead of applying random duct tape to it?

That said, I still have not figured out which real world problem you are
actually trying to solve.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv3 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-09 Thread Thomas Gleixner

On Mon, May 09 2022 at 12:13, Pingfan Liu wrote:
> The following code chunk repeats in both
> migrate_to_reboot_cpu() and smp_shutdown_nonboot_cpus():
>
>   if (!cpu_online(primary_cpu))
>   primary_cpu = cpumask_first(cpu_online_mask);
>
> This is due to a breakage like the following:

I don't see what's broken here.

> kernel_kexec()
>migrate_to_reboot_cpu();
>cpu_hotplug_enable();
> ---> comes a cpu_down(this_cpu) on other cpu
>machine_shutdown();
>  smp_shutdown_nonboot_cpus(); // re-check "if (!cpu_online(primary_cpu))" 
> to protect against the former breakin
>
> Although the kexec-reboot task can get through a cpu_down() on its cpu,
> this code looks a little confusing.

Confusing != broken.

> +/* primary_cpu keeps unchanged after migrate_to_reboot_cpu() */

This comment makes no sense.

>  void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
>  {
>   unsigned int cpu;
>   int error;
>  
> + /*
> +  * Block other cpu hotplug event, so primary_cpu is always online if
> +  * it is not touched by us
> +  */
>   cpu_maps_update_begin();
> -
>   /*
> -  * Make certain the cpu I'm about to reboot on is online.
> -  *
> -  * This is inline to what migrate_to_reboot_cpu() already do.
> +  * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> +  * no further code needs to use CPU hotplug (which is true in
> +  * the reboot case). However, the kexec path depends on using
> +  * CPU hotplug again; so re-enable it here.

You want to reduce confusion, but in reality this is even more confusing
than before.

>*/
> - if (!cpu_online(primary_cpu))
> - primary_cpu = cpumask_first(cpu_online_mask);
> + __cpu_hotplug_enable();

How is this decrement solving anything? At the end of this function, the
counter is incremented again. So what's the point of this exercise?

>   for_each_online_cpu(cpu) {
>   if (cpu == primary_cpu)
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 68480f731192..db4fa6b174e3 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1168,14 +1168,12 @@ int kernel_kexec(void)
>   kexec_in_progress = true;
>   kernel_restart_prepare("kexec reboot");
>   migrate_to_reboot_cpu();
> -
>   /*
> -  * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> -  * no further code needs to use CPU hotplug (which is true in
> -  * the reboot case). However, the kexec path depends on using
> -  * CPU hotplug again; so re-enable it here.
> +  * migrate_to_reboot_cpu() disables CPU hotplug. If an arch
> +  * relies on the cpu teardown to achieve reboot, it needs to
> +  * re-enable CPU hotplug there.

What does that for arch/powerpc/kernel/kexec_machine64.c now?

Nothing, as far as I can tell. Which means you basically reverted
011e4b02f1da ("powerpc, kexec: Fix "Processor X is stuck" issue during
kexec from ST mode") unless I'm completely confused.

>*/
> - cpu_hotplug_enable();

This is tinkering at best. Can we please sit down and rethink this whole
machinery instead of applying random duct tape to it?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3] lockdown, selinux: fix wrong subject in some SELinux lockdown checks

2021-06-19 Thread Thomas Gleixner

On Wed, Jun 16 2021 at 10:51, Ondrej Mosnacek wrote:
> diff --git a/arch/x86/mm/testmmiotrace.c b/arch/x86/mm/testmmiotrace.c
> index bda73cb7a044..c43a13241ae8 100644
> --- a/arch/x86/mm/testmmiotrace.c
> +++ b/arch/x86/mm/testmmiotrace.c
> @@ -116,7 +116,7 @@ static void do_test_bulk_ioremapping(void)
>  static int __init init(void)
>  {
>   unsigned long size = (read_far) ? (8 << 20) : (16 << 10);
> - int ret = security_locked_down(LOCKDOWN_MMIOTRACE);
> + int ret = security_locked_down(current_cred(), LOCKDOWN_MMIOTRACE);

I have no real objection to those patches, but it strikes me odd that
out of the 62 changed places 58 have 'current_cred()' and 4 have NULL as
argument.

I can't see why this would ever end up with anything else than
current_cred() or NULL and NULL being the 'special' case. So why not
having security_locked_down_no_cred() and make current_cred() implicit
for security_locked_down() which avoids most of the churn and just makes
the special cases special. I might be missing something though.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 4/4] compat: remove some compat entry points

2021-05-19 Thread Thomas Gleixner

On Mon, May 17 2021 at 22:33, Arnd Bergmann wrote:
> From: Arnd Bergmann 
>
> These are all handled correctly when calling the native
> system call entry point, so remove the special cases.
>  arch/x86/entry/syscall_x32.c  |  2 ++
>  arch/x86/entry/syscalls/syscall_32.tbl|  6 ++--
>  arch/x86/entry/syscalls/syscall_64.tbl|  4 +--

That conflicts with

  https://lore.kernel.org/lkml/20210517073815.97426-1-masahi...@kernel.org/

which I'm picking up. We have more changes in that area coming in.

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [patch] x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 17:13, Mike Galbraith wrote:
> On Fri, 2021-04-16 at 16:44 +0200, Borislav Petkov wrote:
>> On Fri, Apr 16, 2021 at 03:16:07PM +0200, Mike Galbraith wrote:
>> > On Fri, 2021-04-16 at 14:16 +0200, Borislav Petkov wrote:
>> > >
>> > > Please be more verbose and structure your commit message like this:
>> >
>> > Hrmph, I thought it was too verbose for dinky one-liner if anything.
>>
>> Please look at how other commit messages in tip have free text - not
>> only tools output.
>>
>> Also, this looks like a fix for some previous commit. Please dig out
>> which commit introduced the issue and put its commit ID in a Fixes: tag
>> above your S-o-B.
>>
>> If you don't have time or desire to do that, you can say so and I'll do
>> it myself when I get a chance.
>
> Ok, bin it for the nonce.

Can all of you involved stop this sandpit fight and do something useful
to fix that obvious bug already?

OMG



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Thomas Gleixner

On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>> But that does not solve the problem either simply because then the IOMMU
>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>> error interrupt.
>
> Not if you can tell the IOMMU to stop reporting those errors.
>
> We can easily do it per-device for DMA errors; not quite sure what
> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
> you set the Fault Processing Disable bit per IRTE entry, and you still
> get faults for Compatibility Format interrupts? Not sure about AMD...

It looks like the fault (DMAR) and event (AMD) interrupts can be
disabled in the IOMMU. That might help to bridge the gap until the PCI
bus is scanned in full glory and the devices can be shut up for real.

If we make this conditional for a crash dump kernel that might do the
trick.

Lot's of _might_ there :)

Thanks

tglx






___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Thomas Gleixner

On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote:
> Bjorn Helgaas  writes:
> My two top candidates are poking the IOMMUs early to shut things off,
> and figuring out if we can delay enabling interrupts until we have
> initialized pci.

Keeping interrupts disabled post PCI initialization would be nice, but
that requires feeding the whole init machinery through a shredder and
collecting the bits and pieces.

> Poking at IOMMUs early should work for most systems with ``enterprise''
> hardware.  Systems where people care about kdump the most.

The IOMMU IRQ remapping part _is_ initialized early and _before_
interrupts are enabled.

But that does not solve the problem either simply because then the IOMMU
will catch the rogue MSIs and you get an interrupt storm on the IOMMU
error interrupt.

This one is not going to be turned off because the IOMMU error interrupt
handler will handle each of them and tell the core code that everything
is under control.

As I explained several times now, the only way to shut this up reliably
is at the source. Curing the symptom is almost never a good solution.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Thomas Gleixner

On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote:
> On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
>> Unfortunately there is no way to tell the APIC "Mask vector X" and the
>> dump kernel does neither know which device it comes from nor does it
>> have enumerated PCI completely which would reset the device and shutup
>> the spew. Due to the interrupt storm it does not get that far.
>
> Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
> on all active PCI devices in the crashing kernel before starting the
> crash kernel?  So that the crash kernel starts with a clean slate?
>
> Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that
> is because the crashing kernel doesn't know which devices exist and
> which have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.

As I explained before: The problem with doing anything between crashing
and starting the crash kernel is reducing the chance of actually
starting it. The kernel crashed for whatever reason, so it's in a
completely undefined state.

Ergo there is no 'just do something'. You really have to think hard
about what can be done safely with the least probability of running into
another problem.

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Thomas Gleixner

On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
> ebied...@xmission.com (Eric W. Biederman) writes:
> For ordinary irqs you can have this with level triggered irqs
> and the kernel has code that will shutdown the irq at the ioapic
> level.  Then the kernel continues by polling the irq source.
>
> I am still missing details but my first question is can our general
> solution to screaming level triggered irqs apply?

No.

> How can edge triggered MSI irqs be screaming?
>
> Is there something we can do in enabling the APICs or IOAPICs that
> would allow this to be handled better.  My memory when we enable
> the APICs and IOAPICs we completely clear the APIC entries and so
> should be disabling sources.

Yes, but MSI has nothing to do with APIC/IOAPIC

> Is the problem perhaps that we wind up using an APIC entry that was
> previously used for the MSI interrupt as something else when we
> reprogram them?  Even with this why doesn't the generic code
> to stop screaming irqs apply here?

Again. No. The problem cannot be solved at the APIC level. The APIC is
the receiving end of MSI and has absolutely no control over it.

An MSI interrupt is a (DMA) write to the local APIC base address
0xfeex which has the target CPU and control bits encoded in the
lower bits. The written data is the vector and more control bits.

The only way to stop that is to shut it up at the PCI device level.

Assume the following situation:

  - PCI device has MSI enabled and a valid target vector assigned

  - Kernel crashes

  - Kdump kernel starts

  - PCI device raises interrupts which result in the MSI write

  - Kdump kernel enables interrupts and the pending vector is raised in
the CPU.

  - The CPU has no interrupt descriptor assigned to the vector
and does not even know where the interrupt originated from. So it
treats it like any other spurious interrupt to an unassigned vector,
emits a ratelimited message and ACKs the interrupt at the APIC.

  - PCI device behaves stupid and reraises the interrupt for whatever
reason.

  - Lather, rinse and repeat.

Unfortunately there is no way to tell the APIC "Mask vector X" and the
dump kernel does neither know which device it comes from nor does it
have enumerated PCI completely which would reset the device and shutup
the spew. Due to the interrupt storm it does not get that far.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-14 Thread Thomas Gleixner

Bjorn,

On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> device causing the storm was in PCI_UNKNOWN state?
>> >
>> > That's indeed a really good question.
>> 
>> So we do that on kexec, but is that true when starting a kdump kernel
>> from a kernel crash? I doubt it.
>
> Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>
>   crash_kexec
> machine_kexec
>
> while the usual kexec path is:
>
>   kernel_kexec
> kernel_restart_prepare
>   device_shutdown
> while (!list_empty(&devices_kset->list))
>   dev->bus->shutdown
> pci_device_shutdown# pci_bus_type.shutdown
> machine_kexec
>
> So maybe we need to explore doing some or all of device_shutdown() in
> the crash_kexec() path as well as in the kernel_kexec() path.

The problem is that if the machine crashed anything you try to attempt
before starting the crash kernel is reducing the chance that the crash
kernel actually starts.

Is there something at the root bridge level which allows to tell the
underlying busses to shut up, reset or go into a defined state? That
might avoid chasing lists which might be already unreliable.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-13 Thread Thomas Gleixner

Bjorn,

On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> doing a kexec and the device is in D0-D3hot, which should also disable
>> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> device causing the storm was in PCI_UNKNOWN state?
>
> That's indeed a really good question.

So we do that on kexec, but is that true when starting a kdump kernel
from a kernel crash? I doubt it.

Thanks,

tglx


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-13 Thread Thomas Gleixner

Bjorn,

On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
>> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> I guess Thomas' patch [2] (from thread [1]) doesn't solve this
> problem?

No. As I explained in [1] patch from [2] cannot solve it because the
patch from [2] which is what Liu was trying to solve requires that there
is a registered interrupt handler which knows how to shut up the
interrupt.

> I think [0] proposes using early_quirks() to disable MSIs at
> boot-time.  That doesn't seem like a robust solution because (a) the
> problem affects all arches but early_quirks() is x86-specific and (b)
> even on x86 early_quirks() only works for PCI segment 0 because it
> relies on the 0xCF8/0xCFC I/O ports.
>
> If I understand Thomas' email correctly, the IRQ storm occurs here:
>
>   start_kernel
> setup_arch
>   early_quirks   # x86-only
> ...
>   read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
> outl(..., 0xcf8) # PCI segment 0 only
> inw(0xcfc)
> local_irq_enable
>   ...
> native_irq_enable
>   asm("sti") # <-- enable IRQ, storm occurs
>
> native_irq_enable() happens long before we discover PCI host bridges
> and run the normal PCI quirks, so those would be too late to disable
> MSIs.

Correct.

> It doesn't seem practical to disable MSIs in the kdump kernel at the
> PCI level.  I was hoping we could disable them somewhere in the IRQ
> code, e.g., at IOAPICs, but I think Thomas is saying that's not
> feasible.

MSIs are not even going near the IOAPIC and as long as the interrupt
core does not have an interrupt set up for the device is has no idea
where to look at to shut it down. Actually it does not even reach the
interrupt core. The raised vector arrives at the CPU and the low level
code sees: No handler associated, ignore it. We cannot do anything from
the low level code because all we know is that the vector was raised,
but we have absolutely zero clue where that came from. At that point the
IO-APIC interrupts are definitely not the problem because they are all
disabled.

> It seems like the only option left is to disable MSIs before the
> kexec.  We used to clear the MSI/MSI-X Enable bits in
> pci_device_shutdown(), but that broke console devices that relied on
> MSI and caused "nobody cared" warnings when the devices fell back to
> using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
> pci_device_shutdown()") left them unchanged.

That might be solvable because INTx arrives at the IO-APIC and we could
mask all the INTx related IO-APIC lines, but that's icky because of
this:

> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> doing a kexec and the device is in D0-D3hot, which should also disable
> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> device causing the storm was in PCI_UNKNOWN state?

That's indeed a really good question.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-11 Thread Thomas Gleixner

Ira,

On Fri, Oct 09 2020 at 12:49, ira weiny wrote:
> From: Ira Weiny 
>
> To correctly support the semantics of kmap() with Kernel protection keys
> (PKS), kmap() may be required to set the protections on multiple
> processors (globally).  Enabling PKS globally can be very expensive
> depending on the requested operation.  Furthermore, enabling a domain
> globally reduces the protection afforded by PKS.
>
> Most kmap() (Aprox 209 of 229) callers use the map within a single thread and
> have no need for the protection domain to be enabled globally.  However, the
> remaining callers do not follow this pattern and, as best I can tell, expect
> the mapping to be 'global' and available to any thread who may access the
> mapping.[1]
>
> We don't anticipate global mappings to pmem, however in general there is a
> danger in changing the semantics of kmap().  Effectively, this would cause an
> unresolved page fault with little to no information about why the failure
> occurred.
>
> To resolve this a number of options were considered.
>
> 1) Attempt to change all the thread local kmap() calls to kmap_atomic()[2]
> 2) Introduce a flags parameter to kmap() to indicate if the mapping should be
>global or not
> 3) Change ~20 call sites to 'kmap_global()' to indicate that they require a
>global enablement of the pages.
> 4) Change ~209 call sites to 'kmap_thread()' to indicate that the mapping is 
> to
>be used within that thread of execution only
>
> Option 1 is simply not feasible.  Option 2 would require all of the call sites
> of kmap() to change.  Option 3 seems like a good minimal change but there is a
> danger that new code may miss the semantic change of kmap() and not get the
> behavior the developer intended.  Therefore, #4 was chosen.

There is Option #5:

Convert the thread local kmap() invocations to the proposed kmap_local()
interface which is coming along [1].

That solves a couple of issues:

 1) It relieves the current kmap_atomic() usage sites from the implict
pagefault/preempt disable semantics which apply even when
CONFIG_HIGHMEM is disabled. kmap_local() still can be invoked from
atomic context.

 2) Due to #1 it allows to replace the conditional usage of kmap() and
kmap_atomic() for purely thread local mappings.

 3) It puts the burden on the HIGHMEM inflicted systems

 4) It is actually more efficient for most of the pure thread local use
cases on HIGHMEM inflicted systems because it avoids the overhead of
the global lock and the potential kmap slot exhaustion. A potential
preemption will be more expensive, but that's not really the case we
want to optimize for.

 5) It solves the RT issue vs. kmap_atomic()

So instead of creating yet another variety of kmap() which is just
scratching the particular PKRS itch, can we please consolidate all of
that on the wider reaching kmap_local() approach?

Thanks,

tglx

[1] https://lore.kernel.org/lkml/20201103092712.714480...@linutronix.de/

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-11 Thread Thomas Gleixner

On Mon, Nov 09 2020 at 20:59, Ira Weiny wrote:
> On Tue, Nov 10, 2020 at 02:13:56AM +0100, Thomas Gleixner wrote:
> Also, we can convert the new memcpy_*_page() calls to kmap_local() as well.
> [For now my patch just uses kmap_atomic().]
>
> I've not looked at all of the patches in your latest version.  Have you
> included converting any of the kmap() call sites?  I thought you were more
> focused on converting the kmap_atomic() to kmap_local()?

I did not touch any of those yet, but it's a logical consequence to
convert all kmap() instances which are _not_ creating a global mapping
over to it.

Thanks,

tglx


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-28 Thread Thomas Gleixner

On Wed, Oct 28 2020 at 14:02, Pingfan Liu wrote:
> On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner  wrote:
>> Also Liu's patch only works if:
>>
>>   1) CONFIG_IRQ_TIME_ACCOUNTING is enabled
>
> I wonder whether it can not be a default option or not by the following 
> method:
>   DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to
> a boot param.

How so?

config IRQ_TIME_ACCOUNTING
depends on HAVE_IRQ_TIME_ACCOUNTING && 
!VIRT_CPU_ACCOUNTING_NATIVE

> This will have no impact on performance with the disabled branch.
> Meanwhile users can easily turn on the option to detect an irq flood
> without  recompiling the kernel.
>
> If it is doable, I will rework only on [1/2].

See above :)

>>   2) the runaway interrupt has been requested by the relevant driver in
>>  the dump kernel.
>
> Yes, it raises a big challenge to my method. Kdump kernel miss the
> whole picture of the first kernel's irq routing.

Correct. If there is anything stale then you get what Guilherme
observed. But the irq core can do nothing about that.

Something like the completly untested below should work independent of
config options.

Thanks,

tglx
---
 include/linux/irqdesc.h |4 ++
 kernel/irq/manage.c |3 +
 kernel/irq/spurious.c   |   74 +++-
 3 files changed, 61 insertions(+), 20 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -30,6 +30,8 @@ struct pt_regs;
  * @tot_count: stats field for non-percpu irqs
  * @irq_count: stats field to detect stalled irqs
  * @last_unhandled:aging timer for unhandled count
+ * @storm_count:   Counter for irq storm detection
+ * @storm_checked: Timestamp for irq storm detection
  * @irqs_unhandled:stats field for spurious unhandled interrupts
  * @threads_handled:   stats field for deferred spurious detection of threaded 
handlers
  * @threads_handled_last: comparator field for deferred spurious detection of 
theraded handlers
@@ -65,6 +67,8 @@ struct irq_desc {
unsigned inttot_count;
unsigned intirq_count;  /* For detecting broken IRQs */
unsigned long   last_unhandled; /* Aging timer for unhandled 
count */
+   unsigned long   storm_count;
+   unsigned long   storm_checked;
unsigned intirqs_unhandled;
atomic_tthreads_handled;
int threads_handled_last;
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1581,6 +1581,9 @@ static int
if (!shared) {
init_waitqueue_head(&desc->wait_for_threads);
 
+   /* Take a timestamp for interrupt storm detection */
+   desc->storm_checked = jiffies;
+
/* Setup the type (level, edge polarity) if configured: */
if (new->flags & IRQF_TRIGGER_MASK) {
ret = __irq_set_trigger(desc,
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -21,6 +21,7 @@ static void poll_spurious_irqs(struct ti
 static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs);
 static int irq_poll_cpu;
 static atomic_t irq_poll_active;
+static unsigned long irqstorm_limit __ro_after_init;
 
 /*
  * We wait here for a poller to finish.
@@ -189,18 +190,21 @@ static inline int bad_action_ret(irqretu
  * (The other 100-of-100,000 interrupts may have been a correctly
  *  functioning device sharing an IRQ with the failing one)
  */
-static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
+static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret,
+bool storm)
 {
unsigned int irq = irq_desc_get_irq(desc);
struct irqaction *action;
unsigned long flags;
 
-   if (bad_action_ret(action_ret)) {
-   printk(KERN_ERR "irq event %d: bogus return value %x\n",
-   irq, action_ret);
-   } else {
-   printk(KERN_ERR "irq %d: nobody cared (try booting with "
+   if (!storm) {
+   if (bad_action_ret(action_ret)) {
+   pr_err("irq event %d: bogus return value %x\n",
+  irq, action_ret);
+   } else {
+   pr_err("irq %d: nobody cared (try booting with "
"the \"irqpoll\" option)\n", irq);
+   }
}
dump_stack();
printk(KERN_ERR "handlers:\n");
@@ -228,7 +232,7 @@ static void report_bad_irq(struct irq_de
 
if (count > 0) {
count--;
-   __report_bad_irq(desc, action_ret);
+   __report_bad_irq(desc, action_ret, false);
}
 }
 
@@ -267,6 +271,33 @@

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 17:28, Guilherme Piccoli wrote:
> On Mon, Oct 26, 2020 at 4:59 PM Thomas Gleixner  wrote:
>> It gets flooded right at the point where the crash kernel enables
>> interrupts in start_kernel(). At that point there is no device driver
>> and no interupt requested. All you can see on the console for this is
>>
>>  "common_interrupt: $VECTOR.$CPU No irq handler for vector"
>>
>> And contrary to Liu's patches which try to disable a requested interrupt
>> if too many of them arrive, the kernel cannot do anything because there
>> is nothing to disable in your case. That's why you needed to do the MSI
>> disable magic in the early PCI quirks which run before interrupts get
>> enabled.
>
> Wow, thank you very much for this great explanation (without a
> reproducer) - it's nice to hear somebody that deeply understands the
> code! And double thanks for CCing Bjorn.

Understanding the code is only half of the picture. You need to
understand how the hardware works or not :)

> So, I don't want to hijack Liu's thread, but do you think it makes
> sense to have my approach as a (debug) parameter to prevent such a
> degenerate case?

At least it makes sense to some extent even if it's incomplete. What
bothers me is that it'd be x86 specific while the issue is pretty much
architecture independent. I don't think that the APIC is special in that
regard. Rogue MSIs should be able to bring down pretty much all
architectures.

> Or could we have something in core IRQ code to prevent irq flooding in
> such scenarios, something "stronger" than disabling MSIs (APIC-level,
> likely)?

For your case? No. The APIC cannot be protected against rogue MSIs. The
only cure is to disable interrupts or disable MSIs on all PCI[E] devices
early on. Disabling interrupts is not so much of an option obviously :)

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 12:06, Guilherme Piccoli wrote:
> On Sun, Oct 25, 2020 at 8:12 AM Pingfan Liu  wrote:
>
> Some time ago (2 years) we faced a similar issue in x86-64, a hard to
> debug problem in kdump, that eventually was narrowed to a buggy NIC FW
> flooding IRQs in kdump kernel, and no messages showed (although kernel
> changed a lot since that time, today we might have better IRQ
> handling/warning). We tried an early-boot fix, by disabling MSIs (as
> per PCI spec) early in x86 boot, but it wasn't accepted - Bjorn asked
> pertinent questions that I couldn't respond (I lost the reproducer)
> [0].
...
> [0] lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com

With that broken firmware the NIC continued to send MSI messages to the
vector/CPU which was assigned to it before the crash. But the crash
kernel has no interrupt descriptor for this vector installed. So Liu's
patches wont print anything simply because the interrupt core cannot
detect it.

To answer Bjorns still open question about when the point X is:

https://lore.kernel.org/linux-pci/20181023170343.ga4...@bhelgaas-glaptop.roam.corp.google.com/

It gets flooded right at the point where the crash kernel enables
interrupts in start_kernel(). At that point there is no device driver
and no interupt requested. All you can see on the console for this is

 "common_interrupt: $VECTOR.$CPU No irq handler for vector"

And contrary to Liu's patches which try to disable a requested interrupt
if too many of them arrive, the kernel cannot do anything because there
is nothing to disable in your case. That's why you needed to do the MSI
disable magic in the early PCI quirks which run before interrupts get
enabled.

Also Liu's patch only works if:

  1) CONFIG_IRQ_TIME_ACCOUNTING is enabled

  2) the runaway interrupt has been requested by the relevant driver in
 the dump kernel.

Especially #1 is not a sensible restriction.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-22 Thread Thomas Gleixner

On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> platform.
> When the bug happens, the kernel is totally occupies by irq.  Currently, there
> may be nothing or just soft lockup warning showed in console. It is better
> to warn users with irq flood info.
>
> In the kdump case, the kernel can move on by suppressing the irq flood.

You're curing the symptom not the cause and the cure is just magic and
can't work reliably.

Where is that irq flood originated from and why is none of the
mechanisms we have in place to shut it up working?

Thanks,

tglx





___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [patch 0/2] timekeeping: NMI safe timekeeper enhancements

2020-08-23 Thread Thomas Gleixner

Petr,

On Thu, Aug 20 2020 at 12:43, Petr Mladek wrote:
> On Thu 2020-08-20 12:30:55, Thomas Gleixner wrote:
>> Good. So I suggest that I apply that on top of rc1 somewhere in tip and
>> tag the top commit. So you can pull that tag into your printk branch and
>> go wild.
>
> Sounds good to me.

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
timekeeping-for-printk-2020-08-23

Have fun!

Thanks,

tglx


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [patch 0/2] timekeeping: NMI safe timekeeper enhancements

2020-08-20 Thread Thomas Gleixner

Petr,

On Thu, Aug 20 2020 at 10:47, Petr Mladek wrote:
> The interface is perfectly fine for printk() needs.

Good. So I suggest that I apply that on top of rc1 somewhere in tip and
tag the top commit. So you can pull that tag into your printk branch and
go wild.

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[patch 0/2] timekeeping: NMI safe timekeeper enhancements

2020-08-14 Thread Thomas Gleixner

printk intends to store various timestamps (MONOTONIC, REALTIME, BOOTTIME)
to make correlation of dmesg accross different machines easier.

The NMI safe timekeeper allows to retrieve these timestamps from any
context, but it lacks a few things:

  1) The nmi safe accessors are not providing time stamps until timekeeping
 is initialized during early boot.

 This can be mitigated by using sched clock up to the point where time-
 keeping becomes available. This has no side effects because clock
 monotonic takes sched clock into account at initialization time
 anyway. So no random time jumps are possible.

 If early sched clock is not available then there is no difference
 either, obviously. Both return 0.

  2) It requires a new accessor which allows to retrieve all three clock
 timestamps in one go.

 Trivial excercise. But there are a few twists:

 A) Access to boot time can be racy if the sleep time offset on resume
is injected after timekeeping resume. That's the case when the RTC
or whatever is used to calculate sleep time is not availble when
the timekeeping core is resumed.

 B) Timestamps are frozen accross the very inner low level
suspend/resume path. Not a big problem, but might affect the
developer debug printks.

 A detailed description of these two points is in the changelog of
 patch 2.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[patch 1/2] timekeeping: Utilize local_clock() for NMI safe timekeeper during early boot

2020-08-14 Thread Thomas Gleixner

During early boot the NMI safe timekeeper returns 0 until the first
clocksource becomes available.

This prevents it from being used for printk or other facilities which today
use sched clock. sched clock can be available way before timekeeping is
initialized.

The obvious workaround for this is to utilize the early sched clock in the
default dummy clock read function until a clocksource becomes available.

After switching to the clocksource clock MONOTONIC and BOOTTIME will not
jump because the timekeeping_init() bases clock MONOTONIC on sched clock
and the offset between clock MONOTONIC and BOOTTIME is zero during boot.

Clock REALTIME cannot provide useful timestamps during early boot up to
the point where a persistent clock becomes available, which is either in
timekeeping_init() or later when the RTC driver which might depend on I2C
or other subsystems is initialized.

There is a minor difference to sched_clock() vs. suspend/resume. As the
timekeeper clock source might not be accessible during suspend, after
timekeeping_suspend() timestamps freeze up to the point where
timekeeping_resume() is invoked. OTOH this is true for some sched clock
implementations as well.

Signed-off-by: Thomas Gleixner 
---
 kernel/time/timekeeping.c |   33 +
 1 file changed, 25 insertions(+), 8 deletions(-)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -53,6 +53,9 @@ static struct {
 static DEFINE_RAW_SPINLOCK(timekeeper_lock);
 static struct timekeeper shadow_timekeeper;
 
+/* flag for if timekeeping is suspended */
+int __read_mostly timekeeping_suspended;
+
 /**
  * struct tk_fast - NMI safe timekeeper
  * @seq:   Sequence counter for protecting updates. The lowest bit
@@ -72,26 +75,40 @@ static u64 cycles_at_suspend;
 
 static u64 dummy_clock_read(struct clocksource *cs)
 {
-   return cycles_at_suspend;
+   if (timekeeping_suspended)
+   return cycles_at_suspend;
+   return local_clock();
 }
 
 static struct clocksource dummy_clock = {
.read = dummy_clock_read,
 };
 
+/*
+ * Boot time initialization which allows local_clock() to be utilized
+ * during early boot when clocksources are not available. local_clock()
+ * returns nanoseconds already so no conversion is required, hence mult=1
+ * and shift=0. When the first proper clocksource is installed then
+ * the fast time keepers are updated with the correct values.
+ */
+#define FAST_TK_INIT   \
+   {   \
+   .clock  = &dummy_clock, \
+   .mask   = CLOCKSOURCE_MASK(64), \
+   .mult   = 1,\
+   .shift  = 0,\
+   }
+
 static struct tk_fast tk_fast_mono cacheline_aligned = {
-   .base[0] = { .clock = &dummy_clock, },
-   .base[1] = { .clock = &dummy_clock, },
+   .base[0] = FAST_TK_INIT,
+   .base[1] = FAST_TK_INIT,
 };
 
 static struct tk_fast tk_fast_raw  cacheline_aligned = {
-   .base[0] = { .clock = &dummy_clock, },
-   .base[1] = { .clock = &dummy_clock, },
+   .base[0] = FAST_TK_INIT,
+   .base[1] = FAST_TK_INIT,
 };
 
-/* flag for if timekeeping is suspended */
-int __read_mostly timekeeping_suspended;
-
 static inline void tk_normalize_xtime(struct timekeeper *tk)
 {
while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << 
tk->tkr_mono.shift)) {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[patch 2/2] timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper

2020-08-14 Thread Thomas Gleixner

printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to
make correlation of dmesg from several systems easier.

Provide an interface to retrieve all three timestamps in one go.

There are some caveats:

1) Boot time and late sleep time injection

  Boot time is a racy access on 32bit systems if the sleep time injection
  happens late during resume and not in timekeeping_resume(). That could be
  avoided by expanding struct tk_read_base with boot offset for 32bit and
  adding more overhead to the update. As this is a hard to observe once per
  resume event which can be filtered with reasonable effort using the
  accurate mono/real timestamps, it's probably not worth the trouble.

  Aside of that it might be possible on 32 and 64 bit to observe the
  following when the sleep time injection happens late:

  CPU 0  CPU 1
  timekeeping_resume()
  ktime_get_fast_timestamps()
mono, real = __ktime_get_real_fast()
 inject_sleep_time()
   update boot offset
boot = mono + bootoffset;
  
  That means that boot time already has the sleep time adjustment, but
  real time does not. On the next readout both are in sync again.
  
  Preventing this for 64bit is not really feasible without destroying the
  careful cache layout of the timekeeper because the sequence count and
  struct tk_read_base would then need two cache lines instead of one.

2) Suspend/resume timestamps

   Access to the time keeper clock source is disabled accross the innermost
   steps of suspend/resume. The accessors still work, but the timestamps
   are frozen until time keeping is resumed which happens very early.

   For regular suspend/resume there is no observable difference vs. sched
   clock, but it might affect some of the nasty low level debug printks.

   OTOH, access to sched clock is not guaranteed accross suspend/resume on
   all systems either so it depends on the hardware in use.

   If that turns out to be a real problem then this could be mitigated by
   using sched clock in a similar way as during early boot. But it's not as
   trivial as on early boot because it needs some careful protection
   against the clock monotonic timestamp jumping backwards on resume.

Signed-off-by: Thomas Gleixner 
---
 include/linux/timekeeping.h |   15 
 kernel/time/timekeeping.c   |   76 +---
 2 files changed, 80 insertions(+), 11 deletions(-)

--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -222,6 +222,18 @@ extern bool timekeeping_rtc_skipresume(v
 
 extern void timekeeping_inject_sleeptime64(const struct timespec64 *delta);
 
+/*
+ * struct ktime_timestanps - Simultaneous mono/boot/real timestamps
+ * @mono:  Monotonic timestamp
+ * @boot:  Boottime timestamp
+ * @real:  Realtime timestamp
+ */
+struct ktime_timestamps {
+   u64 mono;
+   u64 boot;
+   u64 real;
+};
+
 /**
  * struct system_time_snapshot - simultaneous raw/real time capture with
  *  counter value
@@ -280,6 +292,9 @@ extern int get_device_system_crosststamp
  */
 extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
 
+/* NMI safe mono/boot/realtime timestamps */
+extern void ktime_get_fast_timestamps(struct ktime_timestamps *snap);
+
 /*
  * Persistent clock related interfaces
  */
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -527,29 +527,29 @@ u64 notrace ktime_get_boot_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_boot_fast_ns);
 
-
 /*
  * See comment for __ktime_get_fast_ns() vs. timestamp ordering
  */
-static __always_inline u64 __ktime_get_real_fast_ns(struct tk_fast *tkf)
+static __always_inline u64 __ktime_get_real_fast(struct tk_fast *tkf, u64 
*mono)
 {
struct tk_read_base *tkr;
+   u64 basem, baser, delta;
unsigned int seq;
-   u64 now;
 
do {
seq = raw_read_seqcount_latch(&tkf->seq);
tkr = tkf->base + (seq & 0x01);
-   now = ktime_to_ns(tkr->base_real);
+   basem = ktime_to_ns(tkr->base);
+   baser = ktime_to_ns(tkr->base_real);
 
-   now += timekeeping_delta_to_ns(tkr,
-   clocksource_delta(
-   tk_clock_read(tkr),
-   tkr->cycle_last,
-   tkr->mask));
+   delta = timekeeping_delta_to_ns(tkr,
+   clocksource_delta(tk_clock_read(tkr),
+   tkr->cycle_last, tkr->mask));
} while (read_seqcount_retry(&tkf->seq, seq));
 
-   return now;
+   if (mono)
+   *mono = basem + delta;
+   return baser + delta;
 }
 
 /**
@@ -557,11 +557,65 @

Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

2020-08-13 Thread Thomas Gleixner

Sergey Senozhatsky  writes:
> On (20/08/11 15:02), Petr Mladek wrote:
>> There is still the alternative to print all three timestamps regularly
>> for those interested. It is less user convenient but much easier
>> to maintain.
>
> Yes, that's a nice alternative.

It's trivial on the kernel side, annoying for all people who do not care
about them because they show up in syslog and it's a fricking nightmare
to reconstruct over a large cluster of machines.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

2020-08-11 Thread Thomas Gleixner

Petr Mladek  writes:
> On Tue 2020-08-11 12:40:22, Orson Zhai wrote:
>> This is an updated version which comes from patch [1] written by Thomas
>> and suggestion [2] about VMCORE_INFO given by Linus.

All of that want's to be properly distangled into seperate patches.

>> This patch has been tested in qemu-x86-system. One problem is the timestamp
>> in kernel log will be printed [0.00] for longer time than before. 
>
> This would be a regression. People put huge effort into having early boot
> timestamps, see
> https://lore.kernel.org/lkml/20180719205545.16512-1-pasha.tatas...@oracle.com/
> Adding some active people from this patchset into CC.
>
> I wonder if we could have these early timestamps also in the mono
> clock.

Not really. timekeeping init happens way after the early TSC (or
whatever clock) is registered as sched_clock(). And there is no
realistic way to move timekeeping init earlier.

What we could do instead is to utilize sched_clock() up to the point
where timekeeping becomes available and ensure that monotonic time is
not jumping backwards vs. sched_clock() when switching over. For this
early boot phase, clock realtime timestamps would be invalid of course
and they can stay invalid even after timekeeping is initialized on
systems where the RTC is not available in the early boot process.

> At least "crash" tool would need an update anyway. AFAIK, it checks
> the size of struct printk_log and refuses to read it when it changes.
>
> It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
> needed because we would need to update the crashdump-related tools anyway.
>
> Well, the timing is good. We are about to switch the printk ring
> buffer into a lockless one. It requires updating the crashdump tools
> as well. We could do this at the same time. The lockless ring buffer
> already is in linux-next. It is aimed for 5.10 or 5.11.
...
> It would be great to synchronize all these changes changes of the
> printk log buffer structures.

I agree that having one update is a good thing, but pretty please can we
finally make progress with this and not create yet another dependency?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/4] x86/mce: protect nr_cpus from rebooting by broadcast mce

2019-08-07 Thread Thomas Gleixner

On Thu, 8 Aug 2019, Pingfan Liu wrote:
> On Wed, Aug 7, 2019 at 9:07 PM Thomas Gleixner  wrote:
> >
> [...]
> > > > >
> > > > > *** Back ground ***
> > > > >
> > > > > On x86 it's required to have all logical CPUs set CR4.MCE=1. 
> > > > > Otherwise, a
> > > > > broadcast MCE observing CR4.MCE=0b on any core will shutdown the 
> > > > > machine.
> >
> > Pingfan, please trim your replies and remove the useless gunk after your 
> > answer.
> OK. I will.
> 
> Thanks,
> Pingfan
> >
> > Thanks,
> >
> > tglx
> 

First attempt to do so failed :)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/4] x86/mce: protect nr_cpus from rebooting by broadcast mce

2019-08-07 Thread Thomas Gleixner

On Wed, 7 Aug 2019, Pingfan Liu wrote:
> On Wed, Aug 07, 2019 at 11:00:41AM +0800, Dave Young wrote:
> > Add Tony and Xunlei in cc.
> > On 08/05/19 at 04:58pm, Pingfan Liu wrote:
> > > This series include two related groups:
> > > [1-3/4]: protect nr_cpus from rebooting by broadcast mce
> > > [4/4]: improve "kexec -l" robustness against broadcast mce
> > > 
> > > When I tried to fix [1], Thomas raised concern about the nr_cpus' 
> > > vulnerability
> > > to unexpected rebooting by broadcast mce. After analysis, I think only the
> > > following first case suffers from the rebooting by broadcast mce. [1-3/4] 
> > > aims
> > > to fix that issue.
> > 
> > I did not understand and read the MCE details, but we previously had a
> > MCE problem, Xunlei fixed in below commit:
> > commit 5bc329503e8191c91c4c40836f062ef771d8ba83
> > Author: Xunlei Pang 
> > Date:   Mon Mar 13 10:50:19 2017 +0100
> > 
> > x86/mce: Handle broadcasted MCE gracefully with kexec
> > 
> > I wonder if this is same issue or not. Also the old discussion is in
> > below thread:
> > https://lore.kernel.org/patchwork/patch/753530/
> > 
> > Tony raised similar questions, but I'm not sure if it is still a problem
> > or it has been fixed.
> > 
>
> Xunlei's patch is the precondition of the stability for the case 2: boot
> up by "kexec -p nr_cpus="

Correct. The only dangerous issue which is then left is that an MCE hits
_before_ all CPUs have CR.MCE=1 set. That's a general issue also for cold
boot. Thanks to the brilliant hardware design, all we can do is pray

> For case1/3, extra effort is needed.
> 
> Thanks,
>   Pingfan
> > > 
> > > *** Back ground ***
> > > 
> > > On x86 it's required to have all logical CPUs set CR4.MCE=1. Otherwise, a
> > > broadcast MCE observing CR4.MCE=0b on any core will shutdown the machine.

Pingfan, please trim your replies and remove the useless gunk after your answer.

Thanks,

tglx

Re: [PATCH RESEND 1/3] x86/boot: Add bit fields into xloadflags for 5-level kernel checking

2019-01-29 Thread Thomas Gleixner

On Fri, 25 Jan 2019, Baoquan He wrote:

> Add two bit fields XLF_5LEVEL and XLF_5LEVEL_ENABLED for 5-level kernel.

These are not bit fields. These are simple bits.

> Bit XLF_5LEVEL indicates if 5-level related code is contained
> in this kernel.
> Bit XLF_5LEVEL_ENABLED indicates if CONFIG_X86_5LEVEL=y is set.

I'm confused. 

> - .word XLF0 | XLF1 | XLF23 | XLF4
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_X86_5LEVEL
> +#define XLF56 (XLF_5LEVEL|XLF_5LEVEL_ENABLED)
> +#else
> +#define XLF56 XLF_5LEVEL
> +#endif
> +#else
> +#define XLF56 0
> +#endif
> +
> + .word XLF0 | XLF1 | XLF23 | XLF4 | XLF56

So this actually stores the bits, but looking at the following patch which
fixes the real issue:

> + if (!(header->xloadflags & XLF_5LEVEL) && pgtable_l5_enabled()) {
> + pr_err("Can not jump to old 4-level kernel from 5-level 
> kernel.\n");
> + return ret;
> + }

So what is XLF_5LEVEL_ENABLED used for and why does it exist at all?

Thanks,

tglx


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/boot: Add bit fields into xloadflags for 5-level kernel checking

2018-09-05 Thread Thomas Gleixner

On Tue, 4 Sep 2018, H. Peter Anvin wrote:

> On 09/04/18 01:42, Kirill A. Shutemov wrote:
> > 
> > Switching between 4- and 5-level paging modes (in either direction)
> > requires paing disabling. It means the code that does the switching has to
> > be under 4G otherwise we would lose control.
> > 
> > We handle the switching correctly in kernel decompression code, but not on
> > kexec caller side.
> > 
> > XLF_5LEVEL indicates that kernel decompression code can deal with
> > switching between paging modes and it's safe to jump there in 5-level
> > paging mode.
> > 
> > As an alternative we can change kexec to switch to 4-level paging mode
> > before starting the new kernel. Not sure how hard it will be.
> > 
> 
> Have a flag saying entering in 5-level mode is fine.  However, you really
> should support returning to 4-level mode in kexec.  It is *much* easier to do
> on the caller side as you have total control of memory allocation there.

Works for a regular kexec, but not for starting a crash kernel

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v10 00/38] x86: Secure Memory Encryption (AMD)

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Tom Lendacky wrote:
> This patch series provides support for AMD's new Secure Memory Encryption 
> (SME)
> feature.
> 
> SME can be used to mark individual pages of memory as encrypted through the
> page tables. A page of memory that is marked encrypted will be automatically
> decrypted when read from DRAM and will be automatically encrypted when
> written to DRAM. Details on SME can found in the links below.
> 
> The SME feature is identified through a CPUID function and enabled through
> the SYSCFG MSR. Once enabled, page table entries will determine how the
> memory is accessed. If a page table entry has the memory encryption mask set,
> then that memory will be accessed as encrypted memory. The memory encryption
> mask (as well as other related information) is determined from settings
> returned through the same CPUID function that identifies the presence of the
> feature.
> 
> The approach that this patch series takes is to encrypt everything possible
> starting early in the boot where the kernel is encrypted. Using the page
> table macros the encryption mask can be incorporated into all page table
> entries and page allocations. By updating the protection map, userspace
> allocations are also marked encrypted. Certain data must be accounted for
> as having been placed in memory before SME was enabled (EFI, initrd, etc.)
> and accessed accordingly.
> 
> This patch series is a pre-cursor to another AMD processor feature called
> Secure Encrypted Virtualization (SEV). The support for SEV will build upon
> the SME support and will be submitted later. Details on SEV can be found
> in the links below.

Well done series. Thanks to all people involved, especially Tom and Boris!
It was a pleasure to review that.

Reviewed-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner

On Wed, 21 Jun 2017, Tom Lendacky wrote:
> On 6/21/2017 10:38 AM, Thomas Gleixner wrote:
> > /*
> >  * Sanitize CPU configuration and retrieve the modifier
> >  * for the initial pgdir entry which will be programmed
> >  * into CR3. Depends on enabled SME encryption, normally 0.
> >  */
> > call __startup_secondary_64
> > 
> >  addq$(init_top_pgt - __START_KERNEL_map), %rax
> > 
> > You can hide that stuff in C-code nicely without adding any cruft to the
> > ASM code.
> > 
> 
> Moving the call to verify_cpu into the C-code might be quite a bit of
> change.  Currently, the verify_cpu code is included code and not a
> global function.

Ah. Ok. I missed that.

> I can still do the __startup_secondary_64() function and then look to
> incorporate verify_cpu into both __startup_64() and
> __startup_secondary_64() as a post-patch to this series.

Yes, just having __startup_secondary_64() for now and there the extra bits
for that encryption stuff is fine.

> At least the secondary path will have a base C routine to which
> modifications can be made in the future if needed.  How does that sound?

Sounds like a plan.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner

On Wed, 21 Jun 2017, Tom Lendacky wrote:
> On 6/21/2017 2:16 AM, Thomas Gleixner wrote:
> > Why is this an unconditional function? Isn't the mask simply 0 when the MEM
> > ENCRYPT support is disabled?
> 
> I made it unconditional because of the call from head_64.S. I can't make
> use of the C level static inline function and since the mask is not a
> variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I
> can't reference the variable directly.
> 
> I could create a #define in head_64.S that changes this to load rax with
> the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's
> not or add a #ifdef at that point in the code directly. Thoughts on
> that?

See below.

> > That does not make any sense. Neither the call to sme_encrypt_kernel() nor
> > the following call to sme_get_me_mask().
> > 
> > __startup_64() is already C code, so why can't you simply call that from
> > __startup_64() in C and return the mask from there?
> 
> I was trying to keep it explicit as to what was happening, but I can
> move those calls into __startup_64().

That's much preferred. And the return value wants to be documented in both
C and ASM code.

> I'll still need the call to sme_get_me_mask() in the secondary_startup_64
> path, though (depending on your thoughts to the above response).

call verify_cpu

movq$(init_top_pgt - __START_KERNEL_map), %rax

So if you make that:

/*
 * Sanitize CPU configuration and retrieve the modifier
 * for the initial pgdir entry which will be programmed
 * into CR3. Depends on enabled SME encryption, normally 0.
 */
call __startup_secondary_64

addq$(init_top_pgt - __START_KERNEL_map), %rax

You can hide that stuff in C-code nicely without adding any cruft to the
ASM code.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active

2017-06-21 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Tom Lendacky wrote:
> Currently there is a check if the address being mapped is in the ISA
> range (is_ISA_range()), and if it is then phys_to_virt() is used to
> perform the mapping.  When SME is active, however, this will result
> in the mapping having the encryption bit set when it is expected that
> an ioremap() should not have the encryption bit set. So only use the
> phys_to_virt() function if SME is not active
> 
> Reviewed-by: Borislav Petkov 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/mm/ioremap.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 4c1b5fd..a382ba9 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -106,9 +107,11 @@ static void __iomem *__ioremap_caller(resource_size_t 
> phys_addr,
>   }
>  
>   /*
> -  * Don't remap the low PCI/ISA area, it's always mapped..
> +  * Don't remap the low PCI/ISA area, it's always mapped.
> +  *   But if SME is active, skip this so that the encryption bit
> +  *   doesn't get set.
>*/
> - if (is_ISA_range(phys_addr, last_addr))
> + if (is_ISA_range(phys_addr, last_addr) && !sme_active())
>   return (__force void __iomem *)phys_to_virt(phys_addr);

More thoughts about that.

Making this conditional on !sme_active() is not the best idea. I'd rather
remove that whole thing and make it unconditional so the code pathes get
always exercised and any subtle wreckage is detected on a broader base and
not only on that hard to access and debug SME capable machine owned by Joe
User.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 10/36] x86/mm: Provide general kernel support for memory encryption

2017-06-21 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Tom Lendacky wrote:
>  
> +#ifndef pgprot_encrypted
> +#define pgprot_encrypted(prot)   (prot)
> +#endif
> +
> +#ifndef pgprot_decrypted

That looks wrong. It's not decrypted it's rather unencrypted, right?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Tom Lendacky wrote:
> diff --git a/arch/x86/include/asm/mem_encrypt.h 
> b/arch/x86/include/asm/mem_encrypt.h
> index a105796..988b336 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -15,16 +15,24 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include 
> +
>  #ifdef CONFIG_AMD_MEM_ENCRYPT
>  
>  extern unsigned long sme_me_mask;
>  
> +void __init sme_enable(void);
> +
>  #else/* !CONFIG_AMD_MEM_ENCRYPT */
>  
>  #define sme_me_mask  0UL
>  
> +static inline void __init sme_enable(void) { }
> +
>  #endif   /* CONFIG_AMD_MEM_ENCRYPT */
>  
> +unsigned long sme_get_me_mask(void);

Why is this an unconditional function? Isn't the mask simply 0 when the MEM
ENCRYPT support is disabled?

> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 6225550..ef12729 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -78,7 +78,29 @@ startup_64:
>   call__startup_64
>   popq%rsi
>  
> - movq$(early_top_pgt - __START_KERNEL_map), %rax
> + /*
> +  * Encrypt the kernel if SME is active.
> +  * The real_mode_data address is in %rsi and that register can be
> +  * clobbered by the called function so be sure to save it.
> +  */
> + push%rsi
> + callsme_encrypt_kernel
> + pop %rsi

That does not make any sense. Neither the call to sme_encrypt_kernel() nor
the following call to sme_get_me_mask().

__startup_64() is already C code, so why can't you simply call that from
__startup_64() in C and return the mask from there?

> @@ -98,7 +120,20 @@ ENTRY(secondary_startup_64)
>   /* Sanitize CPU configuration */
>   call verify_cpu
>  
> - movq$(init_top_pgt - __START_KERNEL_map), %rax
> + /*
> +  * Get the SME encryption mask.
> +  *  The encryption mask will be returned in %rax so we do an ADD
> +  *  below to be sure that the encryption mask is part of the
> +  *  value that will stored in %cr3.
> +  *
> +  * The real_mode_data address is in %rsi and that register can be
> +  * clobbered by the called function so be sure to save it.
> +  */
> + push%rsi
> + callsme_get_me_mask
> + pop %rsi

Do we really need a call here? The mask is established at this point, so
it's either 0 when the encryption stuff is not compiled in or it can be
retrieved from a variable which is accessible at this point.

> +
> + addq$(init_top_pgt - __START_KERNEL_map), %rax
>  1:
>  
>   /* Enable PAE mode, PGE and LA57 */

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active

2017-06-20 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Tom Lendacky wrote:

> Currently there is a check if the address being mapped is in the ISA
> range (is_ISA_range()), and if it is then phys_to_virt() is used to
> perform the mapping.  When SME is active, however, this will result
> in the mapping having the encryption bit set when it is expected that
> an ioremap() should not have the encryption bit set. So only use the
> phys_to_virt() function if SME is not active

This does not make sense to me. What the heck has phys_to_virt() to do with
the encryption bit. Especially why would the encryption bit be set on that
mapping in the first place?

I'm probably missing something, but this want's some coherent explanation
understandable by mere mortals both in the changelog and the code comment.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 06/36] x86/mm: Add Secure Memory Encryption (SME) support

2017-06-20 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Tom Lendacky wrote:
>  
> +config ARCH_HAS_MEM_ENCRYPT
> + def_bool y
> + depends on X86

That one is silly. The config switch is in the x86 KConfig file, so X86 is
on. If you intended to move this to some generic place outside of
x86/Kconfig then this should be

config ARCH_HAS_MEM_ENCRYPT
bool

and x86/Kconfig should have

select ARCH_HAS_MEM_ENCRYPT

and that should be selected by AMD_MEM_ENCRYPT

> +config AMD_MEM_ENCRYPT
> + bool "AMD Secure Memory Encryption (SME) support"
> + depends on X86_64 && CPU_SUP_AMD
> + ---help---
> +   Say yes to enable support for the encryption of system memory.
> +   This requires an AMD processor that supports Secure Memory
> +   Encryption (SME).

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] arch/x86: Fix kdump on x86 with physically hotadded CPUs

2016-10-07 Thread Thomas Gleixner

On Fri, 7 Oct 2016, Jiri Olsa wrote:
> On Thu, Oct 06, 2016 at 11:25:43AM -0400, Prarit Bhargava wrote:
> > I thought about doing this but it seems like every time some driver uses
> > topology_logical_package_id() the driver would have to replicate the error
> > checking code.
> 
> hm, unless we guarantee topology_logical_package_id always
> returns sane values I dont see another way

I did some deeper investigation. So with prarits patch there is only one
possibility left to end up with an empty package id:

SMP Kernel on UP Machine, which has neither ACPI nor MADT. But those
machines are not a problem because they don't use any drivers which would
make use of it. Aside of that for UP we just can set the node of the cpu to
0 and be done with it. I'll pick up Prarits patch and amend it.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] arch/x86: Fix kdump on x86 with physically hotadded CPUs

2016-10-04 Thread Thomas Gleixner

On Tue, 4 Oct 2016, Prarit Bhargava wrote:
> On 10/04/2016 06:58 AM, Thomas Gleixner wrote:
> > While it is the right thing to initialize the package map in that case, it
> > still papers over a robustness issue in the uncore code, which needs to be
> > fixed first.
> 
> I will include a separate patch with an error check for pkg == 0x in the
> uncore code.

0x? That won't help. The id returned is -1 if the entry is not
initialized. And aside of that just patching that particular place is not
helping as the uncore code and also rapl is relying on the package map
being populated.

So we need a sanity check in the initialization code which prevents any of
this being executed.
 
> >> +  if (!num_processors) {
> >> +  pr_warn("CPU 0 not enumerated in mptable or ACPI 
> >> MADT\n");
> >> +  num_processors = 1;
> > 
> > And in this case we end up with the same problem, right?
> 
> It occurs to me that I over thought this: I was thinking that there might 
> exist
> a pre-ACPI (or at least a system without an MADT) x86 system that wold boot 
> such
> that num_processors = 0.  But in that case, the cpu should be listed in the
> mptables so the above should not happen.  I'll change that to a BUG().

No. That's the wrong thing to do. Think SMP kernel on UP machines ...

Thanks,

tglx



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] arch/x86: Fix kdump on x86 with physically hotadded CPUs

2016-10-04 Thread Thomas Gleixner

On Tue, 4 Oct 2016, Jiri Olsa wrote:
> On Tue, Oct 04, 2016 at 12:58:04PM +0200, Thomas Gleixner wrote:
> > > pmu->boxes[pkg] is garbage because pkg was returned as 0x.
> > 
> > And that's what needs to be fixed in the first place.
> 
> right, I'll check on that.. but I think we need this fix as well

That's right.
 
> I was wondering if acpi_boot_init was a better place for that, but then
> Prarit suggested in our discussion that the prefill_possible_map() call
> seems to be a hotplug cleanup.. so it seemed to fit

It's the right place for it.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] arch/x86: Fix kdump on x86 with physically hotadded CPUs

2016-10-04 Thread Thomas Gleixner

On Mon, 3 Oct 2016, Prarit Bhargava wrote:
> BUG: unable to handle kernel paging request at 00841f1f
> IP: [] uncore_change_context+0xd4/0x180
...
>  [] ? uncore_cpu_starting+0x130/0x130
>  [] uncore_event_cpu_online+0x6c/0x80
>  [] cpuhp_invoke_callback+0x49/0x100
>  [] cpuhp_thread_fun+0x41/0x100
>  [] smpboot_thread_fn+0x10f/0x160
>  [] ? sort_range+0x30/0x30
>  [] kthread+0xd8/0xf0
>  [] ret_from_fork+0x1f/0x40
>  [] ? kthread_park+0x60/0x60

> arch/x86/events/intel/uncore.c:
> 1137 static void uncore_change_type_ctx(struct intel_uncore_type *type, int 
> old_ cpu,
> 1138int new_cpu)
> 1139 {
> 1140 struct intel_uncore_pmu *pmu = type->pmus;
> 1141 struct intel_uncore_box *box;
> 1142 int i, pkg;
> 1143
> 1144 pkg = topology_logical_package_id(old_cpu < 0 ? new_cpu : 
> old_cpu);
> 1145 for (i = 0; i < type->num_boxes; i++, pmu++) {
> 1146 box = pmu->boxes[pkg];
> 
> pmu->boxes[pkg] is garbage because pkg was returned as 0x.

And that's what needs to be fixed in the first place.

> This patch adds the missing generic_processor_info() to
> prefill_possible_map() to ensure the initialization of the boot cpu is
> correct. 

> This results in smp_init_package_map() having correct data and
> properly setting the package map for the hotplugged boot cpu, which in
> turn resolves the kdump kernel panic on physically hotplugged cpus.

While it is the right thing to initialize the package map in that case, it
still papers over a robustness issue in the uncore code, which needs to be
fixed first.

> [2] prefill_possible_map() is called before smp_store_boot_cpu_info().
> The comment beside the call to smp_store_boot_cpu_info() states that the
> completed call results in "Final full version of the data".

I'm not sure what that [2] here means and I cannot figure out the meaning
of this sentence either.

This changelog is incomprehensible in general and more a "oh look how I
decoded this problem" report than something which clearly describes the
problem at hand, the root cause and the fix. The latter wants a
understandable explanation why prefill_possible_map() is the right place to
do this.

> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 4296beb8fdd3..d1272febc13b 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1406,9 +1406,18 @@ __init void prefill_possible_map(void)
>  {
>   int i, possible;
>  
> - /* no processor from mptable or madt */
> - if (!num_processors)
> - num_processors = 1;
> + /* No boot processor was found in mptable or ACPI MADT */
> + if (!num_processors) {
> + /* Make sure boot cpu is enumerated */
> + if (apic->cpu_present_to_apicid(0) == BAD_APICID &&
> + apic->apic_id_valid(boot_cpu_physical_apicid))
> + generic_processor_info(boot_cpu_physical_apicid,
> + apic_version[boot_cpu_physical_apicid]);
> + if (!num_processors) {
> + pr_warn("CPU 0 not enumerated in mptable or ACPI 
> MADT\n");
> + num_processors = 1;

And in this case we end up with the same problem, right?

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] Enable legacy irq mode before jump to kexec/kdump kernel

2016-07-19 Thread Thomas Gleixner

On Wed, 20 Jul 2016, b...@redhat.com wrote:
> On 07/20/16 at 03:54am, Wei, Jiangang wrote:
>
> >  In fact, Eric and Ingo suggested that "it should be fixed in the bootup
> > path of the dump kernel, not the crash kernel reboot path", which is
> > convincing and reasonable.
> 
> Well this patch doesn't do differently with Eric's original implemention
> in kexec/kdump path.
> By taking out clear_IO_APIC from disable_IO_APIC, the left code of
> disable_IO_APIC will only do the virtual wire setting. So for
> kexec/kdump path, code basically is the same as Eric's method. But for
> poweroff/halt/reboot, it's enough to call clear_IO_APIC to disable
> IO-APIC.

You're completely ignoring what Jiangang said: 

 "it should be fixed in the bootup path of the dump kernel, not the crash
  kernel reboot path"

and that's the right way to do it. End of story.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

RE: [V4 PATCH 4/4] x86/apic: Introduce noextnmi boot option

2015-10-14 Thread Thomas Gleixner

On Wed, 14 Oct 2015, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > On Fri, 25 Sep 2015, Hidehiro Kawai wrote:
> > 
> > > This patch introduces new boot option "noextnmi" which disables
> > > external NMI.  This option is useful for the dump capture kernel
> > > so that an HA application or administrator wouldn't mistakenly
> > > shoot down the kernel by NMI.
> > >
> > > Currently, only x86 supports this option.
> > 
> > You might add that is can be used for debugging purposes as
> > well. External NMIs can be their own source of trouble. :)
> 
> Thanks for your comments!  I'll do that.
> 
> By the way, I have a pending patch which expands this option like
> this:
> 
>   apic_extnmi={ bsp | all | none }
> 
> If apic_extnmi=all is specified, external NMIs are broadcast to
> all CPUs.  This raises the successful rate of kernel panic in the case
> where an external NMI to CPU 0 is swallowed by other NMI handlers or
> blocked due to hang-up in NMI context.  The patch works without any
> problems, but I'm going to drop the feature if it will cause long
> discussion.  I'd like to settle this patch set down once.  At least,
> I'm going to change this option to apic_extnmi={bsp|none} style for
> the future expansion.
> 
> How do you think about this?

Do it right away with all three variants. They make a lot of sense to
me.
 
Thanks,

tglx___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [V4 PATCH 4/4] x86/apic: Introduce noextnmi boot option

2015-10-13 Thread Thomas Gleixner

Borislav,

On Mon, 5 Oct 2015, Borislav Petkov wrote:
> On Mon, Oct 05, 2015 at 02:03:58AM +, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > That's different from my point of view.  I'm not going to pass
> > some data from the first kernel to the second kernel. I'm just going to
> > provide a configurable option for the second kernel to users.
> 
> Dude, WTF?! You're adding a kernel command line which is supposed to
> be used *only* by the kdump kernel. But nooo, it is there in the open
> and visible to people. And anyone can type it in during boot. AND THAT
> SHOULDN'T BE POSSIBLE IN THE FIRST PLACE!
> 
> This information is strictly for the kdump kernel - it shouldn't be a
> generic command line option. How hard it is to understand that simple
> fact?!

Calm down!

Disabling that NMI in the first kernel is not going to make the world
explode. We have enough command line options a user can type in, which
are way worse than that one. If some "expert" types nonsense on the
first kernel command line, then it's none of our problems, really.

If Kawai would have marked that option as a debug feature, this
discussion would have probably never happened.

Aside of that, the best way to hand in options for the kdump kernel is
the command line. We have an existing interface for that.

Let's move on. Nothing to see here.

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [V4 PATCH 4/4] x86/apic: Introduce noextnmi boot option

2015-10-13 Thread Thomas Gleixner

On Fri, 25 Sep 2015, Hidehiro Kawai wrote:

> This patch introduces new boot option "noextnmi" which disables
> external NMI.  This option is useful for the dump capture kernel
> so that an HA application or administrator wouldn't mistakenly
> shoot down the kernel by NMI.
> 
> Currently, only x86 supports this option.

You might add that is can be used for debugging purposes as
well. External NMIs can be their own source of trouble. :)
 
> Signed-off-by: Hidehiro Kawai 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Jonathan Corbet 
> ---
>  Documentation/kernel-parameters.txt |4 
>  arch/x86/kernel/apic/apic.c |   17 -
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 22a4b68..8bcaccd 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2379,6 +2379,10 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   noexec=on: enable non-executable mappings (default)
>   noexec=off: disable non-executable mappings
>  
> + noextnmi[X86]
> + Mask external NMI.  This option is useful for a
> + dump capture kernel to be shot down by NMI.

That should read: "...not to be shot down", right?

Other than that.

Acked-by: Thomas Gleixner 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] Revert x86: add lapic_shutdown for x86_64

2007-10-29 Thread Thomas Gleixner

On Mon, 29 Oct 2007, Hiroshi Shimamoto wrote:

> Arjan van de Ven wrote:
> > On Mon, 29 Oct 2007 15:39:46 -0700
> > Hiroshi Shimamoto <[EMAIL PROTECTED]> wrote:
> > 
> >> lapic_shutdown is useless on x86_64.
> >>
> > 
> >  but since the goal is to get apic_32.c and apic_64.c to be more
> > converging (to the point of becoming the same file)... isn't your patch
> > going in the opposite direction?
> > 
> Hmm, I'm not sure that this revert affects x86 unification.
> Vivek said that probably we don't have to introduce lapic_shutdown() for 
> 64bit.
> So I submitted this patch which reverts my previous post, it was applied 
> before
> the comment.

Don't worry, we sort this out. There is some inevitable friction loss
for now, but we are getting closer. I gave the apic code some care to
get a readable diff out of it. Result is in:

git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git cleanup

Thanks,

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] x86: unify crash_32/64.c

2007-10-26 Thread Thomas Gleixner

On Fri, 26 Oct 2007, Hiroshi Shimamoto wrote:

Added Venki to CC

> > I'm now testing crash on 32bit, but there is an issue before
> > applying the patches. My machine stopped at checking 'hlt'
> > after kexec, showing below message.
> > 
> > CPU: Intel(R) Xeon(TM) CPU 3.80GHz stepping 0a
> > Checking 'hlt' instruction...
> > 
> v2.6.23.1 works fine for 1st kernel.
> > I'm investigating it..
> 
> I found that the following patch makes my machine stopped.
> bfe0c1cc6456bba1f4e3cc1fe29c0ea578ac763a
> x86: HPET force enable for ICH5
> 
> It means that after applied this patch, HPET is enabled
> automatically on 1st kernel and after crash/kexec the 2nd
> kernel stopped at checking 'hlt'.
> 
> I also tested the latest kernel(2.6.24-rc1-gec3b67c1).
> Boot parameter "nohpet" resolves this issue and kdump
> works well on 32bit.
> So I guess HPET affects this.
> But I don't know why 64bit kernel with HPET is OK.

Hmm. Does the 64 bit code shutdown HPET and restore the IRQ routing to
PIT and 32 bit is missing this ?

tglx

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] x86: unify crash_32/64.c

2007-10-20 Thread Thomas Gleixner

On Fri, 19 Oct 2007, Hiroshi Shimamoto wrote:

> Hi,
> 
> I made patches to unify crash_32/64.c.
> There are three patches;
> 1. add lapic_shutdown for x86_64
> 2. add safe_smp_processor_id for x86_64
> 3. unify crash_32/64.c
> 
> I'm not sure that it's good to split to these patches.

It's fine. So the steps leading to unification are clear.

Applied. Thanks,

 tglx




___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

84 matches

Mail list logo