[PATCH v1] KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB

2021-11-18 Thread Nicholas Piggin
The POWER9 ERAT flush instruction is a SLBIA with IH=7, which is a
reserved value on POWER7/8. On POWER8 this invalidates the SLB entries
above index 0, similarly to SLBIA IH=0.

If the SLB entries are invalidated, and then the guest is bypassed, the
host SLB does not get re-loaded, so the bolted entries above 0 will be
lost. This can result in kernel stack access causing a SLB fault.

Kernel stack access causing a SLB fault was responsible for the infamous
mega bug (search "Fix SLB reload bug"). Although since commit
48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C") that
starts using the kernel stack in the SLB miss handler, it might only
result in an infinite loop of SLB faults. In any case it's a bug.

Fix this by only executing the instruction on >= POWER9 where IH=7 is
defined not to invalidate the SLB. POWER7/8 don't require this ERAT
flush.

Fixes: 5008711259201 ("KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest 
TLB entries")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index fcf4760a3a0e..70b7a8f97153 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -695,6 +695,7 @@ static void flush_guest_tlb(struct kvm *kvm)
   "r" (0) : "memory");
}
asm volatile("ptesync": : :"memory");
+   // POWER9 congruence-class TLBIEL leaves ERAT. Flush it now.
asm volatile(PPC_RADIX_INVALIDATE_ERAT_GUEST : : :"memory");
} else {
for (set = 0; set < kvm->arch.tlb_sets; ++set) {
@@ -705,7 +706,9 @@ static void flush_guest_tlb(struct kvm *kvm)
rb += PPC_BIT(51);  /* increment set number */
}
asm volatile("ptesync": : :"memory");
-   asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT : : :"memory");
+   // POWER9 congruence-class TLBIEL leaves ERAT. Flush it now.
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT : : :"memory");
}
 }
 
-- 
2.23.0



Re: [PATCH 0/3] KEXEC_SIG with appended signature

2021-11-18 Thread Nayna



On 11/16/21 04:53, Michal Suchánek wrote:

On Mon, Nov 15, 2021 at 06:53:53PM -0500, Nayna wrote:

On 11/12/21 03:30, Michal Suchánek wrote:

Hello,

On Thu, Nov 11, 2021 at 05:26:41PM -0500, Nayna wrote:

On 11/8/21 07:05, Michal Suchánek wrote:

Hello,

The other part is that distributions apply 'lockdown' patches that change
the security policy depending on secure boot status which were rejected
by upstream which only hook into the _SIG options, and not into the IMA_
options. Of course, I expect this to change when the IMA options are
universally available across architectures and the support picked up by
distributions.

Which brings the third point: IMA features vary across architectures,
and KEXEC_SIG is more common than IMA_KEXEC.

config/arm64/default:CONFIG_HAVE_IMA_KEXEC=y
config/ppc64le/default:CONFIG_HAVE_IMA_KEXEC=y

config/arm64/default:CONFIG_KEXEC_SIG=y
config/s390x/default:CONFIG_KEXEC_SIG=y
config/x86_64/default:CONFIG_KEXEC_SIG=y

KEXEC_SIG makes it much easier to get uniform features across
architectures.

Architectures use KEXEC_SIG vs IMA_KEXEC based on their requirement.
IMA_KEXEC is for the kernel images signed using sign-file (appended
signatures, not PECOFF), provides measurement along with verification, and

That's certainly not the case. S390 uses appended signatures with
KEXEC_SIG, arm64 uses PECOFF with both KEXEC_SIG and IMA_KEXEC.

Yes, S390 uses appended signature, but they also do not support
measurements.

On the other hand for arm64/x86, PECOFF works only with KEXEC_SIG. Look at
the KEXEC_IMAGE_VERIFY_SIG config dependencies in arch/arm64/Kconfig and
KEXEC_BZIMAGE_VERIFY_SIG config dependencies in arch/x86/Kconfig. Now, if
KEXEC_SIG is not enabled, then IMA appraisal policies are enforced if secure
boot is enabled, refer to security/integrity/ima_efi.c . IMA would fail
verification if kernel is not signed with module sig appended signatures or
signature verification fails.

In short, IMA is used to enforce the existence of a policy if secure boot is
enabled. If they don't support module sig appended signatures, by definition
it fails. Thus PECOFF doesn't work with both KEXEC_SIG and IMA_KEXEC, but
only with KEXEC_SIG.

Then IMA_KEXEC is a no-go. It is not supported on all architectures and
it principially cannot be supported because it does not support PECOFF
which is needed to boot the kernel on EFI platforms. To get feature
parity across architectures KEXEC_SIG is required.


I would not say "a no-go", it is based on user requirements.

The key takeaway from this discussion is that both KEXEC_SIG and 
IMA_KEXEC support functionality with some small degree of overlap, and 
that documenting the differences is needed.  This will help kernel 
consumers to understand the difference and enable the appropriate 
functionality for their environment.


As per my understanding:

KEXEC_SIG:
* Supports kernel image verification
* Linked with secureboot state using downstream patch
* Supports PECOFF and module sig appended signature format
* Supports blocklisting of keys

IMA_KEXEC:
* Supports kernel image verification
* Linked with secureboot state in upstream
* Supports module sig appended signature format and signatures in 
extended attribute.

* Supports blocklisting of keys
* Supports blocklisting single kernel binary
* Supports measurements for attestation
* Supports audit log

Users can enable the option based on their requirements.

Thanks for the good discussion and enabling KEXEC_SIG for POWER as well. 
It would be good to have updated kernel documentation to go along with 
KEXEC_SIG support in the patchset.


Thanks & Regards,
    - Nayna



Re: [PATCH v4 00/25] Unify PCI error response checking

2021-11-18 Thread Bjorn Helgaas
On Thu, Nov 18, 2021 at 07:33:10PM +0530, Naveen Naidu wrote:
> An MMIO read from a PCI device that doesn't exist or doesn't respond
> causes a PCI error.  There's no real data to return to satisfy the 
> CPU read, so most hardware fabricates ~0 data.
> 
> This patch series adds PCI_ERROR_RESPONSE definition and other helper
> definition PCI_SET_ERROR_RESPONSE and PCI_POSSIBLE_ERROR and uses it
> where appropriate to make these checks consistent and easier to find.
> 
> This helps unify PCI error response checking and make error check
> consistent and easier to find.
> 
> This series also ensures that the error response fabrication now happens
> in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
> responsibility from controller drivers to do the error response setting. 

Applied to pci/error for v5.17.  Thanks, this is really nice work.
Somehow small changes like these add up to something much greater than
one would expect.

This touches many native controller drivers but in trivial ways, so I
plan to merge this branch after the usual native controller stuff from
Lorenzo.

I tweaked the commit logs to clarify that this series is all about
*config* reads, not MMIO reads.  MMIO reads have similar issues, and
we can use PCI_ERROR_RESPONSE, etc., there, too, but that's not what
this series does.

Bjorn


[PATCH] powerpc/signal32: Use struct_group() to zero spe regs

2021-11-18 Thread Kees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.

Add a struct_group() for the spe registers so that memset() can correctly reason
about the size:

   In function 'fortify_memset_chk',
   inlined from 'restore_user_regs.part.0' at 
arch/powerpc/kernel/signal_32.c:539:3:
   >> include/linux/fortify-string.h:195:4: error: call to 
'__write_overflow_field' declared with attribute warning: detected write beyond 
size of field (1st parameter); maybe use struct_group()? 
[-Werror=attribute-warning]
 195 |__write_overflow_field();
 |^~~~

Reported-by: kernel test robot 
Signed-off-by: Kees Cook 
---
 arch/powerpc/include/asm/processor.h |  6 --
 arch/powerpc/kernel/signal_32.c  | 14 +-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index e39bd0ff69f3..978a80308466 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -191,8 +191,10 @@ struct thread_struct {
int used_vsr;   /* set if process has used VSX */
 #endif /* CONFIG_VSX */
 #ifdef CONFIG_SPE
-   unsigned long   evr[32];/* upper 32-bits of SPE regs */
-   u64 acc;/* Accumulator */
+   struct_group(spe,
+   unsigned long   evr[32];/* upper 32-bits of SPE regs */
+   u64 acc;/* Accumulator */
+   );
unsigned long   spefscr;/* SPE & eFP status */
unsigned long   spefscr_last;   /* SPEFSCR value on last prctl
   call or trap return */
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 00a9c9cd6d42..5e1664b501e4 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -527,16 +527,20 @@ static long restore_user_regs(struct pt_regs *regs,
regs_set_return_msr(regs, regs->msr & ~(MSR_FP | MSR_FE0 | MSR_FE1));
 
 #ifdef CONFIG_SPE
-   /* force the process to reload the spe registers from
-  current->thread when it next does spe instructions */
+   /*
+* Force the process to reload the spe registers from
+* current->thread when it next does spe instructions.
+* Since this is user ABI, we must enforce the sizing.
+*/
+   BUILD_BUG_ON(sizeof(current->thread.spe) != ELF_NEVRREG * sizeof(u32));
regs_set_return_msr(regs, regs->msr & ~MSR_SPE);
if (msr & MSR_SPE) {
/* restore spe registers from the stack */
-   unsafe_copy_from_user(current->thread.evr, >mc_vregs,
- ELF_NEVRREG * sizeof(u32), failed);
+   unsafe_copy_from_user(>thread.spe, >mc_vregs,
+ sizeof(current->thread.spe), failed);
current->thread.used_spe = true;
} else if (current->thread.used_spe)
-   memset(current->thread.evr, 0, ELF_NEVRREG * sizeof(u32));
+   memset(>thread.spe, 0, sizeof(current->thread.spe));
 
/* Always get SPEFSCR back */
unsafe_get_user(current->thread.spefscr, (u32 __user *)>mc_vregs + 
ELF_NEVRREG, failed);
-- 
2.30.2



Re: [PATCH v3 08/12] KVM: Propagate vcpu explicitly to mark_page_dirty_in_slot()

2021-11-18 Thread David Woodhouse



On 18 November 2021 18:50:55 GMT, Sean Christopherson  wrote:
>On Thu, Nov 18, 2021, Sean Christopherson wrote:
>> On Thu, Nov 18, 2021, David Woodhouse wrote:
>> > That leaves the one in TDP MMU handle_changed_spte_dirty_log() which
>> > AFAICT can trigger the same crash seen by butt3rflyh4ck — can't that
>> > happen from a thread where kvm_get_running_vcpu() is NULL too? For that
>> > one I'm not sure.
>> 
>> I think could be trigger in the TDP MMU via kvm_mmu_notifier_release()
>> -> kvm_mmu_zap_all(), e.g. if the userspace VMM exits while dirty logging is
>> enabled.  That should be easy to (dis)prove via a selftest.
>
>Scratch that, the dirty log update is guarded by the new_spte being present, so
>zapping of any kind won't trigger it.
>
>Currently, I believe the only path that would create a present SPTE without an
>active vCPU is mmu_notifer.change_pte, but that squeaks by because its required
>to be wrapped with invalidate_range_{start,end}(MMU_NOTIFY_CLEAR), and KVM zaps
>in that situation.

Is it sufficient to have *an* active vCPU?  What if a VMM has threads for 
active vCPUs but is doing mmap/munmap on a *different* thread? Does that not 
suffer the same crash?


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


[PATCH 3/3] of/fdt: Rework early_init_dt_scan_memory() to call directly

2021-11-18 Thread Rob Herring
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework
early_init_dt_scan_memory() to be called directly and use libfdt.

Cc: John Crispin 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Frank Rowand 
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring 
---
 arch/mips/ralink/of.c  | 16 ++---
 arch/powerpc/kernel/prom.c | 16 -
 drivers/of/fdt.c   | 68 --
 include/linux/of_fdt.h |  3 +-
 4 files changed, 47 insertions(+), 56 deletions(-)

diff --git a/arch/mips/ralink/of.c b/arch/mips/ralink/of.c
index 0135376c5de5..e1d79523343a 100644
--- a/arch/mips/ralink/of.c
+++ b/arch/mips/ralink/of.c
@@ -53,17 +53,6 @@ void __init device_tree_init(void)
unflatten_and_copy_device_tree();
 }
 
-static int memory_dtb;
-
-static int __init early_init_dt_find_memory(unsigned long node,
-   const char *uname, int depth, void *data)
-{
-   if (depth == 1 && !strcmp(uname, "memory@0"))
-   memory_dtb = 1;
-
-   return 0;
-}
-
 void __init plat_mem_setup(void)
 {
void *dtb;
@@ -77,9 +66,8 @@ void __init plat_mem_setup(void)
dtb = get_fdt();
__dt_setup_arch(dtb);
 
-   of_scan_flat_dt(early_init_dt_find_memory, NULL);
-   if (memory_dtb)
-   of_scan_flat_dt(early_init_dt_scan_memory, NULL);
+   if (!early_init_dt_scan_memory())
+   return;
else if (soc_info.mem_detect)
soc_info.mem_detect();
else if (soc_info.mem_size)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 6e1a106f02eb..63762a3b75e8 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -532,19 +532,19 @@ static int  __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 }
 #endif /* CONFIG_PPC_PSERIES */
 
-static int __init early_init_dt_scan_memory_ppc(unsigned long node,
-   const char *uname,
-   int depth, void *data)
+static int __init early_init_dt_scan_memory_ppc(void)
 {
 #ifdef CONFIG_PPC_PSERIES
-   if (depth == 1 &&
-   strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
+   const void *fdt = initial_boot_params;
+   int node = fdt_path_offset(fdt, "/ibm,dynamic-reconfiguration-memory");
+
+   if (node > 0) {
walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
return 0;
}
 #endif

-   return early_init_dt_scan_memory(node, uname, depth, data);
+   return early_init_dt_scan_memory();
 }
 
 /*
@@ -749,7 +749,7 @@ void __init early_init_devtree(void *params)
 
/* Scan memory nodes and rebuild MEMBLOCKs */
early_init_dt_scan_root();
-   of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
+   early_init_dt_scan_memory_ppc();
 
parse_early_param();
 
@@ -858,7 +858,7 @@ void __init early_get_first_memblock_info(void *params, 
phys_addr_t *size)
 */
add_mem_to_memblock = 0;
early_init_dt_scan_root();
-   of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
+   early_init_dt_scan_memory_ppc();
add_mem_to_memblock = 1;
 
if (size)
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 5e216555fe4f..a799117886f4 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1078,49 +1078,53 @@ u64 __init dt_mem_next_cell(int s, const __be32 **cellp)
 /*
  * early_init_dt_scan_memory - Look for and parse memory nodes
  */
-int __init early_init_dt_scan_memory(unsigned long node, const char *uname,
-int depth, void *data)
+int __init early_init_dt_scan_memory(void)
 {
-   const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
-   const __be32 *reg, *endp;
-   int l;
-   bool hotpluggable;
+   int node;
+   const void *fdt = initial_boot_params;
 
-   /* We are scanning "memory" nodes only */
-   if (type == NULL || strcmp(type, "memory") != 0)
-   return 0;
+   fdt_for_each_subnode(node, fdt, 0) {
+   const char *type = of_get_flat_dt_prop(node, "device_type", 
NULL);
+   const __be32 *reg, *endp;
+   int l;
+   bool hotpluggable;
 
-   reg = of_get_flat_dt_prop(node, "linux,usable-memory", );
-   if (reg == NULL)
-   reg = of_get_flat_dt_prop(node, "reg", );
-   if (reg == NULL)
-   return 0;
+   /* We are scanning "memory" nodes only */
+   if (type == NULL || strcmp(type, "memory") != 0)
+   continue;
 
-   endp = reg + (l / sizeof(__be32));
-   hotpluggable = of_get_flat_dt_prop(node, "hotpluggable", NULL);
+   reg = of_get_flat_dt_prop(node, 

[PATCH 2/3] of/fdt: Rework early_init_dt_scan_root() to call directly

2021-11-18 Thread Rob Herring
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework early_init_dt_scan_root()
to be called directly and use libfdt.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Frank Rowand 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring 
---
 arch/powerpc/kernel/prom.c |  4 ++--
 drivers/of/fdt.c   | 14 +++---
 include/linux/of_fdt.h |  3 +--
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index c6c398ccd98a..6e1a106f02eb 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -748,7 +748,7 @@ void __init early_init_devtree(void *params)
of_scan_flat_dt(early_init_dt_scan_chosen_ppc, boot_command_line);
 
/* Scan memory nodes and rebuild MEMBLOCKs */
-   of_scan_flat_dt(early_init_dt_scan_root, NULL);
+   early_init_dt_scan_root();
of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
 
parse_early_param();
@@ -857,7 +857,7 @@ void __init early_get_first_memblock_info(void *params, 
phys_addr_t *size)
 * mess the memblock.
 */
add_mem_to_memblock = 0;
-   of_scan_flat_dt(early_init_dt_scan_root, NULL);
+   early_init_dt_scan_root();
of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
add_mem_to_memblock = 1;
 
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 1f1705f76263..5e216555fe4f 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1042,13 +1042,14 @@ int __init early_init_dt_scan_chosen_stdout(void)
 /*
  * early_init_dt_scan_root - fetch the top level address and size cells
  */
-int __init early_init_dt_scan_root(unsigned long node, const char *uname,
-  int depth, void *data)
+int __init early_init_dt_scan_root(void)
 {
const __be32 *prop;
+   const void *fdt = initial_boot_params;
+   int node = fdt_path_offset(fdt, "/");
 
-   if (depth != 0)
-   return 0;
+   if (node < 0)
+   return -ENODEV;
 
dt_root_size_cells = OF_ROOT_NODE_SIZE_CELLS_DEFAULT;
dt_root_addr_cells = OF_ROOT_NODE_ADDR_CELLS_DEFAULT;
@@ -1063,8 +1064,7 @@ int __init early_init_dt_scan_root(unsigned long node, 
const char *uname,
dt_root_addr_cells = be32_to_cpup(prop);
pr_debug("dt_root_addr_cells = %x\n", dt_root_addr_cells);
 
-   /* break now */
-   return 1;
+   return 0;
 }
 
 u64 __init dt_mem_next_cell(int s, const __be32 **cellp)
@@ -1263,7 +1263,7 @@ void __init early_init_dt_scan_nodes(void)
int rc;
 
/* Initialize {size,address}-cells info */
-   of_scan_flat_dt(early_init_dt_scan_root, NULL);
+   early_init_dt_scan_root();
 
/* Retrieve various information from the /chosen node */
rc = early_init_dt_scan_chosen(boot_command_line);
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 654722235df6..df3d31926c3c 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -68,8 +68,7 @@ extern void early_init_dt_add_memory_arch(u64 base, u64 size);
 extern u64 dt_mem_next_cell(int s, const __be32 **cellp);
 
 /* Early flat tree scan hooks */
-extern int early_init_dt_scan_root(unsigned long node, const char *uname,
-  int depth, void *data);
+extern int early_init_dt_scan_root(void);
 
 extern bool early_init_dt_scan(void *params);
 extern bool early_init_dt_verify(void *params);
-- 
2.32.0



[PATCH 1/3] of/fdt: Rework early_init_dt_scan_chosen() to call directly

2021-11-18 Thread Rob Herring
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework
early_init_dt_scan_chosen() to be called directly and use libfdt.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Frank Rowand 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring 
---
 arch/powerpc/kernel/prom.c   |  2 +-
 arch/powerpc/mm/nohash/kaslr_booke.c |  4 +--
 drivers/of/fdt.c | 39 ++--
 include/linux/of_fdt.h   |  3 +--
 4 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 2e67588f6f6e..c6c398ccd98a 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -402,7 +402,7 @@ static int __init early_init_dt_scan_chosen_ppc(unsigned 
long node,
const unsigned long *lprop; /* All these set by kernel, so no need to 
convert endian */
 
/* Use common scan routine to determine if this is the chosen node */
-   if (early_init_dt_scan_chosen(node, uname, depth, data) == 0)
+   if (early_init_dt_scan_chosen(data) < 0)
return 0;
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index 8fc49b1b4a91..90debe19ab4c 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -44,9 +44,7 @@ struct regions __initdata regions;
 
 static __init void kaslr_get_cmdline(void *fdt)
 {
-   int node = fdt_path_offset(fdt, "/chosen");
-
-   early_init_dt_scan_chosen(node, "chosen", 1, boot_command_line);
+   early_init_dt_scan_chosen(boot_command_line);
 }
 
 static unsigned long __init rotate_xor(unsigned long hash, const void *area,
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bdca35284ceb..1f1705f76263 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1124,18 +1124,18 @@ int __init early_init_dt_scan_memory(unsigned long 
node, const char *uname,
return 0;
 }
 
-int __init early_init_dt_scan_chosen(unsigned long node, const char *uname,
-int depth, void *data)
+int __init early_init_dt_scan_chosen(char *cmdline)
 {
-   int l;
+   int l, node;
const char *p;
const void *rng_seed;
+   const void *fdt = initial_boot_params;
 
-   pr_debug("search \"chosen\", depth: %d, uname: %s\n", depth, uname);
-
-   if (depth != 1 || !data ||
-   (strcmp(uname, "chosen") != 0 && strcmp(uname, "chosen@0") != 0))
-   return 0;
+   node = fdt_path_offset(fdt, "/chosen");
+   if (node < 0)
+   node = fdt_path_offset(fdt, "/chosen@0");
+   if (node < 0)
+   return -ENOENT;
 
early_init_dt_check_for_initrd(node);
early_init_dt_check_for_elfcorehdr(node);
@@ -1144,7 +1144,7 @@ int __init early_init_dt_scan_chosen(unsigned long node, 
const char *uname,
/* Retrieve command line */
p = of_get_flat_dt_prop(node, "bootargs", );
if (p != NULL && l > 0)
-   strlcpy(data, p, min(l, COMMAND_LINE_SIZE));
+   strlcpy(cmdline, p, min(l, COMMAND_LINE_SIZE));
 
/*
 * CONFIG_CMDLINE is meant to be a default in case nothing else
@@ -1153,18 +1153,18 @@ int __init early_init_dt_scan_chosen(unsigned long 
node, const char *uname,
 */
 #ifdef CONFIG_CMDLINE
 #if defined(CONFIG_CMDLINE_EXTEND)
-   strlcat(data, " ", COMMAND_LINE_SIZE);
-   strlcat(data, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
+   strlcat(cmdline, " ", COMMAND_LINE_SIZE);
+   strlcat(cmdline, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
 #elif defined(CONFIG_CMDLINE_FORCE)
-   strlcpy(data, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
+   strlcpy(cmdline, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
 #else
/* No arguments from boot loader, use kernel's  cmdl*/
-   if (!((char *)data)[0])
-   strlcpy(data, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
+   if (!((char *)cmdline)[0])
+   strlcpy(cmdline, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
 #endif
 #endif /* CONFIG_CMDLINE */
 
-   pr_debug("Command line is: %s\n", (char *)data);
+   pr_debug("Command line is: %s\n", (char *)cmdline);
 
rng_seed = of_get_flat_dt_prop(node, "rng-seed", );
if (rng_seed && l > 0) {
@@ -1178,8 +1178,7 @@ int __init early_init_dt_scan_chosen(unsigned long node, 
const char *uname,
fdt_totalsize(initial_boot_params));
}
 
-   /* break now */
-   return 1;
+   return 0;
 }
 
 #ifndef MIN_MEMBLOCK_ADDR
@@ -1261,14 +1260,14 @@ bool __init early_init_dt_verify(void *params)
 
 void __init early_init_dt_scan_nodes(void)
 {
-   int rc = 0;
+   int rc;
 
/* Initialize {size,address}-cells info */
of_scan_flat_dt(early_init_dt_scan_root, NULL);
 
/* Retrieve various information from the /chosen node */
-   

[PATCH 0/3] of/fdt: Rework early FDT scanning functions

2021-11-18 Thread Rob Herring
The early FDT scanning functions use of_scan_flat_dt() which implements 
its own node walking method. This function predates libfdt and is an 
unnecessary indirection. This series reworks 
early_init_dt_scan_chosen(), early_init_dt_scan_root(), and 
early_init_dt_scan_memory() to be called directly and use libfdt calls.

Ultimately, I want to remove of_scan_flat_dt(). Most of the remaining 
of_scan_flat_dt() users are in powerpc.

Rob


Rob Herring (3):
  of/fdt: Rework early_init_dt_scan_chosen() to call directly
  of/fdt: Rework early_init_dt_scan_root() to call directly
  of/fdt: Rework early_init_dt_scan_memory() to call directly

 arch/mips/ralink/of.c|  16 +---
 arch/powerpc/kernel/prom.c   |  22 ++---
 arch/powerpc/mm/nohash/kaslr_booke.c |   4 +-
 drivers/of/fdt.c | 121 ++-
 include/linux/of_fdt.h   |   9 +-
 5 files changed, 79 insertions(+), 93 deletions(-)

-- 
2.32.0



Re: [PATCH 04/11] powerpc/xive: Introduce xive_core_debugfs_create()

2021-11-18 Thread Cédric Le Goater

On 11/18/21 10:21, Michael Ellerman wrote:

Cédric Le Goater  writes:


and fix some compile issues when !CONFIG_DEBUG_FS.

Signed-off-by: Cédric Le Goater 
---
  arch/powerpc/sysdev/xive/common.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 3d558cad1f19..b71cc1020296 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c

...

@@ -1779,10 +1782,18 @@ static int xive_core_debug_show(struct seq_file *m, 
void *private)
  }
  DEFINE_SHOW_ATTRIBUTE(xive_core_debug);
  
+static void xive_core_debugfs_create(void)

+{
+   debugfs_create_file("xive", 0400, arch_debugfs_dir,
+   NULL, _core_debug_fops);
+}
+
+#endif /* CONFIG_DEBUG_FS */
+
  int xive_core_debug_init(void)
  {
-   if (xive_enabled())
-   debugfs_create_file("xive", 0400, arch_debugfs_dir,
-   NULL, _core_debug_fops);
+   if (xive_enabled() && IS_ENABLED(CONFIG_DEBUG_FS))
+   xive_core_debugfs_create();
+
return 0;
  }


For skiroot_defconfig this gives me:

   arch/powerpc/sysdev/xive/common.c: In function ‘xive_core_init’:
   arch/powerpc/sysdev/xive/common.c:1676:2: error: implicit declaration of 
function ‘xive_core_debugfs_create’; did you mean ‘xive_core_debug_init’? 
[-Werror=implicit-function-declaration]
1676 |  xive_core_debugfs_create();
 |  ^~~~
 |  xive_core_debug_init
   cc1: all warnings being treated as errors


We need an empty inline stub of xive_core_debugfs_create() for the
CONFIG_DEBUG_FS=n case.


I thought IS_ENABLED(CONFIG_DEBUG_FS)) was protecting compile from
that issue. Do you want a v2 for the whole ?

Or, I can send a replacement for patch 4 only.


I'm wondering though why do we have xive_core_debug_init() at all, why
don't we just initialise the debugfs files in xive_core_init()?


I think I tried that and there was an ordering issue.

Thanks,

C.


Re: [PATCH 11/11] powerpc/smp: Add a doorbell=off kernel parameter

2021-11-18 Thread Cédric Le Goater

On 11/18/21 10:24, Michael Ellerman wrote:

Cédric Le Goater  writes:

On 11/11/21 11:41, Michael Ellerman wrote:

Cédric Le Goater  writes:

On processors with a XIVE interrupt controller (POWER9 and above), the
kernel can use either doorbells or XIVE to generate CPU IPIs. Sending
doorbell is generally preferred to using the XIVE IC because it is
faster. There are cases where we want to avoid doorbells and use XIVE
only, for debug or performance. Only useful on POWER9 and above.


How much do we want this?


Yes. Thanks for asking. It is a recent need.

Here is some background I should have added in the first place. May be
for a v2.

We have different ways of doing IPIs on POWER9 and above processors,
depending on the platform and the underlying hypervisor.

- PowerNV uses global doorbells

- pSeries/KVM uses XIVE only because local doorbells are not
efficient, as there are emulated in the KVM hypervisor

- pSeries/PowerVM uses XIVE for remote cores and local doorbells for
threads on same core (SMT4 or 8)

This recent commit 5b06d1679f2f ("powerpc/pseries: Use doorbells even
if XIVE is available") introduced the optimization for PowerVM and
commit 107c55005fbd ("powerpc/pseries: Add KVM guest doorbell
restrictions") restricted the optimization.

We would like a way to turn off the optimization.


Just for test/debug though?


Yes. For now, this is the main target.


Kernel command line args are a bit of a pain, they tend to be poorly
tested, because someone has to explicitly enable them at boot time,
and then reboot to test the other case.


True. The "xive=off" parameter was poorly tested initially.


When would we want to enable this?


For bring-up, for debug, for tests. I have been using a similar switch
to compare the XIVE interrupt controller performance with doorbells on
POWER9 and P0WER10.

A new need arises with PowerVM, some configurations will behave as KVM
(local doorbell are unsupported) and the doorbell=off parameter is a
simple way to handle this case today.


That's the first I've heard of that, what PowerVM configurations have
non-working doorbells?


It's not released yet.


Can we make the kernel smarter about when to use doorbells and make
it automated?


I don't think we want to probe all IPI methods to detect how well
local doorbells are supported on the platform. Do we ?


We don't *want to*, but sounds like we might have to if they are not
working in some configurations as you mentioned above.


Yeah. This is too early. I will keep that patch for internal use
until we have clarified.

Thanks,

C.


Re: [PATCH v3 08/12] KVM: Propagate vcpu explicitly to mark_page_dirty_in_slot()

2021-11-18 Thread David Woodhouse
On Thu, 2021-11-18 at 13:04 +0100, Paolo Bonzini wrote:
> On 11/17/21 22:09, David Woodhouse wrote:
> > >   {
> > > - struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > > + struct kvm_vcpu *running_vcpu = kvm_get_running_vcpu();
> > > 
> > > + WARN_ON_ONCE(vcpu && vcpu != running_vcpu);
> > >   WARN_ON_ONCE(vcpu->kvm != kvm);
> > Ah, that one needs to be changed to check running_vcpu instead. Or this
> > needs to go first:
> > 
> > I think I prefer making the vCPU a required argument. If anyone's going to
> > pull a vCPU pointer out of their posterior, let the caller do it.
> > 
> 
> I understand that feeling, but still using the running vCPU is by far 
> the common case, and it's not worth adding a new function parameter to 
> all call sites.
> 
> What about using a separate function, possibly __-prefixed, for the case 
> where you have a very specific vCPU?

We could certainly do a kvm_vcpu_mark_page_dirty_in_slot() and reduce
the collateral damage. I think this patch wouldn't need to touch
anything outside virt/kvm/ if we did that.

But bikeshedding aside, it occurs to me that I have a more substantial
concern about this approach — are we even *allowed* to touch the dirty
ring of another vCPU? It seems to be based on the assumption that it's
entirely a per-cpu structure on the kernel side. Wouldn't we need a
spinlock in kvm_dirty_ring_push() at the very least?

At this point I'm somewhat tempted to remove the mark_page_dirty() call
from gfn_to_pfn_cache_invalidate_start() and move it to the tail of 
kvm_gfn_to_pfn_cache_refresh() instead, where it unpins and memunmaps
the page. (We'd still tell the *kernel* by calling kvm_set_pfn_dirty()
immediately in the invalidation, just defer the KVM mark_page_dirty()
to be done in the context of an actual vCPU.)

If I do that, I can drop this 'propagate vcpu' patch completely.

I'm already happy enough that I'll fix the Xen shared_info page by
*never* bothering to dirty_log for it, and I can wash my hands of that
problem if I stop typing now. But...

That leaves the one in TDP MMU handle_changed_spte_dirty_log() which
AFAICT can trigger the same crash seen by butt3rflyh4ck — can't that
happen from a thread where kvm_get_running_vcpu() is NULL too? For that
one I'm not sure.



smime.p7s
Description: S/MIME cryptographic signature


[PATCH v4 19/25] PCI/DPC: Use PCI_POSSIBLE_ERROR() to check read from hardware

2021-11-18 Thread Naveen Naidu
An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use PCI_POSSIBLE_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..3e9afee02e8d 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!PCI_POSSIBLE_ERROR(status)) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
PCI_POSSIBLE_ERROR(status))
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1



[PATCH v4 00/25] Unify PCI error response checking

2021-11-18 Thread Naveen Naidu
An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition PCI_SET_ERROR_RESPONSE and PCI_POSSIBLE_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses PCI_SET_ERROR_RESPONSE() when device is not found 

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses PCI_POSSIBLE_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 25:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=
v4:
   - Rename SET_PCI_ERROR_RESPONSE to PCI_SET_ERROR_RESPONSE
   - Rename RESPONSE_IS_PCI_ERROR to PCI_POSSIBLE_ERROR

v3:
   - Change RESPONSE_IS_PCI_ERROR macro definition
   - Fix the macros, Add () around macro parameters
   - Fix alignment issue in Patch 2/24
   - Add proper receipients for all the patches

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (25):
 [PATCH v4 1/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
 [PATCH v4 2/25] PCI: Set error response in config access defines when 
ops->read() fails
 [PATCH v4 3/25] PCI: Use PCI_SET_ERROR_RESPONSE() when device not found
 [PATCH v4 4/25] PCI: Remove redundant error fabrication when device read fails
 [PATCH v4 5/25] PCI: thunder: Remove redundant error fabrication when device 
read fails
 [PATCH v4 6/25] PCI: iproc: Remove redundant error fabrication when device 
read fails
 [PATCH v4 7/25] PCI: mediatek: Remove redundant error fabrication when device 
read fails
 [PATCH v4 8/25] PCI: exynos: Remove redundant error fabrication when device 
read fails
 [PATCH v4 9/25] PCI: histb: Remove redundant error fabrication when device 
read fails
 [PATCH v4 10/25] PCI: kirin: Remove redundant error fabrication when device 
read fails
 [PATCH v4 11/25] PCI: aardvark: Remove redundant error fabrication when device 
read fails
 [PATCH v4 12/25] PCI: mvebu: Remove redundant error fabrication when device 
read fails
 [PATCH v4 13/25] PCI: altera: Remove redundant error fabrication when device 
read fails
 [PATCH v4 14/25] PCI: rcar: Remove redundant error fabrication when device 
read fails
 [PATCH v4 15/25] PCI: rockchip: Remove redundant error fabrication when device 
read fails
 [PATCH v4 16/25] PCI/ERR: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 17/25] PCI: vmd: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 18/25] PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check read from 
hardware
 [PATCH v4 19/25] PCI/DPC: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 20/25] PCI/PME: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 21/25] PCI: cpqphp: Use PCI_POSSIBLE_ERROR() to check read from 
hardware
 [PATCH v4 22/25] PCI: Use PCI_ERROR_RESPONSE to specify hardware error
 [PATCH v4 23/25] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware 
error
 [PATCH v4 24/25] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
 [PATCH v4 25/25] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 32 +++---
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   |  4 +-
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 

Re: [PATCH v3 01/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions

2021-11-18 Thread Naveen Naidu
On 17/11, Bjorn Helgaas wrote:
> On Thu, Oct 21, 2021 at 08:37:26PM +0530, Naveen Naidu wrote:
> > An MMIO read from a PCI device that doesn't exist or doesn't respond
> > causes a PCI error.  There's no real data to return to satisfy the
> > CPU read, so most hardware fabricates ~0 data.
> > 
> > Add a PCI_ERROR_RESPONSE definition for that and use it where
> > appropriate to make these checks consistent and easier to find.
> > 
> > Also add helper definitions SET_PCI_ERROR_RESPONSE and
> > RESPONSE_IS_PCI_ERROR to make the code more readable.
> > 
> > Suggested-by: Bjorn Helgaas 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  include/linux/pci.h | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index cd8aa6fce204..689c8277c584 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -154,6 +154,15 @@ enum pci_interrupt_pin {
> >  /* The number of legacy PCI INTx interrupts */
> >  #define PCI_NUM_INTX   4
> >  
> > +/*
> > + * Reading from a device that doesn't respond typically returns ~0.  A
> > + * successful read from a device may also return ~0, so you need additional
> > + * information to reliably identify errors.
> > + */
> > +#define PCI_ERROR_RESPONSE (~0ULL)
> > +#define SET_PCI_ERROR_RESPONSE(val)(*(val) = ((typeof(*(val))) 
> > PCI_ERROR_RESPONSE))
> > +#define RESPONSE_IS_PCI_ERROR(val) ((val) == ((typeof(val)) 
> > PCI_ERROR_RESPONSE))
> 
> Beautiful!  I really like this.
>

Thank you very much for the review ^^

> I would prefer the macros to start with "PCI_", e.g.,
> PCI_SET_ERROR_RESPONSE().
> 

ACK

> I think "RESPONSE_IS_PCI_ERROR()" is too strong because (as the
> comment says), ~0 *may* indicate an error.  Or it may be a successful
> read of a register that happens to contain ~0.
> 
> Possibilities to convey the idea that this isn't definitive:
> 
>   PCI_POSSIBLE_ERROR_RESPONSE(val)  # a little long
>   PCI_LIKELY_ERROR(val) # we really have no idea whether
>   PCI_PROBABLE_ERROR(val)   #   likely or probable
>   PCI_POSSIBLE_ERROR(val)   # promising?
>

ACK. Will use PCI_POSSIBLE_ERROR()

> Can you rebase to my "main" branch (v5.16-rc1), tweak the above, and
> collect up the acks/reviews?
> 

ACK

> We should also browse drivers outside drivers/pci for places we could
> use these.  Not necessarily as part of this series, although if
> authors there object, it would be good to learn that earlier than
> later.
> 
> Drivers that implement pci_error_handlers might be a fruitful place to
> start.  But you've done a great job finding users of ~0 and 0x...
> in drivers/pci/, too.
> 

A quick grep showed that there are around 80 drivers which have
pci_error_handlers. I was thinking that it would be better if we handle
these drivers in another patch series since the current patch series is
itself 25 patches long. And in my short tenure reading LKML, I gathered
that folks generally are not so kind to a long list of patches in a
single patch series ^^' (I might be wrong though, Apologies)

The consensus on the patch series does seem slightly positive so
ideally, I was hoping that we would not have the case where a author
does not like the way we are handling this patch. Then again, I'm
pretty sure that I might be wrong ^^'

I hope it would be okay that I send in a new patch series with the
suggested changes and handle the other changes in another patch series
^^

Thanks,
Naveen
> > +
> >  /*
> >   * pci_power_t values must match the bits in the Capabilities PME_Support
> >   * and Control/Status PowerState fields in the Power Management capability.
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees


Re: [PATCH v3 08/12] KVM: Propagate vcpu explicitly to mark_page_dirty_in_slot()

2021-11-18 Thread Paolo Bonzini

On 11/17/21 22:09, David Woodhouse wrote:

  {
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+   struct kvm_vcpu *running_vcpu = kvm_get_running_vcpu();

+   WARN_ON_ONCE(vcpu && vcpu != running_vcpu);
WARN_ON_ONCE(vcpu->kvm != kvm);

Ah, that one needs to be changed to check running_vcpu instead. Or this
needs to go first:

I think I prefer making the vCPU a required argument. If anyone's going to
pull a vCPU pointer out of their posterior, let the caller do it.



I understand that feeling, but still using the running vCPU is by far 
the common case, and it's not worth adding a new function parameter to 
all call sites.


What about using a separate function, possibly __-prefixed, for the case 
where you have a very specific vCPU?


Paolo



Re: [PATCH 1/2] powerpc/mce: Avoid using irq_work_queue() in realmode

2021-11-18 Thread Ganesh

On 11/8/21 19:49, Nicholas Piggin wrote:


Excerpts from Ganesh Goudar's message of November 8, 2021 6:38 pm:

In realmode mce handler we use irq_work_queue() to defer
the processing of mce events, irq_work_queue() can only
be called when translation is enabled because it touches
memory outside RMA, hence we enable translation before
calling irq_work_queue and disable on return, though it
is not safe to do in realmode.

To avoid this, program the decrementer and call the event
processing functions from timer handler.

Signed-off-by: Ganesh Goudar
---
  arch/powerpc/include/asm/machdep.h   |  2 +
  arch/powerpc/include/asm/mce.h   |  2 +
  arch/powerpc/include/asm/paca.h  |  1 +
  arch/powerpc/kernel/mce.c| 51 +++-
  arch/powerpc/kernel/time.c   |  3 ++
  arch/powerpc/platforms/pseries/pseries.h |  1 +
  arch/powerpc/platforms/pseries/ras.c | 31 +-
  arch/powerpc/platforms/pseries/setup.c   |  1 +
  8 files changed, 34 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 764f2732a821..c89cc03c0f97 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -103,6 +103,8 @@ struct machdep_calls {
/* Called during machine check exception to retrive fixup address. */
bool(*mce_check_early_recovery)(struct pt_regs *regs);
  
+	void(*machine_check_log_err)(void);

+
/* Motherboard/chipset features. This is a kind of general purpose
 * hook used to control some machine specific features (like reset
 * lines, chip power control, etc...).
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 331d944280b8..187810f13669 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -235,8 +235,10 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
  unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
  extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+extern void machine_check_raise_dec_intr(void);

No new externs on function declarations, they tell me.


ok.


  int mce_register_notifier(struct notifier_block *nb);
  int mce_unregister_notifier(struct notifier_block *nb);
+void mce_run_late_handlers(void);
  #ifdef CONFIG_PPC_BOOK3S_64
  void flush_and_reload_slb(void);
  void flush_erat(void);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index dc05a862e72a..f49180f8c9be 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -280,6 +280,7 @@ struct paca_struct {
  #endif
  #ifdef CONFIG_PPC_BOOK3S_64
struct mce_info *mce_info;
+   atomic_t mces_to_process;
  #endif /* CONFIG_PPC_BOOK3S_64 */
  } cacheline_aligned;
  
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c

index fd829f7f25a4..45baa062ebc0 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,19 +28,9 @@
  
  #include "setup.h"
  
-static void machine_check_process_queued_event(struct irq_work *work);

-static void machine_check_ue_irq_work(struct irq_work *work);
  static void machine_check_ue_event(struct machine_check_event *evt);
  static void machine_process_ue_event(struct work_struct *work);
  
-static struct irq_work mce_event_process_work = {

-.func = machine_check_process_queued_event,
-};
-
-static struct irq_work mce_ue_event_irq_work = {
-   .func = machine_check_ue_irq_work,
-};
-
  static DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
  
  static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);

@@ -89,6 +79,12 @@ static void mce_set_error_info(struct machine_check_event 
*mce,
}
  }
  
+/* Raise decrementer interrupt */

+void machine_check_raise_dec_intr(void)
+{
+   set_dec(1);
+}

The problem here is a timer can be scheduled (e.g., by an external
interrupt if it gets taken before the decrementer, then uses a
timer) and that set decr > 1. See logic in decrementer_set_next_event.

I _think_ the way to get around this would be to have the machine check
just use arch_irq_work_raise.

Then you could also only call the mce handler inside the
test_irq_work_pending() check and avoid the added function call on every
timer. That test should also be marked unlikely come to think of it, but
that's a side patchlet.


Sure, I will use arch_irq_work_raise() and test_irq_work_pending().




+
  /*
   * Decode and save high level MCE information into per cpu buffer which
   * is an array of machine_check_event structure.
@@ -135,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (mce->error_type == MCE_ERROR_TYPE_UE)
mce->u.ue_error.ignore_event = mce_err->ignore_event;
  
+	atomic_inc(_paca->mces_to_process);

+
if (!addr)

Re: [PATCH 1/2] powerpc/mce: Avoid using irq_work_queue() in realmode

2021-11-18 Thread Ganesh

On 11/12/21 12:42, Daniel Axtens wrote:


In realmode mce handler we use irq_work_queue() to defer
the processing of mce events, irq_work_queue() can only
be called when translation is enabled because it touches
memory outside RMA, hence we enable translation before
calling irq_work_queue and disable on return, though it
is not safe to do in realmode.

To avoid this, program the decrementer and call the event
processing functions from timer handler.


This is an interesting approach, and it would indeed be nice to clear up
the MCE handling a bit.

I have a few questions:


diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 331d944280b8..187810f13669 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -235,8 +235,10 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
  unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
  extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+extern void machine_check_raise_dec_intr(void);

Does this need an extern? I think that's the default...?


Not required, I will remove it.


  int mce_register_notifier(struct notifier_block *nb);
  int mce_unregister_notifier(struct notifier_block *nb);
+void mce_run_late_handlers(void);
  #ifdef CONFIG_PPC_BOOK3S_64
  void flush_and_reload_slb(void);
  void flush_erat(void);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index dc05a862e72a..f49180f8c9be 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -280,6 +280,7 @@ struct paca_struct {
  #endif
  #ifdef CONFIG_PPC_BOOK3S_64
struct mce_info *mce_info;
+   atomic_t mces_to_process;

This is in the PACA, which is supposed to be a per-cpu structure: hey
does it need to be atomic_t? Isn't there only one CPU accessing it?


Yes it need not be atomic, got confused with some scenario and
made it atomic.


Does this variable provide anything new compared to mce_nest_count or
mce_queue_count + mce_ue_count? It looks like
machine_check_process_queued_event will clear a queue based on value of
mce_queue_count and machine_check_process_ue_event will clear a queue
based on mce_ue_count...

I think (absent nested interrupts which I talk about below) it should be
the case that mces_to_process == mce_queue_count + mce_ue_count but I
might be wrong?


If we hit exception in process context, we may not increment mce_queue_count,
so the equation need not be true all time.


  #endif /* CONFIG_PPC_BOOK3S_64 */
  } cacheline_aligned;
   

  /*
   * Decode and save high level MCE information into per cpu buffer which
   * is an array of machine_check_event structure.
@@ -135,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (mce->error_type == MCE_ERROR_TYPE_UE)
mce->u.ue_error.ignore_event = mce_err->ignore_event;
  
+	atomic_inc(_paca->mces_to_process);

+

Is there any chance the decrementer will fire between when you do this
atomic_inc() and when you finish adding all the information to the mce
data structure in the remainder of save_mce_event? (e.g. filling in the
tlb_errror.effective_address field)?

(Or does save_mce_event get called with interrupts masked? I find it
very hard to remember what parts of the MCE code path happen under what
circumstances!)


Yes, Interrupts will be disabled, I mean MSR[EE]=0 when mce is being handled.


if (!addr)
return;
  



@@ -338,7 +322,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
   * process pending MCE event from the mce event queue. This function will be
   * called during syscall exit.

Is this comment still accurate if this patch is applied?


No, mpe has also pointed this out, we will clean it in a different patch.




   */
-static void machine_check_process_queued_event(struct irq_work *work)
+static void machine_check_process_queued_event(void)
  {
int index;
struct machine_check_event *evt;
@@ -363,6 +347,17 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
}
  }
  
+void mce_run_late_handlers(void)

+{
+   if (unlikely(atomic_read(_paca->mces_to_process))) {
+   if (ppc_md.machine_check_log_err)
+   ppc_md.machine_check_log_err();
+   machine_check_process_queued_event();
+   machine_check_ue_work();
+   atomic_dec(_paca->mces_to_process);
+   }
+}

What happens if you get a nested MCE between log_err() and
process_queued_event()? If my very foggy memory of the MCE handling is
correct, we enable nested MCEs very early in the process because the
alternative is that a nested MCE will checkstop the box. So I think this
might be possible, albeit probably unlikely.

It looks like process_queued_event clears the entire MCE queue as
determined by the mce_queue_count. So, consider 

[PATCH] powerpc/32: Fix hardlockup on vmap stack overflow

2021-11-18 Thread Christophe Leroy
Since the commit c118c7303ad5 ("powerpc/32: Fix vmap stack - Do not
activate MMU before reading task struct") a vmap stack overflow
results in a hard lockup. This is because emergency_ctx is still
addressed with its virtual address allthough data MMU is not active
anymore at that time.

Fix it by using a physical address instead.

Fixes: c118c7303ad5 ("powerpc/32: Fix vmap stack - Do not activate MMU before 
reading task struct")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 6b1ec9e3541b..349c4a820231 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -202,11 +202,11 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
mfspr   r1, SPRN_SPRG_THREAD
lwz r1, TASK_CPU - THREAD(r1)
slwir1, r1, 3
-   addis   r1, r1, emergency_ctx@ha
+   addis   r1, r1, emergency_ctx-PAGE_OFFSET@ha
 #else
-   lis r1, emergency_ctx@ha
+   lis r1, emergency_ctx-PAGE_OFFSET@ha
 #endif
-   lwz r1, emergency_ctx@l(r1)
+   lwz r1, emergency_ctx-PAGE_OFFSET@l(r1)
addir1, r1, THREAD_SIZE - INT_FRAME_SIZE
EXCEPTION_PROLOG_2 0 vmap_stack_overflow
prepare_transfer_to_handler
-- 
2.33.1



Re: [PATCH 11/11] powerpc/smp: Add a doorbell=off kernel parameter

2021-11-18 Thread Michael Ellerman
Cédric Le Goater  writes:
> On 11/11/21 11:41, Michael Ellerman wrote:
>> Cédric Le Goater  writes:
>>> On processors with a XIVE interrupt controller (POWER9 and above), the
>>> kernel can use either doorbells or XIVE to generate CPU IPIs. Sending
>>> doorbell is generally preferred to using the XIVE IC because it is
>>> faster. There are cases where we want to avoid doorbells and use XIVE
>>> only, for debug or performance. Only useful on POWER9 and above.
>> 
>> How much do we want this?
>
> Yes. Thanks for asking. It is a recent need.
>
> Here is some background I should have added in the first place. May be
> for a v2.
>
> We have different ways of doing IPIs on POWER9 and above processors,
> depending on the platform and the underlying hypervisor.
>
> - PowerNV uses global doorbells
>
> - pSeries/KVM uses XIVE only because local doorbells are not
>efficient, as there are emulated in the KVM hypervisor
>
> - pSeries/PowerVM uses XIVE for remote cores and local doorbells for
>threads on same core (SMT4 or 8)
>
> This recent commit 5b06d1679f2f ("powerpc/pseries: Use doorbells even
> if XIVE is available") introduced the optimization for PowerVM and
> commit 107c55005fbd ("powerpc/pseries: Add KVM guest doorbell
> restrictions") restricted the optimization.
>
> We would like a way to turn off the optimization.

Just for test/debug though?

>> Kernel command line args are a bit of a pain, they tend to be poorly
>> tested, because someone has to explicitly enable them at boot time,
>> and then reboot to test the other case.
>
> True. The "xive=off" parameter was poorly tested initially.
>
>> When would we want to enable this?
>
> For bring-up, for debug, for tests. I have been using a similar switch
> to compare the XIVE interrupt controller performance with doorbells on
> POWER9 and P0WER10.
>
> A new need arises with PowerVM, some configurations will behave as KVM
> (local doorbell are unsupported) and the doorbell=off parameter is a
> simple way to handle this case today.

That's the first I've heard of that, what PowerVM configurations have
non-working doorbells?

>> Can we make the kernel smarter about when to use doorbells and make
>> it automated?
>
> I don't think we want to probe all IPI methods to detect how well
> local doorbells are supported on the platform. Do we ?

We don't *want to*, but sounds like we might have to if they are not
working in some configurations as you mentioned above.

cheers


Re: [PATCH 04/11] powerpc/xive: Introduce xive_core_debugfs_create()

2021-11-18 Thread Michael Ellerman
Cédric Le Goater  writes:

> and fix some compile issues when !CONFIG_DEBUG_FS.
>
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/common.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 3d558cad1f19..b71cc1020296 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
...
> @@ -1779,10 +1782,18 @@ static int xive_core_debug_show(struct seq_file *m, 
> void *private)
>  }
>  DEFINE_SHOW_ATTRIBUTE(xive_core_debug);
>  
> +static void xive_core_debugfs_create(void)
> +{
> + debugfs_create_file("xive", 0400, arch_debugfs_dir,
> + NULL, _core_debug_fops);
> +}
> +
> +#endif /* CONFIG_DEBUG_FS */
> +
>  int xive_core_debug_init(void)
>  {
> - if (xive_enabled())
> - debugfs_create_file("xive", 0400, arch_debugfs_dir,
> - NULL, _core_debug_fops);
> + if (xive_enabled() && IS_ENABLED(CONFIG_DEBUG_FS))
> + xive_core_debugfs_create();
> +
>   return 0;
>  }

For skiroot_defconfig this gives me:

  arch/powerpc/sysdev/xive/common.c: In function ‘xive_core_init’:
  arch/powerpc/sysdev/xive/common.c:1676:2: error: implicit declaration of 
function ‘xive_core_debugfs_create’; did you mean ‘xive_core_debug_init’? 
[-Werror=implicit-function-declaration]
   1676 |  xive_core_debugfs_create();
|  ^~~~
|  xive_core_debug_init
  cc1: all warnings being treated as errors


We need an empty inline stub of xive_core_debugfs_create() for the
CONFIG_DEBUG_FS=n case.

I'm wondering though why do we have xive_core_debug_init() at all, why
don't we just initialise the debugfs files in xive_core_init()?

cheers