[PATCH v2 2/2] KVM: PPC: Book3S PR: Disallow AIL != 0

2022-01-28 Thread Nicholas Piggin
KVM PR does not implement address translation modes on interrupt, so it
must not allow H_SET_MODE to succeed. The behaviour change caused by
this mode is architected and not advisory (interrupts *must* behave
differently).

QEMU does not deal with differences in AIL support in the host. The
solution to that is a spapr capability and corresponding KVM CAP, but
this patch does not break things more than before (the host behaviour
already differs, this change just disallows some modes that are not
implemented properly).

By happy coincidence, this allows PR Linux guests that are using the SCV
facility to boot and run, because Linux disables the use of SCV if AIL
can not be set to 3. This does not fix the underlying problem of missing
SCV support (an OS could implement real-mode SCV vectors and try to
enable the facility). The true fix for that is for KVM PR to emulate scv
interrupts from the facility unavailable interrupt.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_pr_papr.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_pr_papr.c 
b/arch/powerpc/kvm/book3s_pr_papr.c
index 1f10e7dfcdd0..dc4f51ac84bc 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -281,6 +281,22 @@ static int kvmppc_h_pr_logical_ci_store(struct kvm_vcpu 
*vcpu)
return EMULATE_DONE;
 }
 
+static int kvmppc_h_pr_set_mode(struct kvm_vcpu *vcpu)
+{
+   unsigned long mflags = kvmppc_get_gpr(vcpu, 4);
+   unsigned long resource = kvmppc_get_gpr(vcpu, 5);
+
+   if (resource == H_SET_MODE_RESOURCE_ADDR_TRANS_MODE) {
+   /* KVM PR does not provide AIL!=0 to guests */
+   if (mflags == 0)
+   kvmppc_set_gpr(vcpu, 3, H_SUCCESS);
+   else
+   kvmppc_set_gpr(vcpu, 3, H_UNSUPPORTED_FLAG_START - 63);
+   return EMULATE_DONE;
+   }
+   return EMULATE_FAIL;
+}
+
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 {
@@ -384,6 +400,8 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_logical_ci_load(vcpu);
case H_LOGICAL_CI_STORE:
return kvmppc_h_pr_logical_ci_store(vcpu);
+   case H_SET_MODE:
+   return kvmppc_h_pr_set_mode(vcpu);
case H_XIRR:
case H_CPPR:
case H_EOI:
@@ -421,6 +439,7 @@ int kvmppc_hcall_impl_pr(unsigned long cmd)
case H_CEDE:
case H_LOGICAL_CI_LOAD:
case H_LOGICAL_CI_STORE:
+   case H_SET_MODE:
 #ifdef CONFIG_KVM_XICS
case H_XIRR:
case H_CPPR:
@@ -447,6 +466,7 @@ static unsigned int default_hcall_list[] = {
H_BULK_REMOVE,
H_PUT_TCE,
H_CEDE,
+   H_SET_MODE,
 #ifdef CONFIG_KVM_XICS
H_XIRR,
H_CPPR,
-- 
2.23.0



[PATCH v2 1/2] KVM: PPC: Book3S PR: Disable SCV when AIL could be disabled

2022-01-28 Thread Nicholas Piggin
PR KVM does not support running with AIL enabled, and SCV does is not
supported with AIL disabled. Fix this by ensuring the SCV facility is
disabled with FSCR while a CPU could be running with AIL=0.

The PowerNV host supports disabling AIL on a per-CPU basis, so SCV just
needs to be disabled when a vCPU is being run.

The pSeries machine can only switch AIL on a system-wide basis, so it
must disable SCV support at boot if the configuration can potentially
run a PR KVM guest.

Also ensure a the FSCR[SCV] bit can not be enabled when emulating
mtFSCR for the guest.

SCV is not emulated for the PR guest at the moment, this just fixes the
host crashes.

Alternatives considered and rejected:
- SCV support can not be disabled by PR KVM after boot, because it is
  advertised to userspace with HWCAP.
- AIL can not be disabled on a per-CPU basis. At least when running on
  pseries it is a per-LPAR setting.
- Support for real-mode SCV vectors will not be added because they are
  at 0x17000 so making such a large fixed head space causes immediate
  value limits to be exceeded, requiring a lot rework and more code.
- Disabling SCV for any PR KVM possible kernel will cause a slowdown
  when not using PR KVM.
- A boot time option to disable SCV to use PR KVM is user-hostile.
- System call instruction emulation for SCV facility unavailable
  instructions is too complex and old emulation code was subtly broken
  and removed.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S |  4 
 arch/powerpc/kernel/setup_64.c   | 28 
 arch/powerpc/kvm/Kconfig |  9 +
 arch/powerpc/kvm/book3s_pr.c | 20 ++--
 4 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 55caeee37c08..b66dd6f775a4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -809,6 +809,10 @@ __start_interrupts:
  * - MSR_EE|MSR_RI is clear (no reentrant exceptions)
  * - Standard kernel environment is set up (stack, paca, etc)
  *
+ * KVM:
+ * These interrupts do not elevate HV 0->1, so HV is not involved. PR KVM
+ * ensures that FSCR[SCV] is disabled whenever it has to force AIL off.
+ *
  * Call convention:
  *
  * syscall register convention is in Documentation/powerpc/syscall64-abi.rst
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index be8577ac9397..7f7da641e551 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -197,6 +197,34 @@ static void __init configure_exceptions(void)
 
/* Under a PAPR hypervisor, we need hypercalls */
if (firmware_has_feature(FW_FEATURE_SET_MODE)) {
+   /*
+* - PR KVM does not support AIL mode interrupts in the host
+*   while a PR guest is running.
+*
+* - SCV system call interrupt vectors are only implemented for
+*   AIL mode interrupts.
+*
+* - On pseries, AIL mode can only be enabled and disabled
+*   system-wide so when a PR VM is created on a pseries host,
+*   all CPUs of the host are set to AIL=0 mode.
+*
+* - Therefore host CPUs must not execute scv while a PR VM
+*   exists.
+*
+* - SCV support can not be disabled dynamically because the
+*   feature is advertised to host userspace. Disabling the
+*   facility and emulating it would be possible but is not
+*   implemented.
+*
+* - So SCV support is blanket diabled if PR KVM could possibly
+*   run. That is, PR support compiled in, booting on pseries
+*   with hash MMU.
+*/
+   if (IS_ENABLED(CONFIG_KVM_BOOK3S_PR_POSSIBLE) && 
!radix_enabled()) {
+   init_task.thread.fscr &= ~FSCR_SCV;
+   cur_cpu_spec->cpu_user_features2 &= ~PPC_FEATURE2_SCV;
+   }
+
/* Enable AIL if possible */
if (!pseries_enable_reloc_on_exc()) {
init_task.thread.fscr &= ~FSCR_SCV;
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 18e58085447c..ddd88179110a 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -112,12 +112,21 @@ config KVM_BOOK3S_64_PR
  guest in user mode (problem state) and emulating all
  privileged instructions and registers.
 
+ This is only available for hash MMU mode and only supports
+ guests that use hash MMU mode.
+
  This is not as fast as using hypervisor mode, but works on
  machines where hypervisor mode is not available or not usable,
  and can emulate processors that are different from the host
   

[PATCH v2 0/2] KVM: PPC: Book3S PR: Fixes for AIL and SCV

2022-01-28 Thread Nicholas Piggin
The first patch in this series fixes a KVM PR host crash due to a
guest executing the scv instruction or with a pseries SMP host, the
host CPUs executing the scv instruction while a PR guest is running.

The second patch fixes unimplemented H_SET_MODE AIL modes by returning
failure from the hcall rather than succeeding but not implementing
the required behaviour. This works around missing host scv support for
scv-capable Linux guests by causing them to disable the facility.

Still looking at doing a proper capability for QEMU/KVM so we can get
consistency between HV, PR, and TCG. That shouldn't change these
patches though.

Thanks,
Nick

Nicholas Piggin (2):
  KVM: PPC: Book3S PR: Disable SCV when AIL could be disabled
  KVM: PPC: Book3S PR: Disallow AIL != 0

 arch/powerpc/kernel/exceptions-64s.S |  4 
 arch/powerpc/kernel/setup_64.c   | 28 
 arch/powerpc/kvm/Kconfig |  9 +
 arch/powerpc/kvm/book3s_pr.c | 20 ++--
 arch/powerpc/kvm/book3s_pr_papr.c| 20 
 5 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.23.0



[PATCH] powerpc: platforms: 52xx: Fix a resource leak in an error handling path

2022-01-28 Thread Christophe JAILLET
The error handling path of mpc52xx_lpbfifo_probe() and a request_irq() is
not balanced by a corresponding free_irq().

Add the missing call, as already done in the remove function.

Fixes: 3c9059d79f5e ("powerpc/5200: add LocalPlus bus FIFO device driver")
Signed-off-by: Christophe JAILLET 
---
Another strange thing is that the remove function has:
/* Release the bestcomm transmit task */
free_irq(bcom_get_task_irq(lpbfifo.bcom_tx_task), &lpbfifo);
but I've not been able to find a corresponding request_irq().

Is it dead code? Is there something missing in the probe?
(...Is it working?...)
---
 arch/powerpc/platforms/52xx/mpc52xx_lpbfifo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/52xx/mpc52xx_lpbfifo.c 
b/arch/powerpc/platforms/52xx/mpc52xx_lpbfifo.c
index b91ebebd9ff2..e0049b7df212 100644
--- a/arch/powerpc/platforms/52xx/mpc52xx_lpbfifo.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_lpbfifo.c
@@ -530,6 +530,7 @@ static int mpc52xx_lpbfifo_probe(struct platform_device *op)
  err_bcom_rx_irq:
bcom_gen_bd_rx_release(lpbfifo.bcom_rx_task);
  err_bcom_rx:
+   free_irq(lpbfifo.irq, &lpbfifo);
  err_irq:
iounmap(lpbfifo.regs);
lpbfifo.regs = NULL;
-- 
2.32.0



[PATCH] powerpc/fadump: Use swap() instead of open coding it

2022-01-28 Thread Jiapeng Chong
Clean the following coccicheck warning:

./arch/powerpc/kernel/fadump.c:1291:34-35: WARNING opportunity for
swap().

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 arch/powerpc/kernel/fadump.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d0ad86b67e66..de08dd078081 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1271,7 +1271,6 @@ static void fadump_release_reserved_area(u64 start, u64 
end)
 static void sort_and_merge_mem_ranges(struct fadump_mrange_info *mrange_info)
 {
struct fadump_memory_range *mem_ranges;
-   struct fadump_memory_range tmp_range;
u64 base, size;
int i, j, idx;
 
@@ -1286,11 +1285,8 @@ static void sort_and_merge_mem_ranges(struct 
fadump_mrange_info *mrange_info)
if (mem_ranges[idx].base > mem_ranges[j].base)
idx = j;
}
-   if (idx != i) {
-   tmp_range = mem_ranges[idx];
-   mem_ranges[idx] = mem_ranges[i];
-   mem_ranges[i] = tmp_range;
-   }
+   if (idx != i)
+   swap(mem_ranges[idx], mem_ranges[i]);
}
 
/* Merge adjacent reserved ranges */
-- 
2.20.1.7.g153144c



[PATCH 3/3] powerpc/pseries/vas: Disable window open during migration

2022-01-28 Thread Haren Myneni


The current partition migration implementation does not freeze the
user space and the user space can continue open VAS windows. So
when migration_in_progress flag is enabled, VAS open window
API returns -EBUSY.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/vas.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index b53e3fe02971..63316c401abb 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -29,6 +29,7 @@ static bool copypaste_feat;
 
 static struct vas_caps vascaps[VAS_MAX_FEAT_TYPE];
 static DEFINE_MUTEX(vas_pseries_mutex);
+static bool migration_in_progress;
 
 static long hcall_return_busy_check(long rc)
 {
@@ -355,8 +356,11 @@ static struct vas_window *vas_allocate_window(int vas_id, 
u64 flags,
 * same fault IRQ is not freed by the OS before.
 */
mutex_lock(&vas_pseries_mutex);
-   rc = allocate_setup_window(txwin, (u64 *)&domain[0],
-  cop_feat_caps->win_type);
+   if (migration_in_progress)
+   rc = -EBUSY;
+   else
+   rc = allocate_setup_window(txwin, (u64 *)&domain[0],
+  cop_feat_caps->win_type);
mutex_unlock(&vas_pseries_mutex);
if (rc)
goto out;
@@ -893,6 +897,11 @@ int vas_migration_handler(int action)
 
mutex_lock(&vas_pseries_mutex);
 
+   if (action == VAS_SUSPEND)
+   migration_in_progress = true;
+   else
+   migration_in_progress = false;
+
for (i = 0; i < VAS_MAX_FEAT_TYPE; i++) {
vcaps = &vascaps[i];
caps = &vcaps->caps;
-- 
2.27.0




[PATCH 2/3] powerpc/pseries/vas: Add VAS migration handler

2022-01-28 Thread Haren Myneni


Since the VAS windows belong to the VAS hardware resource, the
hypervisor expects the partition to close them on source partition
and reopen them after the partition migrated on the destination
machine.

This handler is called before pseries_suspend() to close these
windows and again invoked after migration. All active windows
for both default and QoS types will be closed and mark them
in-active and reopened after migration with this handler.
During the migration, the user space receives paste instruction
failure if it issues copy/paste on these in-active windows.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/mobility.c |  5 ++
 arch/powerpc/platforms/pseries/vas.c  | 86 +++
 arch/powerpc/platforms/pseries/vas.h  |  6 ++
 3 files changed, 97 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 85033f392c78..70004243e25e 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include "pseries.h"
+#include "vas.h"   /* vas_migration_handler() */
 #include "../../kernel/cacheinfo.h"
 
 static struct kobject *mobility_kobj;
@@ -669,12 +670,16 @@ static int pseries_migrate_partition(u64 handle)
if (ret)
return ret;
 
+   vas_migration_handler(VAS_SUSPEND);
+
ret = pseries_suspend(handle);
if (ret == 0)
post_mobility_fixup();
else
pseries_cancel_migration(handle, ret);
 
+   vas_migration_handler(VAS_RESUME);
+
return ret;
 }
 
diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index e4797fc73553..b53e3fe02971 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -873,6 +873,92 @@ static struct notifier_block pseries_vas_nb = {
.notifier_call = pseries_vas_notifier,
 };
 
+/*
+ * For LPM, all windows have to be closed on the source partition
+ * before migration and reopen them on the destination partition
+ * after migration. So closing windows during suspend and
+ * reopen them during resume.
+ */
+int vas_migration_handler(int action)
+{
+   struct hv_vas_cop_feat_caps *hv_caps;
+   struct vas_cop_feat_caps *caps;
+   int lpar_creds, new_creds = 0;
+   struct vas_caps *vcaps;
+   int i, rc = 0;
+
+   hv_caps = kmalloc(sizeof(*hv_caps), GFP_KERNEL);
+   if (!hv_caps)
+   return -ENOMEM;
+
+   mutex_lock(&vas_pseries_mutex);
+
+   for (i = 0; i < VAS_MAX_FEAT_TYPE; i++) {
+   vcaps = &vascaps[i];
+   caps = &vcaps->caps;
+   lpar_creds = atomic_read(&caps->target_creds);
+
+   rc = h_query_vas_capabilities(H_QUERY_VAS_CAPABILITIES,
+ vcaps->feat,
+ (u64)virt_to_phys(hv_caps));
+   if (!rc) {
+   new_creds = be16_to_cpu(hv_caps->target_lpar_creds);
+   /*
+* Should not happen. But incase print messages, close
+* all windows in the list during suspend and reopen
+* windows based on new lpar_creds on the destination
+* system.
+*/
+   if (lpar_creds != new_creds) {
+   pr_err("state(%d): lpar creds: %d HV lpar 
creds: %d\n",
+   action, lpar_creds, new_creds);
+   pr_err("Used creds: %d, Active creds: %d\n",
+   atomic_read(&caps->used_creds),
+   vcaps->num_wins - vcaps->close_wins);
+   }
+   } else {
+   pr_err("state(%d): Get VAS capabilities failed with 
%d\n",
+   action, rc);
+   /*
+* We can not stop migration with the current lpm
+* implementation. So continue closing all windows in
+* the list (during suspend) and return without
+* opening windows (during resume) if VAS capabilities
+* HCALL failed.
+*/
+   if (action == VAS_RESUME)
+   goto out;
+   }
+
+   switch (action) {
+   case VAS_SUSPEND:
+   rc = reconfig_close_windows(vcaps, vcaps->num_wins,
+   true);
+   break;
+   case VAS_RESUME:
+   atomic_set(&caps->target_creds, new_creds);
+   rc = reconfig_open_windows(vcaps, new_creds, true);
+   bre

[PATCH 1/3] powerpc/pseries/vas: Modify reconfig open/close functions for migration

2022-01-28 Thread Haren Myneni


VAS is a hardware engine stays on the chip. So when the partition
migrates, all VAS windows on the source system have to be closed
and reopen them on the destination after migration.

This patch make changes to the current reconfig_open/close_windows
functions to support migration:
- Set VAS_WIN_MIGRATE_CLOSE to the window status when closes and
  reopen windows with the same status during resume.
- Continue to close all windows even if deallocate HCALL failed
  (should not happen) since no way to stop migration with the
  current LPM implementation.
- If the DLPAR CPU event happens while migration is in progress,
  set VAS_WIN_NO_CRED_CLOSE to the window status. Close window
  happens with the first event (migration or DLPAR) and Reopen
  window happens only with the last event (migration or DLPAR).

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h   |  2 +
 arch/powerpc/platforms/pseries/vas.c | 88 ++--
 2 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index ddc05a8fc2e3..f21e76f47175 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -42,6 +42,8 @@
 /* Linux status bits */
 #define VAS_WIN_NO_CRED_CLOSE  0x0004 /* Window is closed due to */
   /* lost credit */
+#define VAS_WIN_MIGRATE_CLOSE  0x0008 /* Window is closed due to */
+  /* migration */
 /*
  * Get/Set bit fields
  */
diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 3400f4fc6609..e4797fc73553 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -456,11 +456,12 @@ static int vas_deallocate_window(struct vas_window *vwin)
mutex_lock(&vas_pseries_mutex);
/*
 * VAS window is already closed in the hypervisor when
-* lost the credit. So just remove the entry from
-* the list, remove task references and free vas_window
+* lost the credit or with migration. So just remove the entry
+* from the list, remove task references and free vas_window
 * struct.
 */
-   if (win->vas_win.status & VAS_WIN_NO_CRED_CLOSE) {
+   if (!(win->vas_win.status & VAS_WIN_NO_CRED_CLOSE) &&
+   !(win->vas_win.status & VAS_WIN_MIGRATE_CLOSE)) {
rc = deallocate_free_window(win);
if (rc) {
mutex_unlock(&vas_pseries_mutex);
@@ -577,12 +578,14 @@ static int __init get_vas_capabilities(u8 feat, enum 
vas_cop_feat_type type,
  * by setting the remapping to new paste address if the window is
  * active.
  */
-static int reconfig_open_windows(struct vas_caps *vcaps, int creds)
+static int reconfig_open_windows(struct vas_caps *vcaps, int creds,
+bool migrate)
 {
long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
struct vas_cop_feat_caps *caps = &vcaps->caps;
struct pseries_vas_window *win = NULL, *tmp;
int rc, mv_ents = 0;
+   int flag;
 
/*
 * Nothing to do if there are no closed windows.
@@ -601,8 +604,10 @@ static int reconfig_open_windows(struct vas_caps *vcaps, 
int creds)
 * (dedicated). If 1 core is added, this LPAR can have 20 more
 * credits. It means the kernel can reopen 20 windows. So move
 * 20 entries in the VAS windows lost and reopen next 20 windows.
+* For partition migration, reopen all windows that are closed
+* during resume.
 */
-   if (vcaps->close_wins > creds)
+   if ((vcaps->close_wins > creds) && !migrate)
mv_ents = vcaps->close_wins - creds;
 
list_for_each_entry_safe(win, tmp, &vcaps->list, win_list) {
@@ -612,12 +617,35 @@ static int reconfig_open_windows(struct vas_caps *vcaps, 
int creds)
mv_ents--;
}
 
+   /*
+* Open windows if they are closed only with migration or
+* DLPAR (lost credit) before.
+*/
+   if (migrate)
+   flag = VAS_WIN_MIGRATE_CLOSE;
+   else
+   flag = VAS_WIN_NO_CRED_CLOSE;
+
list_for_each_entry_safe_from(win, tmp, &vcaps->list, win_list) {
+   /*
+* This window is closed with DLPAR and migration events.
+* So reopen the window with the last event.
+* The user space is not suspended with the current
+* migration notifier. So the user space can issue DLPAR
+* CPU hotplug while migration in progress. In this case
+* this window will be opened with the last event.
+*/
+   if ((win->vas_win.status & VAS_WIN_NO_CRED_CLOSE) &&
+   (win->vas_win.status & VAS_WIN_MIGRATE_CLOSE)) {
+   win->vas_win.status &= ~flag;
+

[PATCH 0/3] powerpc/pseries/vas: VAS/NXGZIP support with LPM

2022-01-28 Thread Haren Myneni


Virtual Accelerator Switchboard (VAS) is an engine stays on the
chip. So all windows opened on a specific engine belongs to VAS
the chip. The hypervisor expects the partition to close all
active windows on the sources system and reopen them after
migration on the destination machine.

This patch series adds VAS support with the partition migration.
When the migration initiates, the VAS migration handler will be
invoked before pseries_suspend() to close all active windows and
mark them in-active with VAS_WIN_MIGRATE_CLOSE status. Whereas
this migration handler is called after migration to reopen all
windows which has VAS_WIN_MIGRATE_CLOSE status and make them
active again. The user space gets paste instruction failure
when it sends requests on these in-active windows.

These patches depend on VAS/DLPAR support patch series

Haren Myneni (3):
  powerpc/pseries/vas: Modify reconfig open/close functions for
migration
  powerpc/pseries/vas: Add VAS migration handler
  powerpc/pseries/vas: Disable window open during migration

 arch/powerpc/include/asm/vas.h|   2 +
 arch/powerpc/platforms/pseries/mobility.c |   5 +
 arch/powerpc/platforms/pseries/vas.c  | 187 +++---
 arch/powerpc/platforms/pseries/vas.h  |   6 +
 4 files changed, 181 insertions(+), 19 deletions(-)

-- 
2.27.0




Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-28 Thread Steven Rostedt
On Fri, 28 Jan 2022 16:11:39 -0500
Joe Lawrence  wrote:

> The bisect finally landed on:
> 
>   72b3942a173c387b27860ba1069636726e208777 is the first bad commit
>   commit 72b3942a173c387b27860ba1069636726e208777
>   Author: Yinan Liu 
>   Date:   Sun Dec 12 19:33:58 2021 +0800
> 
>   scripts: ftrace - move the sort-processing in ftrace_init
> 
> and I can confirm that your updates today in "[for-linus][PATCH 00/10]
> tracing: Fixes for 5.17-rc1" fix or avoid the issue.  I just wanted to
> add my report in case this adds any future complications for mcount
> build time sorting.  Let me know if any additional tests would be
> helpful.

Thanks for letting me know. That patch set has already landed in Linus's
tree.


-- Steve


Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-28 Thread Joe Lawrence
On Thu, Jan 27, 2022 at 11:42:49AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
>
> First S390 complained that the sorting of the mcount sections at build
> time caused the kernel to crash on their architecture. Now PowerPC is
> complaining about it too. And also ARM64 appears to be having issues.
>
> It may be necessary to also update the relocation table for the values
> in the mcount table. Not only do we have to sort the table, but also
> update the relocations that may be applied to the items in the table.
>
> If the system is not relocatable, then it is fine to sort, but if it is,
> some architectures may have issues (although x86 does not as it shifts all
> addresses the same).
>
> Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
> safe to do the sorting at build time.
>
> Also update the config to compile in build time sorting in the sorttable
> code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
>
> Link: 
> https://lore.kernel.org/all/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/
>
> Cc: Mark Rutland 
> Cc: Yinan Liu 
> Cc: Ard Biesheuvel 
> Cc: Kees Cook 
> Cc: linuxppc-dev@lists.ozlabs.org
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in 
> ftrace_init")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  arch/arm/Kconfig | 1 +
>  arch/x86/Kconfig | 1 +
>  kernel/trace/Kconfig | 8 +++-
>  scripts/Makefile | 2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index c2724d986fa0..5256ebe57451 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -82,6 +82,7 @@ config ARM
>   select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
>   select HAVE_CONTEXT_TRACKING
>   select HAVE_C_RECORDMCOUNT
> + select HAVE_BUILDTIME_MCOUNT_SORT
>   select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
>   select HAVE_DMA_CONTIGUOUS if MMU
>   select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7399327d1eff..46080dea5dba 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -186,6 +186,7 @@ config X86
>   select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
>   select HAVE_C_RECORDMCOUNT
>   select HAVE_OBJTOOL_MCOUNT  if STACK_VALIDATION
> + select HAVE_BUILDTIME_MCOUNT_SORT
>   select HAVE_DEBUG_KMEMLEAK
>   select HAVE_DMA_CONTIGUOUS
>   select HAVE_DYNAMIC_FTRACE
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 752ed89a293b..7e5b92090faa 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
>   help
> C version of recordmcount available?
>
> +config HAVE_BUILDTIME_MCOUNT_SORT
> +   bool
> +   help
> + An architecture selects this if it sorts the mcount_loc section
> +  at build time.
> +
>  config BUILDTIME_MCOUNT_SORT
> bool
> default y
> -   depends on BUILDTIME_TABLE_SORT && !S390
> +   depends on HAVE_BUILDTIME_MCOUNT_SORT
> help
>   Sort the mcount_loc section at build time.
>
> diff --git a/scripts/Makefile b/scripts/Makefile
> index b082d2f93357..cedc1f0e21d8 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -32,7 +32,7 @@ HOSTCFLAGS_sorttable.o += 
> -I$(srctree)/tools/arch/x86/include
>  HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
>  endif
>
> -ifdef CONFIG_DYNAMIC_FTRACE
> +ifdef CONFIG_BUILDTIME_MCOUNT_SORT
>  HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
>  endif
>
> --
> 2.33.0
>

Hi Steve,

I just finished bisecting what is probably the same problem... when
running the livepatch selftests for 5.17-rc1, x86_64 passes, but I kept
getting errors like this on ppc64le:

  kernel: livepatch: enabling patch 'test_klp_livepatch'
  kernel: livepatch: failed to find location for function 'cmdline_proc_show'
  kernel: livepatch: failed to patch object 'vmlinux'
  kernel: livepatch: failed to enable patch 'test_klp_livepatch'
  kernel: livepatch: 'test_klp_livepatch': unpatching complete

which means klp_get_ftrace_location() / ftrace_location_range() hit a
problem with that function.

The bisect finally landed on:

  72b3942a173c387b27860ba1069636726e208777 is the first bad commit
  commit 72b3942a173c387b27860ba1069636726e208777
  Author: Yinan Liu 
  Date:   Sun Dec 12 19:33:58 2021 +0800

  scripts: ftrace - move the sort-processing in ftrace_init

and I can confirm that your updates today in "[for-linus][PATCH 00/10]
tracing: Fixes for 5.17-rc1" fix or avoid the issue.  I just wanted to
add my report in case this adds any future complications for mcount
build time sorting.  Let me know if any additional tests would be
helpful.

Regards,

-- Joe



Re: ftrace hangs waiting for rcu

2022-01-28 Thread Paul E. McKenney
On Fri, Jan 28, 2022 at 08:15:47AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 28, 2022 at 04:11:57PM +, Mark Rutland wrote:
> > On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > > Hi Mark,
> > > 
> > > Mark Rutland  writes:
> > > 
> > > > On arm64 I bisected this down to:
> > > >
> > > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > > > dynamic queue selection")
> > > >
> > > > Which was going wrong because ilog2() rounds down, and so the shift was 
> > > > wrong
> > > > for any nr_cpus that was not a power-of-two. Paul had already fixed 
> > > > that in
> > > > rcu-next, and just sent a pull request to Linus:
> > > >
> > > >   
> > > > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > > >
> > > > With that applied, I no longer see these hangs.
> > > >
> > > > Does your s390 test machine have a non-power-of-two nr_cpus, and does 
> > > > that fix
> > > > the issue for you?
> > > 
> > > We noticed the PR from Paul and are currently testing the fix. So far
> > > it's looking good. The configuration where we have seen the hang is a
> > > bit unusual:
> > > 
> > > - 16 physical CPUs on the kvm host
> > > - 248 logical CPUs inside kvm
> > 
> > Aha! 248 is notably *NOT* a power of two, and in this case the shift would 
> > be
> > wrong (ilog2() would give 7, when we need a shift of 8).
> > 
> > So I suspect you're hitting the same issue as I was.
> 
> And apparently no one runs -next on systems having a non-power-of-two
> number of CPUs.  ;-)

And the fix is now in mainline.

Thanx, Paul

> > Thanks,
> > Mark.
> > 
> > > - debug kernel both on the host and kvm guest
> > > 
> > > So things are likely a bit slow in the kvm guest. Interesting is that
> > > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > > and gets confused before all cpus are brought up. Have to read code/test
> > > to see whether that could be possible.
> > > 
> > > Thanks for investigating!
> > > Sven


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Sven Schnelle
Hi Mark,

Mark Rutland  writes:

> On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
>> We noticed the PR from Paul and are currently testing the fix. So far
>> it's looking good. The configuration where we have seen the hang is a
>> bit unusual:
>> 
>> - 16 physical CPUs on the kvm host
>> - 248 logical CPUs inside kvm
>
> Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> wrong (ilog2() would give 7, when we need a shift of 8).
>
> So I suspect you're hitting the same issue as I was.

Argh, indeed! I somehow changed 'power of two' to 'odd number' in my
head. I guess it's time for the weekend. :-)

Thanks!


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Paul E. McKenney
On Fri, Jan 28, 2022 at 04:11:57PM +, Mark Rutland wrote:
> On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > Hi Mark,
> > 
> > Mark Rutland  writes:
> > 
> > > On arm64 I bisected this down to:
> > >
> > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > > dynamic queue selection")
> > >
> > > Which was going wrong because ilog2() rounds down, and so the shift was 
> > > wrong
> > > for any nr_cpus that was not a power-of-two. Paul had already fixed that 
> > > in
> > > rcu-next, and just sent a pull request to Linus:
> > >
> > >   
> > > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > >
> > > With that applied, I no longer see these hangs.
> > >
> > > Does your s390 test machine have a non-power-of-two nr_cpus, and does 
> > > that fix
> > > the issue for you?
> > 
> > We noticed the PR from Paul and are currently testing the fix. So far
> > it's looking good. The configuration where we have seen the hang is a
> > bit unusual:
> > 
> > - 16 physical CPUs on the kvm host
> > - 248 logical CPUs inside kvm
> 
> Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> wrong (ilog2() would give 7, when we need a shift of 8).
> 
> So I suspect you're hitting the same issue as I was.

And apparently no one runs -next on systems having a non-power-of-two
number of CPUs.  ;-)

Thanx, Paul

> Thanks,
> Mark.
> 
> > - debug kernel both on the host and kvm guest
> > 
> > So things are likely a bit slow in the kvm guest. Interesting is that
> > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > and gets confused before all cpus are brought up. Have to read code/test
> > to see whether that could be possible.
> > 
> > Thanks for investigating!
> > Sven


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Mark Rutland
On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> Hi Mark,
> 
> Mark Rutland  writes:
> 
> > On arm64 I bisected this down to:
> >
> >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > dynamic queue selection")
> >
> > Which was going wrong because ilog2() rounds down, and so the shift was 
> > wrong
> > for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> > rcu-next, and just sent a pull request to Linus:
> >
> >   
> > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> >
> > With that applied, I no longer see these hangs.
> >
> > Does your s390 test machine have a non-power-of-two nr_cpus, and does that 
> > fix
> > the issue for you?
> 
> We noticed the PR from Paul and are currently testing the fix. So far
> it's looking good. The configuration where we have seen the hang is a
> bit unusual:
> 
> - 16 physical CPUs on the kvm host
> - 248 logical CPUs inside kvm

Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
wrong (ilog2() would give 7, when we need a shift of 8).

So I suspect you're hitting the same issue as I was.

Thanks,
Mark.

> - debug kernel both on the host and kvm guest
> 
> So things are likely a bit slow in the kvm guest. Interesting is that
> the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> and gets confused before all cpus are brought up. Have to read code/test
> to see whether that could be possible.
> 
> Thanks for investigating!
> Sven


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Sven Schnelle
Hi Mark,

Mark Rutland  writes:

> On arm64 I bisected this down to:
>
>   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic 
> queue selection")
>
> Which was going wrong because ilog2() rounds down, and so the shift was wrong
> for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> rcu-next, and just sent a pull request to Linus:
>
>   
> https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
>
> With that applied, I no longer see these hangs.
>
> Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
> the issue for you?

We noticed the PR from Paul and are currently testing the fix. So far
it's looking good. The configuration where we have seen the hang is a
bit unusual:

- 16 physical CPUs on the kvm host
- 248 logical CPUs inside kvm
- debug kernel both on the host and kvm guest

So things are likely a bit slow in the kvm guest. Interesting is that
the number of CPUs is even. But maybe RCU sees an odd number of CPUs
and gets confused before all cpus are brought up. Have to read code/test
to see whether that could be possible.

Thanks for investigating!
Sven


Re: ftrace hangs waiting for rcu (was: Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting)

2022-01-28 Thread Mark Rutland
Hi Sven,

On Thu, Jan 27, 2022 at 07:42:35PM +0100, Sven Schnelle wrote:
> Mark Rutland  writes:
> 
> > * I intermittently see a hang when running the tests. I previously hit that
> >   when originally trying to bisect this issue (and IIRC that bisected down 
> > to
> >   some RCU changes, but I need to re-run that). When the tests hang I
> >   magic-srsrq + L tells me:
> >
> >   [  271.938438] sysrq: Show Blocked State
> >   [  271.939245] task:ftracetest  state:D stack:0 pid: 5687 ppid:  
> > 5627 flags:0x0200
> >   [  271.940961] Call trace:
> >   [  271.941472]  __switch_to+0x104/0x160
> >   [  271.942213]  __schedule+0x2b0/0x6e0
> >   [  271.942933]  schedule+0x5c/0xf0
> >   [  271.943586]  schedule_timeout+0x184/0x1c4
> >   [  271.944410]  wait_for_completion+0x8c/0x12c
> >   [  271.945274]  __wait_rcu_gp+0x184/0x190
> >   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
> >   [  271.947007]  update_ftrace_function+0xa4/0xec
> >   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
> >   [  271.948898]  unregister_ftrace_function+0x34/0x70
> >   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
> >   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
> >   [  271.951552]  tracing_set_trace_write+0xe8/0x150
> >   [  271.952477]  vfs_write+0xfc/0x284
> >   [  271.953171]  ksys_write+0x7c/0x110
> >   [  271.953874]  __arm64_sys_write+0x2c/0x40
> >   [  271.954678]  invoke_syscall+0x5c/0x130
> >   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
> >   [  271.956435]  do_el0_svc+0x74/0x90
> >   [  271.957124]  el0_svc+0x2c/0x90
> >   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
> >   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

On arm64 I bisected this down to:

  7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic 
queue selection")

Which was going wrong because ilog2() rounds down, and so the shift was wrong
for any nr_cpus that was not a power-of-two. Paul had already fixed that in
rcu-next, and just sent a pull request to Linus:

  
https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/

With that applied, I no longer see these hangs.

Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
the issue for you?

On arm64 the startup tests didn't seem to trigger the hang, but I was able to
trigger the hang fairly reliably with the ftrace selftests, e.g.

  $ for N in $(seq 1 10); do ./ftracetest test.d/00basic/basic2.tc; done

... which prior to the fix, would hang between runs 2 to 5.

Thanks,
Mark.

> that's interesting. On s390 i'm seeing the same problem in CI, but with
> the startup ftrace tests. So that's likely not arm64 spacific.
> 
> On s390, the last messages from ftrace are [5.663568] clocksource: 
> jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 
> 1911260446275 ns
> [5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, 
> vmalloc)
> [5.739549] Running postponed tracer tests:
> [5.740662] Testing tracer function: PASSED
> [6.194635] Testing dynamic ftrace: PASSED
> [6.471213] Testing dynamic ftrace ops #1: 
> [6.558445] (1 0 1 0 0) 
> [6.558458] (1 1 2 0 0) 
> [6.699135] (2 1 3 0 764347) 
> [6.699252] (2 2 4 0 766466) 
> [6.759857] (3 2 4 0 1159604)
> [..] hangs here
> 
> The backtrace looks like this, which is very similar to the one above:
> 
> crash> bt 1
> PID: 1  TASK: 80e68100  CPU: 133  COMMAND: "swapper/0"
>  #0 [380004df808] __schedule at cda39f0e
>  #1 [380004df880] schedule at cda3a488
>  #2 [380004df8b0] schedule_timeout at cda41ef6
>  #3 [380004df978] wait_for_completion at cda3bd0a
>  #4 [380004df9d8] __wait_rcu_gp at cc92
>  #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
>  #6 [380004dfad8] ftrace_shutdown at cce7b050
>  #7 [380004dfb18] unregister_ftrace_function at cce7b192
>  #8 [380004dfb50] trace_selftest_ops at cda1e0fa
>  #9 [380004dfba0] run_tracer_selftest at cda1e4f2
> #10 [380004dfc00] trace_selftest_startup_function at ce74355c
> #11 [380004dfc58] run_tracer_selftest at cda1e2fc
> #12 [380004dfc98] init_trace_selftests at ce742d30
> #13 [380004dfcd0] do_one_initcall at cccdca16
> #14 [380004dfd68] do_initcalls at ce72e776
> #15 [380004dfde0] kernel_init_freeable at ce72ea60
> #16 [380004dfe50] kernel_init at cda333fe
> #17 [380004dfe68] __ret_from_fork at cccdf920
> #18 [380004dfe98] ret_from_fork at cda444ca
> 
> I didn't had success reproducing it so far, but it is good to know that
> this also happens when running the ftrace testsuite.
> 
> I have several crashdumps, so i could try to pull out some information
> if someone tells me what to look for.
> 
> Thanks,
> Sven


Re: [PATCH v3] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

2022-01-28 Thread Mahesh J Salgaonkar
On 2021-12-09 09:02:51 Thu, Nathan Lynch wrote:
> Mahesh Salgaonkar  writes:
> > To avoid this issue, fix the pci hotplug driver (rpaphp) to return an error
> > if the slot presence state can not be detected immediately. Current
> > implementation uses rtas_get_sensor() API which blocks the slot check state
> > until rtas call returns success. Change rpaphp_get_sensor_state() to invoke
> > rtas_call(get-sensor-state) directly and take actions based on rtas return
> > status. This patch now errors out immediately on busy return status from
> > rtas_call.
> >
> > Please note that, only on certain PHB failures, the slot presence check
> > returns BUSY condition. In normal cases it returns immediately with a
> > correct presence state value. Hence this change has no impact on normal pci
> > dlpar operations.
> 
> I was wondering about this. This seems to be saying -2/990x cannot
> happen in other cases. I couldn't find this specified in the
> architecture. It seems a bit risky to me to *always* error out on
> -2/990x - won't we have intermittent slot enable failures?

Sorry for the late response. So instead of always returning error out
how about we error out only if pe is going through EEH recovery ? During
get_adapter_status I can check if pe->state is set to EEH_PE_RECOVERING
and only then return error on busy else fallback to existing method of
rtas_get_sensor. Let me send out another version with this approach.

Thanks,
-Mahesh.

> 
> > +/*
> > + * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
> > + *-1: Hardware Error
> > + *-2: RTAS_BUSY
> > + *-3: Invalid sensor. RTAS Parameter Error.
> > + * -9000: Need DR entity to be powered up and unisolated before RTAS call
> > + * -9001: Need DR entity to be powered up, but not unisolated, before RTAS 
> > call
> > + * -9002: DR entity unusable
> > + *  990x: Extended delay - where x is a number in the range of 0-5
> > + */
> > +#define RTAS_HARDWARE_ERROR-1
> > +#define RTAS_INVALID_SENSOR-3
> > +#define SLOT_UNISOLATED-9000
> > +#define SLOT_NOT_UNISOLATED-9001
> > +#define SLOT_NOT_USABLE-9002
> > +
> > +static int rtas_to_errno(int rtas_rc)
> > +{
> > +   int rc;
> > +
> > +   switch (rtas_rc) {
> > +   case RTAS_HARDWARE_ERROR:
> > +   rc = -EIO;
> > +   break;
> > +   case RTAS_INVALID_SENSOR:
> > +   rc = -EINVAL;
> > +   break;
> > +   case SLOT_UNISOLATED:
> > +   case SLOT_NOT_UNISOLATED:
> > +   rc = -EFAULT;
> > +   break;
> > +   case SLOT_NOT_USABLE:
> > +   rc = -ENODEV;
> > +   break;
> > +   case RTAS_BUSY:
> > +   case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
> > +   rc = -EBUSY;
> > +   break;
> > +   default:
> > +   err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
> > +   rc = -ERANGE;
> > +   break;
> > +   }
> > +   return rc;
> > +}
> 
> These conversions look OK to me.

-- 
Mahesh J Salgaonkar


[PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

2022-01-28 Thread David Hildenbrand
... and call node_dev_init() after memory_dev_init() from driver_init(),
so before any of the existing arch/subsys calls. All online nodes should
be known at that point.

This is in line with memory_dev_init(), which initializes the memory
device subsystem and creates all memory block devices.

Similar to memory_dev_init(), panic() if anything goes wrong, we don't
want to continue with such basic initialization errors.

The important part is that node_dev_init() gets called after
memory_dev_init() and after cpu_dev_init(), but before any of the
relevant archs call register_cpu() to register the new cpu device under
the node device. The latter should be the case for the current users
of topology_init().

Cc: Andrew Morton 
Cc: Greg Kroah-Hartman 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "Rafael J. Wysocki" 
Cc: x...@kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@kvack.org
Signed-off-by: David Hildenbrand 

---

RFC because I tested only on x86-64 and s390x, I think I cross-compiled all
applicable architectures except riscv and sparc.

This is somewhat a preparation for detecting if a memory block
(/sys/devices/system/memory/memory*) is managed by a single zone, and
storing the zone for the memory block -- to get rid of
test_pages_in_a_zone(). For that, we want to know all nodes that are
applicable for a single memory block (mem->nid), which is determined when
registering the node.

While this change might not be strictly required for that change, this
way it's easier to see when the nodes are gettin created and
consequently when the node ids for a memory block are determined.

---
 arch/arm64/kernel/setup.c   |  3 ---
 arch/ia64/kernel/topology.c | 10 --
 arch/mips/kernel/topology.c |  5 -
 arch/powerpc/kernel/sysfs.c | 17 -
 arch/riscv/kernel/setup.c   |  3 ---
 arch/s390/kernel/numa.c |  7 ---
 arch/sh/kernel/topology.c   |  5 -
 arch/sparc/kernel/sysfs.c   | 12 
 arch/x86/kernel/topology.c  |  5 -
 drivers/base/init.c |  1 +
 drivers/base/node.c | 30 +-
 include/linux/node.h|  4 
 12 files changed, 22 insertions(+), 80 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index f70573928f1b..3505789cf4bd 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -406,9 +406,6 @@ static int __init topology_init(void)
 {
int i;
 
-   for_each_online_node(i)
-   register_one_node(i);
-
for_each_possible_cpu(i) {
struct cpu *cpu = &per_cpu(cpu_data.cpu, i);
cpu->hotpluggable = cpu_can_disable(i);
diff --git a/arch/ia64/kernel/topology.c b/arch/ia64/kernel/topology.c
index e4992917a24b..94a848b06f15 100644
--- a/arch/ia64/kernel/topology.c
+++ b/arch/ia64/kernel/topology.c
@@ -70,16 +70,6 @@ static int __init topology_init(void)
 {
int i, err = 0;
 
-#ifdef CONFIG_NUMA
-   /*
-* MCD - Do we want to register all ONLINE nodes, or all POSSIBLE nodes?
-*/
-   for_each_online_node(i) {
-   if ((err = register_one_node(i)))
-   goto out;
-   }
-#endif
-
sysfs_cpus = kcalloc(NR_CPUS, sizeof(struct ia64_cpu), GFP_KERNEL);
if (!sysfs_cpus)
panic("kzalloc in topology_init failed - NR_CPUS too big?");
diff --git a/arch/mips/kernel/topology.c b/arch/mips/kernel/topology.c
index 08ad6371fbe0..9429d85a4703 100644
--- a/arch/mips/kernel/topology.c
+++ b/arch/mips/kernel/topology.c
@@ -12,11 +12,6 @@ static int __init topology_init(void)
 {
int i, ret;
 
-#ifdef CONFIG_NUMA
-   for_each_online_node(i)
-   register_one_node(i);
-#endif /* CONFIG_NUMA */
-
for_each_present_cpu(i) {
struct cpu *c = &per_cpu(cpu_devices, i);
 
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index d45a415d5374..2069bbb90a9a 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -1110,14 +1110,6 @@ EXPORT_SYMBOL_GPL(cpu_remove_dev_attr_group);
 /* NUMA stuff */
 
 #ifdef CONFIG_NUMA
-static void __init register_nodes(void)
-{
-   int i;
-
-   for (i = 0; i < MAX_NUMNODES; i++)
-   register_one_node(i);
-}
-
 int sysfs_add_device_to_node(struct device *dev, int nid)
 {
struct node *node = nod

Re: [PATCHv3] powerpc: mm: radix_tlb: rearrange the if-else block

2022-01-28 Thread Nathan Chancellor
On Fri, Jan 28, 2022 at 02:17:13PM +0100, Anders Roxell wrote:
> Clang warns:
> 
> arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is 
> uninitialized when used here [-Werror,-Wuninitialized]
> __tlbiel_va_range(hstart, hend, pid,
>   ^~
> arch/powerpc/mm/book3s64/radix_tlb.c:1175:23: note: initialize the variable 
> 'hstart' to silence this warning
> unsigned long hstart, hend;
> ^
>  = 0
> arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is 
> uninitialized when used here [-Werror,-Wuninitialized]
> __tlbiel_va_range(hstart, hend, pid,
>   ^~~~
> arch/powerpc/mm/book3s64/radix_tlb.c:1175:29: note: initialize the variable 
> 'hend' to silence this warning
> unsigned long hstart, hend;
>   ^
>= 0
> 2 errors generated.
> 
> Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
> always gets initialized, this will silence the warnings. That will also
> simplify the 'else' path. Clang is getting confused with these warnings,
> but the warnings is a false-positive.
> 
> Suggested-by: Arnd Bergmann 
> Suggested-by: Nathan Chancellor 
> Reviewed-by: Christophe Leroy 
> Signed-off-by: Anders Roxell 

Reviewed-by: Nathan Chancellor 

> ---
>  arch/powerpc/mm/book3s64/radix_tlb.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 7724af19ed7e..5172d5cec2c0 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1171,15 +1171,12 @@ static inline void __radix__flush_tlb_range(struct 
> mm_struct *mm,
>   }
>   }
>   } else {
> - bool hflush = false;
> + bool hflush;
>   unsigned long hstart, hend;
>  
> - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> - hstart = (start + PMD_SIZE - 1) & PMD_MASK;
> - hend = end & PMD_MASK;
> - if (hstart < hend)
> - hflush = true;
> - }
> + hstart = (start + PMD_SIZE - 1) & PMD_MASK;
> + hend = end & PMD_MASK;
> + hflush = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hstart < 
> hend;
>  
>   if (type == FLUSH_TYPE_LOCAL) {
>   asm volatile("ptesync": : :"memory");
> -- 
> 2.34.1
> 


Re: [PATCH v2 4/5] modules: Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC

2022-01-28 Thread Daniel Thompson
On Thu, Jan 27, 2022 at 11:28:09AM +, Christophe Leroy wrote:
> Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to allow architectures
> to request having modules data in vmalloc area instead of module area.
> 
> This is required on powerpc book3s/32 in order to set data non
> executable, because it is not possible to set executability on page
> basis, this is done per 256 Mbytes segments. The module area has exec
> right, vmalloc area has noexec.
> 
> This can also be useful on other powerpc/32 in order to maximize the
> chance of code being close enough to kernel core to avoid branch
> trampolines.
> 
> Signed-off-by: Christophe Leroy 
> Cc: Jason Wessel 
> Cc: Daniel Thompson 
> Cc: Douglas Anderson 

Thanks for diligence in making sure kdb is up to date!

Acked-by: Daniel Thompson 


Daniel.


Re: [BUG] mtd: cfi_cmdset_0002: write regression since v4.17-rc1

2022-01-28 Thread Ahmad Fatoum
Hello Tokunori-san,

On 15.12.21 18:34, Tokunori Ikegami wrote:
> Hi Ahmad-san,

Thanks for your reply (and Thorsten for the reminder) and sorry for
the delay. I had a lot of backlog after my time off.

> Sorry for the regression issue by the change: dfeae1073583.
> To make sure could you please try with the word write instead of the buffered 
> writes?

The issue is still there with #define FORCE_WORD_WRITE 1:

  jffs2: Write clean marker to block at 0x000a failed: -5
  MTD do_write_oneword_once(): software timeout

> FYI: There are some changes to disable the buffered writes as below.
>   1. 
> https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/ar71xx/patches-4.9/411-mtd-cfi_cmdset_0002-force-word-write.patch;h=ddd69f17e1ac16e8fc3a694c56231fee1e2ef149;hb=fec8fe806963c96a6506c2aebc3572d3a11f285f
>   2. 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/mtd/chips/cfi_cmdset_0002.c?h=v5.16-rc5&id=7e4404113686868858a34210c28ae122e967aa64
> 
> Note:
>   Currently I am not able to investigate the issue on the product for the 
> change before.
> 
>   By the way in the past I had investigated the similar issue on Buffalo 
> WZR-HP-G300NH using the S29GL256N.
>   It was not able to find the root cause by the investigation since not 
> required actually at that time.
>   Also actually the buffered writes were disabled on the OpenWrt firmware as 
> the change [1] above.
>   But I am not sure the reason detail to disable the buffered writes on the 
> OpenWrt firmware.
>   I thought the issue not caused by the change: dfeae1073583 since the issue 
> happened without the change.
> 
>   So I am not sure why the above change [2] needed to disable the buffered 
> writes on Buffalo WZR-HP-G300NH.
>   Probably seems needed to disable the buffered writes on the other firmware 
> also but not OpenWrt firmware.
> 
>   Anyway there are difference with your regression issue as below.
>     1. Flash device: S29GL064N (Your regression issue), S29GL256N 
> (WZR-HP-G300NH)
>     2. Regression issue: Yes (Your regression issue), No (WZR-HP-G300NH as I 
> investigated before)

Doesn't seem to be a buffered write issue here though as the writes
did work fine before dfeae1073583. Any other ideas?

Cheers,
Ahmad

> 
> Regards,
> Ikegami
> 
> On 2021/12/14 16:23, Thorsten Leemhuis wrote:
>> [TLDR: adding this regression to regzbot; most of this mail is compiled
>> from a few templates paragraphs some of you might have seen already.]
>>
>> Hi, this is your Linux kernel regression tracker speaking.
>>
>> Top-posting for once, to make this easy accessible to everyone.
>>
>> Thanks for the report.
>>
>> Adding the regression mailing list to the list of recipients, as it
>> should be in the loop for all regressions, as explained here:
>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>>
>> To be sure this issue doesn't fall through the cracks unnoticed, I'm
>> adding it to regzbot, my Linux kernel regression tracking bot:
>>
>> #regzbot ^introduced dfeae1073583
>> #regzbot title mtd: cfi_cmdset_0002: flash write accesses on the
>> hardware fail on a PowerPC MPC8313 to a 8-bit-parallel S29GL064N flash
>> #regzbot ignore-activity
>>
>> Reminder: when fixing the issue, please add a 'Link:' tag with the URL
>> to the report (the parent of this mail), then regzbot will automatically
>> mark the regression as resolved once the fix lands in the appropriate
>> tree. For more details about regzbot see footer.
>>
>> Sending this to everyone that got the initial report, to make all aware
>> of the tracking. I also hope that messages like this motivate people to
>> directly get at least the regression mailing list and ideally even
>> regzbot involved when dealing with regressions, as messages like this
>> wouldn't be needed then.
>>
>> Don't worry, I'll send further messages wrt to this regression just to
>> the lists (with a tag in the subject so people can filter them away), as
>> long as they are intended just for regzbot. With a bit of luck no such
>> messages will be needed anyway.
>>
>> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat).
>>
>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>> on my table. I can only look briefly into most of them. Unfortunately
>> therefore I sometimes will get things wrong or miss something important.
>> I hope that's not the case here; if you think it is, don't hesitate to
>> tell me about it in a public reply. That's in everyone's interest, as
>> what I wrote above might be misleading to everyone reading this; any
>> suggestion I gave thus might sent someone reading this down the wrong
>> rabbit hole, which none of us wants.
>>
>> BTW, I have no personal interest in this issue, which is tracked using
>> regzbot, my Linux kernel regression tracking bot
>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
>> this mail to get things rolling again and hence don't need to be CC

[PATCHv3] powerpc: mm: radix_tlb: rearrange the if-else block

2022-01-28 Thread Anders Roxell
Clang warns:

arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is 
uninitialized when used here [-Werror,-Wuninitialized]
__tlbiel_va_range(hstart, hend, pid,
  ^~
arch/powerpc/mm/book3s64/radix_tlb.c:1175:23: note: initialize the variable 
'hstart' to silence this warning
unsigned long hstart, hend;
^
 = 0
arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is 
uninitialized when used here [-Werror,-Wuninitialized]
__tlbiel_va_range(hstart, hend, pid,
  ^~~~
arch/powerpc/mm/book3s64/radix_tlb.c:1175:29: note: initialize the variable 
'hend' to silence this warning
unsigned long hstart, hend;
  ^
   = 0
2 errors generated.

Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
always gets initialized, this will silence the warnings. That will also
simplify the 'else' path. Clang is getting confused with these warnings,
but the warnings is a false-positive.

Suggested-by: Arnd Bergmann 
Suggested-by: Nathan Chancellor 
Reviewed-by: Christophe Leroy 
Signed-off-by: Anders Roxell 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 7724af19ed7e..5172d5cec2c0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1171,15 +1171,12 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
}
}
} else {
-   bool hflush = false;
+   bool hflush;
unsigned long hstart, hend;
 
-   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-   hstart = (start + PMD_SIZE - 1) & PMD_MASK;
-   hend = end & PMD_MASK;
-   if (hstart < hend)
-   hflush = true;
-   }
+   hstart = (start + PMD_SIZE - 1) & PMD_MASK;
+   hend = end & PMD_MASK;
+   hflush = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hstart < 
hend;
 
if (type == FLUSH_TYPE_LOCAL) {
asm volatile("ptesync": : :"memory");
-- 
2.34.1



Re: [PATCHv2] powerpc: mm: radix_tlb: rearrange the if-else block

2022-01-28 Thread Anders Roxell
On Fri, 28 Jan 2022 at 11:14, Christophe Leroy
 wrote:
>
>
>
> Le 28/01/2022 à 11:08, Anders Roxell a écrit :
> > Clang warns:
> >
> > arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is 
> > uninitialized when used here [-Werror,-Wuninitialized]
> >  __tlbiel_va_range(hstart, hend, pid,
> >^~
> > arch/powerpc/mm/book3s64/radix_tlb.c:1175:23: note: initialize the variable 
> > 'hstart' to silence this warning
> >  unsigned long hstart, hend;
> >  ^
> >   = 0
> > arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is 
> > uninitialized when used here [-Werror,-Wuninitialized]
> >  __tlbiel_va_range(hstart, hend, pid,
> >^~~~
> > arch/powerpc/mm/book3s64/radix_tlb.c:1175:29: note: initialize the variable 
> > 'hend' to silence this warning
> >  unsigned long hstart, hend;
> >^
> > = 0
> > 2 errors generated.
> >
> > Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
> > always gets initialized, this will silence the warnings. That will also
> > simplify the 'else' path. Clang is getting confused with these warnings,
> > but the warnings is a false-positive.
> >
> > Suggested-by: Arnd Bergmann 
> > Suggested-by: Nathan Chancellor 
> > Signed-off-by: Anders Roxell 
> > ---
> >   arch/powerpc/mm/book3s64/radix_tlb.c | 9 +++--
> >   1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> > b/arch/powerpc/mm/book3s64/radix_tlb.c
> > index 7724af19ed7e..7d65965a0688 100644
> > --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> > +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> > @@ -1174,12 +1174,9 @@ static inline void __radix__flush_tlb_range(struct 
> > mm_struct *mm,
> >   bool hflush = false;
>
> You should then remove the default initialisation of hflush to false
> which has become pointless.
>
> With that fixed,
>
> Reviewed-by: Christophe Leroy 

Thank you for the review.
I will send a v3 shortly with that fixed.

Cheers,
Anders


Re: [PATCH 2/2] powerpc/uprobes: Reject uprobe on a system call instruction

2022-01-28 Thread Naveen N. Rao

On 2022-01-27 13:14, Nicholas Piggin wrote:

Excerpts from Michael Ellerman's message of January 25, 2022 9:45 pm:

Nicholas Piggin  writes:

Per the ISA, a Trace interrupt is not generated for a system call
[vectored] instruction. Reject uprobes on such instructions as we are
not emulating a system call [vectored] instruction anymore.


This should really be patch 1, otherwise there's a single commit 
window

where we allow uprobes on sc but don't honour them.


Yep true. I also messed up Naveen's attribution! Will re-send (or maybe
Naveen would take over the series).


Yes, let me come up with a better, more complete patch for this.






Signed-off-by: Naveen N. Rao 
[np: Switch to pr_info_ratelimited]
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/ppc-opcode.h | 1 +
 arch/powerpc/kernel/uprobes.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h

index 9675303b724e..8bbe16ce5173 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -411,6 +411,7 @@
 #define PPC_RAW_DCBFPS(a, b)		(0x7cac | ___PPC_RA(a) | 
___PPC_RB(b) | (4 << 21))
 #define PPC_RAW_DCBSTPS(a, b)		(0x7cac | ___PPC_RA(a) | 
___PPC_RB(b) | (6 << 21))

 #define PPC_RAW_SC()   (0x4402)
+#define PPC_RAW_SCV()  (0x4401)
 #define PPC_RAW_SYNC() (0x7c0004ac)
 #define PPC_RAW_ISYNC()(0x4c00012c)

diff --git a/arch/powerpc/kernel/uprobes.c 
b/arch/powerpc/kernel/uprobes.c

index c6975467d9ff..3779fde804bd 100644
--- a/arch/powerpc/kernel/uprobes.c
+++ b/arch/powerpc/kernel/uprobes.c
@@ -41,6 +41,12 @@ int arch_uprobe_analyze_insn(struct arch_uprobe 
*auprobe,

if (addr & 0x03)
return -EINVAL;

+   if (ppc_inst_val(ppc_inst_read(auprobe->insn)) == PPC_RAW_SC() ||
+   ppc_inst_val(ppc_inst_read(auprobe->insn)) == PPC_RAW_SCV()) {


We should probably reject hypercall too?

There's also a lot of reserved fields in `sc`, so doing an exact match
like this risks missing instructions that are badly formed but the CPU
will happily execute as `sc`.


Yeah, scv as well has lev != 0 unsupported so should be excluded.


We'd obviously never expect to see those in compiler generated code, 
but

it'd still be safer to mask. We could probably just reject opcode 17
entirely.


Indeed, thanks.



And I guess for a subsequent patch, but we should be rejecting some
others here as well shouldn't we? Like rfid etc.


Traps under discussion I guess. For uprobe, rfid will be just another
privilege fault. Is that dealt with somehow or do all privileged and
illegal instructions also need to be excluded from stepping? (I assume
we must handle that in a general way somehow)


Yes, this is all handled in our interrupt code if we emulate any of 
those

privileged instructions. Otherwise, if a signal is generated, that would
be caught by uprobe_deny_signal().


Thanks,
Naveen


Re: [PATCH 0/2] powerpc: Disable syscall emulation and stepping

2022-01-28 Thread Naveen N. Rao

On 2022-01-27 13:09, Nicholas Piggin wrote:

Excerpts from naverao1's message of January 25, 2022 8:48 pm:

On 2022-01-25 11:23, Christophe Leroy wrote:

Le 25/01/2022 à 04:04, Nicholas Piggin a écrit :

+Naveen (sorry missed cc'ing you at first)

Excerpts from Christophe Leroy's message of January 24, 2022 4:39 
pm:



Le 24/01/2022 à 06:57, Nicholas Piggin a écrit :

As discussed previously

https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/238946.html

I'm wondering whether PPC32 should be returning -1 for syscall
instructions too here? That could be done in another patch anyway.



The 'Programming Environments Manual for 32-Bit Implementations of
the
PowerPC™ Architecture' says:

The following are not traced:
• rfi instruction
• sc and trap instructions that trap
• Other instructions that cause interrupts (other than trace
interrupts)
• The first instruction of any interrupt handler
• Instructions that are emulated by software


So I think PPC32 should return -1 as well.


I agree.

What about the trap instructions? analyse_instr returns 0 for them
which falls through to return 0 for emulate_step, should they
return -1 as well or am I missing something?


Yeah, good point about the trap instructions.





For the traps I don't know. The manual says "trap instructions that
trap" are not traced. It means that "trap instructions that _don't_
trap" are traced. Taking into account that trap instructions don't 
trap

at least 99.9% of the time, not sure if returning -1 is needed.

Allthought that'd probably be the safest.


'trap' is a special case since it is predominantly used by debuggers
and/or tracing infrastructure. Kprobes and Uprobes do not allow probes
on a trap instruction. But, xmon can be asked to step on a trap
instruction and that can interfere with kprobes in weird ways.

So, I think it is best if we also exclude trap instructions from being
single stepped.



But then what happens with other instruction that will sparsely
generate
an exception like a DSI or so ? If we do it for the traps then we
should
do it for this as well, and then it becomes a non ending story.


For a DSI, we restart the same instruction after handling the page
fault.
The single step exception is raised on the subsequent successful
completion of the instruction.


Although it can cause a signal, and the signal handler can decide
to resume somewhere else.


If a signal is generated while we are single-stepping, we delay signal
delivery (see uprobe_deny_signal()) until after the single stepping.
For fatal signals, single stepping is disabled before we allow the
signal to be delivered.


Or kernel mode equivalent it can go to a
fixup handler and resume somewhere else.


For kprobes, we do not allow probing instructions that have an extable
entry.

- Naveen


Re: [PATCH 0/2] powerpc: Disable syscall emulation and stepping

2022-01-28 Thread Naveen N. Rao
[Sorry if you receive this in duplicate. Resending since this message 
didn't hit the list]



On 2022-01-25 11:23, Christophe Leroy wrote:

Le 25/01/2022 à 04:04, Nicholas Piggin a écrit :

+Naveen (sorry missed cc'ing you at first)

Excerpts from Christophe Leroy's message of January 24, 2022 4:39 pm:



Le 24/01/2022 à 06:57, Nicholas Piggin a écrit :

As discussed previously

https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/238946.html

I'm wondering whether PPC32 should be returning -1 for syscall
instructions too here? That could be done in another patch anyway.



The 'Programming Environments Manual for 32-Bit Implementations of 
the

PowerPC™ Architecture' says:

The following are not traced:
• rfi instruction
• sc and trap instructions that trap
• Other instructions that cause interrupts (other than trace 
interrupts)

• The first instruction of any interrupt handler
• Instructions that are emulated by software


So I think PPC32 should return -1 as well.


I agree.

What about the trap instructions? analyse_instr returns 0 for them
which falls through to return 0 for emulate_step, should they
return -1 as well or am I missing something?


Yeah, good point about the trap instructions.





For the traps I don't know. The manual says "trap instructions that
trap" are not traced. It means that "trap instructions that _don't_
trap" are traced. Taking into account that trap instructions don't trap
at least 99.9% of the time, not sure if returning -1 is needed.

Allthought that'd probably be the safest.


'trap' is a special case since it is predominantly used by debuggers
and/or tracing infrastructure. Kprobes and Uprobes do not allow probes
on a trap instruction. But, xmon can be asked to step on a trap
instruction and that can interfere with kprobes in weird ways.

So, I think it is best if we also exclude trap instructions from being
single stepped.



But then what happens with other instruction that will sparsely 
generate
an exception like a DSI or so ? If we do it for the traps then we 
should

do it for this as well, and then it becomes a non ending story.


For a DSI, we restart the same instruction after handling the page 
fault.

The single step exception is raised on the subsequent successful
completion of the instruction. For most other interrupts (alignment, vsx
unavailable, ...), we end up emulating the single step exception itself
(see emulate_single_step()). So, those are ok if caused by an 
instruction

being stepped.


- Naveen


[PATCH V3 1/2] mm/migration: Add trace events for THP migrations

2022-01-28 Thread Anshuman Khandual
This adds two trace events for PMD based THP migration without split. These
events closely follow the implementation details like setting and removing
of PMD migration entries, which are essential operations for THP migration.
This moves CREATE_TRACE_POINTS into generic THP from powerpc for these new
trace events to be available on other platforms as well.

Cc: Steven Rostedt 
Cc: Ingo Molnar 
Cc: Andrew Morton 
Cc: Zi Yan 
Cc: Naoya Horiguchi 
Cc: John Hubbard 
Cc: Matthew Wilcox 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/mm/book3s64/trace.c |  1 -
 include/trace/events/thp.h   | 27 +++
 mm/huge_memory.c |  5 +
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/trace.c b/arch/powerpc/mm/book3s64/trace.c
index b86e7b906257..ccd64b5e6cac 100644
--- a/arch/powerpc/mm/book3s64/trace.c
+++ b/arch/powerpc/mm/book3s64/trace.c
@@ -3,6 +3,5 @@
  * This file is for defining trace points and trace related helpers.
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define CREATE_TRACE_POINTS
 #include 
 #endif
diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h
index ca3f2767828a..202b3e3e67ff 100644
--- a/include/trace/events/thp.h
+++ b/include/trace/events/thp.h
@@ -48,6 +48,33 @@ TRACE_EVENT(hugepage_update,
TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 
0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set)
 );
 
+DECLARE_EVENT_CLASS(migration_pmd,
+
+   TP_PROTO(unsigned long addr, unsigned long pmd),
+
+   TP_ARGS(addr, pmd),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, addr)
+   __field(unsigned long, pmd)
+   ),
+
+   TP_fast_assign(
+   __entry->addr = addr;
+   __entry->pmd = pmd;
+   ),
+   TP_printk("addr=%lx, pmd=%lx", __entry->addr, __entry->pmd)
+);
+
+DEFINE_EVENT(migration_pmd, set_migration_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd),
+   TP_ARGS(addr, pmd)
+);
+
+DEFINE_EVENT(migration_pmd, remove_migration_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd),
+   TP_ARGS(addr, pmd)
+);
 #endif /* _TRACE_THP_H */
 
 /* This part must be outside protection */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..ab49f9a3e420 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -39,6 +39,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * By default, transparent hugepage support is disabled in order to avoid
  * risking an increased memory footprint for applications that are not
@@ -3173,6 +3176,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk 
*pvmw,
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
page_remove_rmap(page, true);
put_page(page);
+   trace_set_migration_pmd(address, pmd_val(pmdswp));
 }
 
 void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
@@ -3206,5 +3210,6 @@ void remove_migration_pmd(struct page_vma_mapped_walk 
*pvmw, struct page *new)
if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
+   trace_remove_migration_pmd(address, pmd_val(pmde));
 }
 #endif
-- 
2.25.1



Re: [PATCHv2] powerpc: mm: radix_tlb: rearrange the if-else block

2022-01-28 Thread Christophe Leroy


Le 28/01/2022 à 11:08, Anders Roxell a écrit :
> Clang warns:
> 
> arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is 
> uninitialized when used here [-Werror,-Wuninitialized]
>  __tlbiel_va_range(hstart, hend, pid,
>^~
> arch/powerpc/mm/book3s64/radix_tlb.c:1175:23: note: initialize the variable 
> 'hstart' to silence this warning
>  unsigned long hstart, hend;
>  ^
>   = 0
> arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is 
> uninitialized when used here [-Werror,-Wuninitialized]
>  __tlbiel_va_range(hstart, hend, pid,
>^~~~
> arch/powerpc/mm/book3s64/radix_tlb.c:1175:29: note: initialize the variable 
> 'hend' to silence this warning
>  unsigned long hstart, hend;
>^
> = 0
> 2 errors generated.
> 
> Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
> always gets initialized, this will silence the warnings. That will also
> simplify the 'else' path. Clang is getting confused with these warnings,
> but the warnings is a false-positive.
> 
> Suggested-by: Arnd Bergmann 
> Suggested-by: Nathan Chancellor 
> Signed-off-by: Anders Roxell 
> ---
>   arch/powerpc/mm/book3s64/radix_tlb.c | 9 +++--
>   1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 7724af19ed7e..7d65965a0688 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1174,12 +1174,9 @@ static inline void __radix__flush_tlb_range(struct 
> mm_struct *mm,
>   bool hflush = false;

You should then remove the default initialisation of hflush to false 
which has become pointless.

With that fixed,

Reviewed-by: Christophe Leroy 


>   unsigned long hstart, hend;
>   
> - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> - hstart = (start + PMD_SIZE - 1) & PMD_MASK;
> - hend = end & PMD_MASK;
> - if (hstart < hend)
> - hflush = true;
> - }
> + hstart = (start + PMD_SIZE - 1) & PMD_MASK;
> + hend = end & PMD_MASK;
> + hflush = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hstart < 
> hend;
>   
>   if (type == FLUSH_TYPE_LOCAL) {
>   asm volatile("ptesync": : :"memory");

[PATCHv2] powerpc: mm: radix_tlb: rearrange the if-else block

2022-01-28 Thread Anders Roxell
Clang warns:

arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is 
uninitialized when used here [-Werror,-Wuninitialized]
__tlbiel_va_range(hstart, hend, pid,
  ^~
arch/powerpc/mm/book3s64/radix_tlb.c:1175:23: note: initialize the variable 
'hstart' to silence this warning
unsigned long hstart, hend;
^
 = 0
arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is 
uninitialized when used here [-Werror,-Wuninitialized]
__tlbiel_va_range(hstart, hend, pid,
  ^~~~
arch/powerpc/mm/book3s64/radix_tlb.c:1175:29: note: initialize the variable 
'hend' to silence this warning
unsigned long hstart, hend;
  ^
   = 0
2 errors generated.

Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
always gets initialized, this will silence the warnings. That will also
simplify the 'else' path. Clang is getting confused with these warnings,
but the warnings is a false-positive.

Suggested-by: Arnd Bergmann 
Suggested-by: Nathan Chancellor 
Signed-off-by: Anders Roxell 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 7724af19ed7e..7d65965a0688 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1174,12 +1174,9 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
bool hflush = false;
unsigned long hstart, hend;
 
-   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-   hstart = (start + PMD_SIZE - 1) & PMD_MASK;
-   hend = end & PMD_MASK;
-   if (hstart < hend)
-   hflush = true;
-   }
+   hstart = (start + PMD_SIZE - 1) & PMD_MASK;
+   hend = end & PMD_MASK;
+   hflush = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hstart < 
hend;
 
if (type == FLUSH_TYPE_LOCAL) {
asm volatile("ptesync": : :"memory");
-- 
2.34.1



powerpc: Set crashkernel offset to mid of RMA region

2022-01-28 Thread Sourabh Jain
On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

Signed-off-by: Sourabh Jain 
Reported-and-tested-by: Abdul haleem 

---
 arch/powerpc/kernel/rtas.c |  4 
 arch/powerpc/kexec/core.c  | 15 +++
 2 files changed, 15 insertions(+), 4 deletions(-)

 ---
 Change in v3:
Dropped 1st and 2nd patch from v2. 1st and 2nd patch from v2 patch
series [1] try to discover 1T segment MMU feature support
BEFORE boot CPU paca allocation ([1] describes why it is needed).
MPE has posted a patch [2] that archives a similar objective by moving
boot CPU paca allocation after mmu_early_init_devtree().

NOTE: This patch is dependent on the patch [2].

[1] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20211018084434.217772-3-sourabhj...@linux.ibm.com/
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html
 ---

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 733e6ef36758..06df7464fb57 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1313,6 +1313,10 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 8b68d9f91a03..abf5897ae88c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -134,11 +134,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
-- 
2.34.1