Re: [PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-09-08 Thread Alexey Kardashevskiy

[hopefulle fixed my thunderbird now]

Huh, not sure anymore after reading d56f5136b0102 "KVM: let 
kvm_destroy_vm_debugfs clean up vCPU debugfs directories" which remove

debugfs_dentry from vcpu. Paolo?


On 05/09/2021 12:27, Alexey Kardashevskiy wrote:

Please ignore this one, v3 is coming.

After I posted this, I suddenly realized that the vcpu debugfs entry
remain until the VM exists and this does not handle vcpu
hotunplug+hotplug (the ppc book3e did handle this). Thanks,


On 04/09/2021 23:35, Alexey Kardashevskiy wrote:

At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
creates its own entry for timings.

The problems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries. This defines handlers
for HV KVM and defines the Book3E debugfs vcpu helper as a handler.

This makes use of already existing kvm_arch_create_vcpu_debugfs
on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
   arch/powerpc/include/asm/kvm_host.h|  7 +++---
   arch/powerpc/include/asm/kvm_ppc.h |  2 ++
   arch/powerpc/kvm/timing.h  |  7 +++---
   include/linux/kvm_host.h   |  3 +++
   arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
   arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
   arch/powerpc/kvm/book3s_hv.c   | 30 +-
   arch/powerpc/kvm/e500.c|  1 +
   arch/powerpc/kvm/e500mc.c  |  1 +
   arch/powerpc/kvm/powerpc.c | 15 ++---
   arch/powerpc/kvm/timing.c  | 20 -
   virt/kvm/kvm_main.c|  3 +++
   12 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2bcac6da0a4b..f29b66cc2163 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
   #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
   #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
   #endif
   
   #ifdef CONFIG_PPC_BOOK3S

@@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
   #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
   };
   
@@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}

   static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
   static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
   
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
   #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6355a6980ccf..fd841e844b90 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   void (*create_vm_debugfs)(struct kvm *kvm);
+   void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
   };
   
   extern struct kvmppc_ops *kvmppc_hv_ops;

diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..36f7c201c6f1 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
   #ifdef CONFIG_KVM_EXIT_TIMING
   void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
   void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
+   struct dentry 

Re: [PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-09-04 Thread Alexey Kardashevskiy
Huh, not sure anymore after reading d56f5136b0102 "KVM: let 
kvm_destroy_vm_debugfs clean up vCPU debugfs directories" which remove 
debugfs_dentry from vcpu. Paolo?




On 05/09/2021 12:27, Alexey Kardashevskiy wrote:

Please ignore this one, v3 is coming.

After I posted this, I suddenly realized that the vcpu debugfs entry 
remain until the VM exists and this does not handle vcpu 
hotunplug+hotplug (the ppc book3e did handle this). Thanks,



On 04/09/2021 23:35, Alexey Kardashevskiy wrote:

At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
creates its own entry for timings.

The problems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries. This defines handlers
for HV KVM and defines the Book3E debugfs vcpu helper as a handler.

This makes use of already existing kvm_arch_create_vcpu_debugfs
on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
  arch/powerpc/include/asm/kvm_host.h    |  7 +++---
  arch/powerpc/include/asm/kvm_ppc.h |  2 ++
  arch/powerpc/kvm/timing.h  |  7 +++---
  include/linux/kvm_host.h   |  3 +++
  arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
  arch/powerpc/kvm/book3s_hv.c   | 30 +-
  arch/powerpc/kvm/e500.c    |  1 +
  arch/powerpc/kvm/e500mc.c  |  1 +
  arch/powerpc/kvm/powerpc.c | 15 ++---
  arch/powerpc/kvm/timing.c  | 20 -
  virt/kvm/kvm_main.c    |  3 +++
  12 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h

index 2bcac6da0a4b..f29b66cc2163 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
  bool dawr1_enabled;
  pgd_t *pgtable;
  u64 process_table;
-    struct dentry *debugfs_dir;
  struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
  #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
  #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
  u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
  u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
  u64 timing_last_exit;
-    struct dentry *debugfs_exit_timing;
  #endif
  #ifdef CONFIG_PPC_BOOK3S
@@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
  struct kvmhv_tb_accumulator rm_exit;    /* real-mode exit code */
  struct kvmhv_tb_accumulator guest_time;    /* guest execution */
  struct kvmhv_tb_accumulator cede_time;    /* time napping inside 
guest */

-
-    struct dentry *debugfs_dir;
  #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
  };
@@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct 
kvm_vcpu *vcpu) {}

  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
  #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h

index 6355a6980ccf..fd841e844b90 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
  int (*svm_off)(struct kvm *kvm);
  int (*enable_dawr1)(struct kvm *kvm);
  bool (*hash_v3_possible)(void);
+    void (*create_vm_debugfs)(struct kvm *kvm);
+    void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);

  };
  extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..36f7c201c6f1 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
  #ifdef CONFIG_KVM_EXIT_TIMING
  void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
  void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
+    struct dentry *debugfs_dentry);
  static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int 
type)

  {
@@ -27,8 +27,7 @@ sta

Re: [PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-09-04 Thread Alexey Kardashevskiy

Please ignore this one, v3 is coming.

After I posted this, I suddenly realized that the vcpu debugfs entry 
remain until the VM exists and this does not handle vcpu 
hotunplug+hotplug (the ppc book3e did handle this). Thanks,



On 04/09/2021 23:35, Alexey Kardashevskiy wrote:

At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
creates its own entry for timings.

The problems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries. This defines handlers
for HV KVM and defines the Book3E debugfs vcpu helper as a handler.

This makes use of already existing kvm_arch_create_vcpu_debugfs
on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
  arch/powerpc/include/asm/kvm_host.h|  7 +++---
  arch/powerpc/include/asm/kvm_ppc.h |  2 ++
  arch/powerpc/kvm/timing.h  |  7 +++---
  include/linux/kvm_host.h   |  3 +++
  arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
  arch/powerpc/kvm/book3s_hv.c   | 30 +-
  arch/powerpc/kvm/e500.c|  1 +
  arch/powerpc/kvm/e500mc.c  |  1 +
  arch/powerpc/kvm/powerpc.c | 15 ++---
  arch/powerpc/kvm/timing.c  | 20 -
  virt/kvm/kvm_main.c|  3 +++
  12 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2bcac6da0a4b..f29b66cc2163 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
  #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
  #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
  #endif
  
  #ifdef CONFIG_PPC_BOOK3S

@@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
  #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
  };
  
@@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}

  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
  
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
  #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6355a6980ccf..fd841e844b90 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   void (*create_vm_debugfs)(struct kvm *kvm);
+   void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
  };
  
  extern struct kvmppc_ops *kvmppc_hv_ops;

diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..36f7c201c6f1 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
  #ifdef CONFIG_KVM_EXIT_TIMING
  void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
  void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
+   struct dentry *debugfs_dentry);
  
  static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)

  {
@@ -27,8 +27,7 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu 
*vcpu, int type)
  static inline void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu) {}
  static inline void kvmppc_up

[PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-09-04 Thread Alexey Kardashevskiy
At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
creates its own entry for timings.

The problems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries. This defines handlers
for HV KVM and defines the Book3E debugfs vcpu helper as a handler.

This makes use of already existing kvm_arch_create_vcpu_debugfs
on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
 arch/powerpc/include/asm/kvm_host.h|  7 +++---
 arch/powerpc/include/asm/kvm_ppc.h |  2 ++
 arch/powerpc/kvm/timing.h  |  7 +++---
 include/linux/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 30 +-
 arch/powerpc/kvm/e500.c|  1 +
 arch/powerpc/kvm/e500mc.c  |  1 +
 arch/powerpc/kvm/powerpc.c | 15 ++---
 arch/powerpc/kvm/timing.c  | 20 -
 virt/kvm/kvm_main.c|  3 +++
 12 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2bcac6da0a4b..f29b66cc2163 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S
@@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
@@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6355a6980ccf..fd841e844b90 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   void (*create_vm_debugfs)(struct kvm *kvm);
+   void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..36f7c201c6f1 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
 #ifdef CONFIG_KVM_EXIT_TIMING
 void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
 void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
+   struct dentry *debugfs_dentry);
 
 static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
 {
@@ -27,8 +27,7 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu 
*vcpu, int type)
 static inline void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu) {}
 static inline void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu) {}
 static inline void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
-   unsigned int id) {}
-static inline void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
+ struct dentry *debug

[PATCH kernel] KVM: PPC: Book3S: Merge powerpc's debugfs entry content into generic entry

2021-09-02 Thread Alexey Kardashevskiy
At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid".

The rproblems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries.

This makes use of already existing kvm_arch_create_vcpu_debugfs.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h|  6 +++---
 arch/powerpc/include/asm/kvm_ppc.h |  2 ++
 include/linux/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 30 +-
 arch/powerpc/kvm/powerpc.c | 12 +++
 virt/kvm/kvm_main.c|  3 +++
 8 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2bcac6da0a4b..e4f2feb67b53 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -828,8 +827,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
@@ -868,4 +865,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6355a6980ccf..8b3f7f6e3f12 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   void (*create_kvm_debugfs)(struct kvm *kvm);
+   void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..74d2c1c3df1b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1021,6 +1021,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long 
state);
 #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
 void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 #endif
+#ifdef __KVM_HAVE_ARCH_KVM_DEBUGFS
+void kvm_arch_create_kvm_debugfs(struct kvm *kvm);
+#endif
 
 int kvm_arch_hardware_enable(void);
 void kvm_arch_hardware_disable(void);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index c63e263312a4..33dae253a0ac 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -2112,7 +2112,7 @@ static const struct file_operations debugfs_htab_fops = {
 
 void kvmppc_mmu_debugfs_init(struct kvm *kvm)
 {
-   debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm,
+   debugfs_create_file("htab", 0400, kvm->debugfs_dentry, kvm,
_htab_fops);
 }
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index c5508744e14c..f4e083c20872 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1452,7 +1452,7 @@ static const struct file_operations debugfs_radix_fops = {
 
 void kvmhv_radix_debugfs_init(struct kvm *kvm)
 {
-   debugfs_create_file("radix", 0400, kvm->arch.debugfs_dir, kvm,
+   debugfs_create_file("radix", 0400, kvm->debugfs_dentry, kvm,
_radix_fops);
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c8f12b056968..325b388c725a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s

Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-09-01 Thread Alexey Kardashevskiy




On 02/09/2021 00:23, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


On 24/08/2021 18:37, Alexey Kardashevskiy wrote:



On 18/08/2021 08:20, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


On 07/07/2021 14:13, Alexey Kardashevskiy wrote:



alternatively move this debugfs stuff under the platform-independent
directory, how about that?


That's a good idea. I only now realized we have two separate directories
for the same guest:

$ ls /sys/kernel/debug/kvm/ | grep $pid
19062-11
vm19062

Looks like we would have to implement kvm_arch_create_vcpu_debugfs for
the vcpu information and add a similar hook for the vm.


Something like that. From the git history, it looks like the ppc folder
was added first and then the generic kvm folder was added but apparently
they did not notice the ppc one due to natural reasons :)

If you are not too busy, can you please merge the ppc one into the
generic one and post the patch, so we won't need to fix these
duplication warnings again? Thanks,




Turns out it is not that straight forward as I thought as the common KVM
debugfs entry is created after PPC HV KVM created its own and there is
no obvious way to change the order (no "post init" hook in
kvmppc_ops).


That is why I mentioned creating a hook similar to
kvm_create_vcpu_debugfs in the common KVM code. kvm_create_vm_debugfs or
something.


ah sorry I missed that :-/



Alternatively, maybe kvm_create_vm_debugfs could be moved earlier into
kvm_create_vm, before kvm_arch_post_init_vm and we could move our code
into kvm_arch_post_init_vm.


kvm_arch_create_vcpu_debugfs() or kvm_arch_post_init_vm() will still 
require hooks in kvmppc_ops and such bikeshedding may take a while :)





Also, unlike the common KVM debugfs setup, we do not allocate structures
to support debugfs nodes so we do not leak anything to bother with a
mutex like 85cd39af14f4 did.

So I'd stick to the original patch to reduce the noise in the dmesg, and
it also exposes lpid which I find rather useful for finding the right
partition scope tree in partition_tb.

Michael?







---
    arch/powerpc/kvm/book3s_hv.c | 2 +-
    1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c
b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm
*kvm)
    /*
     * Create a debugfs directory for the VM
     */
-    snprintf(buf, sizeof(buf), "vm%d", current->pid);
+    snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
    kvm->arch.debugfs_dir = debugfs_create_dir(buf,
kvm_debugfs_dir);
    kvmppc_mmu_debugfs_init(kvm);
    if (radix_enabled())





--
Alexey


Re: [PATCH kernel] KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots

2021-09-01 Thread Alexey Kardashevskiy




On 02/09/2021 00:59, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


The userspace can trigger "vmalloc size %lu allocation failure: exceeds
total pages" via the KVM_SET_USER_MEMORY_REGION ioctl.

This silences the warning by checking the limit before calling vzalloc()
and returns ENOMEM if failed.

This does not call underlying valloc helpers as __vmalloc_node() is only
exported when CONFIG_TEST_VMALLOC_MODULE and __vmalloc_node_range() is not
exported at all.

Spotted by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/kvm/book3s_hv.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 474c0cfde384..a59f1cccbcf9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4830,8 +4830,12 @@ static int kvmppc_core_prepare_memory_region_hv(struct 
kvm *kvm,
unsigned long npages = mem->memory_size >> PAGE_SHIFT;

if (change == KVM_MR_CREATE) {
-   slot->arch.rmap = vzalloc(array_size(npages,
- sizeof(*slot->arch.rmap)));
+   unsigned long cb = array_size(npages, sizeof(*slot->arch.rmap));


What does cb mean?


"count of bytes"

This is from my deep Windows past :)

https://docs.microsoft.com/en-us/windows/win32/stg/coding-style-conventions





+
+   if ((cb >> PAGE_SHIFT) > totalram_pages())
+   return -ENOMEM;
+
+   slot->arch.rmap = vzalloc(cb);
if (!slot->arch.rmap)
return -ENOMEM;
}


--
Alexey


[PATCH kernel] KVM: PPC: Book3S: Suppress failed alloc warning in H_COPY_TOFROM_GUEST

2021-09-01 Thread Alexey Kardashevskiy
H_COPY_TOFROM_GUEST is an hcall for an upper level VM to access its nested
VMs memory. The userspace can trigger WARN_ON_ONCE(!(gfp & __GFP_NOWARN))
in __alloc_pages() by constructing a tiny VM which only does
H_COPY_TOFROM_GUEST with a too big GPR9 (number of bytes to copy).

This silences the warning by adding __GFP_NOWARN.

Spotted by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index e57c08b968c0..a2e34efb8d31 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -580,7 +580,7 @@ long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
if (eaddr & (0xFFFUL << 52))
return H_PARAMETER;
 
-   buf = kzalloc(n, GFP_KERNEL);
+   buf = kzalloc(n, GFP_KERNEL | __GFP_NOWARN);
if (!buf)
return H_NO_MEM;
 
-- 
2.30.2



[PATCH kernel] KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots

2021-09-01 Thread Alexey Kardashevskiy
The userspace can trigger "vmalloc size %lu allocation failure: exceeds
total pages" via the KVM_SET_USER_MEMORY_REGION ioctl.

This silences the warning by checking the limit before calling vzalloc()
and returns ENOMEM if failed.

This does not call underlying valloc helpers as __vmalloc_node() is only
exported when CONFIG_TEST_VMALLOC_MODULE and __vmalloc_node_range() is not
exported at all.

Spotted by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_hv.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 474c0cfde384..a59f1cccbcf9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4830,8 +4830,12 @@ static int kvmppc_core_prepare_memory_region_hv(struct 
kvm *kvm,
unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
if (change == KVM_MR_CREATE) {
-   slot->arch.rmap = vzalloc(array_size(npages,
- sizeof(*slot->arch.rmap)));
+   unsigned long cb = array_size(npages, sizeof(*slot->arch.rmap));
+
+   if ((cb >> PAGE_SHIFT) > totalram_pages())
+   return -ENOMEM;
+
+   slot->arch.rmap = vzalloc(cb);
if (!slot->arch.rmap)
return -ENOMEM;
}
-- 
2.30.2



Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-09-01 Thread Alexey Kardashevskiy




On 24/08/2021 18:37, Alexey Kardashevskiy wrote:



On 18/08/2021 08:20, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


On 07/07/2021 14:13, Alexey Kardashevskiy wrote:



alternatively move this debugfs stuff under the platform-independent
directory, how about that?


That's a good idea. I only now realized we have two separate directories
for the same guest:

$ ls /sys/kernel/debug/kvm/ | grep $pid
19062-11
vm19062

Looks like we would have to implement kvm_arch_create_vcpu_debugfs for
the vcpu information and add a similar hook for the vm.


Something like that. From the git history, it looks like the ppc folder 
was added first and then the generic kvm folder was added but apparently 
they did not notice the ppc one due to natural reasons :)


If you are not too busy, can you please merge the ppc one into the 
generic one and post the patch, so we won't need to fix these 
duplication warnings again? Thanks,




Turns out it is not that straight forward as I thought as the common KVM 
debugfs entry is created after PPC HV KVM created its own and there is 
no obvious way to change the order (no "post init" hook in kvmppc_ops).


Also, unlike the common KVM debugfs setup, we do not allocate structures 
to support debugfs nodes so we do not leak anything to bother with a 
mutex like 85cd39af14f4 did.


So I'd stick to the original patch to reduce the noise in the dmesg, and 
it also exposes lpid which I find rather useful for finding the right 
partition scope tree in partition_tb.


Michael?







---
   arch/powerpc/kvm/book3s_hv.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c 
b/arch/powerpc/kvm/book3s_hv.c

index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm 
*kvm)

   /*
    * Create a debugfs directory for the VM
    */
-    snprintf(buf, sizeof(buf), "vm%d", current->pid);
+    snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
   kvm->arch.debugfs_dir = debugfs_create_dir(buf, 
kvm_debugfs_dir);

   kvmppc_mmu_debugfs_init(kvm);
   if (radix_enabled())





--
Alexey


[PATCH kernel] KVM: PPC: Fix clearing never mapped TCEs in realmode

2021-08-26 Thread Alexey Kardashevskiy
Since e1a1ef84cd07, pages for TCE tables for KVM guests are allocated
only when needed. This allows skipping any update when clearing TCEs.
This works mostly fine as TCE updates are handled when MMU is enabled.
The realmode handlers fail with H_TOO_HARD when pages are not yet
allocated except when clearing a TCE in which case KVM prints a warning
but proceeds to dereference a NULL pointer which crashes the host OS.

This has not been caught so far as the change is reasonably new,
POWER9 runs mostly radix which does not use realmode handlers.
With hash, the default TCE table is memset() by QEMU the machine reset
which triggers page faults and the KVM TCE device's kvm_spapr_tce_fault()
handles those with MMU on. And the huge DMA windows are not cleared
by VMs whicn instead successfully create a DMA window big enough to map
the VM memory 1:1 and then VMs just map everything without clearing.

This started crashing now as upcoming sriov-under-powervm support added
a mode when a dymanic DMA window not big enough to map the VM memory 1:1
but it is used anyway, and the VM now is the first (i.e. not QEMU) to
clear a just created table. Note that the upstream QEMU needs to be
modified to trigger the VM to trigger the host OS crash.

This replaces WARN_ON_ONCE_RM() with a check and return.
This adds another warning if TCE is not being cleared.

Cc: Leonardo Bras 
Fixes: e1a1ef84cd07 ("KVM: PPC: Book3S: Allocate guest TCEs on demand too")
Signed-off-by: Alexey Kardashevskiy 
---

With recent changes in the printk() department, calling pr_err() when MMU
off causes lockdep lockups which I did not dig any further so we should
start getting rid of the realmode's WARN_ON_ONCE_RM().
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 083a4e037718..e5ba96c41f3f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -173,10 +173,13 @@ static void kvmppc_rm_tce_put(struct 
kvmppc_spapr_tce_table *stt,
idx -= stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
/*
-* page must not be NULL in real mode,
-* kvmppc_rm_ioba_validate() must have taken care of this.
+* kvmppc_rm_ioba_validate() allows pages not be allocated if TCE is
+* being cleared, otherwise it returns H_TOO_HARD and we skip this.
 */
-   WARN_ON_ONCE_RM(!page);
+   if (!page) {
+   WARN_ON_ONCE_RM(tce != 0);
+   return;
+   }
tbl = kvmppc_page_address(page);
 
tbl[idx % TCES_PER_PAGE] = tce;
-- 
2.30.2



Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-08-24 Thread Alexey Kardashevskiy




On 18/08/2021 08:20, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


On 07/07/2021 14:13, Alexey Kardashevskiy wrote:



alternatively move this debugfs stuff under the platform-independent
directory, how about that?


That's a good idea. I only now realized we have two separate directories
for the same guest:

$ ls /sys/kernel/debug/kvm/ | grep $pid
19062-11
vm19062

Looks like we would have to implement kvm_arch_create_vcpu_debugfs for
the vcpu information and add a similar hook for the vm.


Something like that. From the git history, it looks like the ppc folder 
was added first and then the generic kvm folder was added but apparently 
they did not notice the ppc one due to natural reasons :)


If you are not too busy, can you please merge the ppc one into the 
generic one and post the patch, so we won't need to fix these 
duplication warnings again? Thanks,





---
   arch/powerpc/kvm/book3s_hv.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
/*
 * Create a debugfs directory for the VM
 */
-   snprintf(buf, sizeof(buf), "vm%d", current->pid);
+   snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
kvmppc_mmu_debugfs_init(kvm);
if (radix_enabled())



--
Alexey


Re: [PATCH v5 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-08-24 Thread Alexey Kardashevskiy




On 17/08/2021 16:12, Leonardo Brás wrote:

On Tue, 2021-08-17 at 02:59 -0300, Leonardo Brás wrote:

Hello Fred, thanks for the feedback!

On Tue, 2021-07-20 at 19:51 +0200, Frederic Barrat wrote:



On 16/07/2021 10:27, Leonardo Bras wrote:

Update remove_dma_window() so it can be used to remove DDW with a
given
property name.

This enables the creation of new property names for DDW, so we
can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
   arch/powerpc/platforms/pseries/iommu.c | 21 +++
--
   1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c
b/arch/powerpc/platforms/pseries/iommu.c
index 108c3dcca686..17c6f4706e76 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -830,31 +830,32 @@ static void remove_dma_window(struct
device_node *np, u32 *ddw_avail,
 np, ret,
ddw_avail[DDW_REMOVE_PE_DMA_WIN],
liobn);
   }
   
-static void remove_ddw(struct device_node *np, bool remove_prop)

+static int remove_ddw(struct device_node *np, bool remove_prop,
const char *win_name)
   {



Why switch to returning an int? None of the callers check it.


IIRC, in a previous version it did make sense, which is not the case
anymore. I will revert this.

Thanks!


Oh, sorry about that, it is in fact still needed:



Then you should have added it in 10/11.



It will make sense in patch v5 10/11:
On iommu_reconfig_notifier(), if (action == OF_RECONFIG_DETACH_NODE),
we need to remove a DDW if it exists.

As there may be different window names, it tests for DIRECT64_PROPNAME,
and if it's not found, it tests for DMA64_PROPNAME.

This approach will skip scanning for DMA64_PROPNAME if
DIRECT64_PROPNAME was found, as both may not exist in the same node.
But for this approach to work we need remove_ddw() to return error if
the property is not found.

Does it make sense? or should I just test for both?


Or you could just try removing both without checking the return code, it 
is one extra of_find_property in very rare code path. Not worth 
reposting though imho. (sorry I was off last week, catching up). Thanks,




--
Alexey


Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-08-13 Thread Alexey Kardashevskiy




On 07/07/2021 14:13, Alexey Kardashevskiy wrote:

Currently it is vm-$currentpid which works as long as there is just one
VM per the userspace (99.99% cases) but produces a bunch
of "debugfs: Directory 'vm16679' with parent 'kvm' already present!"
when syzkaller (syscall fuzzer) is running so only one VM is present in
the debugfs for a given process.

This changes the debugfs node to include the LPID which alone should be
system wide unique. This leaves the existing pid for the convenience of
matching the VM's debugfs with the running userspace process (QEMU).

Signed-off-by: Alexey Kardashevskiy 


Looks like this is not enough as syzkaller still manages to cause the 
error message, I need more robust approach as in 
https://lore.kernel.org/patchwork/patch/1472025/  or   alternatively 
move this debugfs stuff under the platform-independent directory, how 
about that?




---
  arch/powerpc/kvm/book3s_hv.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
/*
 * Create a debugfs directory for the VM
 */
-   snprintf(buf, sizeof(buf), "vm%d", current->pid);
+   snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
kvmppc_mmu_debugfs_init(kvm);
if (radix_enabled())



--
Alexey


[PATCH kernel v2] KVM: PPC: Use arch_get_random_seed_long instead of powernv variant

2021-08-05 Thread Alexey Kardashevskiy
The powernv_get_random_long() does not work in nested KVM (which is
pseries) and produces a crash when accessing in_be64(rng->regs) in
powernv_get_random_long().

This replaces powernv_get_random_long with the ppc_md machine hook
wrapper.

Signed-off-by: Alexey Kardashevskiy 
---

Changes:
v2:
* replaces [PATCH kernel] powerpc/powernv: Check if powernv_rng is initialized

---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index be0cde26f156..ecfd133e0ca8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1165,7 +1165,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
 #endif
case H_RANDOM:
-   if (!powernv_get_random_long(>arch.regs.gpr[4]))
+   if (!arch_get_random_seed_long(>arch.regs.gpr[4]))
ret = H_HARDWARE;
break;
case H_RPT_INVALIDATE:
-- 
2.30.2



[PATCH kernel] powerpc/powernv: Check if powernv_rng is initialized

2021-07-29 Thread Alexey Kardashevskiy
The powernv-rng driver has 2 users - the bare metal powernv platform and
the KVM's H_RANDOM hcall. The hcall handler works fine when it is L0 KVM
but fails in L1 KVM as there is no support for the HW registers in L1 VMs
and such support is not advertised either (== no "ibm,power-rng" in
the FDT). So when a nested VM tries H_RANDOM, the L1 KVM crashes on
in_be64(rng->regs).

This checks the pointers and returns an error if the feature is not
set up.

Signed-off-by: Alexey Kardashevskiy 
---


Randomly randomized H_RANDOM:

00:00:45 executing program 10:
r0 = openat$kvm(0xff9c, &(0x7f00), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0x2000ae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0x2000ae41, 0x0)
ioctl$KVM_SET_REGS(r2, 0x8188ae82, &(0x7f0001c0)={[0x0, 0x0, 
0xffe1, 0x0, 0x0, 0x20953, 0x0, 0xfffe, 0x0, 0x0, 
0x2], 0x2000})
syz_kvm_setup_cpu$ppc64(0x, r2, &(0x7fe8/0x18)=nil, 
0x0, 0x0, 0x0, 0x0, 0x0)
r3 = openat$kvm(0xff9c, &(0x7f000100), 0x0, 0x0)
syz_kvm_setup_cpu$ppc64(r1, r2, &(0x7fe7/0x18)=nil, 
&(0x7f80)=[{0x0, 
&(0x7f000280)="e03d0080ef61e403ef79ef650900ef61647b007ce03fff63e403ff7bff679952ff6370e63f7e603c6360e403637863640003636018a8803c28bf8460e4038478ef97846436888460b6f6a03c88d6a560e403a5781beda564d879a5602665c03cb08dc660e403c67806b3c664966fc660d53fe03cddf1e760e403e7785c41e7646623e7602244463fb1f2803e00809462e403947a946604009462a6a6607f4abb4c13603f7b63e4037b7b7b679a367b6332d9c17c201c994f7201004cbb7a603f72047b63e4037b7b955f7b6799947b636401607f",
 0xf0}], 0x1, 0x0, &(0x7fc0)=[@featur2={0x1, 0x1000}], 0x1)


cpu 0xd: Vector: 300 (Data Access) at [c0001599f590]
pc: c011d2bc: powernv_get_random_long+0x4c/0xc0
lr: c011d298: powernv_get_random_long+0x28/0xc0
sp: c0001599f830
   msr: 8280b033
   dar: 0
 dsisr: 4000
  current = 0xc000614c7f80
  paca= 0xc000fff81700   irqmask: 0x03   irq_happened: 0x01
pid   = 31576, comm = syz-executor.10

Linux version 5.14.0-rc2-le_f29cf1ff9a23_a+fstn1 (aik@fstn1-p1) (gcc (Ubuntu 
10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #263 SMP Thu 
Jul 29 17:56:12 AEST 2021
enter ? for help
[c0001599f860] c01e45f8 kvmppc_pseries_do_hcall+0x5d8/0x2190
[c0001599f8f0] c01ea2dc kvmppc_vcpu_run_hv+0x31c/0x14d0
[c0001599f9c0] c01bd518 kvmppc_vcpu_run+0x48/0x60
[c0001599f9f0] c01b74b0 kvm_arch_vcpu_ioctl_run+0x580/0x7d0
[c0001599fa90] c019e6f8 kvm_vcpu_ioctl+0x418/0xd00
[c0001599fc70] c079d8c4 sys_ioctl+0xb44/0x2100
[c0001599fd90] c003b704 system_call_exception+0x224/0x410
[c0001599fe10] c000c0e8 system_call_vectored_common+0xe8/0x278



---
 arch/powerpc/platforms/powernv/rng.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/rng.c 
b/arch/powerpc/platforms/powernv/rng.c
index 72c25295c1c2..070d0963995d 100644
--- a/arch/powerpc/platforms/powernv/rng.c
+++ b/arch/powerpc/platforms/powernv/rng.c
@@ -105,6 +105,8 @@ int powernv_get_random_long(unsigned long *v)
struct powernv_rng *rng;
 
rng = get_cpu_var(powernv_rng);
+   if (!rng || !rng->regs)
+   return 0;
 
*v = rng_whiten(rng, in_be64(rng->regs));
 
-- 
2.30.2



Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-22 Thread Alexey Kardashevskiy




On 22/07/2021 01:04, Frederic Barrat wrote:



On 21/07/2021 05:32, Alexey Kardashevskiy wrote:

+    struct iommu_table *newtbl;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
+    const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;

+
+    /* Look for MMIO32 */
+    if ((pci->phb->mem_resources[i].flags & mask) == 
IORESOURCE_MEM)

+    break;
+    }
+
+    if (i == ARRAY_SIZE(pci->phb->mem_resources))
+    goto out_del_list;



So we exit and do nothing if there's no MMIO32 bar?
Isn't the intent just to figure out the MMIO32 area to reserve it 
when init'ing the table? In which case we could default to 0,0


I'm actually not clear why we are reserving this area on pseries.




If we do not reserve it, then the iommu code will allocate DMA pages 
from there and these addresses are MMIO32 from the kernel pov at 
least. I saw crashes when (I think) a device tried DMAing to the top 
2GB of the bus space which happened to be a some other device's BAR.



hmmm... then figuring out the correct range needs more work. We could 
have more than one MMIO32 bar. And they don't have to be adjacent. 


They all have to be within the MMIO32 window of a PHB and we reserve the 
entire window here.


I 
don't see that we are reserving any range on the initial table though 
(on pseries).
True, we did not need to, as the hypervisor always took care of DMA and 
MMIO32 regions to not overlap.


And in this series we do not (strictly speaking) need this either as 
phyp never allocates more than one window dynamically and that only 
window is always the second one starting from 0x800.... It 
is probably my mistake that KVM allows a new window to start from 0 - 
PAPR did not prohibit this explicitly.


And for the KVM case, we do not need to remove the default window as KVM 
can pretty much always allocate as many TCE as the VM wants. But we 
still allow removing the default window and creating a huge one instead 
at 0x0 as this way we can allow 1:1 for every single PCI device even if 
it only allows 48 (or similar but less than 64bit) DMA. Hope this makes 
sense. Thanks,



--
Alexey


Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-20 Thread Alexey Kardashevskiy




On 21/07/2021 04:12, Frederic Barrat wrote:



On 16/07/2021 10:27, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 87 +-
  1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c

index 22d251e15b61..a67e71c49aeb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
  unsigned long num_pfn, const void *arg)
@@ -925,6 +926,7 @@ static int find_existing_ddw_windows(void)
  return 0;
  find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+    find_existing_ddw_windows_named(DMA64_PROPNAME);
  return 0;
  }
@@ -1211,14 +1213,17 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  struct ddw_create_response create;
  int page_shift;
  u64 win_addr;
+    const char *win_name;
  struct device_node *dn;
  u32 ddw_avail[DDW_APPLICABLE_SIZE];
  struct direct_window *window;
  struct property *win64;
  bool ddw_enabled = false;
  struct failed_ddw_pdn *fpdn;
-    bool default_win_removed = false;
+    bool default_win_removed = false, direct_mapping = false;
  bool pmem_present;
+    struct pci_dn *pci = PCI_DN(pdn);
+    struct iommu_table *tbl = pci->table_group->tables[0];
  dn = of_find_node_by_type(NULL, "ibm,pmemory");
  pmem_present = dn != NULL;
@@ -1227,6 +1232,7 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  mutex_lock(_window_init_mutex);
  if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+    direct_mapping = (len >= max_ram_len);
  ddw_enabled = true;
  goto out_unlock;
  }
@@ -1307,8 +1313,7 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

    query.page_size);
  goto out_failed;
  }
-    /* verify the window * number of ptes will map the partition */
-    /* check largest block * page size > max memory hotplug addr */
+
  /*
   * The "ibm,pmemory" can appear anywhere in the address space.
   * Assuming it is still backed by page structs, try 
MAX_PHYSMEM_BITS
@@ -1324,13 +1329,25 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  dev_info(>dev, "Skipping ibm,pmemory");
  }
+    /* check if the available block * number of ptes will map 
everything */

  if (query.largest_available_block < (1ULL << (len - page_shift))) {
  dev_dbg(>dev,
  "can't map partition max 0x%llx with %llu %llu-sized 
pages\n",

  1ULL << len,
  query.largest_available_block,
  1ULL << page_shift);
-    goto out_failed;
+
+    /* DDW + IOMMU on single window may fail if there is any 
allocation */

+    if (default_win_removed && iommu_table_in_use(tbl)) {
+    dev_dbg(>dev, "current IOMMU table in use, can't be 
replaced.\n");

+    goto out_failed;
+    }
+
+    len = 

Re: [PATCH v5 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-07-20 Thread Alexey Kardashevskiy




On 20/07/2021 15:38, Leonardo Brás wrote:

Hello Fred, thanks for this feedback!

Sorry if I miss anything, this snippet was written for v1 over an year
ago, and I have not taken a look at it ever since.

On Mon, 2021-07-19 at 15:53 +0200, Frederic Barrat wrote:



On 16/07/2021 10:27, Leonardo Bras wrote:

@@ -1099,18 +1105,13 @@ int iommu_take_ownership(struct iommu_table
*tbl)
 for (i = 0; i < tbl->nr_pools; i++)
 spin_lock_nest_lock(>pools[i].lock, 

large_pool.lock);
   
-   iommu_table_release_pages(tbl);

-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
 pr_err("iommu_tce: it_map is not empty");
 ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore
bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl-

it_reserved_start,

-   tbl->it_reserved_end);
-   } else {
-   memset(tbl->it_map, 0xff, sz);
 }
   
+   memset(tbl->it_map, 0xff, sz);

+



So if the table is not empty, we fail (EBUSY) but we now also
completely
overwrite the bitmap. It was in an unexpected state, but we're making
it
worse. Or am I missing something?


IIRC there was a reason to do that at the time, but TBH I don't really
remember it, and by looking at the code right now you seem to be
correct about this causing trouble.

I will send a v6 fixing it soon.
Please review the remaining patches for some issue I may be missing.

Alexey, any comments on that?



Agree with Fred, this is a bug, EBUSY is not that unexpected :-/ Thanks,






    Fred



Again, thank you for reviewing Fred!
Best regards,
Leonardo Bras







--
Alexey


Re: [PATCH v4 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-14 Thread Alexey Kardashevskiy




On 13/07/2021 14:36, Leonardo Brás wrote:

On Tue, 2021-05-11 at 17:57 +1000, Alexey Kardashevskiy wrote:



On 01/05/2021 02:31, Leonardo Bras wrote:

[...]
   pmem_present = dn != NULL;
@@ -1218,8 +1224,12 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
   
 mutex_lock(_window_init_mutex);
   
-   if (find_existing_ddw(pdn, >dev.archdata.dma_offset,

))
-   goto out_unlock;
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset,
)) {
+   direct_mapping = (len >= max_ram_len);
+
+   mutex_unlock(_window_init_mutex);
+   return direct_mapping;


Does not this break the existing case when direct_mapping==true by
skipping setting dev->dev.bus_dma_limit before returning?



Yes, it does. Good catch!
I changed it to use a flag instead of win64 for return, and now I can
use the same success exit path for both the new config and the config
found in list. (out_unlock)





+   }
   
 /*

  * If we already went through this for a previous function of
@@ -1298,7 +1308,6 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
 goto out_failed;
 }
 /* verify the window * number of ptes will map the partition
*/
-   /* check largest block * page size > max memory hotplug addr
*/
 /*
  * The "ibm,pmemory" can appear anywhere in the address
space.
  * Assuming it is still backed by page structs, try
MAX_PHYSMEM_BITS
@@ -1320,6 +1329,17 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
 1ULL << len,
 query.largest_available_block,
 1ULL << page_shift);
+
+   len = order_base_2(query.largest_available_block <<
page_shift);
+   win_name = DMA64_PROPNAME;


[1] 



+   } else {
+   direct_mapping = true;
+   win_name = DIRECT64_PROPNAME;
+   }
+
+   /* DDW + IOMMU on single window may fail if there is any
allocation */
+   if (default_win_removed && !direct_mapping &&
iommu_table_in_use(tbl)) {
+   dev_dbg(>dev, "current IOMMU table in use, can't
be replaced.\n");



... remove !direct_mapping and move to [1]?



sure, done!





 goto out_failed;
 }
   
@@ -1331,8 +1351,7 @@ static bool enable_ddw(struct pci_dev *dev,

struct device_node *pdn)
   create.liobn, dn);
   
 win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;

-   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn,
win_addr,
-   page_shift, len);
+   win64 = ddw_property_create(win_name, create.liobn, win_addr,
page_shift, len);
 if (!win64) {
 dev_info(>dev,
  "couldn't allocate property, property name,
or value\n");
@@ -1350,12 +1369,47 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
 if (!window)
 goto out_del_prop;
   
-   ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>

PAGE_SHIFT,
-   win64->value,
tce_setrange_multi_pSeriesLP_walk);
-   if (ret) {
-   dev_info(>dev, "failed to map direct window for
%pOF: %d\n",
-    dn, ret);
-   goto out_del_list;
+   if (direct_mapping) {
+   /* DDW maps the whole partition, so enable direct DMA
mapping */
+   ret = walk_system_ram_range(0, memblock_end_of_DRAM()

PAGE_SHIFT,

+   win64->value,
tce_setrange_multi_pSeriesLP_walk);
+   if (ret) {
+   dev_info(>dev, "failed to map direct
window for %pOF: %d\n",
+    dn, ret);
+   goto out_del_list;
+   }
+   } else {
+   struct iommu_table *newtbl;
+   int i;
+
+   /* New table for using DDW instead of the default DMA
window */
+   newtbl = iommu_pseries_alloc_table(pci->phb->node);
+   if (!newtbl) {
+   dev_dbg(>dev, "couldn't create new IOMMU
table\n");
+   goto out_del_list;
+   }
+
+   for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources);
i++) {
+   const unsigned long mask = IORESOURCE_MEM_64
| IORESOURCE_MEM;
+
+   /* Look for MMIO32 */
+   if ((pci->phb->mem_resources[i].flags & mask)
== IORESOURCE_MEM)
+   break;


What if there is no IORESOURCE_MEM? pci->phb->mem_resources[i].start
below will have garbage.




Yeah, that makes sense. I will add this lines after 'for':

if (i == ARRAY_SIZ

Re: [PATCH v4 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2021-07-14 Thread Alexey Kardashevskiy




On 13/07/2021 14:47, Leonardo Brás wrote:

Hello Alexey,

On Fri, 2021-06-18 at 19:26 -0300, Leonardo Brás wrote:



+    unsigned long liobn,
unsigned long win_addr,
+    unsigned long
window_size,
unsigned long page_shift,
+    unsigned long base,
struct
iommu_table_ops *table_ops)



iommu_table_setparms() rather than passing 0 around.

The same comment about "liobn" - set it in
iommu_table_setparms_lpar().
The reviewer will see what field atters in what situation imho.



The idea here was to keep all tbl parameters setting in
_iommu_table_setparms (or iommu_table_setparms_common).

I understand the idea that each one of those is optional in the other
case, but should we keep whatever value is present in the other
variable (not zeroing the other variable), or do someting like:

tbl->it_index = 0;
tbl->it_base = basep;
(in iommu_table_setparms)

tbl->it_index = liobn;
tbl->it_base = 0;
(in iommu_table_setparms_lpar)



This one is supposed to be a question, but I missed the question mark.
Sorry about that.


Ah ok :)


I would like to get your opinion in this :)


Besides making the "base" parameter a pointer, I really do not have 
strong preference, just make it not hurting eyes of a reader, that's all :)


imho in general, rather than answering 5 weeks later, it is more 
productive to address whatever comments were made, add comments (in the 
code or commit logs) why you are sticking to your initial approach, 
rebase and repost the whole thing. Thanks,




--
Alexey


Re: [PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-07-07 Thread Alexey Kardashevskiy




On 08/07/2021 03:48, Fabiano Rosas wrote:

Alexey Kardashevskiy  writes:


Currently it is vm-$currentpid which works as long as there is just one
VM per the userspace (99.99% cases) but produces a bunch
of "debugfs: Directory 'vm16679' with parent 'kvm' already present!"
when syzkaller (syscall fuzzer) is running so only one VM is present in
the debugfs for a given process.

This changes the debugfs node to include the LPID which alone should be
system wide unique. This leaves the existing pid for the convenience of
matching the VM's debugfs with the running userspace process (QEMU).

Signed-off-by: Alexey Kardashevskiy 


Reviewed-by: Fabiano Rosas 


thanks.

Strangely it also fixes a bunch of

BUG: unable to handle kernel NULL pointer dereference in corrupted
BUG: unable to handle kernel paging request in corrupted

I was having 3 of these for every hour of running syzkaller and not 
anymore with this patch.






---
  arch/powerpc/kvm/book3s_hv.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
/*
 * Create a debugfs directory for the VM
 */
-   snprintf(buf, sizeof(buf), "vm%d", current->pid);
+   snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
kvmppc_mmu_debugfs_init(kvm);
if (radix_enabled())


--
Alexey


[PATCH kernel] KVM: PPC: Book3S HV: Make unique debugfs nodename

2021-07-06 Thread Alexey Kardashevskiy
Currently it is vm-$currentpid which works as long as there is just one
VM per the userspace (99.99% cases) but produces a bunch
of "debugfs: Directory 'vm16679' with parent 'kvm' already present!"
when syzkaller (syscall fuzzer) is running so only one VM is present in
the debugfs for a given process.

This changes the debugfs node to include the LPID which alone should be
system wide unique. This leaves the existing pid for the convenience of
matching the VM's debugfs with the running userspace process (QEMU).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..0223ddc0eed0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5227,7 +5227,7 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
/*
 * Create a debugfs directory for the VM
 */
-   snprintf(buf, sizeof(buf), "vm%d", current->pid);
+   snprintf(buf, sizeof(buf), "vm%d-lp%ld", current->pid, lpid);
kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
kvmppc_mmu_debugfs_init(kvm);
if (radix_enabled())
-- 
2.30.2



Re: [PATCH] Revert "powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs"

2021-05-26 Thread Alexey Kardashevskiy




On 27/05/2021 00:45, Frederic Barrat wrote:

This reverts commit 3c0468d4451eb6b4f6604370639f163f9637a479.

That commit was breaking alignment guarantees for the DMA address when
allocating coherent mappings, as described in
Documentation/core-api/dma-api-howto.rst

It was also noticed by Mellanox' driver:
[ 1515.763621] mlx5_core c002:01:00.0: mlx5_frag_buf_alloc_node:146:(pid 
13402): unexpected map alignment: 0x08c61000, page_shift=16
[ 1515.763635] mlx5_core c002:01:00.0: mlx5_cqwq_create:181:(pid
13402): mlx5_frag_buf_alloc_node() failed, -12

Signed-off-by: Frederic Barrat 


Should it be

Fixes: 3c0468d4451e ("powerpc/kernel/iommu: Align size for 
IOMMU_PAGE_SIZE() to save TCEs")


?

Anyway,

Reviewed-by: Alexey Kardashevskiy 

I should have known better in the first place, sorry :-/ Thanks,



---
  arch/powerpc/kernel/iommu.c | 11 +--
  1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 57d6b85e9b96..2af89a5e379f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -898,7 +898,6 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
-   size_t size_io = size;
  
  	size = PAGE_ALIGN(size);

order = get_order(size);
@@ -925,9 +924,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
  
  	/* Set up tces to cover the allocated range */

-   size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
-   nio_pages = size_io >> tbl->it_page_shift;
-   io_order = get_iommu_order(size_io, tbl);
+   nio_pages = size >> tbl->it_page_shift;
+   io_order = get_iommu_order(size, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
if (mapping == DMA_MAPPING_ERROR) {
@@ -942,9 +940,10 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
 void *vaddr, dma_addr_t dma_handle)
  {
if (tbl) {
-   size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
-   unsigned int nio_pages = size_io >> tbl->it_page_shift;
+   unsigned int nio_pages;
  
+		size = PAGE_ALIGN(size);

+   nio_pages = size >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));



--
Alexey


Re: [RFC PATCH kernel] powerpc: Fix early setup to make early_ioremap work

2021-05-19 Thread Alexey Kardashevskiy




On 20/05/2021 15:46, Christophe Leroy wrote:



Le 20/05/2021 à 05:29, Alexey Kardashevskiy a écrit :

The immediate problem is that after
0bd3f9e953bd ("powerpc/legacy_serial: Use early_ioremap()")
the kernel silently reboots. The reason is that early_ioremap() returns
broken addresses as it uses slot_virt[] array which initialized with
offsets from FIXADDR_TOP == IOREMAP_END+FIXADDR_SIZE ==
KERN_IO_END- FIXADDR_SIZ + FIXADDR_SIZE == __kernel_io_end which is 0
when early_ioremap_setup() is called. __kernel_io_end is initialized
little bit later in early_init_mmu().

This fixes the initialization by swapping early_ioremap_setup and
early_init_mmu.


Hum ... Chris tested it on a T2080RDB, that must be a book3e.

So we missed it. I guess your fix is right.



Oh cool.



This also fixes IOREMAP_END to use FIXADDR_SIZE defined just next to it,
seems to make sense, unless there is some weird logic with redefining
FIXADDR_SIZE as the compiling goes.


Well, I don't think the order of defines matters, the change should be 
kept out of the fix.


When I see this:

#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
#define FIXADDR_SIZESZ_32M


... I have to think harder what FIXADDR_SIZE was in the first macro and 
in what order the preprocessor expands them.



But if you want it anyway, then I'd suggest to move it before 
IOREMAP_BASE in order to keep the 3 IOREMAP_xxx together.


Up to Michael, I guess.






Signed-off-by: Alexey Kardashevskiy 


Reviewed-by: Christophe Leroy 


---
  arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +-
  arch/powerpc/kernel/setup_64.c   | 3 ++-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h

index a666d561b44d..54a06129794b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -325,8 +325,8 @@ extern unsigned long pci_io_base;
  #define  PHB_IO_END    (KERN_IO_START + FULL_IO_SIZE)
  #define IOREMAP_BASE    (PHB_IO_END)
  #define IOREMAP_START    (ioremap_bot)
-#define IOREMAP_END    (KERN_IO_END - FIXADDR_SIZE)
  #define FIXADDR_SIZE    SZ_32M
+#define IOREMAP_END    (KERN_IO_END - FIXADDR_SIZE)
  /* Advertise special mapping type for AGP */
  #define HAVE_PAGE_AGP
diff --git a/arch/powerpc/kernel/setup_64.c 
b/arch/powerpc/kernel/setup_64.c

index b779d25761cf..ce09fe5debf4 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -369,11 +369,12 @@ void __init early_setup(unsigned long dt_ptr)
  apply_feature_fixups();
  setup_feature_keys();
-    early_ioremap_setup();
  /* Initialize the hash table or TLB handling */
  early_init_mmu();
+    early_ioremap_setup();
+
  /*
   * After firmware and early platform setup code has set things up,
   * we note the SPR values for configurable control/performance



--
Alexey


Re: [RFC PATCH kernel] powerpc: Fix early setup to make early_ioremap work

2021-05-19 Thread Alexey Kardashevskiy

Hm, my thunderbird says it is not cc:'ed but git sendmail says it did cc:


Server: localhost
MAIL FROM:
RCPT TO:
RCPT TO:
RCPT TO:
RCPT TO:
From: Alexey Kardashevskiy 
To: linuxppc-dev@lists.ozlabs.org
Cc: Alexey Kardashevskiy ,
Michael Ellerman ,
Christophe Leroy 
Subject: [RFC PATCH kernel] powerpc: Fix early setup to make 
early_ioremap work



Not sure what to believe.


On 20/05/2021 13:29, Alexey Kardashevskiy wrote:

The immediate problem is that after
0bd3f9e953bd ("powerpc/legacy_serial: Use early_ioremap()")
the kernel silently reboots. The reason is that early_ioremap() returns
broken addresses as it uses slot_virt[] array which initialized with
offsets from FIXADDR_TOP == IOREMAP_END+FIXADDR_SIZE ==
KERN_IO_END- FIXADDR_SIZ + FIXADDR_SIZE == __kernel_io_end which is 0
when early_ioremap_setup() is called. __kernel_io_end is initialized
little bit later in early_init_mmu().

This fixes the initialization by swapping early_ioremap_setup and
early_init_mmu.

This also fixes IOREMAP_END to use FIXADDR_SIZE defined just next to it,
seems to make sense, unless there is some weird logic with redefining
FIXADDR_SIZE as the compiling goes.

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +-
  arch/powerpc/kernel/setup_64.c   | 3 ++-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a666d561b44d..54a06129794b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -325,8 +325,8 @@ extern unsigned long pci_io_base;
  #define  PHB_IO_END   (KERN_IO_START + FULL_IO_SIZE)
  #define IOREMAP_BASE  (PHB_IO_END)
  #define IOREMAP_START (ioremap_bot)
-#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
  #define FIXADDR_SIZE  SZ_32M
+#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
  
  /* Advertise special mapping type for AGP */

  #define HAVE_PAGE_AGP
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index b779d25761cf..ce09fe5debf4 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -369,11 +369,12 @@ void __init early_setup(unsigned long dt_ptr)
apply_feature_fixups();
setup_feature_keys();
  
-	early_ioremap_setup();
  
  	/* Initialize the hash table or TLB handling */

early_init_mmu();
  
+	early_ioremap_setup();

+
/*
 * After firmware and early platform setup code has set things up,
 * we note the SPR values for configurable control/performance



--
Alexey


[RFC PATCH kernel] powerpc: Fix early setup to make early_ioremap work

2021-05-19 Thread Alexey Kardashevskiy
The immediate problem is that after
0bd3f9e953bd ("powerpc/legacy_serial: Use early_ioremap()")
the kernel silently reboots. The reason is that early_ioremap() returns
broken addresses as it uses slot_virt[] array which initialized with
offsets from FIXADDR_TOP == IOREMAP_END+FIXADDR_SIZE ==
KERN_IO_END- FIXADDR_SIZ + FIXADDR_SIZE == __kernel_io_end which is 0
when early_ioremap_setup() is called. __kernel_io_end is initialized
little bit later in early_init_mmu().

This fixes the initialization by swapping early_ioremap_setup and
early_init_mmu.

This also fixes IOREMAP_END to use FIXADDR_SIZE defined just next to it,
seems to make sense, unless there is some weird logic with redefining
FIXADDR_SIZE as the compiling goes.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +-
 arch/powerpc/kernel/setup_64.c   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a666d561b44d..54a06129794b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -325,8 +325,8 @@ extern unsigned long pci_io_base;
 #define  PHB_IO_END(KERN_IO_START + FULL_IO_SIZE)
 #define IOREMAP_BASE   (PHB_IO_END)
 #define IOREMAP_START  (ioremap_bot)
-#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
 #define FIXADDR_SIZE   SZ_32M
+#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
 
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index b779d25761cf..ce09fe5debf4 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -369,11 +369,12 @@ void __init early_setup(unsigned long dt_ptr)
apply_feature_fixups();
setup_feature_keys();
 
-   early_ioremap_setup();
 
/* Initialize the hash table or TLB handling */
early_init_mmu();
 
+   early_ioremap_setup();
+
/*
 * After firmware and early platform setup code has set things up,
 * we note the SPR values for configurable control/performance
-- 
2.30.2



Re: [PATCH v2 2/2] powerpc/legacy_serial: Use early_ioremap()

2021-05-19 Thread Alexey Kardashevskiy




On 20/04/2021 23:32, Christophe Leroy wrote:

From: Christophe Leroy 

[0.00] ioremap() called early from 
find_legacy_serial_ports+0x3cc/0x474. Use early_ioremap() instead

find_legacy_serial_ports() is called early from setup_arch(), before
paging_init(). vmalloc is not available yet, ioremap shouldn't be
used that early.

Use early_ioremap() and switch to a regular ioremap() later.

Signed-off-by: Christophe Leroy 
Signed-off-by: Christophe Leroy 


My POWER9 box silently reboots with the upstream kernel which has this.

This hunk:

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c

index f061e06e9f51..6bdb3f5f64e3 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -336,6 +336,16 @@ static void __init setup_legacy_serial_console(int 
console)

if (addr == NULL)
return;
udbg_uart_init_mmio(addr, stride);
+
+
+   {
+   void *ea = early_ioremap(info->taddr, 0x1000);
+   pr_err("___K___ (%u) %s %u: ior=%lx early=%lx\n",
+   smp_processor_id(), __func__, __LINE__,
+   (unsigned long) addr, (unsigned 
long) ea);

+   early_iounmap(ea, 0x1000);
+   }
+


produced:

[0.00] ___K___ (0) setup_legacy_serial_console 345: 
ior=c00a83f8 early=ffc003f8 




The early address just does not look right - ffc003f8. Do you 
have a quick idea what is exactly wrong before I wake up and dig more? 
:)  It is powernv_defconfig. Thanks,





---
  arch/powerpc/kernel/legacy_serial.c | 33 +
  1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index f061e06e9f51..8b2c1a8553a0 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #undef DEBUG
  
@@ -34,6 +35,7 @@ static struct legacy_serial_info {

unsigned intclock;
int irq_check_parent;
phys_addr_t taddr;
+   void __iomem*early_addr;
  } legacy_serial_infos[MAX_LEGACY_SERIAL_PORTS];
  
  static const struct of_device_id legacy_serial_parents[] __initconst = {

@@ -325,17 +327,16 @@ static void __init setup_legacy_serial_console(int 
console)
  {
struct legacy_serial_info *info = _serial_infos[console];
struct plat_serial8250_port *port = _serial_ports[console];
-   void __iomem *addr;
unsigned int stride;
  
  	stride = 1 << port->regshift;
  
  	/* Check if a translated MMIO address has been found */

if (info->taddr) {
-   addr = ioremap(info->taddr, 0x1000);
-   if (addr == NULL)
+   info->early_addr = early_ioremap(info->taddr, 0x1000);
+   if (info->early_addr == NULL)
return;
-   udbg_uart_init_mmio(addr, stride);
+   udbg_uart_init_mmio(info->early_addr, stride);
} else {
/* Check if it's PIO and we support untranslated PIO */
if (port->iotype == UPIO_PORT && isa_io_special)
@@ -353,6 +354,30 @@ static void __init setup_legacy_serial_console(int console)
udbg_uart_setup(info->speed, info->clock);
  }
  
+static int __init ioremap_legacy_serial_console(void)

+{
+   struct legacy_serial_info *info = 
_serial_infos[legacy_serial_console];
+   struct plat_serial8250_port *port = 
_serial_ports[legacy_serial_console];
+   void __iomem *vaddr;
+
+   if (legacy_serial_console < 0)
+   return 0;
+
+   if (!info->early_addr)
+   return 0;
+
+   vaddr = ioremap(info->taddr, 0x1000);
+   if (WARN_ON(!vaddr))
+   return -ENOMEM;
+
+   udbg_uart_init_mmio(vaddr, 1 << port->regshift);
+   early_iounmap(info->early_addr, 0x1000);
+   info->early_addr = NULL;
+
+   return 0;
+}
+early_initcall(ioremap_legacy_serial_console);
+
  /*
   * This is called very early, as part of setup_system() or eventually
   * setup_arch(), basically before anything else in this file. This function



--
Alexey


Re: [PATCH kernel v3] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-16 Thread Alexey Kardashevskiy




On 5/14/21 18:46, Segher Boessenkool wrote:

Hi!

On Fri, May 14, 2021 at 11:42:32AM +0900, Masahiro Yamada wrote:

In my best guess, the reason why powerpc adding the endian flag to CPP
is this line in arch/powerpc/kernel/vdso64/vdso64.lds.S

#ifdef __LITTLE_ENDIAN__
OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", "elf64-powerpcle")
#else
OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc")
#endif


Which is equivalent to

#ifdef __LITTLE_ENDIAN__
OUTPUT_FORMAT("elf64-powerpcle")
#else
OUTPUT_FORMAT("elf64-powerpc")
#endif

so please change that at the same time if you touch this :-)


"If you touch this" approach did not work well with this patch so sorry 
but no ;)


and for a separate patch, I'll have to dig since when it is equal, do 
you know?






__LITTLE_ENDIAN__  is defined by powerpc gcc and clang.


This predefined macro is required by the newer ABIs, but all older


That's good so I'll stick to it.


compilers have it as well.  _LITTLE_ENDIAN is not supported on all
platforms (but it is if your compiler targets Linux, which you cannot
necessarily rely on).  These macros are PowerPC-specific.

For GCC, for all targets, you can say
   #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
You do not need any of the other *ORDER__ macros in most cases.
See "info cpp" for the sordid details.


[2] powerpc-linux-gnu-gcc + -mlittle-endian-> __LITTLE_ENDIAN__ is defined


You can just write -mbig and -mlittle btw.  Those aren't available on
all targets, but neither are the long-winded -m{big,little}-endian
option names.  Pet peeve, I know :-)


I am looking the same guarantees across modern enough gcc and clang and 
I am not sure all of the above is valid for clang 10.0.something (or 
whatever we say we support) ;)



--
Alexey


[PATCH kernel] powerpc/makefile: Remove flag duplicates when generating vdso linker scripts

2021-05-13 Thread Alexey Kardashevskiy
The cmd_cpp_lds_S rule already has -P and -U$(ARCH) so there is no need
in duplicating these, clean that up. Since only -C is left and
scripts/Makefile.build have -C removed since
commit 5cb0512c02ec ("Kbuild: don't pass "-C" to preprocessor when processing 
linker scripts")
this follows the lead and removes CPPFLAGS_vdso(32|64).lds altogether.

Signed-off-by: Alexey Kardashevskiy 
---

scripts/checkpatch.pl complains as it does not handle quotes in
the commit subject line well. oh well.

---
 arch/powerpc/kernel/vdso32/Makefile | 1 -
 arch/powerpc/kernel/vdso64/Makefile | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 7d9a6fee0e3d..7420e88d5aa3 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -44,7 +44,6 @@ asflags-y := -D__VDSO32__ -s
 
 obj-y += vdso32_wrapper.o
 targets += vdso32.lds
-CPPFLAGS_vdso32.lds += -P -C -Upowerpc
 
 # link rule for the .so file, .lds has to be first
 $(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) $(obj)/vgettimeofday.o 
FORCE
diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile
index 2813e3f98db6..fb118630c334 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -30,7 +30,6 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
 asflags-y := -D__VDSO64__ -s
 
 targets += vdso64.lds
-CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
 
 # link rule for the .so file, .lds has to be first
 $(obj)/vdso64.so.dbg: $(src)/vdso64.lds $(obj-vdso64) $(obj)/vgettimeofday.o 
FORCE
-- 
2.30.2



Re: [PATCH kernel v3] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-13 Thread Alexey Kardashevskiy




On 14/05/2021 12:42, Masahiro Yamada wrote:

On Fri, May 14, 2021 at 3:59 AM Nathan Chancellor  wrote:


On 5/13/2021 4:59 AM, Alexey Kardashevskiy wrote:

The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in

-emit-llvm-bc -fno-lto -flto -fvisibility=hidden \
   -fsanitize=cfi-mfcall -fno-lto  ...

in the clang command line and triggers error:


I do not know how to reproduce this for powerpc.
Currently, only x86 and arm64 select
ARCH_SUPPORTS_LTO_CLANG.

Is this a fix for a potential issue?


Yeah, it is work in progress to enable LTO_CLANG for PPC64:

https://github.com/aik/linux/commits/lto








clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redefinition. Which works fine as in most
place KBUILD_CFLAGS is passed to $CPP except
arch/powerpc/kernel/vdso64/vdso(32|64).lds. To fix vdso, this does:
1. add -m(big|little)-endian to $CPP
2. add target to $KBUILD_CPPFLAGS as otherwise clang ignores 
-m(big|little)-endian if
the building platform does not support big endian (such as x86).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* moved vdso cleanup in a separate patch
* only add target to KBUILD_CPPFLAGS for CLANG

v2:
* fix KBUILD_CPPFLAGS
* add CLANG_FLAGS to CPPFLAGS
---
   Makefile  | 1 +
   arch/powerpc/Makefile | 3 ++-
   2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 15b6476d0f89..5b545bef7653 100644
--- a/Makefile
+++ b/Makefile
@@ -576,6 +576,7 @@ CC_VERSION_TEXT = $(subst $(pound),,$(shell $(CC) --version 
2>/dev/null | head -
   ifneq ($(findstring clang,$(CC_VERSION_TEXT)),)
   ifneq ($(CROSS_COMPILE),)
   CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%))
+KBUILD_CPPFLAGS  += --target=$(notdir $(CROSS_COMPILE:%-=%))


You can avoid the duplication here by just doing:

KBUILD_CPPFLAGS += $(CLANG_FLAGS)

I am still not super happy about the flag duplication but I am not sure
I can think of a better solution. If KBUILD_CPPFLAGS are always included
when building .o files, maybe we should just add $(CLANG_FLAGS) to
KBUILD_CPPFLAGS instead of KBUILD_CFLAGS?


Hmm, I think including --target=* in CPP flags is sensible,
but not all CLANG_FLAGS are CPP flags.
At least, -(no)-integrated-as is not a CPP flag.

We could introduce a separate CLANG_CPP_FLAGS, but
it would require more code changes...

So, I do not have a strong opinion either way.



BTW, another approach might be to modify the linker script.


In my best guess, the reason why powerpc adding the endian flag to CPP
is this line in arch/powerpc/kernel/vdso64/vdso64.lds.S

#ifdef __LITTLE_ENDIAN__
OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", "elf64-powerpcle")
#else
OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc")
#endif


You can use the CONFIG option to check the endian-ness.

#ifdef CONFIG_CPU_BIG_ENDIAN
OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc")
#else
OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", "elf64-powerpcle")
#endif


All the big endian arches define CONFIG_CPU_BIG_ENDIAN.
(but not all little endian arches define CONFIG_CPU_LITTLE_ENDIAN)



This should work with .lds. But missing --target=* might still hit us 
somewhere else later, these include 3 header files each and there might 
be endianness dependent stuff.






So,
#ifdef CONFIG_CPU_BIG_ENDIAN
< big endian code >
#else
   < little endian code >
#endif

works for all architectures.


Only the exception is you cannot replace the one in uapi headers.
   arch/powerpc/include/uapi/asm/byteorder.h: #ifdef __LITTLE_ENDIAN__
since it is exported to userspace, where CONFIG options are not available.



BTW, various flags are historically used.

  -  CONFIG_CPU_BIG_ENDIAN   /  CONFIG_CPU_LITTLE_ENDIAN
  -  __BIG_ENDIAN   / __LITTLE_ENDIAN
  -  __LITTLE_ENDIAN__ (powerpc only)



__LITTLE_ENDIAN__  is defined by powerpc gcc and clang.

My experiments...


[1] powerpc-linux-gnu-gcc-> __BIG_ENDIAN__ is defined

masahiro@grover:~$ echo | powerpc-linux-gnu-gcc -E  -dM -x c - | grep ENDIAN
#define __ORDER_LITTLE_ENDIAN__ 1234
#define __BIG_ENDIAN__ 1
#define __FLOAT_WORD_ORDER__ __ORDER_BIG_ENDIAN__
#define __ORDER_PDP_ENDIAN__ 3412
#define _BIG_ENDIAN 1
#define __BYTE_ORDER__ __ORDER_BIG_ENDIAN__
#define __VEC_ELEMENT_REG_ORDER__ __ORDER_BIG_ENDIAN__
#define __ORDER_BIG_ENDIAN__ 4321


[2] powerpc-linux-gnu-gcc + -mlittle-endian-> __LITTLE_ENDIAN__ is defined


Re: [PATCH kernel v3] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-13 Thread Alexey Kardashevskiy




On 14/05/2021 04:59, Nathan Chancellor wrote:

On 5/13/2021 4:59 AM, Alexey Kardashevskiy wrote:

The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in

-emit-llvm-bc -fno-lto -flto -fvisibility=hidden \
  -fsanitize=cfi-mfcall -fno-lto  ...

in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed 
with '-flto'


This removes unnecessary CPP redefinition. Which works fine as in most
place KBUILD_CFLAGS is passed to $CPP except
arch/powerpc/kernel/vdso64/vdso(32|64).lds. To fix vdso, this does:
1. add -m(big|little)-endian to $CPP
2. add target to $KBUILD_CPPFLAGS as otherwise clang ignores 
-m(big|little)-endian if

the building platform does not support big endian (such as x86).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* moved vdso cleanup in a separate patch
* only add target to KBUILD_CPPFLAGS for CLANG

v2:
* fix KBUILD_CPPFLAGS
* add CLANG_FLAGS to CPPFLAGS
---
  Makefile  | 1 +
  arch/powerpc/Makefile | 3 ++-
  2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 15b6476d0f89..5b545bef7653 100644
--- a/Makefile
+++ b/Makefile
@@ -576,6 +576,7 @@ CC_VERSION_TEXT = $(subst $(pound),,$(shell $(CC) 
--version 2>/dev/null | head -

  ifneq ($(findstring clang,$(CC_VERSION_TEXT)),)
  ifneq ($(CROSS_COMPILE),)
  CLANG_FLAGS    += --target=$(notdir $(CROSS_COMPILE:%-=%))
+KBUILD_CPPFLAGS    += --target=$(notdir $(CROSS_COMPILE:%-=%))


You can avoid the duplication here by just doing:

KBUILD_CPPFLAGS    += $(CLANG_FLAGS)


This has potential of duplicating even more flags which is exactly what 
I am trying to avoid here.



I am still not super happy about the flag duplication but I am not sure 
I can think of a better solution. If KBUILD_CPPFLAGS are always included 
when building .o files,



My understanding is that KBUILD_CPPFLAGS should not be added for .o. Who 
does know or decide for sure about what CPPFLAGS are for? :)



maybe we should just add $(CLANG_FLAGS) to 
KBUILD_CPPFLAGS instead of KBUILD_CFLAGS?



  endif
  ifeq ($(LLVM_IAS),1)
  CLANG_FLAGS    += -integrated-as
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 3212d076ac6a..306bfd2797ad 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -76,6 +76,7 @@ endif
  ifdef CONFIG_CPU_LITTLE_ENDIAN
  KBUILD_CFLAGS    += -mlittle-endian
+KBUILD_CPPFLAGS    += -mlittle-endian
  KBUILD_LDFLAGS    += -EL
  LDEMULATION    := lppc
  GNUTARGET    := powerpcle
@@ -83,6 +84,7 @@ MULTIPLEWORD    := -mno-multiple
  KBUILD_CFLAGS_MODULE += $(call cc-option,-mno-save-toc-indirect)
  else
  KBUILD_CFLAGS += $(call cc-option,-mbig-endian)
+KBUILD_CPPFLAGS += $(call cc-option,-mbig-endian)
  KBUILD_LDFLAGS    += -EB
  LDEMULATION    := ppc
  GNUTARGET    := powerpc
@@ -208,7 +210,6 @@ KBUILD_CPPFLAGS    += -I $(srctree)/arch/$(ARCH) 
$(asinstr)

  KBUILD_AFLAGS    += $(AFLAGS-y)
  KBUILD_CFLAGS    += $(call cc-option,-msoft-float)
  KBUILD_CFLAGS    += -pipe $(CFLAGS-y)
-CPP    = $(CC) -E $(KBUILD_CFLAGS)
  CHECKFLAGS    += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
  ifdef CONFIG_CPU_BIG_ENDIAN





--
Alexey


[PATCH kernel v3] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-13 Thread Alexey Kardashevskiy
The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in

-emit-llvm-bc -fno-lto -flto -fvisibility=hidden \
 -fsanitize=cfi-mfcall -fno-lto  ...

in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redefinition. Which works fine as in most
place KBUILD_CFLAGS is passed to $CPP except
arch/powerpc/kernel/vdso64/vdso(32|64).lds. To fix vdso, this does:
1. add -m(big|little)-endian to $CPP
2. add target to $KBUILD_CPPFLAGS as otherwise clang ignores 
-m(big|little)-endian if
the building platform does not support big endian (such as x86).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* moved vdso cleanup in a separate patch
* only add target to KBUILD_CPPFLAGS for CLANG

v2:
* fix KBUILD_CPPFLAGS
* add CLANG_FLAGS to CPPFLAGS
---
 Makefile  | 1 +
 arch/powerpc/Makefile | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 15b6476d0f89..5b545bef7653 100644
--- a/Makefile
+++ b/Makefile
@@ -576,6 +576,7 @@ CC_VERSION_TEXT = $(subst $(pound),,$(shell $(CC) --version 
2>/dev/null | head -
 ifneq ($(findstring clang,$(CC_VERSION_TEXT)),)
 ifneq ($(CROSS_COMPILE),)
 CLANG_FLAGS+= --target=$(notdir $(CROSS_COMPILE:%-=%))
+KBUILD_CPPFLAGS+= --target=$(notdir $(CROSS_COMPILE:%-=%))
 endif
 ifeq ($(LLVM_IAS),1)
 CLANG_FLAGS+= -integrated-as
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 3212d076ac6a..306bfd2797ad 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -76,6 +76,7 @@ endif
 
 ifdef CONFIG_CPU_LITTLE_ENDIAN
 KBUILD_CFLAGS  += -mlittle-endian
+KBUILD_CPPFLAGS+= -mlittle-endian
 KBUILD_LDFLAGS += -EL
 LDEMULATION:= lppc
 GNUTARGET  := powerpcle
@@ -83,6 +84,7 @@ MULTIPLEWORD  := -mno-multiple
 KBUILD_CFLAGS_MODULE += $(call cc-option,-mno-save-toc-indirect)
 else
 KBUILD_CFLAGS += $(call cc-option,-mbig-endian)
+KBUILD_CPPFLAGS += $(call cc-option,-mbig-endian)
 KBUILD_LDFLAGS += -EB
 LDEMULATION:= ppc
 GNUTARGET  := powerpc
@@ -208,7 +210,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
 KBUILD_AFLAGS  += $(AFLAGS-y)
 KBUILD_CFLAGS  += $(call cc-option,-msoft-float)
 KBUILD_CFLAGS  += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)
 
 CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
 ifdef CONFIG_CPU_BIG_ENDIAN
-- 
2.30.2



Re: [RFC 01/10] powerpc/rtas: new APIs for busy and extended delay statuses

2021-05-13 Thread Alexey Kardashevskiy




On 04/05/2021 13:03, Nathan Lynch wrote:

Add new APIs for handling busy (-2) and extended delay
hint (9900...9905) statuses from RTAS. These are intended to be
drop-in replacements for existing uses of rtas_busy_delay().

A problem with rtas_busy_delay() and rtas_busy_delay_time() is that
they consider -2/busy to be equivalent to 9900 (wait 1ms). In fact,
the OS should call again as soon as it wants on -2, which at least on
PowerVM means RTAS is returning only to uphold the general requirement
that RTAS must return control to the OS in a "timely fashion" (250us).

Combine this with the fact that msleep(1) actually sleeps for more
like 20ms in practice: on busy VMs we schedule away for much longer
than necessary on -2 and 9900.

This is fixed in rtas_sched_if_busy(), which uses usleep_range() for
small delay hints, and only schedules away on -2 if there is other
work available. It also refuses to sleep longer than one second
regardless of the hinted value, on the assumption that even longer
running operations can tolerate polling at 1HZ.

rtas_spin_if_busy() and rtas_force_spin_if_busy() are provided for
atomic contexts which need to handle busy status and extended delay
hints.

Signed-off-by: Nathan Lynch 
---
  arch/powerpc/include/asm/rtas.h |   4 +
  arch/powerpc/kernel/rtas.c  | 168 
  2 files changed, 172 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9dc97d2f9d27..555ff3290f92 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -266,6 +266,10 @@ extern int rtas_set_rtc_time(struct rtc_time *rtc_time);
  extern unsigned int rtas_busy_delay_time(int status);
  extern unsigned int rtas_busy_delay(int status);
  
+bool rtas_sched_if_busy(int status);

+bool rtas_spin_if_busy(int status);
+bool rtas_force_spin_if_busy(int status);
+
  extern int early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data);
  
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c

index 6bada744402b..4a1dfbfa51ba 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -519,6 +519,174 @@ unsigned int rtas_busy_delay(int status)
  }
  EXPORT_SYMBOL(rtas_busy_delay);
  
+/**

+ * rtas_force_spin_if_busy() - Consume a busy or extended delay status
+ * in atomic context.
+ * @status: Return value from rtas_call() or similar function.
+ *
+ * Use this function when you cannot avoid using an RTAS function
+ * which may return an extended delay hint in atomic context. If
+ * possible, use rtas_spin_if_busy() or rtas_sched_if_busy() instead
+ * of this function.
+ *
+ * Return: True if @status is -2 or 990x, in which case
+ * rtas_spin_if_busy() will have delayed an appropriate amount
+ * of time, and the caller should call the RTAS function
+ * again. False otherwise.
+ */
+bool rtas_force_spin_if_busy(int status)


rtas_force_delay_if_busy()? neither this one nor rtas_spin_if_busy() 
actually spins.




+{
+   bool was_busy = true;
+
+   switch (status) {
+   case RTAS_BUSY:
+   /* OK to call again immediately; do nothing. */
+   break;
+   case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
+   mdelay(1);
+   break;
+   default:
+   was_busy = false;
+   break;
+   }
+
+   return was_busy;
+}
+
+/**
+ * rtas_spin_if_busy() - Consume a busy status in atomic context.
+ * @status: Return value from rtas_call() or similar function.
+ *
+ * Prefer rtas_sched_if_busy() over this function. Prefer this
+ * function over rtas_force_spin_if_busy(). Use this function in
+ * atomic contexts with RTAS calls that are specified to return -2 but
+ * not 990x. This function will complain and execute a minimal delay
+ * if passed a 990x status.
+ *
+ * Return: True if @status is -2 or 990x, in which case
+ * rtas_spin_if_busy() will have delayed an appropriate amount
+ * of time, and the caller should call the RTAS function
+ * again. False otherwise.
+ */
+bool rtas_spin_if_busy(int status)


rtas_delay_if_busy()?



+{
+   bool was_busy = true;
+
+   switch (status) {
+   case RTAS_BUSY:
+   /* OK to call again immediately; do nothing. */
+   break;
+   case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
+   /*
+* Generally, RTAS functions which can return this
+* status should be considered too expensive to use in
+* atomic context. Change the calling code to use
+* rtas_sched_if_busy(), or if that's not possible,
+* use rtas_force_spin_if_busy().
+*/
+   pr_warn_once("%pS may use RTAS call in atomic context which returns 
extended delay.\n",
+__builtin_return_address(0));
+

Re: [PATCH kernel v2] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-11 Thread Alexey Kardashevskiy



On 11 May 2021 21:24:55 Segher Boessenkool  wrote:


Hi!

On Tue, May 11, 2021 at 02:48:12PM +1000, Alexey Kardashevskiy wrote:

--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -44,7 +44,7 @@ asflags-y := -D__VDSO32__ -s

obj-y += vdso32_wrapper.o
targets += vdso32.lds
-CPPFLAGS_vdso32.lds += -P -C -Upowerpc
+CPPFLAGS_vdso32.lds += -C

# link rule for the .so file, .lds has to be first
$(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) 
$(obj)/vgettimeofday.o FORCE



--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -30,7 +30,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
asflags-y := -D__VDSO64__ -s

targets += vdso64.lds
-CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
+CPPFLAGS_vdso64.lds += -C

# link rule for the .so file, .lds has to be first
$(obj)/vdso64.so.dbg: $(src)/vdso64.lds $(obj-vdso64) 
$(obj)/vgettimeofday.o FORCE


Why are you removing -P and -Upowerpc here?  "powerpc" is a predefined
macro on powerpc-linux (no underscores or anything, just the bareword).
This is historical, like "unix" and "linux".  If you use the C
preprocessor for things that are not C code (like the kernel does here)
you need to undefine these macros, if anything in the files you run
through the preprocessor contains those words, or funny / strange / bad
things will happen.  Presumably at some time in the past it did contain
"powerpc" somewhere.

-P is to inhibit line number output.  Whatever consumes the
preprocessor output will have to handle line directives if you remove
this flag.  Did you check if this will work for everything that uses
$(CPP)?


i don't know about everything for sure but i checked few configs and in all 
cases (except vdso) $CPP was receiving cflags.




In any case, please mention the reasoning (and the fact that you are
removing these flags!) in the commit message.  Thanks!



but i did mention this, the last paragraph... they are duplicated.




Segher




Re: [PATCH kernel v2] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-11 Thread Alexey Kardashevskiy




On 5/12/21 05:18, Nathan Chancellor wrote:

On 5/10/2021 9:48 PM, Alexey Kardashevskiy wrote:

The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in

-emit-llvm-bc -fno-lto -flto -fvisibility=hidden \
  -fsanitize=cfi-mfcall -fno-lto  ...

in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed 
with '-flto'


This removes unnecessary CPP redefinition. Which works fine as in most
place KBUILD_CFLAGS is passed to $CPP except
arch/powerpc/kernel/vdso64/vdso(32|64).lds (and probably some others,
not yet detected). To fix vdso, we do:
1. explicitly add -m(big|little)-endian to $CPP
2. (for clang) add $CLANG_FLAGS to $KBUILD_CPPFLAGS as otherwise clang
silently ignores -m(big|little)-endian if the building platform does not
support big endian (such as x86) so --prefix= is required.

While at this, remove some duplication from CPPFLAGS_vdso(32|64)
as cmd_cpp_lds_S has them anyway. It still puzzles me why we need -C
(preserve comments in the preprocessor output) flag here.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* fix KBUILD_CPPFLAGS
* add CLANG_FLAGS to CPPFLAGS
---
  Makefile    | 1 +
  arch/powerpc/Makefile   | 3 ++-
  arch/powerpc/kernel/vdso32/Makefile | 2 +-
  arch/powerpc/kernel/vdso64/Makefile | 2 +-
  4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 72af8e423f11..13acd2183d55 100644
--- a/Makefile
+++ b/Makefile
@@ -591,6 +591,7 @@ CLANG_FLAGS    += 
--prefix=$(GCC_TOOLCHAIN_DIR)$(notdir $(CROSS_COMPILE))

  endif
  CLANG_FLAGS    += -Werror=unknown-warning-option
  KBUILD_CFLAGS    += $(CLANG_FLAGS)
+KBUILD_CPPFLAGS    += $(CLANG_FLAGS)


This is going to cause flag duplication, which would be nice to avoid. I 
do not know if we can get away with just adding $(CLANG_FLAGS) to 
KBUILD_CPPFLAGS instead of KBUILD_CFLAGS though. It seems like this 
assignment might be better in arch/powerpc/Makefile with the 
KBUILD_CPPFLAGS additions there.



It is a fair point about the duplication (which is ww, I often see 
-mbig-endian 3 - three - times) and I think I only need --prefix= there 
but this is still exactly the place to do such thing as it potentially 
affects all archs supporting both endianness (not many though, yeah). 
Thanks,







Cheers,
Nathan


  KBUILD_AFLAGS    += $(CLANG_FLAGS)
  export CLANG_FLAGS
  endif
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 3212d076ac6a..306bfd2797ad 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -76,6 +76,7 @@ endif
  ifdef CONFIG_CPU_LITTLE_ENDIAN
  KBUILD_CFLAGS    += -mlittle-endian
+KBUILD_CPPFLAGS    += -mlittle-endian
  KBUILD_LDFLAGS    += -EL
  LDEMULATION    := lppc
  GNUTARGET    := powerpcle
@@ -83,6 +84,7 @@ MULTIPLEWORD    := -mno-multiple
  KBUILD_CFLAGS_MODULE += $(call cc-option,-mno-save-toc-indirect)
  else
  KBUILD_CFLAGS += $(call cc-option,-mbig-endian)
+KBUILD_CPPFLAGS += $(call cc-option,-mbig-endian)
  KBUILD_LDFLAGS    += -EB
  LDEMULATION    := ppc
  GNUTARGET    := powerpc
@@ -208,7 +210,6 @@ KBUILD_CPPFLAGS    += -I $(srctree)/arch/$(ARCH) 
$(asinstr)

  KBUILD_AFLAGS    += $(AFLAGS-y)
  KBUILD_CFLAGS    += $(call cc-option,-msoft-float)
  KBUILD_CFLAGS    += -pipe $(CFLAGS-y)
-CPP    = $(CC) -E $(KBUILD_CFLAGS)
  CHECKFLAGS    += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
  ifdef CONFIG_CPU_BIG_ENDIAN
diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile

index 7d9a6fee0e3d..ea001c6df1fa 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -44,7 +44,7 @@ asflags-y := -D__VDSO32__ -s
  obj-y += vdso32_wrapper.o
  targets += vdso32.lds
-CPPFLAGS_vdso32.lds += -P -C -Upowerpc
+CPPFLAGS_vdso32.lds += -C
  # link rule for the .so file, .lds has to be first
  $(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) 
$(obj)/vgettimeofday.o FORCE
diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile

index 2813e3f98db6..07eadba48c7a 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -30,7 +30,7 @@ ccflags-y := -shared -fno-common -fno-builtin 
-nostdlib \

  asflags-y := -D__VDSO64__ -s
  targets += vdso64.lds
-CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
+CPPFLAGS_vdso64.lds += -C
  # link rule for the .so file, .lds has to be first
  $(obj)/vdso64.so.dbg: $(src)/vdso64.lds $(obj-vdso64) 
$(obj)/vgettimeofday.o FORCE






--
Alexey


Re: [PATCH kernel v2] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-11 Thread Alexey Kardashevskiy




On 5/12/21 09:16, Segher Boessenkool wrote:

On Tue, May 11, 2021 at 11:30:17PM +1000, Alexey Kardashevskiy wrote:

In any case, please mention the reasoning (and the fact that you are
removing these flags!) in the commit message.  Thanks!


but i did mention this, the last paragraph... they are duplicated.


Oh!  I completely missed those few lines.  Sorry for that :-(


Well, I probably should have made it a separate patch anyway, I'll 
repost separately.




To compensate a bit:


It still puzzles me why we need -C
(preserve comments in the preprocessor output) flag here.


It is so that a human can look at the output and read it.  Comments are
very significant to human readers :-)


I seriously doubt anyone ever read those :) I suspect this is to pull 
all the licenses in one place and do some checking but I did not dig deep.



--
Alexey


Re: [PATCH v4 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-05-11 Thread Alexey Kardashevskiy




On 01/05/2021 02:31, Leonardo Bras wrote:

Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 



Reviewed-by: Alexey Kardashevskiy 



---
  arch/powerpc/platforms/pseries/iommu.c | 21 +++--
  1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 89cb6e9e9f31..f8922fcf34b6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -823,31 +823,32 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
  }
  
-static void remove_ddw(struct device_node *np, bool remove_prop)

+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
  {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
  
+	win = of_find_property(np, win_name, NULL);

+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
-
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win)
-   return;
+   return 0;
  
  	if (win->length >= sizeof(struct dynamic_dma_window_prop))

remove_dma_window(np, ddw_avail, win);
  
  	if (!remove_prop)

-   return;
+   return 0;
  
  	ret = of_remove_property(np, win);

if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
  }
  
  static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)

@@ -900,7 +901,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
  
@@ -1372,7 +1373,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)

win64 = NULL;
  
  out_remove_win:

-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
  
  out_failed:

if (default_win_removed)
@@ -1536,7 +1537,7 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);



--
Alexey


Re: [PATCH v4 09/11] powerpc/pseries/iommu: Find existing DDW with given property name

2021-05-11 Thread Alexey Kardashevskiy




On 01/05/2021 02:31, Leonardo Bras wrote:

At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.

With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.

In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index f8922fcf34b6..de54ddd9decd 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -888,24 +888,21 @@ static struct direct_window *ddw_list_new_entry(struct 
device_node *pdn,
return window;
  }
  
-static int find_existing_ddw_windows(void)

+static void find_existing_ddw_windows_named(const char *name)


I'd suggest find_existing_ddw_windows_by_name() but this is nitpicking.

Reviewed-by: Alexey Kardashevskiy 



  {
int len;
struct device_node *pdn;
struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
-
-   if (!firmware_has_feature(FW_FEATURE_LPAR))
-   return 0;
+   const struct dynamic_dma_window_prop *dma64;
  
-	for_each_node_with_property(pdn, DIRECT64_PROPNAME) {

-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true, DIRECT64_PROPNAME);
+   for_each_node_with_property(pdn, name) {
+   dma64 = of_get_property(pdn, name, );
+   if (!dma64 || len < sizeof(*dma64)) {
+   remove_ddw(pdn, true, name);
continue;
}
  
-		window = ddw_list_new_entry(pdn, direct64);

+   window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;
  
@@ -913,6 +910,14 @@ static int find_existing_ddw_windows(void)

list_add(>list, _window_list);
spin_unlock(_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   find_existing_ddw_windows_named(DIRECT64_PROPNAME);
  
  	return 0;

  }



--
Alexey


Re: [PATCH v4 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-05-11 Thread Alexey Kardashevskiy




On 01/05/2021 02:31, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 87 +-
  1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index de54ddd9decd..572879af0211 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,7 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
  };
  
+static phys_addr_t ddw_memory_hotplug_max(void);

  #ifdef CONFIG_IOMMU_API
  static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
enum dma_data_direction *direction, bool 
realmode);
@@ -380,6 +381,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,

unsigned long num_pfn, const void *arg)
@@ -918,6 +920,7 @@ static int find_existing_ddw_windows(void)
return 0;
  
  	find_existing_ddw_windows_named(DIRECT64_PROPNAME);

+   find_existing_ddw_windows_named(DMA64_PROPNAME);
  
  	return 0;

  }
@@ -1207,10 +1210,13 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
+   const char *win_name;
struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
  
  	dn = of_find_node_by_type(NULL, "ibm,pmemory");

pmem_present = dn != NULL;
@@ -1218,8 +1224,12 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  
  	mutex_lock(_window_init_mutex);
  
-	if (find_existing_ddw(pdn, >dev.archdata.dma_offset, ))

-   goto out_unlock;
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
+
+   mutex_unlock(_window_init_mutex);
+   return direct_mapping;


Does not this break the existing case when direct_mapping==true by 
skipping setting dev->dev.bus_dma_limit before returning?





+   }
  
  	/*

 * If we already went through this for a previous function of
@@ -1298,7 +1308,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
/* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1320,6 +1329,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
+
+   len = 

[PATCH kernel v2] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-05-10 Thread Alexey Kardashevskiy
The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in

-emit-llvm-bc -fno-lto -flto -fvisibility=hidden \
 -fsanitize=cfi-mfcall -fno-lto  ...

in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redefinition. Which works fine as in most
place KBUILD_CFLAGS is passed to $CPP except
arch/powerpc/kernel/vdso64/vdso(32|64).lds (and probably some others,
not yet detected). To fix vdso, we do:
1. explicitly add -m(big|little)-endian to $CPP
2. (for clang) add $CLANG_FLAGS to $KBUILD_CPPFLAGS as otherwise clang
silently ignores -m(big|little)-endian if the building platform does not
support big endian (such as x86) so --prefix= is required.

While at this, remove some duplication from CPPFLAGS_vdso(32|64)
as cmd_cpp_lds_S has them anyway. It still puzzles me why we need -C
(preserve comments in the preprocessor output) flag here.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* fix KBUILD_CPPFLAGS
* add CLANG_FLAGS to CPPFLAGS
---
 Makefile| 1 +
 arch/powerpc/Makefile   | 3 ++-
 arch/powerpc/kernel/vdso32/Makefile | 2 +-
 arch/powerpc/kernel/vdso64/Makefile | 2 +-
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 72af8e423f11..13acd2183d55 100644
--- a/Makefile
+++ b/Makefile
@@ -591,6 +591,7 @@ CLANG_FLAGS += --prefix=$(GCC_TOOLCHAIN_DIR)$(notdir 
$(CROSS_COMPILE))
 endif
 CLANG_FLAGS+= -Werror=unknown-warning-option
 KBUILD_CFLAGS  += $(CLANG_FLAGS)
+KBUILD_CPPFLAGS+= $(CLANG_FLAGS)
 KBUILD_AFLAGS  += $(CLANG_FLAGS)
 export CLANG_FLAGS
 endif
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 3212d076ac6a..306bfd2797ad 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -76,6 +76,7 @@ endif
 
 ifdef CONFIG_CPU_LITTLE_ENDIAN
 KBUILD_CFLAGS  += -mlittle-endian
+KBUILD_CPPFLAGS+= -mlittle-endian
 KBUILD_LDFLAGS += -EL
 LDEMULATION:= lppc
 GNUTARGET  := powerpcle
@@ -83,6 +84,7 @@ MULTIPLEWORD  := -mno-multiple
 KBUILD_CFLAGS_MODULE += $(call cc-option,-mno-save-toc-indirect)
 else
 KBUILD_CFLAGS += $(call cc-option,-mbig-endian)
+KBUILD_CPPFLAGS += $(call cc-option,-mbig-endian)
 KBUILD_LDFLAGS += -EB
 LDEMULATION:= ppc
 GNUTARGET  := powerpc
@@ -208,7 +210,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
 KBUILD_AFLAGS  += $(AFLAGS-y)
 KBUILD_CFLAGS  += $(call cc-option,-msoft-float)
 KBUILD_CFLAGS  += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)
 
 CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
 ifdef CONFIG_CPU_BIG_ENDIAN
diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 7d9a6fee0e3d..ea001c6df1fa 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -44,7 +44,7 @@ asflags-y := -D__VDSO32__ -s
 
 obj-y += vdso32_wrapper.o
 targets += vdso32.lds
-CPPFLAGS_vdso32.lds += -P -C -Upowerpc
+CPPFLAGS_vdso32.lds += -C
 
 # link rule for the .so file, .lds has to be first
 $(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) $(obj)/vgettimeofday.o 
FORCE
diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile
index 2813e3f98db6..07eadba48c7a 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -30,7 +30,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
 asflags-y := -D__VDSO64__ -s
 
 targets += vdso64.lds
-CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
+CPPFLAGS_vdso64.lds += -C
 
 # link rule for the .so file, .lds has to be first
 $(obj)/vdso64.so.dbg: $(src)/vdso64.lds $(obj-vdso64) $(obj)/vgettimeofday.o 
FORCE
-- 
2.30.2



Re: [PATCH v4 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-05-10 Thread Alexey Kardashevskiy




On 5/1/21 02:31, Leonardo Bras wrote:

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 



Reviewed-by: Alexey Kardashevskiy 




---
  arch/powerpc/platforms/pseries/iommu.c | 93 --
  1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 955cf095416c..5a70ecd579b8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,

+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64->name);
+   kfree(win64->value);
+   kfree(win64);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
  /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
  static int iommu_get_page_shift(u32 query_page_size)
  {
@@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64 = NULL;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
  
  	ret = create_ddw(dev, ddw_avail, , page_shift, len);

if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
  
  	dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",

  create.liobn, dn);
  
-	window = ddw_list_new_entry(pdn, ddwprop);

+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev

Re: [PATCH v4 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2021-05-10 Thread Alexey Kardashevskiy




On 5/1/21 02:31, Leonardo Bras wrote:

Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, move iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

Signed-off-by: Leonardo Bras 



This does not apply anymore as it conflicts with my 4be518d838809e2135.



---
  arch/powerpc/platforms/pseries/iommu.c | 100 -
  1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 5a70ecd579b8..89cb6e9e9f31 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,11 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
  };
  
+#ifdef CONFIG_IOMMU_API

+static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
+   enum dma_data_direction *direction, bool 
realmode);
+#endif



Instead of declaring this so far from the the code which needs it, may 
be add


struct iommu_table_ops iommu_table_lpar_multi_ops;

right before iommu_table_setparms() (as the sctruct is what you actually 
want there), and you won't need to move iommu_table_pseries_ops as well.




+
  static struct iommu_table *iommu_pseries_alloc_table(int node)
  {
struct iommu_table *tbl;
@@ -501,6 +506,28 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long 
start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
  }
  
+static inline void _iommu_table_setparms(struct iommu_table *tbl, unsigned long busno,



The underscore is confusing, may be iommu_table_do_setparms()? 
iommu_table_setparms_common()? Not sure. I cannot recall a single 
function with just one leading underscore, I suspect I was pushed back 
when I tried adding one ages ago :) "inline" seems excessive, the 
compiler will probably figure it out anyway.





+unsigned long liobn, unsigned long 
win_addr,
+unsigned long window_size, unsigned 
long page_shift,
+unsigned long base, struct 
iommu_table_ops *table_ops)



Make "base" a pointer. Or, better, just keep setting it directly in 
iommu_table_setparms() rather than passing 0 around.


The same comment about "liobn" - set it in iommu_table_setparms_lpar(). 
The reviewer will see what field atters in what situation imho.





+{
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_size = window_size >> page_shift;
+   tbl->it_page_shift = page_shift;
+   tbl->it_base = base;
+   tbl->it_blocksize = 16;
+   tbl->it_type = TCE_PCI;
+   tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops = {
+   .set = tce_build_pSeries,
+   .clear = tce_free_pSeries,
+   .get = tce_get_pseries
+};
+
  static void iommu_table_setparms(struct pci_controller *phb,
 struct device_node *dn,
 struct iommu_table *tbl)
@@ -509,8 +536,13 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
const unsigned long *basep;
const u32 *sizep;
  
-	node = phb->dn;

+   /* Test if we are going over 2GB of DMA space */
+   if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) {
+   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
+   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+   }
  
+	node = phb->dn;

basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +551,25 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
return;
}
  
-	tbl->it_base = (unsigned long)__va(*basep);

+   _iommu_table_setparms(tbl, phb->bus->number, 0, 
phb->dma_window_base_cur,
+ phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+ (unsigned long)__va(*basep), 
_table_pseries_ops);
  
  	if (!is_kdump_kernel())

memset((void *)tbl->it_base, 0, *sizep);
  
-	tbl->it_busno = phb->bus->number;

-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
-   /* Units of tce entries */
-   tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
-   /* Test if we are going over 2GB of DMA space */
-   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
-   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
-   

Re: [PATCH v4 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-05-10 Thread Alexey Kardashevskiy




On 5/1/21 02:31, Leonardo Bras wrote:

Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/include/asm/iommu.h |  1 +
  arch/powerpc/kernel/iommu.c  | 65 
  2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
   */
  extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
  
  #define IOMMU_TABLE_GROUP_MAX_TABLES	2
  
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c

index ad82dda81640..5e168bd91401 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -691,32 +691,24 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
  
-	tbl->it_reserved_start = res_start;

-   tbl->it_reserved_end = res_end;
-
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
+   if (res_start < tbl->it_offset)
+   res_start = tbl->it_offset;
  
-	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)

-   set_bit(i - tbl->it_offset, tbl->it_map);
-}
+   if (res_end > (tbl->it_offset + tbl->it_size))
+   res_end = tbl->it_offset + tbl->it_size;
  
-static void iommu_table_release_pages(struct iommu_table *tbl)

-{
-   int i;
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
  
-	/*

-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
  
  	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)

-   clear_bit(i - tbl->it_offset, tbl->it_map);
+   set_bit(i - tbl->it_offset, tbl->it_map);



git produced a messy chunk here. The new logic is:


static void iommu_table_reserve_pages(struct iommu_table *tbl,
unsigned long res_start, unsigned long res_end)
{
int i;

WARN_ON_ONCE(res_end < res_start);
/*
 * Reserve page 0 so it will not be used for any mappings.
 * This avoids buggy drivers that consider page 0 to be invalid
 * to crash the machine or even lose data.
 */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);

if (res_start < tbl->it_offset)
res_start = tbl->it_offset;

if (res_end > (tbl->it_offset + tbl->it_size))
res_end = tbl->it_offset + tbl->it_size;

/* Check if res_start..res_end is a valid range in the table */
if (res_start >= res_end) {
tbl->it_reserved_start = tbl->it_offset;
tbl->it_reserved_end = tbl->it_offset;
return;
}


It is just hard to read. A code reviewer would assume res_end >= 
res_start (as there is WARN_ON) but later we allow res_end to be lesser 
than res_start.


but may be it is just me :)
Otherwise looks good.

Reviewed-by: Alexey Kardashevskiy 



  }
  
  /*

@@ -781,6 +773,22 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
  }
  
+bool iommu_table_in_use(struct iommu_table *tbl)

+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->i

Re: [PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-04-23 Thread Alexey Kardashevskiy




On 22/04/2021 17:07, Leonardo Bras wrote:

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 93 --
  1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 955cf095416c..48c029386d94 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,

+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64);
+   kfree(win64->name);
+   kfree(win64->value);



Wrong order.





+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
  /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
  static int iommu_get_page_shift(u32 query_page_size)
  {
@@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64 = NULL;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
  
  	ret = create_ddw(dev, ddw_avail, , page_shift, len);

if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
  
  	dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",

  create.liobn, dn);
  
-	window = ddw_list_new_entry(pdn, ddwprop);

+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev_info(>dev,
+"couldn't allocate property, property name, or 
value\n");
+   goto out_del_win;
+   }
+
+   ret = 

Re: [PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-04-23 Thread Alexey Kardashevskiy




On 22/04/2021 17:07, Leonardo Bras wrote:

Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().



After rereading the patch, I wonder why we had this TCE_RPN_MASK at all 
but what is certain is that this has nothing to do with IODA3 as these 
TCEs are guest phys addresses in pseries and IODA3 is bare metal. Except...




Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
  arch/powerpc/include/asm/tce.h |  8 --
  arch/powerpc/platforms/pseries/iommu.c | 39 +++---
  2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
  #define TCE_VB0
  #define TCE_PCI   1
  
-/* TCE page size is 4096 bytes (1 << 12) */

-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
  #define TCE_ENTRY_SIZE8   /* each TCE is 64 bits 
*/
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
  #define TCE_VALID 0x800   /* TCE valid */
  #define TCE_ALLIO 0x400   /* TCE valid for all lpars */
  #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 67c9953a6503..796ab356341c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);


(nit: only used once)

  
  	proto_tce = TCE_PCI_READ; // Read allowed
  
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,



... this pseries which is not pseriesLP, i.e. no LPAR == bare metal 
pseries such as ancient power5 or cellbe (I guess) and for those 
TCE_RPN_MASK may actually make sense, keep it.


The rest of the patch looks good. Thanks,


  
  	while (npages--) {

/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
  
-		uaddr += TCE_PAGE_SIZE;

+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
  }
  
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);

+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
  static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
  
  static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,

@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
  
  	while (npages--) {

-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
  
  		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {

ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
  
  	if ((npages == 1) || !

Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Alexey Kardashevskiy




On 23/04/2021 08:58, Daniel Axtens wrote:

Hi Alexey,


The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall


Checkpatch doesn't like this line:
WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a 
maximum 75 chars per line)
#14:
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall
However, it doesn't make sense to wrap the line so we should just ignore
checkpatch here.


in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redifinition.

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/Makefile | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index c9d2c7825cd6..3a2f2001c62b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
  KBUILD_AFLAGS += $(AFLAGS-y)
  KBUILD_CFLAGS += $(call cc-option,-msoft-float)
  KBUILD_CFLAGS += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)


I was trying to check the history to see why powerpc has its own
definition. It seems to date back at least as far as merging the two
powerpc platforms into one, maybe it was helpful then. I agree it
doesn't seem to be needed now.

Snowpatch claims that this breaks the build, but I haven't been able to
reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it
off to a fork of mpe's linux-ci repo to see if any of those builds hit
any issues: https://github.com/daxtens/linux-ci/actions


To be precise, you need LLVM + LTO + byte code (-emit-llvm-bc), I am not 
even sure what is the point of having -flto without -emit-llvm-bc.


No flags duplication:

[fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13  -emit-llvm-bc 
-fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall 
/mnt/sdb/pbuild/llvm-bugs/1/a.c
/usr/bin/ld: warning: cannot find entry symbol mit-llvm-bc; defaulting 
to 13e0
/usr/bin/ld: 
/usr/lib/powerpc64le-linux-gnu/crt1.o:(.data.rel.ro.local+0x8): 
undefined reference to `main'
clang-13: error: linker command failed with exit code 1 (use -v to see 
invocation)


=> command line is fine, the file is not (but it is for debugging this 
problem).



Now I am adding the second -fno-lto:

[fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13  -emit-llvm-bc 
-fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall -fno-lto 
/mnt/sdb/pbuild/llvm-bugs/1/a.c
clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed 
with '-flto'



=> failed.


Assuming that doesn't break, this patch looks good to me:
Reviewed-by: Daniel Axtens 

Kind regards,
Daniel



--
Alexey


[PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Alexey Kardashevskiy
The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall
in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redifinition.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index c9d2c7825cd6..3a2f2001c62b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
 KBUILD_AFLAGS  += $(AFLAGS-y)
 KBUILD_CFLAGS  += $(call cc-option,-msoft-float)
 KBUILD_CFLAGS  += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)
 
 CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
 ifdef CONFIG_CPU_BIG_ENDIAN
-- 
2.25.1



Re: [PATCH 1/1] powerpc/pseries/iommu: Fix window size for direct mapping with pmem

2021-04-19 Thread Alexey Kardashevskiy




On 20/04/2021 14:54, Leonardo Bras wrote:

As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS) it's
possible to use direct DMA mapping even with pmem region.

But, if that happens, the window size (len) is set to
(MAX_PHYSMEM_BITS - page_shift) instead of MAX_PHYSMEM_BITS, causing a
pagesize times smaller DDW to be created, being insufficient for correct
usage.

Fix this so the correct window size is used in this case.


Good find indeed.

afaict this does not create a huge problem though as 
query.largest_available_block is always smaller than (MAX_PHYSMEM_BITS - 
page_shift) where it matters (phyp).



Reviewed-by: Alexey Kardashevskiy 



Fixes: bf6e2d562bbc4("powerpc/dma: Fallback to dma_ops when persistent memory 
present")
Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..836cbbe0ecc5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1229,7 +1229,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (pmem_present) {
if (query.largest_available_block >=
(1ULL << (MAX_PHYSMEM_BITS - page_shift)))
-   len = MAX_PHYSMEM_BITS - page_shift;
+   len = MAX_PHYSMEM_BITS;
else
dev_info(>dev, "Skipping ibm,pmemory");
}



--
Alexey


Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 17:58, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:41 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 17:33, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside
of the dma window.



Oh, that makes sense.
On a previous patch (5/14 IIRC), I changed the behavior to only store
the valid range on tbl, but now I understand why it's important to
store the raw value.

Ok, I will change it back so the reserved range stays in tbl even if it
does not intersect with the DMA window. This way I can reuse the values
in case of indirect mapping with DDW.

Is that ok? Are the reserved values are supposed to stay the same after
changing from Default DMA window to DDW?


I added them to know what bits in it_map to ignore when checking if
there is any active user of the table. If you have non zero reserved
start/end but they do not affect it_map, then it is rather weird way to
carry reserved start/end from DDW to no-DDW.



Ok, agreed.


  May be do not set these at
all for DDW with window start at 1<<59 and when going back to no-DDW (or
if DDW starts at 0) - just set them from MMIO32, just as they are
initialized in the first place.



If I get it correctly from pci_of_scan.c, MMIO32 = {0, 32MB}, is that
correct?


No, under QEMU it is 0x8000.-0x1..:

/proc/device-tree/pci@8002000/ranges

7 cells for each resource, the second one is MMIO32 (the first is IO 
ports, the last is 64bit MMIO).




So, if DDW starts at any value in this range (most probably at zero),
we should remove the rest, is that correct?

Could it always use iommu_init_table(..., 0, 32MB) here, so it always
reserve any part of the DMA window that's in this range? Ot there may
be other reserved values range?


and when going back to no-DDW


After iommu_init_table() there should be no failure, so it looks like
there is no 'going back to no-DDW'. Am I missing something?


Well, a random driver could request 32bit DMA and if the new window is 
1:1, then it would break but this does not seem to happen and we do not 
support it anyway so no loss here.



--
Alexey


Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 17:33, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside
of the dma window.



Oh, that makes sense.
On a previous patch (5/14 IIRC), I changed the behavior to only store
the valid range on tbl, but now I understand why it's important to
store the raw value.

Ok, I will change it back so the reserved range stays in tbl even if it
does not intersect with the DMA window. This way I can reuse the values
in case of indirect mapping with DDW.

Is that ok? Are the reserved values are supposed to stay the same after
changing from Default DMA window to DDW?


I added them to know what bits in it_map to ignore when checking if 
there is any active user of the table. If you have non zero reserved 
start/end but they do not affect it_map, then it is rather weird way to 
carry reserved start/end from DDW to no-DDW. May be do not set these at 
all for DDW with window start at 1<<59 and when going back to no-DDW (or 
if DDW starts at 0) - just set them from MMIO32, just as they are 
initialized in the first place.




--
Alexey


Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside 
of the dma window.



--
Alexey


Re: [PATCH v6 33/48] KVM: PPC: Book3S HV P9: Improve exit timing accounting coverage

2021-04-09 Thread Alexey Kardashevskiy




On 05/04/2021 11:19, Nicholas Piggin wrote:

The C conversion caused exit timing to become a bit cramped. Expand it
to cover more of the entry and exit code.

Signed-off-by: Nicholas Piggin 



Reviewed-by: Alexey Kardashevskiy 


---
  arch/powerpc/kvm/book3s_hv_interrupt.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_interrupt.c 
b/arch/powerpc/kvm/book3s_hv_interrupt.c
index e93d2a6456ff..44c77f907f91 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupt.c
+++ b/arch/powerpc/kvm/book3s_hv_interrupt.c
@@ -154,6 +154,8 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
if (hdec < 0)
return BOOK3S_INTERRUPT_HV_DECREMENTER;
  
+	start_timing(vcpu, >arch.rm_entry);

+
if (vc->tb_offset) {
u64 new_tb = tb + vc->tb_offset;
mtspr(SPRN_TBU40, new_tb);
@@ -204,8 +206,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 */
mtspr(SPRN_HDEC, hdec);
  
-	start_timing(vcpu, >arch.rm_entry);

-
vcpu->arch.ceded = 0;
  
  	WARN_ON_ONCE(vcpu->arch.shregs.msr & MSR_HV);

@@ -349,8 +349,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
  
  	accumulate_time(vcpu, >arch.rm_exit);
  
-	end_timing(vcpu);

-
/* Advance host PURR/SPURR by the amount used by guest */
purr = mfspr(SPRN_PURR);
spurr = mfspr(SPRN_SPURR);
@@ -415,6 +413,8 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
  
  	switch_mmu_to_host_radix(kvm, host_pidr);
  
+	end_timing(vcpu);

+
return trap;
  }
  EXPORT_SYMBOL_GPL(kvmhv_vcpu_entry_p9);



--
Alexey


Re: [PATCH v6 32/48] KVM: PPC: Book3S HV P9: Read machine check registers while MSR[RI] is 0

2021-04-09 Thread Alexey Kardashevskiy
arch.regs.gpr[11] = exsave[EX_R11/sizeof(u64)];
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index d4bca93b79f6..8d8a4d5f0b55 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -199,6 +199,8 @@ static void kvmppc_tb_resync_done(void)
   * know about the exact state of the TB value. Resync TB call will
   * restore TB to host timebase.
   *
+ * This could use the new OPAL_HANDLE_HMI2 to avoid resyncing TB every time.



Educating myself - is it because OPAL_HANDLE_HMI2 tells if it is TB/TOD 
which is the problem so we can avoid calling opal_resync_timebase() if 
it is not TB? OPAL_HANDLE_HMI2 does not seem to resync TB itself. The 
comment just does not seem related to the rest of the patch.


Otherwise, looks good.

Reviewed-by: Alexey Kardashevskiy 



+ *
   * Things to consider:
   * - On TB error, HMI interrupt is reported on all the threads of the core
   *   that has encountered TB error irrespective of split-core mode.



--
Alexey


Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-08 Thread Alexey Kardashevskiy




On 08/04/2021 19:04, Michael Ellerman wrote:

Alexey Kardashevskiy  writes:

On 08/04/2021 15:37, Michael Ellerman wrote:

Leonardo Bras  writes:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.


Do we know of any hardware & hypervisor combination that will actually
give us bigger pages?



On P8 16MB host pages and 16MB hardware iommu pages worked.

On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB
hardware IOMMU pages.


The current code already tries 16MB though.

I'm wondering if we're going to ask for larger sizes that have never
been tested and possibly expose bugs. But it sounds like this is mainly
targeted at future platforms.



I tried for fun to pass through a PCI device to a guest with this patch as:

pbuild/qemu-killslof-aiku1904le-ppc64/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-enable-kvm \
-m 16G \
-kernel ./vmldbg \
-initrd /home/aik/t/le.cpio \
-device vfio-pci,id=vfio0001_01_00_0,host=0001:01:00.0 \
-mem-prealloc \
-mem-path qemu_hp_1G_node0 \
-global spapr-pci-host-bridge.pgsz=0xff000 \
-machine cap-cfpc=broken,cap-ccf-assist=off \
-smp 1,threads=1 \
-L /home/aik/t/qemu-ppc64-bios/ \
-trace events=qemu_trace_events \
-d guest_errors,mmu \
-chardev socket,id=SOCKET0,server=on,wait=off,path=qemu.mon.1_1_0_0 \
-mon chardev=SOCKET0,mode=control


The guest created a huge window:

xhci_hcd :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 
22 22 returned 0 (liobn = 0x8001 starting addr = 800 0)


The first "22" is page_shift in hex (16GB), the second "22" is 
window_shift (so we have 1 TCE).


On the host side the window#1 was created with 1GB pages:
pci 0001:01 : [PE# fd] Setting up window#1 
800..80007ff pg=4000



The XHCI seems working. Without the patch 16MB was the maximum.





diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
   };


A comment saying where the values come from would be good.


+#define QUERY_DDW_PGSIZE_4K0x01
+#define QUERY_DDW_PGSIZE_64K   0x02
+#define QUERY_DDW_PGSIZE_16M   0x04
+#define QUERY_DDW_PGSIZE_32M   0x08
+#define QUERY_DDW_PGSIZE_64M   0x10
+#define QUERY_DDW_PGSIZE_128M  0x20
+#define QUERY_DDW_PGSIZE_256M  0x40
+#define QUERY_DDW_PGSIZE_16G   0x80


I'm not sure the #defines really gain us much vs just putting the
literal values in the array below?


Then someone says "u magic values" :) I do not mind either way. Thanks,


Yeah that's true. But #defining them doesn't make them less magic, if
you only use them in one place :)


Defining them with "QUERY_DDW" in the names kinda tells where they are 
from. Can also grep QEMU using these to see how the other side handles 
it. Dunno.


btw the bot complained about __builtin_ctz(SZ_16G) which should be 
__builtin_ctzl(SZ_16G) so we have to ask Leonardo to repost anyway :)




--
Alexey


Re: [PATCH v6 30/48] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

2021-04-08 Thread Alexey Kardashevskiy




On 05/04/2021 11:19, Nicholas Piggin wrote:

Almost all logic is moved to C, by introducing a new in_guest mode for
the P9 path that branches very early in the KVM interrupt handler to
P9 exit code.

The main P9 entry and exit assembly is now only about 160 lines of low
level stack setup and register save/restore, plus a bad-interrupt
handler.

There are two motivations for this, the first is just make the code more
maintainable being in C. The second is to reduce the amount of code
running in a special KVM mode, "realmode". In quotes because with radix
it is no longer necessarily real-mode in the MMU, but it still has to be
treated specially because it may be in real-mode, and has various
important registers like PID, DEC, TB, etc set to guest. This is hostile
to the rest of Linux and can't use arbitrary kernel functionality or be
instrumented well.

This initial patch is a reasonably faithful conversion of the asm code,
but it does lack any loop to return quickly back into the guest without
switching out of realmode in the case of unimportant or easily handled
interrupts. As explained in previous changes, handling HV interrupts
in real mode is not so important for P9.

Use of Linux 64s interrupt entry code register conventions including
paca EX_ save areas are brought into the KVM code. There is no point
shuffling things into different paca save areas and making up a
different calling convention for KVM.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/asm-prototypes.h |   3 +-
  arch/powerpc/include/asm/kvm_asm.h|   3 +-
  arch/powerpc/include/asm/kvm_book3s_64.h  |   8 +
  arch/powerpc/include/asm/kvm_host.h   |   7 +-
  arch/powerpc/kernel/security.c|   5 +-
  arch/powerpc/kvm/Makefile |   1 +
  arch/powerpc/kvm/book3s_64_entry.S| 247 ++
  arch/powerpc/kvm/book3s_hv.c  |   9 +-
  arch/powerpc/kvm/book3s_hv_interrupt.c| 218 +++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 125 +--
  10 files changed, 501 insertions(+), 125 deletions(-)
  create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 939f3c94c8f3..7c74c80ed994 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -122,6 +122,7 @@ extern s32 patch__call_flush_branch_caches3;
  extern s32 patch__flush_count_cache_return;
  extern s32 patch__flush_link_stack_return;
  extern s32 patch__call_kvm_flush_link_stack;
+extern s32 patch__call_kvm_flush_link_stack_p9;
  extern s32 patch__memset_nocache, patch__memcpy_nocache;
  
  extern long flush_branch_caches;

@@ -142,7 +143,7 @@ void kvmhv_load_host_pmu(void);
  void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
  void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
  
-int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
  
  long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);

  long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a3633560493b..b4f9996bd331 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -146,7 +146,8 @@
  #define KVM_GUEST_MODE_GUEST  1
  #define KVM_GUEST_MODE_SKIP   2
  #define KVM_GUEST_MODE_GUEST_HV   3
-#define KVM_GUEST_MODE_HOST_HV 4
+#define KVM_GUEST_MODE_GUEST_HV_FAST   4 /* ISA v3.0 with host radix mode */
+#define KVM_GUEST_MODE_HOST_HV 5
  
  #define KVM_INST_FETCH_FAILED	-1
  
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h

index 9bb9bb370b53..c214bcffb441 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -153,9 +153,17 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
  }
  
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+
  #define KVM_DEFAULT_HPT_ORDER 24  /* 16MB HPT by default */
  #endif
  
+/*

+ * Invalid HDSISR value which is used to indicate when HW has not set the reg.
+ * Used to work around an errata.
+ */
+#define HDSISR_CANARY  0x7fff
+
  /*
   * We use a lock bit in HPTE dword 0 to synchronize updates and
   * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 05fb00d37609..fa0083345b11 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -690,7 +690,12 @@ struct kvm_vcpu_arch {
ulong fault_dar;
u32 fault_dsisr;
unsigned long intr_msr;
-   ulong fault_gpa;/* guest real address of page fault (POWER9) */
+   /*
+* POWER9 and later, fault_gpa contains the guest real address of page
+* fault for 

Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-08 Thread Alexey Kardashevskiy




On 08/04/2021 15:37, Michael Ellerman wrote:

Leonardo Bras  writes:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.


Do we know of any hardware & hypervisor combination that will actually
give us bigger pages?



On P8 16MB host pages and 16MB hardware iommu pages worked.

On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB 
hardware IOMMU pages.






Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 49 ++
  1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
  };


A comment saying where the values come from would be good.


+#define QUERY_DDW_PGSIZE_4K0x01
+#define QUERY_DDW_PGSIZE_64K   0x02
+#define QUERY_DDW_PGSIZE_16M   0x04
+#define QUERY_DDW_PGSIZE_32M   0x08
+#define QUERY_DDW_PGSIZE_64M   0x10
+#define QUERY_DDW_PGSIZE_128M  0x20
+#define QUERY_DDW_PGSIZE_256M  0x40
+#define QUERY_DDW_PGSIZE_16G   0x80


I'm not sure the #defines really gain us much vs just putting the
literal values in the array below?



Then someone says "u magic values" :) I do not mind either way. Thanks,




+struct iommu_ddw_pagesize {
+   u32 mask;
+   int shift;
+};
+
  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
  {
struct iommu_table_group *table_group;
@@ -1099,6 +1113,31 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+/* Returns page shift based on "IO Page Sizes" output at ibm,query-pe-dma-window. See LoPAR */

+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const struct iommu_ddw_pagesize ddw_pagesize[] = {
+   { QUERY_DDW_PGSIZE_16G,  __builtin_ctz(SZ_16G)  },
+   { QUERY_DDW_PGSIZE_256M, __builtin_ctz(SZ_256M) },
+   { QUERY_DDW_PGSIZE_128M, __builtin_ctz(SZ_128M) },
+   { QUERY_DDW_PGSIZE_64M,  __builtin_ctz(SZ_64M)  },
+   { QUERY_DDW_PGSIZE_32M,  __builtin_ctz(SZ_32M)  },
+   { QUERY_DDW_PGSIZE_16M,  __builtin_ctz(SZ_16M)  },
+   { QUERY_DDW_PGSIZE_64K,  __builtin_ctz(SZ_64K)  },
+   { QUERY_DDW_PGSIZE_4K,   __builtin_ctz(SZ_4K)   }
+   };



cheers



--
Alexey


Re: [PATCH v4 29/46] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

2021-04-01 Thread Alexey Kardashevskiy
 exsave[EX_DAR/sizeof(u64)];
+   vcpu->arch.fault_dsisr = exsave[EX_DSISR/sizeof(u64)];
+   vcpu->arch.fault_gpa = mfspr(SPRN_ASDR);
+
+   } else if (trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
+   vcpu->arch.fault_gpa = mfspr(SPRN_ASDR);
+
+   } else if (trap == BOOK3S_INTERRUPT_H_FAC_UNAVAIL) {
+   vcpu->arch.hfscr = mfspr(SPRN_HFSCR);
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   /*
+* Softpatch interrupt for transactional memory emulation cases
+* on POWER9 DD2.2.  This is early in the guest exit path - we
+* haven't saved registers or done a treclaim yet.
+*/
+   } else if (trap == BOOK3S_INTERRUPT_HV_SOFTPATCH) {
+   vcpu->arch.emul_inst = mfspr(SPRN_HEIR);
+
+   /*
+* The cases we want to handle here are those where the guest
+* is in real suspend mode and is trying to transition to
+* transactional mode.
+*/
+   if (local_paca->kvm_hstate.fake_suspend &&
+   (vcpu->arch.shregs.msr & MSR_TS_S)) {
+   if (kvmhv_p9_tm_emulation_early(vcpu)) {
+   /* Prevent it being handled again. */
+   trap = 0;
+   }
+   }
+#endif
+   }
+
+   radix_clear_slb();
+
+   __mtmsrd(msr, 0);



The asm code only sets RI but this potentially sets more bits including
MSR_EE, is it expected to be 0 when __kvmhv_vcpu_entry_p9() is called?


Yes.


+   mtspr(SPRN_CTRLT, 1);


What is this for? ISA does not shed much light:
===
63 RUN This  bit  controls  an  external  I/O  pin.
===


I don't think it even does that these days. It interacts with the PMU.
I was looking whether it's feasible to move it into PMU code entirely,
but apparently some tool or something might sample it. I'm a bit
suspicious about that because an untrusted guest could be running and
claim not to so I don't know what said tool really achieves, but I'll
go through that fight another day.

But KVM has to set it to 1 at exit because Linux host has it set to 1
except in CPU idle.



It this CTRLT setting a new thing or the asm does it too? I could not 
spot it.






+
+   accumulate_time(vcpu, >arch.rm_exit);


This should not compile without CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.


It has an ifdef wrapper so it should work (it does on my local tree
which is slightly newer than what you have but I don't think I fixed
anything around this recently).



You are absolutely right, my bad.




+
+   end_timing(vcpu);
+
+   return trap;



The asm does "For hash guest, read the guest SLB and save it away", this
code does not. Is this new fast-path-in-c only for radix-on-radix or
hash VMs are supported too?


That asm code does not run for "guest_exit_short_path" case (aka the
p9 path aka the fast path).

Upstream code only supports radix host and radix guest in this path.
The old path supports hash and radix. That's unchanged with this patch.

After the series, the new path supports all P9 modes (hash/hash,
radix/radix, and radix/hash), and the old path supports P7 and P8 only.



Thanks for clarification. Besides that CTRLT, I checked if the new c 
code matches the old asm code (which made diving into ISA incredible fun 
:) ) so fwiw


Reviewed-by: Alexey Kardashevskiy 


I'd really like to see longer commit logs clarifying all intended 
changes but it is probably just me.





Thanks,
Nick



--
Alexey


Re: [PATCH v4 29/46] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

2021-03-31 Thread Alexey Kardashevskiy




On 3/23/21 12:02 PM, Nicholas Piggin wrote:

Almost all logic is moved to C, by introducing a new in_guest mode that
selects and branches very early in the interrupt handler to the P9 exit
code.

The remaining assembly is only about 160 lines of low level stack setup,
with VCPU vs host register save and restore, plus a small shim to the
legacy paths in the interrupt handler.

There are two motivations for this, the first is just make the code more
maintainable being in C. The second is to reduce the amount of code
running in a special KVM mode, "realmode". I put that in quotes because
with radix it is no longer necessarily real-mode in the MMU, but it
still has to be treated specially because it may be in real-mode, and
has various important registers like PID, DEC, TB, etc set to guest.
This is hostile to the rest of Linux and can't use arbitrary kernel
functionality or be instrumented well.

This initial patch is a reasonably faithful conversion of the asm code.
It does lack any loop to return quickly back into the guest without
switching out of realmode in the case of unimportant or easily handled
interrupts, as explained in the previous change, handling HV interrupts
in real mode is not so important for P9.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/asm-prototypes.h |   3 +-
  arch/powerpc/include/asm/kvm_asm.h|   3 +-
  arch/powerpc/include/asm/kvm_book3s_64.h  |   8 +
  arch/powerpc/kernel/security.c|   5 +-
  arch/powerpc/kvm/Makefile |   3 +
  arch/powerpc/kvm/book3s_64_entry.S| 246 ++
  arch/powerpc/kvm/book3s_hv.c  |   9 +-
  arch/powerpc/kvm/book3s_hv_interrupt.c| 223 
  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 123 +--
  9 files changed, 500 insertions(+), 123 deletions(-)
  create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 939f3c94c8f3..7c74c80ed994 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -122,6 +122,7 @@ extern s32 patch__call_flush_branch_caches3;
  extern s32 patch__flush_count_cache_return;
  extern s32 patch__flush_link_stack_return;
  extern s32 patch__call_kvm_flush_link_stack;
+extern s32 patch__call_kvm_flush_link_stack_p9;
  extern s32 patch__memset_nocache, patch__memcpy_nocache;
  
  extern long flush_branch_caches;

@@ -142,7 +143,7 @@ void kvmhv_load_host_pmu(void);
  void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
  void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
  
-int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
  
  long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);

  long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a3633560493b..b4f9996bd331 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -146,7 +146,8 @@
  #define KVM_GUEST_MODE_GUEST  1
  #define KVM_GUEST_MODE_SKIP   2
  #define KVM_GUEST_MODE_GUEST_HV   3
-#define KVM_GUEST_MODE_HOST_HV 4
+#define KVM_GUEST_MODE_GUEST_HV_FAST   4 /* ISA v3.0 with host radix mode */
+#define KVM_GUEST_MODE_HOST_HV 5
  
  #define KVM_INST_FETCH_FAILED	-1
  
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h

index 9bb9bb370b53..c214bcffb441 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -153,9 +153,17 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
  }
  
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+
  #define KVM_DEFAULT_HPT_ORDER 24  /* 16MB HPT by default */
  #endif
  
+/*

+ * Invalid HDSISR value which is used to indicate when HW has not set the reg.
+ * Used to work around an errata.
+ */
+#define HDSISR_CANARY  0x7fff
+
  /*
   * We use a lock bit in HPTE dword 0 to synchronize updates and
   * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index e4e1a94ccf6a..3a607c11f20f 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -430,16 +430,19 @@ device_initcall(stf_barrier_debugfs_init);
  
  static void update_branch_cache_flush(void)

  {
-   u32 *site;
+   u32 *site, __maybe_unused *site2;
  
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE

site = __call_kvm_flush_link_stack;
+   site2 = __call_kvm_flush_link_stack_p9;
// This controls the branch from guest_exit_cont to kvm_flush_link_stack
if (link_stack_flush_type == BRANCH_CACHE_FLUSH_NONE) {
patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
+   patch_instruction_site(site2, 

Re: [PATCH v4 24/46] KVM: PPC: Book3S HV P9: Use large decrementer for HDEC

2021-03-25 Thread Alexey Kardashevskiy




On 23/03/2021 12:02, Nicholas Piggin wrote:

On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.

Reviewed-by: Alexey Kardashevskiy 
Signed-off-by: Nicholas Piggin 



ERROR: modpost: "decrementer_max" [arch/powerpc/kvm/kvm-hv.ko] undefined!


need this:

--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -89,6 +89,7 @@ static struct clocksource clocksource_timebase = {

 #define DECREMENTER_DEFAULT_MAX 0x7FFF
 u64 decrementer_max = DECREMENTER_DEFAULT_MAX;
+EXPORT_SYMBOL_GPL(decrementer_max);



---
  arch/powerpc/include/asm/time.h | 2 ++
  arch/powerpc/kvm/book3s_hv.c| 3 ++-
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 8dd3cdb25338..68d94711811e 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -18,6 +18,8 @@
  #include 
  
  /* time.c */

+extern u64 decrementer_max;
+
  extern unsigned long tb_ticks_per_jiffy;
  extern unsigned long tb_ticks_per_usec;
  extern unsigned long tb_ticks_per_sec;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8215430e6d5e..bb30c5ab53d1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3658,7 +3658,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
vc->tb_offset_applied = 0;
}
  
-	mtspr(SPRN_HDEC, 0x7fff);

+   /* HDEC must be at least as large as DEC, so decrementer_max fits */
+   mtspr(SPRN_HDEC, decrementer_max);
  
  	switch_mmu_to_host_radix(kvm, host_pidr);
  



--
Alexey


Re: [PATCH v4 28/46] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

2021-03-23 Thread Alexey Kardashevskiy




On 23/03/2021 12:02, Nicholas Piggin wrote:

irq_work's use of the DEC SPR is racy with guest<->host switch and guest
entry which flips the DEC interrupt to guest, which could lose a host
work interrupt.

This patch closes one race, and attempts to comment another class of
races.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kvm/book3s_hv.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1f38a0abc611..989a1ff5ad11 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3745,6 +3745,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
if (!(vcpu->arch.ctrl & 1))
mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
  
+	/*

+* When setting DEC, we must always deal with irq_work_raise via NMI vs
+* setting DEC. The problem occurs right as we switch into guest mode
+* if a NMI hits and sets pending work and sets DEC, then that will
+* apply to the guest and not bring us back to the host.
+*
+* irq_work_raise could check a flag (or possibly LPCR[HDICE] for
+* example) and set HDEC to 1? That wouldn't solve the nested hv
+* case which needs to abort the hcall or zero the time limit.
+*
+* XXX: Another day's problem.
+*/
mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
  
  	if (kvmhv_on_pseries()) {

@@ -3879,7 +3891,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
  
-	mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - tb);

+   set_dec_or_work(local_paca->kvm_hstate.dec_expires - tb);



set_dec_or_work() will write local_paca->kvm_hstate.dec_expires - tb - 1 
to SPRN_DEC which is not exactly the same, is this still alright?


I asked in v3 but it is probably lost :)


+
mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
  
  	kvmhv_load_host_pmu();




--
Alexey


Re: [PATCH v4 22/46] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

2021-03-23 Thread Alexey Kardashevskiy




On 23/03/2021 20:16, Nicholas Piggin wrote:

Excerpts from Alexey Kardashevskiy's message of March 23, 2021 7:02 pm:



On 23/03/2021 12:02, Nicholas Piggin wrote:

In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path.

POWER8 and earlier are much more expensive to exit from HV real mode
and switch to host mode, because on those processors HV interrupts get
to the hypervisor with the MMU off, and the other threads in the core
need to be pulled out of the guest, and SLBs all need to be saved,
ERATs invalidated, and host SLB reloaded before the MMU is re-enabled
in host mode. Hash guests also require a lot of hcalls to run. The
XICS interrupt controller requires hcalls to run.

By contrast, POWER9 has independent thread switching, and in radix mode
the hypervisor is already in a host virtual memory mode when the HV
interrupt is taken. Radix + xive guests don't need hcalls to handle
interrupts or manage translations.

So it's much less important to handle hcalls in real mode in P9.

Signed-off-by: Nicholas Piggin 
---
   arch/powerpc/include/asm/kvm_ppc.h  |  5 ++
   arch/powerpc/kvm/book3s_hv.c| 57 
   arch/powerpc/kvm/book3s_hv_rmhandlers.S |  5 ++
   arch/powerpc/kvm/book3s_xive.c  | 70 +
   4 files changed, 127 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 73b1ca5a6471..db6646c2ade2 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -607,6 +607,7 @@ extern void kvmppc_free_pimap(struct kvm *kvm);
   extern int kvmppc_xics_rm_complete(struct kvm_vcpu *vcpu, u32 hcall);
   extern void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu);
   extern int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd);
+extern int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req);
   extern u64 kvmppc_xics_get_icp(struct kvm_vcpu *vcpu);
   extern int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
   extern int kvmppc_xics_connect_vcpu(struct kvm_device *dev,
@@ -639,6 +640,8 @@ static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
   static inline void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu) { }
   static inline int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
{ return 0; }
+static inline int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req)
+   { return 0; }
   #endif
   
   #ifdef CONFIG_KVM_XIVE

@@ -673,6 +676,7 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 irq,
   int level, bool line_status);
   extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
   extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu);
   
   static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

   {
@@ -714,6 +718,7 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 ir
  int level, bool line_status) { return 
-ENODEV; }
   static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
   static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu) { }
   
   static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

{ return 0; }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index fa7614c37e08..17739aaee3d8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1142,12 +1142,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
   }
   
   /*

- * Handle H_CEDE in the nested virtualization case where we haven't
- * called the real-mode hcall handlers in book3s_hv_rmhandlers.S.
+ * Handle H_CEDE in the P9 path where we don't call the real-mode hcall
+ * handlers in book3s_hv_rmhandlers.S.
+ *
* This has to be done early, not in kvmppc_pseries_do_hcall(), so
* that the cede logic in kvmppc_run_single_vcpu() works properly.
*/
-static void kvmppc_nested_cede(struct kvm_vcpu *vcpu)
+static void kvmppc_cede(struct kvm_vcpu *vcpu)
   {
vcpu->arch.shregs.msr |= MSR_EE;
vcpu->arch.ceded = 1;
@@ -1403,9 +1404,15 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
   
-		/* hypercall with MSR_PR has already been handled in rmode,

-* and never reaches here.
-*/
+   if (unlikely(vcpu->arch.shregs.msr & MSR_PR)) {
+   /*
+* Guest userspace executed sc 1, reflect it back as a
+* privileged program check interrupt.
+*/
+   kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV);
+   r = RESUME_GUEST;
+   break;
+   }
   
   		

Re: [PATCH v4 22/46] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

2021-03-23 Thread Alexey Kardashevskiy
hv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
return trap;
  }
  
+static inline bool hcall_is_xics(unsigned long req)

+{
+   return (req == H_EOI || req == H_CPPR || req == H_IPI ||
+   req == H_IPOLL || req == H_XIRR || req == H_XIRR_X);


Do not need braces :)



+}
+
  /*
   * Virtual-mode guest entry for POWER9 and later when the host and
   * guest are both using the radix MMU.  The LPIDR has already been set.
@@ -3774,15 +3787,36 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
/* H_CEDE has to be handled now, not later */
if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
-   kvmppc_nested_cede(vcpu);
+   kvmppc_cede(vcpu);
kvmppc_set_gpr(vcpu, 3, 0);
trap = 0;
}
} else {
kvmppc_xive_push_vcpu(vcpu);
trap = kvmhv_load_hv_regs_and_go(vcpu, time_limit, lpcr);
+   if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
+   !(vcpu->arch.shregs.msr & MSR_PR)) {
+   unsigned long req = kvmppc_get_gpr(vcpu, 3);
+
+   /* H_CEDE has to be handled now, not later */
+   if (req == H_CEDE) {
+   kvmppc_cede(vcpu);
+   kvmppc_xive_cede_vcpu(vcpu); /* may un-cede */
+   kvmppc_set_gpr(vcpu, 3, 0);
+   trap = 0;
+
+   /* XICS hcalls must be handled before xive is pulled */
+   } else if (hcall_is_xics(req)) {
+   int ret;
+
+   ret = kvmppc_xive_xics_hcall(vcpu, req);
+   if (ret != H_TOO_HARD) {
+   kvmppc_set_gpr(vcpu, 3, ret);
+   trap = 0;
+   }
+   }
+   }
kvmppc_xive_pull_vcpu(vcpu);
-
}
  
  	vcpu->arch.slb_max = 0;

@@ -4442,8 +4476,11 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
else
r = kvmppc_run_vcpu(vcpu);
  
-		if (run->exit_reason == KVM_EXIT_PAPR_HCALL &&

-   !(vcpu->arch.shregs.msr & MSR_PR)) {
+   if (run->exit_reason == KVM_EXIT_PAPR_HCALL) {
+   if (WARN_ON_ONCE(vcpu->arch.shregs.msr & MSR_PR)) {
+   r = RESUME_GUEST;
+   continue;
+   }
trace_kvm_hcall_enter(vcpu);
r = kvmppc_pseries_do_hcall(vcpu);
trace_kvm_hcall_exit(vcpu, r);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c11597f815e4..2d0d14ed1d92 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1397,9 +1397,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
mr  r4,r9
bge fast_guest_return
  2:
+   /* If we came in through the P9 short path, no real mode hcalls */
+   lwz r0, STACK_SLOT_SHORT_PATH(r1)
+   cmpwi   r0, 0
+   bne no_try_real



btw is mmu on at this point? or it gets enabled by rfid at the end of 
guest_exit_short_path?



anyway,


Reviewed-by: Alexey Kardashevskiy 




/* See if this is an hcall we can handle in real mode */
cmpwi   r12,BOOK3S_INTERRUPT_SYSCALL
beq hcall_try_real_mode
+no_try_real:
  
  	/* Hypervisor doorbell - exit only if host IPI flag set */

cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 741bf1f4387a..dcc07ceaf5ca 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -158,6 +158,40 @@ void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu)
  }
  EXPORT_SYMBOL_GPL(kvmppc_xive_pull_vcpu);
  
+void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu)

+{
+   void __iomem *esc_vaddr = (void __iomem *)vcpu->arch.xive_esc_vaddr;
+
+   if (!esc_vaddr)
+   return;
+
+   /* we are using XIVE with single escalation */
+
+   if (vcpu->arch.xive_esc_on) {
+   /*
+* If we still have a pending escalation, abort the cede,
+* and we must set PQ to 10 rather than 00 so that we don't
+* potentially end up with two entries for the escalation
+* interrupt in the XIVE interrupt queue.  In that case
+* we also don't want to set xive_esc_on to 1 here in
+* case we race with xive_esc_irq().
+*/
+ 

Re: [PATCH v4 04/46] KVM: PPC: Book3S HV: Prevent radix guests from setting LPCR[TC]

2021-03-23 Thread Alexey Kardashevskiy




On 23/03/2021 12:02, Nicholas Piggin wrote:

This bit only applies to hash partitions.

Signed-off-by: Nicholas Piggin 




Reviewed-by: Alexey Kardashevskiy 


---
  arch/powerpc/kvm/book3s_hv.c| 6 ++
  arch/powerpc/kvm/book3s_hv_nested.c | 3 +--
  2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c5de7e3f22b6..1ffb0902e779 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1645,6 +1645,12 @@ static int kvm_arch_vcpu_ioctl_set_sregs_hv(struct 
kvm_vcpu *vcpu,
   */
  unsigned long kvmppc_filter_lpcr_hv(struct kvmppc_vcore *vc, unsigned long 
lpcr)
  {
+   struct kvm *kvm = vc->kvm;
+
+   /* LPCR_TC only applies to HPT guests */
+   if (kvm_is_radix(kvm))
+   lpcr &= ~LPCR_TC;
+
/* On POWER8 and above, userspace can modify AIL */
if (!cpu_has_feature(CPU_FTR_ARCH_207S))
lpcr &= ~LPCR_AIL;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index f7b441b3eb17..851e3f527eb2 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -140,8 +140,7 @@ static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct 
hv_guest_state *hr)
/*
 * Don't let L1 change LPCR bits for the L2 except these:
 */
-   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
-   LPCR_LPES | LPCR_MER;
+   mask = LPCR_DPFD | LPCR_ILE | LPCR_AIL | LPCR_LD | LPCR_LPES | LPCR_MER;
hr->lpcr = kvmppc_filter_lpcr_hv(vc,
(vc->lpcr & ~mask) | (hr->lpcr & mask));
  



--
Alexey


Re: [PATCH 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-03-23 Thread Alexey Kardashevskiy




On 23/03/2021 06:09, Leonardo Bras wrote:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.
> Signed-off-by: Leonardo Bras 
---
  arch/powerpc/include/asm/iommu.h   |  8 
  arch/powerpc/platforms/pseries/iommu.c | 28 +++---
  2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..c170048b7a1b 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -19,6 +19,14 @@
  #include 
  #include 
  
+#define IOMMU_PAGE_SHIFT_16G	34

+#define IOMMU_PAGE_SHIFT_256M  28
+#define IOMMU_PAGE_SHIFT_128M  27
+#define IOMMU_PAGE_SHIFT_64M   26
+#define IOMMU_PAGE_SHIFT_32M   25
+#define IOMMU_PAGE_SHIFT_16M   24
+#define IOMMU_PAGE_SHIFT_64K   16



These are not very descriptive, these are just normal shifts, could be 
as simple as __builtin_ctz(SZ_4K) (gcc will optimize this) and so on.


OTOH the PAPR page sizes need macros as they are the ones which are 
weird and screaming for macros.


I'd steal/rework spapr_page_mask_to_query_mask() from QEMU. Thanks,





+
  #define IOMMU_PAGE_SHIFT_4K  12
  #define IOMMU_PAGE_SIZE_4K   (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K)
  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..02958e80aa91 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1099,6 +1099,24 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+/* Returns page shift based on "IO Page Sizes" output at ibm,query-pe-dma-window. SeeL LoPAR */

+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const int shift[] = {IOMMU_PAGE_SHIFT_4K,   IOMMU_PAGE_SHIFT_64K,  
IOMMU_PAGE_SHIFT_16M,
+IOMMU_PAGE_SHIFT_32M,  IOMMU_PAGE_SHIFT_64M,  
IOMMU_PAGE_SHIFT_128M,
+IOMMU_PAGE_SHIFT_256M, IOMMU_PAGE_SHIFT_16G};
+   int i = ARRAY_SIZE(shift) - 1;
+
+   /* Looks for the largest page size supported */
+   for (; i >= 0; i--) {
+   if (query_page_size & (1 << i))
+   return shift[i];
+   }
+
+   /* No valid page size found. */
+   return 0;
+}
+
  /*
   * If the PE supports dynamic dma windows, and there is space for a table
   * that can map all pages in a linear offset, then setup such a table,
@@ -1206,13 +1224,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
-   page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
-   page_shift = 12; /* 4kB */
-   } else {
+
+   page_shift = iommu_get_page_shift(query.page_size);
+   if (!page_shift) {
dev_dbg(>dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;



--
Alexey


Re: [PATCH v3 25/41] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

2021-03-22 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

irq_work's use of the DEC SPR is racy with guest<->host switch and guest
entry which flips the DEC interrupt to guest, which could lose a host
work interrupt.

This patch closes one race, and attempts to comment another class of
races.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kvm/book3s_hv.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6f3e3aed99aa..b7a88960ac49 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3704,6 +3704,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
if (!(vcpu->arch.ctrl & 1))
mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
  
+	/*

+* When setting DEC, we must always deal with irq_work_raise via NMI vs
+* setting DEC. The problem occurs right as we switch into guest mode
+* if a NMI hits and sets pending work and sets DEC, then that will
+* apply to the guest and not bring us back to the host.
+*
+* irq_work_raise could check a flag (or possibly LPCR[HDICE] for
+* example) and set HDEC to 1? That wouldn't solve the nested hv
+* case which needs to abort the hcall or zero the time limit.
+*
+* XXX: Another day's problem.
+*/
mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
  
  	if (kvmhv_on_pseries()) {

@@ -3838,7 +3850,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
  
-	mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - tb);

+   set_dec_or_work(local_paca->kvm_hstate.dec_expires - tb);


set_dec_or_work() will write local_paca->kvm_hstate.dec_expires - tb - 1 
to SPRN_DEC which is not exactly the same, is this still alright?




+
mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
  
  	kvmhv_load_host_pmu();




--
Alexey


Re: [PATCH v3 24/41] powerpc: add set_dec_or_work API for safely updating decrementer

2021-03-22 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

Decrementer updates must always check for new irq work to avoid an
irq work decrementer interrupt being lost.

Add an API for this in the timer code so callers don't have to care
about details.

Signed-off-by: Nicholas Piggin 


Reviewed-by: Alexey Kardashevskiy 



---
  arch/powerpc/include/asm/time.h |  9 +
  arch/powerpc/kernel/time.c  | 20 +++-
  2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 0128cd9769bc..d62bde57bf02 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -78,6 +78,15 @@ static inline void set_dec(u64 val)
mtspr(SPRN_DEC, val - 1);
  }
  
+#ifdef CONFIG_IRQ_WORK

+void set_dec_or_work(u64 val);
+#else
+static inline void set_dec_or_work(u64 val)
+{
+   set_dec(val);
+}
+#endif
+
  static inline unsigned long tb_ticks_since(unsigned long tstamp)
  {
return mftb() - tstamp;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index c5d524622c17..341cc8442e5e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -562,6 +562,15 @@ void arch_irq_work_raise(void)
preempt_enable();
  }
  
+void set_dec_or_work(u64 val)

+{
+   set_dec(val);
+   /* We may have raced with new irq work */
+   if (unlikely(test_irq_work_pending()))
+   set_dec(1);
+}
+EXPORT_SYMBOL_GPL(set_dec_or_work);
+
  #else  /* CONFIG_IRQ_WORK */
  
  #define test_irq_work_pending()	0

@@ -629,10 +638,7 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
} else {
now = *next_tb - now;
if (now <= decrementer_max)
-   set_dec(now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
+   set_dec_or_work(now);
__this_cpu_inc(irq_stat.timer_irqs_others);
}
  
@@ -874,11 +880,7 @@ static int decrementer_set_next_event(unsigned long evt,

  struct clock_event_device *dev)
  {
__this_cpu_write(decrementers_next_tb, get_tb() + evt);
-   set_dec(evt);
-
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
+   set_dec_or_work(evt);
  
  	return 0;

  }



--
Alexey


Re: [PATCH v3 20/41] KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR

2021-03-22 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

LPCR[HDICE]=0 suppresses hypervisor decrementer exceptions on some
processors, so it must be enabled before HDEC is set.


Educating myself - is not it a processor bug when it does not suppress 
hdec exceptions with HDICE=0?


Also, why do we want to enable interrupts before writing HDEC? Enabling 
it may cause an interrupt right away a


Anyway whatever the answers are, this is not changed by this patch and 
the change makes sense so


Reviewed-by: Alexey Kardashevskiy 



Rather than set it in the host LPCR then setting HDEC, move the HDEC
update to after the guest MMU context (including LPCR) is loaded.
There shouldn't be much concern with delaying HDEC by some 10s or 100s
of nanoseconds by setting it a bit later.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kvm/book3s_hv.c | 19 +++
  1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1f2ba8955c6a..ffde1917ab68 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3505,20 +3505,9 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
host_dawrx1 = mfspr(SPRN_DAWRX1);
}
  
-	/*

-* P8 and P9 suppress the HDEC exception when LPCR[HDICE] = 0,
-* so set HDICE before writing HDEC.
-*/
-   mtspr(SPRN_LPCR, kvm->arch.host_lpcr | LPCR_HDICE);
-   isync();
-
hdec = time_limit - mftb();
-   if (hdec < 0) {
-   mtspr(SPRN_LPCR, kvm->arch.host_lpcr);
-   isync();
+   if (hdec < 0)
return BOOK3S_INTERRUPT_HV_DECREMENTER;
-   }
-   mtspr(SPRN_HDEC, hdec);
  
  	if (vc->tb_offset) {

u64 new_tb = mftb() + vc->tb_offset;
@@ -3564,6 +3553,12 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
  
  	switch_mmu_to_guest_radix(kvm, vcpu, lpcr);
  
+	/*

+* P9 suppresses the HDEC exception when LPCR[HDICE] = 0,
+* so set guest LPCR (with HDICE) before writing HDEC.
+*/
+   mtspr(SPRN_HDEC, hdec);
+
mtspr(SPRN_SRR0, vcpu->arch.shregs.srr0);
mtspr(SPRN_SRR1, vcpu->arch.shregs.srr1);
  



--
Alexey


Re: [PATCH v3 21/41] KVM: PPC: Book3S HV P9: Use large decrementer for HDEC

2021-03-22 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.

Signed-off-by: Nicholas Piggin 


Reviewed-by: Alexey Kardashevskiy 



---
  arch/powerpc/include/asm/time.h | 2 ++
  arch/powerpc/kvm/book3s_hv.c| 3 ++-
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 8dd3cdb25338..68d94711811e 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -18,6 +18,8 @@
  #include 
  
  /* time.c */

+extern u64 decrementer_max;
+
  extern unsigned long tb_ticks_per_jiffy;
  extern unsigned long tb_ticks_per_usec;
  extern unsigned long tb_ticks_per_sec;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ffde1917ab68..24b0680f0ad7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3623,7 +3623,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
vc->tb_offset_applied = 0;
}
  
-	mtspr(SPRN_HDEC, 0x7fff);

+   /* HDEC must be at least as large as DEC, so decrementer_max fits */
+   mtspr(SPRN_HDEC, decrementer_max);
  
  	switch_mmu_to_host_radix(kvm, host_pidr);
  





--
Alexey


Re: [PATCH v3 19/41] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

2021-03-22 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path.

POWER8 and earlier are much more expensive to exit from HV real mode
and switch to host mode, because on those processors HV interrupts get
to the hypervisor with the MMU off, and the other threads in the core
need to be pulled out of the guest, and SLBs all need to be saved,
ERATs invalidated, and host SLB reloaded before the MMU is re-enabled
in host mode. Hash guests also require a lot of hcalls to run. The
XICS interrupt controller requires hcalls to run.

By contrast, POWER9 has independent thread switching, and in radix mode
the hypervisor is already in a host virtual memory mode when the HV
interrupt is taken. Radix + xive guests don't need hcalls to handle
interrupts or manage translations.

So it's much less important to handle hcalls in real mode in P9.


So acde25726bc6034b (which added if(kvm_is_radix(vcpu->kvm))return 
H_TOO_HARD) can be reverted, pretty much?






Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/kvm_ppc.h  |  5 +++
  arch/powerpc/kvm/book3s_hv.c| 46 +++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  5 +++
  arch/powerpc/kvm/book3s_xive.c  | 60 +
  4 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 73b1ca5a6471..db6646c2ade2 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -607,6 +607,7 @@ extern void kvmppc_free_pimap(struct kvm *kvm);
  extern int kvmppc_xics_rm_complete(struct kvm_vcpu *vcpu, u32 hcall);
  extern void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu);
  extern int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd);
+extern int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req);
  extern u64 kvmppc_xics_get_icp(struct kvm_vcpu *vcpu);
  extern int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
  extern int kvmppc_xics_connect_vcpu(struct kvm_device *dev,
@@ -639,6 +640,8 @@ static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
  static inline void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu) { }
  static inline int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
{ return 0; }
+static inline int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req)
+   { return 0; }
  #endif
  
  #ifdef CONFIG_KVM_XIVE

@@ -673,6 +676,7 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 irq,
   int level, bool line_status);
  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
  extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu);
  
  static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

  {
@@ -714,6 +718,7 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 ir
  int level, bool line_status) { return 
-ENODEV; }
  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
  static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu) { }
  
  static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

{ return 0; }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 497f216ad724..1f2ba8955c6a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1147,7 +1147,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
   * This has to be done early, not in kvmppc_pseries_do_hcall(), so
   * that the cede logic in kvmppc_run_single_vcpu() works properly.
   */
-static void kvmppc_nested_cede(struct kvm_vcpu *vcpu)
+static void kvmppc_cede(struct kvm_vcpu *vcpu)
  {
vcpu->arch.shregs.msr |= MSR_EE;
vcpu->arch.ceded = 1;
@@ -1403,9 +1403,15 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
  
-		/* hypercall with MSR_PR has already been handled in rmode,

-* and never reaches here.
-*/
+   if (unlikely(vcpu->arch.shregs.msr & MSR_PR)) {
+   /*
+* Guest userspace executed sc 1, reflect it back as a
+* privileged program check interrupt.
+*/
+   kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV);
+   r = RESUME_GUEST;
+   break;
+   }
  
  		run->papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);

for (i = 0; i < 9; ++i)
@@ -3740,15 +3746,36 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
/* H_CEDE has to be handled now, not later */
if (trap == BOOK3S_INTERRUPT_SYSCALL && 

Re: [PATCH v3 16/41] KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together

2021-03-22 Thread Alexey Kardashevskiy
it,
if (cpu_has_feature(CPU_FTR_ARCH_31))
asm volatile(PPC_CP_ABORT);
   
-	mtspr(SPRN_LPID, vcpu->kvm->arch.host_lpid);	/* restore host LPID */

-   isync();
-
vc->dpdes = mfspr(SPRN_DPDES);
vc->vtb = mfspr(SPRN_VTB);
mtspr(SPRN_DPDES, 0);
@@ -3605,7 +3625,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
}
   
   	mtspr(SPRN_HDEC, 0x7fff);

-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
+
+   switch_mmu_to_host_radix(kvm, host_pidr);
   
   	return trap;

   }
@@ -4138,7 +4159,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
   {
struct kvm_run *run = vcpu->run;
int trap, r, pcpu;
-   int srcu_idx, lpid;
+   int srcu_idx;
struct kvmppc_vcore *vc;
struct kvm *kvm = vcpu->kvm;
struct kvm_nested_guest *nested = vcpu->arch.nested;
@@ -4212,13 +4233,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
vc->vcore_state = VCORE_RUNNING;
trace_kvmppc_run_core(vc, 0);
   
-	if (cpu_has_feature(CPU_FTR_HVMODE)) {



The new location of mtspr(SPRN_LPID, lpid) does not check for
CPU_FTR_HVMODE anymore, is this going to work with HV KVM on pseries?


Yes, these are moved to HVMODE specific code now.


ah right, kvmhv_on_pseries() is !cpu_has_feature(CPU_FTR_HVMODE).


Reviewed-by: Alexey Kardashevskiy 


--
Alexey


Re: [PATCH v3 18/41] KVM: PPC: Book3S HV P9: Move xive vcpu context management into kvmhv_p9_guest_entry

2021-03-21 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

Move the xive management up so the low level register switching can be
pushed further down in a later patch. XIVE MMIO CI operations can run in
higher level code with machine checks, tracing, etc., available.

Signed-off-by: Nicholas Piggin 




Reviewed-by: Alexey Kardashevskiy 



---
  arch/powerpc/kvm/book3s_hv.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b265522fc467..497f216ad724 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3558,15 +3558,11 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
  
  	switch_mmu_to_guest_radix(kvm, vcpu, lpcr);
  
-	kvmppc_xive_push_vcpu(vcpu);

-
mtspr(SPRN_SRR0, vcpu->arch.shregs.srr0);
mtspr(SPRN_SRR1, vcpu->arch.shregs.srr1);
  
  	trap = __kvmhv_vcpu_entry_p9(vcpu);
  
-	kvmppc_xive_pull_vcpu(vcpu);

-
/* Advance host PURR/SPURR by the amount used by guest */
purr = mfspr(SPRN_PURR);
spurr = mfspr(SPRN_SPURR);
@@ -3749,7 +3745,10 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
trap = 0;
}
} else {
+   kvmppc_xive_push_vcpu(vcpu);
trap = kvmhv_load_hv_regs_and_go(vcpu, time_limit, lpcr);
+   kvmppc_xive_pull_vcpu(vcpu);
+
}
  
  	vcpu->arch.slb_max = 0;




--
Alexey


Re: [PATCH v3 17/41] KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C

2021-03-21 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

This is more symmetric with kvmppc_xive_push_vcpu. The extra test in
the asm will go away in a later change.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/kvm_ppc.h  |  2 ++
  arch/powerpc/kvm/book3s_hv.c|  2 ++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  5 
  arch/powerpc/kvm/book3s_xive.c  | 31 +
  4 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9531b1c1b190..73b1ca5a6471 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -672,6 +672,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
  
  static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

  {
@@ -712,6 +713,7 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
  
  static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)

{ return 0; }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b9cae42b9cd5..b265522fc467 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3565,6 +3565,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
  
  	trap = __kvmhv_vcpu_entry_p9(vcpu);
  
+	kvmppc_xive_pull_vcpu(vcpu);

+
/* Advance host PURR/SPURR by the amount used by guest */
purr = mfspr(SPRN_PURR);
spurr = mfspr(SPRN_SPURR);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 75405ef53238..c11597f815e4 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1442,6 +1442,11 @@ guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = 
paca */
bl  kvmhv_accumulate_time
  #endif
  #ifdef CONFIG_KVM_XICS
+   /* If we came in through the P9 short path, xive pull is done in C */
+   lwz r0, STACK_SLOT_SHORT_PATH(r1)
+   cmpwi   r0, 0
+   bne 1f
+
/* We are exiting, pull the VP from the XIVE */
lbz r0, VCPU_XIVE_PUSHED(r9)
cmpwi   cr0, r0, 0
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e7219b6f5f9a..52cdb9e2660a 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -127,6 +127,37 @@ void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu)
  }
  EXPORT_SYMBOL_GPL(kvmppc_xive_push_vcpu);
  
+/*

+ * Pull a vcpu's context from the XIVE on guest exit.
+ * This assumes we are in virtual mode (MMU on)
+ */
+void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu)
+{
+   void __iomem *tima = local_paca->kvm_hstate.xive_tima_virt;
+
+   if (!vcpu->arch.xive_pushed)
+   return;
+
+   /*
+* Sould not have been pushed if there is no tima



s/Sould/Should/

Otherwise good

Reviewed-by: Alexey Kardashevskiy 




+*/
+   if (WARN_ON(!tima))
+   return;
+
+   eieio();
+   /* First load to pull the context, we ignore the value */
+   __raw_readl(tima + TM_SPC_PULL_OS_CTX);
+   /* Second load to recover the context state (Words 0 and 1) */
+   vcpu->arch.xive_saved_state.w01 = __raw_readq(tima + TM_QW1_OS);
+
+   /* Fixup some of the state for the next load */
+   vcpu->arch.xive_saved_state.lsmfb = 0;
+   vcpu->arch.xive_saved_state.ack = 0xff;
+   vcpu->arch.xive_pushed = 0;
+   eieio();
+}
+EXPORT_SYMBOL_GPL(kvmppc_xive_pull_vcpu);
+
  /*
   * This is a simple trigger for a generic XIVE IRQ. This must
   * only be called for interrupts that support a trigger page



--
Alexey


Re: [PATCH v3 16/41] KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together

2021-03-21 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

Switching the MMU from radix<->radix mode is tricky particularly as the
MMU can remain enabled and requires a certain sequence of SPR updates.
Move these together into their own functions.

This also includes the radix TLB check / flush because it's tied in to
MMU switching due to tlbiel getting LPID from LPIDR.

(XXX: isync / hwsync synchronisation TBD)



Looks alright but what is this comment about? Is something missing or 
just sub optimal?





Signed-off-by: Nicholas Piggin 




---
  arch/powerpc/kvm/book3s_hv.c | 55 +---
  1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f1230f9d98ba..b9cae42b9cd5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3449,12 +3449,38 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
trace_kvmppc_run_core(vc, 1);
  }
  
+static void switch_mmu_to_guest_radix(struct kvm *kvm, struct kvm_vcpu *vcpu, u64 lpcr)

+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
+   u32 lpid;
+
+   lpid = nested ? nested->shadow_lpid : kvm->arch.lpid;
+
+   mtspr(SPRN_LPID, lpid);
+   mtspr(SPRN_LPCR, lpcr);
+   mtspr(SPRN_PID, vcpu->arch.pid);
+   isync();
+
+   /* TLBIEL must have LPIDR set, so set guest LPID before flushing. */
+   kvmppc_check_need_tlb_flush(kvm, vc->pcpu, nested);
+}
+
+static void switch_mmu_to_host_radix(struct kvm *kvm, u32 pid)
+{
+   mtspr(SPRN_PID, pid);
+   mtspr(SPRN_LPID, kvm->arch.host_lpid);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr);
+   isync();
+}
+
  /*
   * Load up hypervisor-mode registers on P9.
   */
  static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 unsigned long lpcr)
  {
+   struct kvm *kvm = vcpu->kvm;
struct kvmppc_vcore *vc = vcpu->arch.vcore;
s64 hdec;
u64 tb, purr, spurr;
@@ -3477,12 +3503,12 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
 * P8 and P9 suppress the HDEC exception when LPCR[HDICE] = 0,
 * so set HDICE before writing HDEC.
 */
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr | LPCR_HDICE);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr | LPCR_HDICE);
isync();
  
  	hdec = time_limit - mftb();

if (hdec < 0) {
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr);
isync();
return BOOK3S_INTERRUPT_HV_DECREMENTER;
}
@@ -3517,7 +3543,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
}
mtspr(SPRN_CIABR, vcpu->arch.ciabr);
mtspr(SPRN_IC, vcpu->arch.ic);
-   mtspr(SPRN_PID, vcpu->arch.pid);
  
  	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |

  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
@@ -3531,8 +3556,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
  
  	mtspr(SPRN_AMOR, ~0UL);
  
-	mtspr(SPRN_LPCR, lpcr);

-   isync();
+   switch_mmu_to_guest_radix(kvm, vcpu, lpcr);
  
  	kvmppc_xive_push_vcpu(vcpu);
  
@@ -3571,7 +3595,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,

mtspr(SPRN_DAWR1, host_dawr1);
mtspr(SPRN_DAWRX1, host_dawrx1);
}
-   mtspr(SPRN_PID, host_pidr);
  
  	/*

 * Since this is radix, do a eieio; tlbsync; ptesync sequence in
@@ -3586,9 +3609,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if (cpu_has_feature(CPU_FTR_ARCH_31))
asm volatile(PPC_CP_ABORT);
  
-	mtspr(SPRN_LPID, vcpu->kvm->arch.host_lpid);	/* restore host LPID */

-   isync();
-
vc->dpdes = mfspr(SPRN_DPDES);
vc->vtb = mfspr(SPRN_VTB);
mtspr(SPRN_DPDES, 0);
@@ -3605,7 +3625,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
}
  
  	mtspr(SPRN_HDEC, 0x7fff);

-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
+
+   switch_mmu_to_host_radix(kvm, host_pidr);
  
  	return trap;

  }
@@ -4138,7 +4159,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
  {
struct kvm_run *run = vcpu->run;
int trap, r, pcpu;
-   int srcu_idx, lpid;
+   int srcu_idx;
struct kvmppc_vcore *vc;
struct kvm *kvm = vcpu->kvm;
struct kvm_nested_guest *nested = vcpu->arch.nested;
@@ -4212,13 +4233,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
vc->vcore_state = VCORE_RUNNING;
trace_kvmppc_run_core(vc, 0);
  
-	if (cpu_has_feature(CPU_FTR_HVMODE)) {



The new location of mtspr(SPRN_LPID, lpid) does not check for 
CPU_FTR_HVMODE anymore, is this 

Re: [PATCH v3 15/41] KVM: PPC: Book3S 64: Minimise hcall handler calling convention differences

2021-03-21 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

This sets up the same calling convention from interrupt entry to
KVM interrupt handler for system calls as exists for other interrupt
types.

This is a better API, it uses a save area rather than SPR, and it has
more registers free to use. Using a single common API helps maintain
it, and it becomes easier to use in C in a later patch.

Signed-off-by: Nicholas Piggin 



Reviewed-by: Alexey Kardashevskiy 




---
  arch/powerpc/kernel/exceptions-64s.S | 16 +++-
  arch/powerpc/kvm/book3s_64_entry.S   | 22 +++---
  2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b4eab5084964..ce6f5f863d3d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1892,8 +1892,22 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
  
  #ifdef CONFIG_KVM_BOOK3S_64_HANDLER

  TRAMP_REAL_BEGIN(kvm_hcall)
+   std r9,PACA_EXGEN+EX_R9(r13)
+   std r11,PACA_EXGEN+EX_R11(r13)
+   std r12,PACA_EXGEN+EX_R12(r13)
+   mfcrr9
mfctr   r10
-   SET_SCRATCH0(r10) /* Save r13 in SCRATCH0 */
+   std r10,PACA_EXGEN+EX_R13(r13)
+   li  r10,0
+   std r10,PACA_EXGEN+EX_CFAR(r13)
+   std r10,PACA_EXGEN+EX_CTR(r13)
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_PPR
+   std r10,PACA_EXGEN+EX_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+
+   HMT_MEDIUM
+
  #ifdef CONFIG_RELOCATABLE
/*
 * Requires __LOAD_FAR_HANDLER beause kvmppc_hcall lives
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 7a6b060ceed8..129d3f81800e 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -14,24 +14,9 @@
  .global   kvmppc_hcall
  .balign IFETCH_ALIGN_BYTES
  kvmppc_hcall:
-   /*
-* This is a hcall, so register convention is as
-* Documentation/powerpc/papr_hcalls.rst, with these additions:
-* R13  = PACA
-* guest R13 saved in SPRN_SCRATCH0
-* R10  = free
-*/
-BEGIN_FTR_SECTION
-   mfspr   r10,SPRN_PPR
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   HMT_MEDIUM
-   mfcrr10
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r10,32
-   ori r12,r12,0xc00
-   ld  r10,PACA_EXGEN+EX_R10(r13)
-   b   do_kvm_interrupt
+   ld  r10,PACA_EXGEN+EX_R13(r13)
+   SET_SCRATCH0(r10)
+   li  r10,0xc00
  
  .global	kvmppc_interrupt

  .balign IFETCH_ALIGN_BYTES
@@ -62,7 +47,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r10,EX_R10(r11)
ld  r11,EX_R11(r11)
  
-do_kvm_interrupt:

/*
 * Hcalls and other interrupts come here after normalising register
 * contents and save locations:



--
Alexey


Re: [PATCH v3 14/41] KVM: PPC: Book3S 64: move bad_host_intr check to HV handler

2021-03-20 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

This is not used by PR KVM.

Signed-off-by: Nicholas Piggin 



Reviewed-by: Alexey Kardashevskiy 

a small tote - it probably makes sense to move this before 09/41 as this 
one removes what 09/41 added to book3s_64_entry.S. Thanks,




---
  arch/powerpc/kvm/book3s_64_entry.S  | 3 ---
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
  arch/powerpc/kvm/book3s_segment.S   | 7 +++
  3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index d06e81842368..7a6b060ceed8 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -78,11 +78,8 @@ do_kvm_interrupt:
beq-.Lmaybe_skip
  .Lno_skip:
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
-   beq kvmppc_bad_host_intr
  #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
cmpwi   r9,KVM_GUEST_MODE_GUEST
-   ld  r9,HSTATE_SCRATCH2(r13)
beq kvmppc_interrupt_pr
  #endif
b   kvmppc_interrupt_hv
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index f976efb7e4a9..75405ef53238 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1265,6 +1265,7 @@ hdec_soon:
  kvmppc_interrupt_hv:
/*
 * Register contents:
+* R9   = HSTATE_IN_GUEST
 * R12  = (guest CR << 32) | interrupt vector
 * R13  = PACA
 * guest R12 saved in shadow VCPU SCRATCH0
@@ -1272,6 +1273,8 @@ kvmppc_interrupt_hv:
 * guest R9 saved in HSTATE_SCRATCH2
 */
/* We're now back in the host but in guest MMU context */
+   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
+   beq kvmppc_bad_host_intr
li  r9, KVM_GUEST_MODE_HOST_HV
stb r9, HSTATE_IN_GUEST(r13)
  
@@ -3272,7 +3275,6 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)

   * cfar is saved in HSTATE_CFAR(r13)
   * ppr is saved in HSTATE_PPR(r13)
   */
-.global kvmppc_bad_host_intr
  kvmppc_bad_host_intr:
/*
 * Switch to the emergency stack, but start half-way down in
diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index 1f492aa4c8d6..ef1d88b869bf 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -167,8 +167,15 @@ kvmppc_interrupt_pr:
 * R12 = (guest CR << 32) | exit handler id
 * R13 = PACA
 * HSTATE.SCRATCH0 = guest R12
+*
+* If HV is possible, additionally:
+* R9  = HSTATE_IN_GUEST
+* HSTATE.SCRATCH2 = guest R9
 */
  #ifdef CONFIG_PPC64
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   ld  r9,HSTATE_SCRATCH2(r13)
+#endif
/* Match 32-bit entry */
rotldi  r12, r12, 32  /* Flip R12 halves for stw */
stw r12, HSTATE_SCRATCH1(r13) /* CR is now in the low half */



--
Alexey


Re: [PATCH v3 13/41] KVM: PPC: Book3S 64: Move interrupt early register setup to KVM

2021-03-20 Thread Alexey Kardashevskiy




On 06/03/2021 02:06, Nicholas Piggin wrote:

Like the earlier patch for hcalls, KVM interrupt entry requires a
different calling convention than the Linux interrupt handlers
set up. Move the code that converts from one to the other into KVM.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kernel/exceptions-64s.S | 131 +--
  arch/powerpc/kvm/book3s_64_entry.S   |  34 ++-
  2 files changed, 55 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b7092ba87da8..b4eab5084964 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -187,7 +187,6 @@ do_define_int n
.endif
  .endm
  
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER

  /*
   * All interrupts which set HSRR registers, as well as SRESET and MCE and
   * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
@@ -220,54 +219,25 @@ do_define_int n
   * to KVM to handle.
   */
  
-.macro KVMTEST name

+.macro KVMTEST name handler
+#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
-   bne \name\()_kvm
-.endm
-
-.macro GEN_KVM name
-   .balign IFETCH_ALIGN_BYTES
-\name\()_kvm:
-
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_CFAR(r13)
-   std r10,HSTATE_CFAR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
-
-   ld  r10,IAREA+EX_CTR(r13)
-   mtctr   r10
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_PPR(r13)
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   ld  r11,IAREA+EX_R11(r13)
-   ld  r12,IAREA+EX_R12(r13)
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r9,32
-   ld  r9,IAREA+EX_R9(r13)
-   ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
-   ori r12,r12,(IVEC + 0x2)
+   li  r10,(IVEC + 0x2)
FTR_SECTION_ELSE
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
.elseif IHSRR
-   ori r12,r12,(IVEC+ 0x2)
+   li  r10,(IVEC + 0x2)
.else
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
.endif
-   b   kvmppc_interrupt
-.endm
-
-#else
-.macro KVMTEST name
-.endm
-.macro GEN_KVM name
-.endm
+   bne \handler
  #endif
+.endm
  
  /*

   * This is the BOOK3S interrupt entry code macro.
@@ -409,7 +379,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
  DEFINE_FIXED_SYMBOL(\name\()_common_real)
  \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
  
  	ld	r10,PACAKMSR(r13)	/* get MSR value for kernel */

@@ -432,7 +402,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
  DEFINE_FIXED_SYMBOL(\name\()_common_virt)
  \name\()_common_virt:
.if IKVM_VIRT
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
  1:
.endif
.endif /* IVIRT */
@@ -446,7 +416,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
  DEFINE_FIXED_SYMBOL(\name\()_common_real)
  \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
  .endm
  
@@ -967,8 +937,6 @@ EXC_COMMON_BEGIN(system_reset_common)

EXCEPTION_RESTORE_REGS
RFI_TO_USER_OR_KERNEL
  
-	GEN_KVM system_reset

-
  
  /**

   * Interrupt 0x200 - Machine Check Interrupt (MCE).
@@ -1132,7 +1100,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
/*
 * Check if we are coming from guest. If yes, then run the normal
 * exception handler which will take the
-* machine_check_kvm->kvmppc_interrupt branch to deliver the MC event
+* machine_check_kvm->kvm_interrupt branch to deliver the MC event
 * to guest.
 */
lbz r11,HSTATE_IN_GUEST(r13)
@@ -1203,8 +1171,6 @@ EXC_COMMON_BEGIN(machine_check_common)
bl  machine_check_exception
b   interrupt_return
  
-	GEN_KVM machine_check

-
  
  #ifdef CONFIG_PPC_P7_NAP

  /*
@@ -1339,8 +1305,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
REST_NVGPRS(r1)
b   interrupt_return
  
-	GEN_KVM data_access

-
  
  /**

   * Interrupt 0x380 - Data Segment Interrupt (DSLB).
@@ -1390,8 +1354,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  do_bad_slb_fault
b   interrupt_return
  
-	GEN_KVM data_access_slb

-
  
  /**

   * Interrupt 0x400 - Instruction Storage Interrupt (ISI).
@@ -1428,8 +1390,6 @@ MMU_FTR_SECTION_ELSE
  ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
b   interrupt_return
  
-	GEN_KVM instruction_access

-
  
  /**

   * Interrupt 0x480 - Instruction Segment Interrupt (ISLB).
@@ -1474,8 +1434,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  

Re: PowerPC64 future proof kernel toc, revised for lld

2021-03-11 Thread Alexey Kardashevskiy




On 12/03/2021 10:32, Michael Ellerman wrote:

Alan Modra  writes:

On Wed, Mar 10, 2021 at 01:44:57PM +0100, Christophe Leroy wrote:


Le 10/03/2021 à 13:25, Alan Modra a écrit :

On Wed, Mar 10, 2021 at 08:33:37PM +1100, Alexey Kardashevskiy wrote:

One more question - the older version had a construct "DEFINED (.TOC.) ?
.TOC. : ..." in case .TOC. is not defined (too old ld? too old gcc?) but the
newer patch seems assuming it is always defined, when was it added? I have
the same check in SLOF, for example, do I still need it?


.TOC. symbol support was first added 2012-11-06, so you need
binutils-2.24 or later to use .TOC. as a symbol.



As of today, minimum requirement to build kernel is binutils 2.23, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_process_changes.html-23current-2Dminimal-2Drequirements=DwIDAw=jf_iaSHvJObTbx-siA1ZOg=uzpscot8Q8p-51o1Gp1vnzKV94bfny2qmUdVe821lv0=SYi605mn0I1hf1QoHuvHXtS_Z-R6JJHbzS34cEtV2Tk=47ckf3yxVcP6RwRb8D9viYOQSWpf6rXrnWj4YM4OTJ0=


Yes, and arch/powerpc/Makefile complains about 2.24.  So for powerpc
that means you need to go to at least 2.25.


Not quite. It only complains for little endian builds, and only if you
have stock 2.24, it will allow a 2.24..

I do most of my builds with 2.34, so I have no issue with newer
binutils. But we try not to increase the minimum version too rapidly to
accommodate folks using older and/or "Enterprise" distros that are stuck
on old toolchains.

I think we are within our rights to increase the minimum requirement for
powerpc builds, if it brings advantages we can identify.

The way to do that would be to add a new check in our arch Makefile that
rejects the older versions.


The upstream llvm just learnt to handle the .TOC. symbol in linker 
scripts so we may delay the future for a bit longer :) @dja wanted 
upstream llvm anyway and the currently supported llvm 10.xx is not much 
value for our experiments.


https://github.com/llvm/llvm-project/commit/e4f385d89448393b4d213339bbaa42b49489



--
Alexey


Re: [PATCH V2 2/2] powerpc/perf: Add platform specific check_attr_config

2021-03-10 Thread Alexey Kardashevskiy




On 26/02/2021 17:50, Madhavan Srinivasan wrote:

Add platform specific attr.config value checks. Patch
includes checks for both power9 and power10.

Signed-off-by: Madhavan Srinivasan 
---
Changelog v1:
- No changes.

  arch/powerpc/perf/isa207-common.c | 41 +++
  arch/powerpc/perf/isa207-common.h |  2 ++
  arch/powerpc/perf/power10-pmu.c   | 13 ++
  arch/powerpc/perf/power9-pmu.c| 13 ++
  4 files changed, 69 insertions(+)

diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index e4f577da33d8..b255799f5b51 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -694,3 +694,44 @@ int isa207_get_alternatives(u64 event, u64 alt[], int 
size, unsigned int flags,
  
  	return num_alt;

  }
+
+int isa3_X_check_attr_config(struct perf_event *ev)



"isa300" is used everywhere else to refer to ISA 3.00.



+{
+   u64 val, sample_mode;
+   u64 event = ev->attr.config;
+
+   val = (event >> EVENT_SAMPLE_SHIFT) & EVENT_SAMPLE_MASK;


I am not familiar with the code - "Raw event encoding for Power9" from 
arch/powerpc/perf/power9-pmu.c - where is this from? Is this how linux 
defines encoding or it is P9 UM or something?



+   sample_mode = val & 0x3;
+
+   /*
+* MMCRA[61:62] is Randome Sampling Mode (SM).
+* value of 0b11 is reserved.
+*/
+   if (sample_mode == 0x3)
+   return -1;
+
+   /*
+* Check for all reserved value
+*/
+   switch (val) {
+   case 0x5:
+   case 0x9:
+   case 0xD:
+   case 0x19:
+   case 0x1D:
+   case 0x1A:
+   case 0x1E:



What spec did these numbers come from?


+   return -1;
+   }
+
+   /*
+* MMCRA[48:51]/[52:55]) Threshold Start/Stop
+* Events Selection.
+* 0b/0b is reserved.


The mapping between the event and MMCRA is very unclear :) But there are 
more reserved values in MMCRA in PowerISA_public.v3.0B.pdf:


===
 Reserved

Problem state access (SPR 770)
1000 -  - ReservedPrivileged access (SPR 770 or 786)
1000 -  - Implementation-dependent
===

Do not you need to filter these too?


+*/
+   val = (event >> EVENT_THR_CTL_SHIFT) & EVENT_THR_CTL_MASK;
+   if (((val & 0xF0) == 0xF0) || ((val & 0xF) == 0xF))
+   return -1;


Since the filters may differ for problem and privileged, may be make 
these check_attr_config() hooks return EINVAL or EPERM and pass it on in 
the caller? Not sure there is much value in it though.




+
+   return 0;
+}
diff --git a/arch/powerpc/perf/isa207-common.h 
b/arch/powerpc/perf/isa207-common.h
index 1af0e8c97ac7..ae8eaf05efd1 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -280,4 +280,6 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, 
u32 flags,
struct pt_regs *regs);
  void isa207_get_mem_weight(u64 *weight);
  
+int isa3_X_check_attr_config(struct perf_event *ev);

+
  #endif
diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index a901c1348cad..bc64354cab6a 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -106,6 +106,18 @@ static int power10_get_alternatives(u64 event, unsigned 
int flags, u64 alt[])
return num_alt;
  }
  
+static int power10_check_attr_config(struct perf_event *ev)

+{
+   u64 val;
+   u64 event = ev->attr.config;
+
+   val = (event >> EVENT_SAMPLE_SHIFT) & EVENT_SAMPLE_MASK;
+   if (val == 0x10 || isa3_X_check_attr_config(ev))
+   return -1;
+
+   return 0;
+}
+
  GENERIC_EVENT_ATTR(cpu-cycles,PM_RUN_CYC);
  GENERIC_EVENT_ATTR(instructions,  PM_RUN_INST_CMPL);
  GENERIC_EVENT_ATTR(branch-instructions,   PM_BR_CMPL);
@@ -559,6 +571,7 @@ static struct power_pmu power10_pmu = {
.attr_groups= power10_pmu_attr_groups,
.bhrb_nr= 32,
.capabilities   = PERF_PMU_CAP_EXTENDED_REGS,
+   .check_attr_config  = power10_check_attr_config,
  };
  
  int init_power10_pmu(void)

diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
index 2a57e93a79dc..b3b9b226d053 100644
--- a/arch/powerpc/perf/power9-pmu.c
+++ b/arch/powerpc/perf/power9-pmu.c
@@ -151,6 +151,18 @@ static int power9_get_alternatives(u64 event, unsigned int 
flags, u64 alt[])
return num_alt;
  }
  
+static int power9_check_attr_config(struct perf_event *ev)

+{
+   u64 val;
+   u64 event = ev->attr.config;
+
+   val = (event >> EVENT_SAMPLE_SHIFT) & EVENT_SAMPLE_MASK;
+   if (val == 0xC || isa3_X_check_attr_config(ev))
+   return -1;
+
+   return 0;
+}
+
  GENERIC_EVENT_ATTR(cpu-cycles,PM_CYC);
  GENERIC_EVENT_ATTR(stalled-cycles-frontend,   

Re: PowerPC64 future proof kernel toc, revised for lld

2021-03-10 Thread Alexey Kardashevskiy
One more question - the older version had a construct "DEFINED (.TOC.) ? 
.TOC. : ..." in case .TOC. is not defined (too old ld? too old gcc?) but 
the newer patch seems assuming it is always defined, when was it added? 
I have the same check in SLOF, for example, do I still need it?





On 10/03/2021 16:07, Alan Modra wrote:

On Wed, Mar 10, 2021 at 03:44:44PM +1100, Alexey Kardashevskiy wrote:

For my own education, is .got for prom_init.o still generated by ld or gcc?


.got is generated by ld.


In other words, should "objdump -D -s -j .got" ever dump .got for any .o
file, like below?


No.  "objdump -r prom_init.o | grep GOT" will tell you whether
prom_init.o *may* cause ld to generate .got entries.  (Linker
optimisations or --gc-sections might remove the need for those .got
entries.)


objdump: section '.got' mentioned in a -j option, but not found in any input
file


Right, expected.



--
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: a...@linux.ibm.com


Re: PowerPC64 future proof kernel toc, revised for lld

2021-03-10 Thread Alexey Kardashevskiy




On 10/03/2021 14:48, Alan Modra wrote:

This patch future-proofs the kernel against linker changes that might
put the toc pointer at some location other than .got+0x8000, by
replacing __toc_start+0x8000 with .TOC. throughout.  If the kernel's
idea of the toc pointer doesn't agree with the linker, bad things
happen.



Works great with gcc (v8, v10), ld (2.23), clang-11, lld-11.




prom_init.c code relocating its toc is also changed so that a symbolic
__prom_init_toc_start toc-pointer relative address is calculated
rather than assuming that it is always at toc-pointer - 0x8000.  The
length calculations loading values from the toc are also avoided.
It's a little incestuous to do that with unreloc_toc picking up
adjusted values (which is fine in practice, they both adjust by the
same amount if all goes well).

I've also changed the way .got is aligned in vmlinux.lds and
zImage.lds, mostly so that dumping out section info by objdump or
readelf plainly shows the alignment is 256.  This linker script
feature was added 2005-09-27, available in FSF binutils releases from
2.17 onwards.  Should be safe to use in the kernel, I think.

Finally, put *(.got) before the prom_init.o entry which only needs
*(.toc), so that the GOT header goes in the correct place.  I don't
believe this makes any difference for the kernel as it would for
dynamic objects being loaded by ld.so.  That change is just to stop
lusers who blindly copy kernel scripts being led astray.  Of course,
this change needs the prom_init.c changes.

Some notes on .toc and .got.

.toc is a compiler generated section of addresses.  .got is a linker
generated section of addresses, generally built when the linker sees
R_*_*GOT* relocations.  In the case of powerpc64 ld.bfd, there are
multiple generated .got sections, one per input object file.  So you
can somewhat reasonably write in a linker script an input section
statement like *prom_init.o(.got .toc) to mean "the .got and .toc
section for files matching *prom_init.o".



For my own education, is .got for prom_init.o still generated by ld or gcc?

In other words, should "objdump -D -s -j .got" ever dump .got for any .o 
file, like below?


===
objdump -D -s -j .got 
~/pbuild/kernel-llvm-ld/arch/powerpc/kernel/prom_init.o 




/home/aik/pbuild/kernel-llvm-ld/arch/powerpc/kernel/prom_init.o: 
file format elf64-powerpcle 




objdump: section '.got' mentioned in a -j option, but not found in any 
input file

===



--
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: a...@linux.ibm.com


[PATCH kernel v2] powerpc/iommu: Annotate nested lock for lockdep

2021-02-28 Thread Alexey Kardashevskiy
The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU group
to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock
which makes lockdep not complaining when locking nested locks if
the nest lock is locked already.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(&(p->lock)/1);
  lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* fixed iommu_release_ownership() as well

---
 arch/powerpc/kernel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c1a5c366a664..d0df3e5ff5e0 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock(>pools[i].lock);
+   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
 
iommu_table_release_pages(tbl);
 
@@ -1117,7 +1117,7 @@ void iommu_release_ownership(struct iommu_table *tbl)
 
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock(>pools[i].lock);
+   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
 
memset(tbl->it_map, 0, sz);
 
-- 
2.17.1



Re: [PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep

2021-02-22 Thread Alexey Kardashevskiy




On 18/02/2021 23:59, Frederic Barrat wrote:



On 16/02/2021 04:20, Alexey Kardashevskiy wrote:

The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU group
to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


other info that might help us debug this:
  Possible unsafe locking scenario:

    CPU0
    
   lock(&(p->lock)/1);
   lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/kernel/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 557a09dd5b2f..2ee642a6731a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  spin_lock_irqsave(>large_pool.lock, flags);
  for (i = 0; i < tbl->nr_pools; i++)
-    spin_lock(>pools[i].lock);
+    spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);



We have the same pattern and therefore should have the same problem in 
iommu_release_ownership().


But as I understand, we're hacking our way around lockdep here, since 
conceptually, those locks are independent. I was wondering why it seems 
to fix it by worrying only about the large pool lock.


This is the other way around - we telling the lockdep not to worry about 
small pool locks if the nest lock (==large pool lock) is locked. The 
warning is printed when a nested lock is detected and the lockdep checks 
if there is a nest for this nested lock at check_deadlock().



That loop can take 
many locks (up to 4 with current config). However, if the dma window is 
less than 1GB, we would only have one, so it would make sense for 
lockdep to stop complaining.


Why would it stop if the large pool is always there?

Is it what happened? In which case, this 
patch doesn't really fix it. Or I'm missing something :-)


I tried with 1 or 2 small pools, no difference at all. I might also be 
missing something here too :)





   Fred




  iommu_table_release_pages(tbl);



--
Alexey


Re: [PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep

2021-02-22 Thread Alexey Kardashevskiy




On 20/02/2021 14:49, Alexey Kardashevskiy wrote:



On 18/02/2021 23:59, Frederic Barrat wrote:



On 16/02/2021 04:20, Alexey Kardashevskiy wrote:

The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU 
group

to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


other info that might help us debug this:
  Possible unsafe locking scenario:

    CPU0
    
   lock(&(p->lock)/1);
   lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/kernel/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 557a09dd5b2f..2ee642a6731a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  spin_lock_irqsave(>large_pool.lock, flags);
  for (i = 0; i < tbl->nr_pools; i++)
-    spin_lock(>pools[i].lock);
+    spin_lock_nest_lock(>pools[i].lock, 
>large_pool.lock);



We have the same pattern and therefore should have the same problem in 
iommu_release_ownership().


But as I understand, we're hacking our way around lockdep here, since 
conceptually, those locks are independent. I was wondering why it 
seems to fix it by worrying only about the large pool lock. That loop 
can take many locks (up to 4 with current config). However, if the dma 
window is less than 1GB, we would only have one, so it would make 
sense for lockdep to stop complaining. Is it what happened? In which 
case, this patch doesn't really fix it. Or I'm missing something :-)



My rough undestanding is that when spin_lock_nest_lock is called first 
time, it does some magic with lockdep classes somewhere in 
__lock_acquire()/register_lock_class() and right after that the nested 
lock is not the same as before and it is annotated so  we cannot lock 
nested locks without locking the nest lock first and no (re)annotation 
is needed. I'll try to poke this code once again and see, it is just was 
easier with p9/nested which is gone for now because of little snow in 
one of the southern states :)



Turns out I have good imagination and in fact it does print this huge 
warning in the release hook as well so v2 is coming. Thanks,








   Fred




  iommu_table_release_pages(tbl);





--
Alexey


Re: [PATCH kernel 2/2] powerpc/iommu: Do not immediately panic when failed IOMMU table allocation

2021-02-21 Thread Alexey Kardashevskiy




On 18/02/2021 06:32, Leonardo Bras wrote:

On Tue, 2021-02-16 at 14:33 +1100, Alexey Kardashevskiy wrote:

Most platforms allocate IOMMU table structures (specifically it_map)
at the boot time and when this fails - it is a valid reason for panic().

However the powernv platform allocates it_map after a device is returned
to the host OS after being passed through and this happens long after
the host OS booted. It is quite possible to trigger the it_map allocation
panic() and kill the host even though it is not necessary - the host OS
can still use the DMA bypass mode (requires a tiny fraction of it_map's
memory) and even if that fails, the host OS is runnnable as it was without
the device for which allocating it_map causes the panic.

Instead of immediately crashing in a powernv/ioda2 system, this prints
an error and continues. All other platforms still call panic().

Signed-off-by: Alexey Kardashevskiy 


Hello Alexey,

This looks like a good change, that passes panic() decision to platform
code. Everything looks pretty straightforward, but I have a question
regarding this:


@@ -1930,16 +1931,16 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
    res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
    res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
    }
-   iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end);
-   rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl);

+   if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end))
+   rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl);
+   else
+   rc = -ENOMEM;
    if (rc) {
-   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
-   rc);
+   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", 
rc);
    iommu_tce_table_put(tbl);
-   return rc;
+   tbl = NULL; /* This clears iommu_table_base below */
    }
-
    if (!pnv_iommu_bypass_disabled)
    pnv_pci_ioda2_set_bypass(pe, true);
  



If I could understand correctly, previously if iommu_init_table() did
not panic(), and pnv_pci_ioda2_set_window() returned something other
than 0, it would return rc in the if (rc) clause, but now it does not
happen anymore, going through if (!pnv_iommu_bypass_disabled) onwards.

Is that desired?



Yes. A PE (==device, pretty much) has 2 DMA windows:
- the default one which requires some RAM to operate
- a bypass mode which tells the hardware that PCI addresses are 
statically mapped to RAM 1:1.


This bypass mode does not require extra memory to work and is used in 
the most cases on the bare metal as long as the device supports 64bit 
DMA which is everything except GPUs. Since it is cheap to enable and 
this what we prefer anyway, no urge to fail.




As far as I could see, returning rc there seems a good procedure after
iommu_init_table returning -ENOMEM.


This change is intentional and yes it could be done by a separate patch 
but I figured there is no that much value in splitting.




--
Alexey


Re: [PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep

2021-02-19 Thread Alexey Kardashevskiy




On 18/02/2021 23:59, Frederic Barrat wrote:



On 16/02/2021 04:20, Alexey Kardashevskiy wrote:

The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU group
to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: 
iommu_take_ownership+0xac/0x1e0


other info that might help us debug this:
  Possible unsafe locking scenario:

    CPU0
    
   lock(&(p->lock)/1);
   lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/kernel/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 557a09dd5b2f..2ee642a6731a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  spin_lock_irqsave(>large_pool.lock, flags);
  for (i = 0; i < tbl->nr_pools; i++)
-    spin_lock(>pools[i].lock);
+    spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);



We have the same pattern and therefore should have the same problem in 
iommu_release_ownership().


But as I understand, we're hacking our way around lockdep here, since 
conceptually, those locks are independent. I was wondering why it seems 
to fix it by worrying only about the large pool lock. That loop can take 
many locks (up to 4 with current config). However, if the dma window is 
less than 1GB, we would only have one, so it would make sense for 
lockdep to stop complaining. Is it what happened? In which case, this 
patch doesn't really fix it. Or I'm missing something :-)



My rough undestanding is that when spin_lock_nest_lock is called first 
time, it does some magic with lockdep classes somewhere in 
__lock_acquire()/register_lock_class() and right after that the nested 
lock is not the same as before and it is annotated so  we cannot lock 
nested locks without locking the nest lock first and no (re)annotation 
is needed. I'll try to poke this code once again and see, it is just was 
easier with p9/nested which is gone for now because of little snow in 
one of the southern states :)





   Fred




  iommu_table_release_pages(tbl);



--
Alexey


[PATCH kernel 2/2] powerpc/iommu: Do not immediately panic when failed IOMMU table allocation

2021-02-15 Thread Alexey Kardashevskiy
Most platforms allocate IOMMU table structures (specifically it_map)
at the boot time and when this fails - it is a valid reason for panic().

However the powernv platform allocates it_map after a device is returned
to the host OS after being passed through and this happens long after
the host OS booted. It is quite possible to trigger the it_map allocation
panic() and kill the host even though it is not necessary - the host OS
can still use the DMA bypass mode (requires a tiny fraction of it_map's
memory) and even if that fails, the host OS is runnnable as it was without
the device for which allocating it_map causes the panic.

Instead of immediately crashing in a powernv/ioda2 system, this prints
an error and continues. All other platforms still call panic().

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c   |  6 --
 arch/powerpc/platforms/cell/iommu.c   |  3 ++-
 arch/powerpc/platforms/pasemi/iommu.c |  4 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 15 ---
 arch/powerpc/platforms/pseries/iommu.c| 10 +++---
 arch/powerpc/sysdev/dart_iommu.c  |  3 ++-
 6 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8eb6eb0afa97..c1a5c366a664 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -728,8 +728,10 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
tbl->it_map = vzalloc_node(sz, nid);
-   if (!tbl->it_map)
-   panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
+   if (!tbl->it_map) {
+   pr_err("%s: Can't allocate %ld bytes\n", __func__, sz);
+   return NULL;
+   }
 
iommu_table_reserve_pages(tbl, res_start, res_end);
 
diff --git a/arch/powerpc/platforms/cell/iommu.c 
b/arch/powerpc/platforms/cell/iommu.c
index 2124831cf57c..fa08699aedeb 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -486,7 +486,8 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct 
device_node *np,
window->table.it_size = size >> window->table.it_page_shift;
window->table.it_ops = _iommu_ops;
 
-   iommu_init_table(>table, iommu->nid, 0, 0);
+   if (!iommu_init_table(>table, iommu->nid, 0, 0))
+   panic("Failed to initialize iommu table");
 
pr_debug("\tioid  %d\n", window->ioid);
pr_debug("\tblocksize %ld\n", window->table.it_blocksize);
diff --git a/arch/powerpc/platforms/pasemi/iommu.c 
b/arch/powerpc/platforms/pasemi/iommu.c
index b500a6e47e6b..5be7242fbd86 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -146,7 +146,9 @@ static void iommu_table_iobmap_setup(void)
 */
iommu_table_iobmap.it_blocksize = 4;
iommu_table_iobmap.it_ops = _table_iobmap_ops;
-   iommu_init_table(_table_iobmap, 0, 0, 0);
+   if (!iommu_init_table(_table_iobmap, 0, 0, 0))
+   panic("Failed to initialize iommu table");
+
pr_debug(" <- %s\n", __func__);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f0f901683a2f..66c3c3337334 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1762,7 +1762,8 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
tbl->it_ops = _ioda1_iommu_ops;
pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
-   iommu_init_table(tbl, phb->hose->node, 0, 0);
+   if (!iommu_init_table(tbl, phb->hose->node, 0, 0))
+   panic("Failed to initialize iommu table");
 
pe->dma_setup_done = true;
return;
@@ -1930,16 +1931,16 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
}
-   iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end);
 
-   rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl);
+   if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end))
+   rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl);
+   else
+   rc = -ENOMEM;
if (rc) {
-   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
-   rc);
+   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", 
rc);
i

[PATCH kernel 0/2] powerpc/iommu: Stop crashing the host when VM is terminated

2021-02-15 Thread Alexey Kardashevskiy
Killing a VM on a host under memory pressure kills a host which is
annoying. 1/2 reduces the chances, 2/2 eliminates panic() on
ioda2.


This is based on sha1
f40ddce88593 Linus Torvalds "Linux 5.11".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc/iommu: Allocate it_map by vmalloc
  powerpc/iommu: Do not immediately panic when failed IOMMU table
allocation

 arch/powerpc/kernel/iommu.c   | 19 ++-
 arch/powerpc/platforms/cell/iommu.c   |  3 ++-
 arch/powerpc/platforms/pasemi/iommu.c |  4 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 15 ---
 arch/powerpc/platforms/pseries/iommu.c| 10 +++---
 arch/powerpc/sysdev/dart_iommu.c  |  3 ++-
 6 files changed, 28 insertions(+), 26 deletions(-)

-- 
2.17.1



[PATCH kernel 1/2] powerpc/iommu: Allocate it_map by vmalloc

2021-02-15 Thread Alexey Kardashevskiy
The IOMMU table uses the it_map bitmap to keep track of allocated DMA
pages. This has always been a contiguous array allocated at either
the boot time or when a passed through device is returned to the host OS.
The it_map memory is allocated by alloc_pages() which allocates
contiguous physical memory.

Such allocation method occasionally creates a problem when there is
no big chunk of memory available (no free memory or too fragmented).
On powernv/ioda2 the default DMA window requires 16MB for it_map.

This replaces alloc_pages_node() with vzalloc_node() which allocates
contiguous block but in virtual memory. This should reduce changes of
failure but should not cause other behavioral changes as it_map is only
used by the kernel's DMA hooks/api when MMU is on.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c00214a4355c..8eb6eb0afa97 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,7 +719,6 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
 {
unsigned long sz;
static int welcomed = 0;
-   struct page *page;
unsigned int i;
struct iommu_pool *p;
 
@@ -728,11 +727,9 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
/* number of bytes needed for the bitmap */
sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
-   page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz));
-   if (!page)
+   tbl->it_map = vzalloc_node(sz, nid);
+   if (!tbl->it_map)
panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
-   tbl->it_map = page_address(page);
-   memset(tbl->it_map, 0, sz);
 
iommu_table_reserve_pages(tbl, res_start, res_end);
 
@@ -774,8 +771,6 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
 
 static void iommu_table_free(struct kref *kref)
 {
-   unsigned long bitmap_sz;
-   unsigned int order;
struct iommu_table *tbl;
 
tbl = container_of(kref, struct iommu_table, it_kref);
@@ -796,12 +791,8 @@ static void iommu_table_free(struct kref *kref)
if (!bitmap_empty(tbl->it_map, tbl->it_size))
pr_warn("%s: Unexpected TCEs\n", __func__);
 
-   /* calculate bitmap size in bytes */
-   bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
-
/* free bitmap */
-   order = get_order(bitmap_sz);
-   free_pages((unsigned long) tbl->it_map, order);
+   vfree(tbl->it_map);
 
/* free table */
kfree(tbl);
-- 
2.17.1



[PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep

2021-02-15 Thread Alexey Kardashevskiy
The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU group
to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(&(p->lock)/1);
  lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 557a09dd5b2f..2ee642a6731a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock(>pools[i].lock);
+   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
 
iommu_table_release_pages(tbl);
 
-- 
2.17.1



Re: [PATCH kernel] powerpc/perf: Stop crashing with generic_compat_pmu

2021-02-15 Thread Alexey Kardashevskiy




On 03/12/2020 16:27, Madhavan Srinivasan wrote:


On 12/2/20 8:31 AM, Alexey Kardashevskiy wrote:

Hi Maddy,

I just noticed that I still have "powerpc/perf: Add checks for 
reserved values" in my pile (pushed here 
https://github.com/aik/linux/commit/61e1bc3f2e19d450e2e2d39174d422160b21957b 
), do we still need it? The lockups I saw were fixed by 
https://github.com/aik/linux/commit/17899eaf88d689 but it is hardly a 
replacement. Thanks,


sorry missed this. Will look at this again. Since we will need 
generation specific checks for the reserve field.



So any luck with this? Cheers,






Maddy




On 04/06/2020 02:34, Madhavan Srinivasan wrote:



On 6/2/20 8:26 AM, Alexey Kardashevskiy wrote:
The bhrb_filter_map ("The  Branch History  Rolling  Buffer") 
callback is

only defined in raw CPUs' power_pmu structs. The "architected" CPUs use
generic_compat_pmu which does not have this callback and crashed occur.

This add a NULL pointer check for bhrb_filter_map() which behaves as if
the callback returned an error.

This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.


Changes looks fine.
Reviewed-by: Madhavan Srinivasan 

The commit be80e758d0c2e ('powerpc/perf: Add generic compat mode pmu 
driver')

which introduced generic_compat_pmu was merged in v5.2.  So we need to
CC stable starting from 5.2 :( .  My bad,  sorry.

Maddy


Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/perf/core-book3s.c | 19 ++-
  1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index 3dcfecf858f3..36870569bf9c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1515,9 +1515,16 @@ static int power_pmu_add(struct perf_event 
*event, int ef_flags)

  ret = 0;
   out:
  if (has_branch_stack(event)) {
-    power_pmu_bhrb_enable(event);
-    cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
-    event->attr.branch_sample_type);
+    u64 bhrb_filter = -1;
+
+    if (ppmu->bhrb_filter_map)
+    bhrb_filter = ppmu->bhrb_filter_map(
+    event->attr.branch_sample_type);
+
+    if (bhrb_filter != -1) {
+    cpuhw->bhrb_filter = bhrb_filter;
+    power_pmu_bhrb_enable(event); /* Does bhrb_users++ */
+    }
  }

  perf_pmu_enable(event->pmu);
@@ -1839,7 +1846,6 @@ static int power_pmu_event_init(struct 
perf_event *event)

  int n;
  int err;
  struct cpu_hw_events *cpuhw;
-    u64 bhrb_filter;

  if (!ppmu)
  return -ENOENT;
@@ -1945,7 +1951,10 @@ static int power_pmu_event_init(struct 
perf_event *event)

  err = power_check_constraints(cpuhw, events, cflags, n + 1);

  if (has_branch_stack(event)) {
-    bhrb_filter = ppmu->bhrb_filter_map(
+    u64 bhrb_filter = -1;
+
+    if (ppmu->bhrb_filter_map)
+    bhrb_filter = ppmu->bhrb_filter_map(
  event->attr.branch_sample_type);

  if (bhrb_filter == -1) {






--
Alexey


[PATCH kernel v3] powerpc/uaccess: Skip might_fault() when user access is enabled

2021-02-04 Thread Alexey Kardashevskiy
The amount of code executed with enabled user space access (unlocked KUAP)
should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying
interrupts which in turn may access the user space and forget to restore
the KUAP state.

The problem places are:
1. strncpy_from_user (and similar) which unlock KUAP and call
unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.
2. __put_user_nocheck_goto() which is called with unlocked KUAP.

This changes __get_user_nocheck() to look at @do_allow to decide whether
to skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.

This drops might_fault() in __put_user_nocheck_goto() as it is only
called from unsafe_xxx helpers which manage KUAP themselves.

Since keeping might_fault() is still desireable, this adds those
to user_access_begin/read/write which is the last point where
we can safely do so.

Fixes: 334710b1496a ("powerpc/uaccess: Implement unsafe_put_user() using 'asm 
goto'")
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* removed might_fault() from __put_user_nocheck_goto
* added might_fault() to user(_|_read_|_write_)access_begin

v2:
* s/!do_allow/do_allow/

---

Here is more detail about the issue:
https://lore.kernel.org/linuxppc-dev/20210203084503.gx6...@kitsune.suse.cz/T/

Another example of the problem:

Kernel attempted to write user page (22c3) - exploit attempt? (uid: 0)
[ cut here ]
Bug: Write fault blocked by KUAP!
WARNING: CPU: 1 PID: 16712 at 
/home/aik/p/kernel-syzkaller/arch/powerpc/mm/fault.c:229 
__do_page_fault+0xca4/0xf10

NIP [c06ff804] filldir64+0x484/0x820
LR [c06ff7fc] filldir64+0x47c/0x820
--- interrupt: 300
[c000589f3b40] [c08131b0] proc_fill_cache+0xf0/0x2b0
[c000589f3c60] [c0814658] proc_pident_readdir+0x1f8/0x390
[c000589f3cc0] [c06fd8e8] iterate_dir+0x108/0x370
[c000589f3d20] [c06fe3d8] sys_getdents64+0xa8/0x410
[c000589f3db0] [c004b708] system_call_exception+0x178/0x2b0
[c000589f3e10] [c000e060] system_call_common+0xf0/0x27c
---
 arch/powerpc/include/asm/uaccess.h | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 501c9a79038c..a789601998d3 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -216,8 +216,6 @@ do {
\
 #define __put_user_nocheck_goto(x, ptr, size, label)   \
 do {   \
__typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
-   if (!is_kernel_addr((unsigned long)__pu_addr))  \
-   might_fault();  \
__chk_user_ptr(ptr);\
__put_user_size_goto((x), __pu_addr, (size), label);\
 } while (0)
@@ -313,7 +311,7 @@ do {
\
__typeof__(size) __gu_size = (size);\
\
__chk_user_ptr(__gu_addr);  \
-   if (!is_kernel_addr((unsigned long)__gu_addr))  \
+   if (do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \
might_fault();  \
barrier_nospec();   \
if (do_allow)   
\
@@ -508,6 +506,8 @@ static __must_check inline bool user_access_begin(const 
void __user *ptr, size_t
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_read_write_user((void __user *)ptr, ptr, len);
return true;
 }
@@ -521,6 +521,8 @@ user_read_access_begin(const void __user *ptr, size_t len)
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_read_from_user(ptr, len);
return true;
 }
@@ -532,6 +534,8 @@ user_write_access_begin(const void __user *ptr, size_t len)
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_write_to_user((void __user *)ptr, len);
return true;
 }
-- 
2.17.1



[PATCH kernel v2] powerpc/uaccess: Skip might_fault() when user access is enabled

2021-02-02 Thread Alexey Kardashevskiy
The amount of code executed with enabled user space access (unlocked KUAP)
should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying
interrupts which in turn may access the user space and forget to restore
the KUAP state.

The problem places are strncpy_from_user (and similar) which unlock KUAP
and call unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.

This changes __get_user_nocheck() to look at @do_allow to decide whether
to skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* s/!do_allow/do_allow/
---
 arch/powerpc/include/asm/uaccess.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 501c9a79038c..27e109866c42 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -313,7 +313,7 @@ do {
\
__typeof__(size) __gu_size = (size);\
\
__chk_user_ptr(__gu_addr);  \
-   if (!is_kernel_addr((unsigned long)__gu_addr))  \
+   if (do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \
might_fault();  \
barrier_nospec();   \
if (do_allow)   
\
-- 
2.17.1



Re: [PATCH kernel] powerpc/uaccess: Skip might_fault() when user access is enabled

2021-02-02 Thread Alexey Kardashevskiy




On 02/02/2021 20:14, Alexey Kardashevskiy wrote:

The amount of code executed with enabled user space access (unlocked KUAP)
should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying
interrupts which in turn may access the user space and forget to restore
the KUAP state.

The problem places are strncpy_from_user (and similar) which unlock KUAP
and call unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.

This changes __get_user_nocheck() to look at @do_allow to decide whether
to skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.

Signed-off-by: Alexey Kardashevskiy 
---


This an attempt to fix that KUAP restore problem from
"powerpc/kuap: Restore AMR after replaying soft interrupts".



---
  arch/powerpc/include/asm/uaccess.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 501c9a79038c..cd6c0427a9e2 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -313,7 +313,7 @@ do {
\
__typeof__(size) __gu_size = (size);\
\
__chk_user_ptr(__gu_addr);  \
-   if (!is_kernel_addr((unsigned long)__gu_addr))  \
+   if (!do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \



ah my bad, must be "if (do_allow..."



might_fault();  \
barrier_nospec();   \
if (do_allow)   
\



--
Alexey


[PATCH kernel] powerpc/kuap: Restore AMR after replaying soft interrupts

2021-02-02 Thread Alexey Kardashevskiy
Since de78a9c "powerpc: Add a framework for Kernel Userspace Access
Protection", user access helpers call user_{read|write}_access_{begin|end}
when user space access is allowed.

890274c "powerpc/64s: Implement KUAP for Radix MMU" made the mentioned
helpers program a AMR special register to allow such access for a short
period of time, most of the time AMR is expected to block user memory
access by the kernel.

Since the code accesses the user space memory, unsafe_get_user()
calls might_fault() which calls arch_local_irq_restore() if either
CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
arch_local_irq_restore() then attempts to replay pending soft interrupts
as KUAP regions have hardware interrupts enabled.
If a pending interrupt happens to do user access (performance interrupts
do that), it enables access for a short period of time so after returning
from the replay, the user access state remains blocked and if a user page
fault happens - "Bug: Read fault blocked by AMR!" appears and SIGSEGV is
sent.

This saves/restores AMR when replaying interrupts.

This adds a check if AMR was not blocked when before replaying interrupts.

Found by syzkaller. The call stack for the bug is:

copy_from_user_nofault+0xf8/0x250
perf_callchain_user_64+0x3d8/0x8d0
perf_callchain_user+0x38/0x50
get_perf_callchain+0x28c/0x300
perf_callchain+0xb0/0x130
perf_prepare_sample+0x364/0xbf0
perf_event_output_forward+0xe0/0x280
__perf_event_overflow+0xa4/0x240
perf_swevent_hrtimer+0x1d4/0x1f0
__hrtimer_run_queues+0x328/0x900
hrtimer_interrupt+0x128/0x350
timer_interrupt+0x180/0x600
replay_soft_interrupts+0x21c/0x4f0
arch_local_irq_restore+0x94/0x150
lock_is_held_type+0x140/0x200
___might_sleep+0x220/0x330
__might_fault+0x88/0x120
do_strncpy_from_user+0x108/0x2b0
strncpy_from_user+0x1d0/0x2a0
getname_flags+0x88/0x2c0
do_sys_openat2+0x2d4/0x5f0
do_sys_open+0xcc/0x140
system_call_exception+0x160/0x240
system_call_common+0xf0/0x27c

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Nicholas Piggin 
---
Changes:
v3:
* do not block/unblock if AMR was blocked
* reverted move of AMR_KUAP_***
* added pr_warn

v2:
* fixed compile on hash
* moved get/set to arch_local_irq_restore
* block KUAP before replaying

---

This is an example:

[ cut here ]
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at 
/home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 
__do_page_fau

Modules linked in:
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP:  c009ece8 LR: c009ece4 CTR: 
REGS: cdc63560 TRAP: 0700   Not tainted  (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR:  80021033   CR: 28002888  XER: 2004
CFAR: c01fa928 IRQMASK: 1
GPR00: c009ece4 cdc637f0 c2397600 001f
GPR04: c20eb318  cdc63494 0027
GPR08: c0007fe4de68 cdfe9180  0001
GPR12: 2000 c30a  
GPR16:    bfff
GPR20:  c000134a4020 c19c2218 0fe0
GPR24:   cd106200 4000
GPR28:  0300 cdc63910 c1946730
NIP [c009ece8] __do_page_fault+0xb38/0xde0
LR [c009ece4] __do_page_fault+0xb34/0xde0
Call Trace:
[cdc637f0] [c009ece4] __do_page_fault+0xb34/0xde0 (unreliable)
[cdc638a0] [c000c968] handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
LR = strncpy_from_user+0x284/0x440
[cdc63ba0] [c0c3dcb0] strncpy_from_user+0x2f0/0x440 (unreliable)
[cdc63c30] [c068b888] getname_flags+0x88/0x2c0
[cdc63c90] [c0662a44] do_sys_openat2+0x2d4/0x5f0
[cdc63d30] [c066560c] do_sys_open+0xcc/0x140
[cdc63dc0] [c0045e10] system_call_exception+0x160/0x240
[cdc63e20] [c000da60] system_call_common+0xf0/0x27c
Instruction dump:
409c0048 3fe2ff5b 3bfff128 fac10060 fae10068 482f7a85 6000 3c62ff5b
7fe4fb78 3863f250 4815bbd9 6000 <0fe0> 3c62ff5b 3863f2b8 4815c8b5
irq event stamp: 254
hardirqs last  enabled at (253): [] 
arch_local_irq_restore+0xa0/0x150
hardirqs last disabled at (254): [] 
data_access_common_virt+0x1b0/0x1d0
softirqs last  enabled at (0): [] copy_process+0x78c/0x2120
softirqs last disabled at (0): [<>] 0x0
---[ end trace ba98aec5151f3aeb ]---
---
 arch/powerpc/kernel/irq.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index cc7a6271b6b4..592abc798826 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -269,6 +269,23 @@ void replay_soft_interrupts(void)
  

[PATCH kernel] powerpc/uaccess: Skip might_fault() when user access is enabled

2021-02-02 Thread Alexey Kardashevskiy
The amount of code executed with enabled user space access (unlocked KUAP)
should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying
interrupts which in turn may access the user space and forget to restore
the KUAP state.

The problem places are strncpy_from_user (and similar) which unlock KUAP
and call unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.

This changes __get_user_nocheck() to look at @do_allow to decide whether
to skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.

Signed-off-by: Alexey Kardashevskiy 
---


This an attempt to fix that KUAP restore problem from
"powerpc/kuap: Restore AMR after replaying soft interrupts".



---
 arch/powerpc/include/asm/uaccess.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 501c9a79038c..cd6c0427a9e2 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -313,7 +313,7 @@ do {
\
__typeof__(size) __gu_size = (size);\
\
__chk_user_ptr(__gu_addr);  \
-   if (!is_kernel_addr((unsigned long)__gu_addr))  \
+   if (!do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \
might_fault();  \
barrier_nospec();   \
if (do_allow)   
\
-- 
2.17.1



Re: [PATCH 6/6] powerpc/rtas: constrain user region allocation to RMA

2021-01-22 Thread Alexey Kardashevskiy




On 22/01/2021 02:27, Nathan Lynch wrote:

Michael Ellerman  writes:

Nathan Lynch  writes:

Alexey Kardashevskiy  writes:

On 16/01/2021 02:38, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 15/01/2021 09:00, Nathan Lynch wrote:

Memory locations passed as arguments from the OS to RTAS usually need
to be addressable in 32-bit mode and must reside in the Real Mode
Area. On PAPR guests, the RMA starts at logical address 0 and is the
first logical memory block reported in the LPAR’s device tree.

On powerpc targets with RTAS, Linux makes available to user space a
region of memory suitable for arguments to be passed to RTAS via
sys_rtas(). This region (rtas_rmo_buf) is allocated via the memblock
API during boot in order to ensure that it satisfies the requirements
described above.

With radix MMU, the upper limit supplied to the memblock allocation
can exceed the bounds of the first logical memory block, since
ppc64_rma_size is ULONG_MAX and RTAS_INSTANTIATE_MAX is 1GB. (512MB is
a common size of the first memory block according to a small sample of
LPARs I have checked.) This leads to failures when user space invokes
an RTAS function that uses a work area, such as
ibm,configure-connector.

Alter the determination of the upper limit for rtas_rmo_buf's
allocation to consult the device tree directly, ensuring placement
within the RMA regardless of the MMU in use.


Can we tie this with RTAS (which also needs to be in RMA) and simply add
extra 64K in prom_instantiate_rtas() and advertise this address
(ALIGH_UP(rtas-base + rtas-size, PAGE_SIZE)) to the user space? We do
not need this RMO area before that point.


Can you explain more about what advantage that would bring? I'm not
seeing it. It's a more significant change than what I've written
here.



We already allocate space for RTAS and (like RMO) it needs to be in RMA,
and RMO is useless without RTAS. We can reuse RTAS allocation code for
RMO like this:


When you say RMO I assume you are referring to rtas_rmo_buf? (I don't
think it is well-named.)

...

RMO (Real mode offset) is the old term we used to use to refer to what
is now called the RMA (Real mode area). There are still many references
to RMO in Linux, but they almost certainly all refer to what we now call
the RMA.


Yes... but I think in this discussion Alexey was using RMO to stand in
for rtas_rmo_buf, which was what I was trying to clarify.



Correct. Thanks for the clarification, appreciated.




May be store in the FDT as "linux,rmo-base" next to "linux,rtas-base",
for clarity, as sharing symbols between prom and main kernel is a bit
tricky.

The benefit is that we do not do the same thing   (== find 64K in RMA)
in 2 different ways and if the RMO allocated my way is broken - we'll
know it much sooner as RTAS itself will break too.


Implementation details aside... I'll grant that combining the
allocations into one in prom_init reduces some duplication in the sense
that both are subject to the same constraints (mostly - the RTAS data
area must not cross a 256MB boundary, while the user region may). But
they really are distinct concerns. The RTAS private data area is
specified in the platform architecture, the OS is obligated to allocate
it and pass it to instantiate-rtas, etc etc. However the user region
(rtas_rmo_buf) is purely a Linux construct which is there to support
sys_rtas.

Now, there are multiple sites in the kernel proper that must allocate
memory suitable for passing to RTAS. Obviously there is value in
consolidating the logic for that purpose in one place, so I'll work on
adding that in v2. OK?


I don't think we want to move any allocations into prom_init.c unless we
have to.

It's best thought of as a trampoline, that runs before the kernel
proper, to transition from live OF to a flat DT environment. One thing
that must be done as part of that is instantiating RTAS, because it's
basically a runtime copy of the live OF. But any other allocs are for
Linux to handle later, IMHO.


Agreed.


Then the only comment I have left is may be use of_address_to_resource() 
+ resource_size() instead of of_n_addr_cells()/of_n_size_cells() (like 
pseries_memory_block_size()). And now I shut up :) Thanks,



--
Alexey


Re: [PATCH 5/6] powerpc/rtas: rename RTAS_RMOBUF_MAX to RTAS_USER_REGION_SIZE

2021-01-19 Thread Alexey Kardashevskiy




On 20/01/2021 12:17, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 16/01/2021 02:56, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 15/01/2021 09:00, Nathan Lynch wrote:

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 332e1000ca0f..1aa7ab1cbc84 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -19,8 +19,11 @@
#define RTAS_UNKNOWN_SERVICE (-1)
#define RTAS_INSTANTIATE_MAX (1ULL<<30) /* Don't instantiate rtas at/above 
this value */

-/* Buffer size for ppc_rtas system call. */

-#define RTAS_RMOBUF_MAX (64 * 1024)
+/* Work areas shared with RTAS must be 4K, naturally aligned. */


Why exactly 4K and not (for example) PAGE_SIZE?


4K is a platform requirement and isn't related to Linux's configured
page size. See the PAPR specification for RTAS functions such as
ibm,configure-connector, ibm,update-nodes, ibm,update-properties.


Good, since we are documenting things here - add to the comment ("per
PAPR")?


But almost every constant in this header relates to a specification or
requirement in PAPR.



Yup, "almost".




There are other calls with work area parameters where alignment isn't
specified (e.g. ibm,get-system-parameter) but 4KB alignment is a safe
choice for those.


+#define RTAS_WORK_AREA_SIZE   4096
+
+/* Work areas allocated for user space access. */
+#define RTAS_USER_REGION_SIZE (RTAS_WORK_AREA_SIZE * 16)


This is still 64K but no clarity why. There is 16 of something, what
is it?


There are 16 4KB work areas in the region. I can name it
RTAS_NR_USER_WORK_AREAS or similar.



Why 16? PAPR (then add "per PAPR") or we just like 16 ("should be
enough")?


PAPR doesn't know anything about the user region; it's a Linux
construct. It's been 64KB since pre-git days and I'm not sure what the
original reason is. At this point, maintaining a kernel-user ABI seems
like enough justification for the value.


I am not arguing keeping the numbers but you are replacing one magic 
number with another and for neither it is horribly obvious where they 
came from. Is 16 the max number of concurrently running sys_rtas system 
calls? Does the userspace ensure there is no more than 16? btw where is 
that userspace code? I thought 
https://github.com/power-ras/ppc64-diag.git but no. Thanks,




--
Alexey


Re: [PATCH 6/6] powerpc/rtas: constrain user region allocation to RMA

2021-01-19 Thread Alexey Kardashevskiy




On 20/01/2021 11:39, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 16/01/2021 02:38, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 15/01/2021 09:00, Nathan Lynch wrote:

Memory locations passed as arguments from the OS to RTAS usually need
to be addressable in 32-bit mode and must reside in the Real Mode
Area. On PAPR guests, the RMA starts at logical address 0 and is the
first logical memory block reported in the LPAR’s device tree.

On powerpc targets with RTAS, Linux makes available to user space a
region of memory suitable for arguments to be passed to RTAS via
sys_rtas(). This region (rtas_rmo_buf) is allocated via the memblock
API during boot in order to ensure that it satisfies the requirements
described above.

With radix MMU, the upper limit supplied to the memblock allocation
can exceed the bounds of the first logical memory block, since
ppc64_rma_size is ULONG_MAX and RTAS_INSTANTIATE_MAX is 1GB. (512MB is
a common size of the first memory block according to a small sample of
LPARs I have checked.) This leads to failures when user space invokes
an RTAS function that uses a work area, such as
ibm,configure-connector.

Alter the determination of the upper limit for rtas_rmo_buf's
allocation to consult the device tree directly, ensuring placement
within the RMA regardless of the MMU in use.


Can we tie this with RTAS (which also needs to be in RMA) and simply add
extra 64K in prom_instantiate_rtas() and advertise this address
(ALIGH_UP(rtas-base + rtas-size, PAGE_SIZE)) to the user space? We do
not need this RMO area before that point.


Can you explain more about what advantage that would bring? I'm not
seeing it. It's a more significant change than what I've written
here.



We already allocate space for RTAS and (like RMO) it needs to be in RMA,
and RMO is useless without RTAS. We can reuse RTAS allocation code for
RMO like this:


When you say RMO I assume you are referring to rtas_rmo_buf? (I don't
think it is well-named.)



===
diff --git a/arch/powerpc/kernel/prom_init.c
b/arch/powerpc/kernel/prom_init.c
index e9d4eb6144e1..d9527d3e01d2 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1821,7 +1821,8 @@ static void __init prom_instantiate_rtas(void)
  if (size == 0)
  return;

-   base = alloc_down(size, PAGE_SIZE, 0);
+   /* One page for RTAS, one for RMO */


One page for RTAS? RTAS is ~20MB on LPARs I've checked:

# lsprop /proc/device-tree/rtas/{rtas-size,linux,rtas-base}
/proc/device-tree/rtas/rtas-size
 0137 (20381696)


You are right, I did not sleep well when replied, sorry about that :) I 
tried it with KVM where RTAS is just a few KBs (20 constant bytes + MCE 
log, depends on cpu number) so it worked for me.






+   base = alloc_down(size, PAGE_SIZE + PAGE_SIZE, 0);


This changes the alignment but not the size of the allocation.



Should be:

base = alloc_down(ALIGN_UP(size, PAGE_SIZE) + PAGE_SIZE, PAGE_SIZE, 0);





  if (base == 0)
  prom_panic("Could not allocate memory for RTAS\n");

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index d126d71ea5bd..885d95cf4ed3 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1186,6 +1186,7 @@ void __init rtas_initialize(void)
  rtas.size = size;
  no_entry = of_property_read_u32(rtas.dev, "linux,rtas-entry",
);
  rtas.entry = no_entry ? rtas.base : entry;
+   rtas_rmo_buf = rtas.base + PAGE_SIZE;


I think this would overlay the user region on top of the RTAS private
data area, allowing user space to corrupt it.



Right, my bad. Should be:

rtas_rmo_buf = ALIGN_UP(rtas.base + rtas.size, PAGE_SIZE);






  /* If RTAS was found, allocate the RMO buffer for it and look for
   * the stop-self token if any
@@ -1196,11 +1197,6 @@ void __init rtas_initialize(void)
  ibm_suspend_me_token = rtas_token("ibm,suspend-me");
  }
   #endif
-   rtas_rmo_buf = memblock_phys_alloc_range(RTAS_RMOBUF_MAX, PAGE_SIZE,
-0, rtas_region);
-   if (!rtas_rmo_buf)
-   panic("ERROR: RTAS: Failed to allocate %lx bytes below
%pa\n",
- PAGE_SIZE, _region);
===

May be store in the FDT as "linux,rmo-base" next to "linux,rtas-base",
for clarity, as sharing symbols between prom and main kernel is a bit
tricky.

The benefit is that we do not do the same thing   (== find 64K in RMA)
in 2 different ways and if the RMO allocated my way is broken - we'll
know it much sooner as RTAS itself will break too.


Implementation details aside... I'll grant that combining the
allocations into one in prom_init reduces some duplication in the sense
that both are subject to the same constraints (mostly - the RTAS data
area must not cross a 256MB boundary, while the user region may). 

Re: [PATCH 5/6] powerpc/rtas: rename RTAS_RMOBUF_MAX to RTAS_USER_REGION_SIZE

2021-01-17 Thread Alexey Kardashevskiy




On 16/01/2021 02:56, Nathan Lynch wrote:

Alexey Kardashevskiy  writes:

On 15/01/2021 09:00, Nathan Lynch wrote:

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 332e1000ca0f..1aa7ab1cbc84 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -19,8 +19,11 @@
   #define RTAS_UNKNOWN_SERVICE (-1)
   #define RTAS_INSTANTIATE_MAX (1ULL<<30) /* Don't instantiate rtas at/above 
this value */
   
-/* Buffer size for ppc_rtas system call. */

-#define RTAS_RMOBUF_MAX (64 * 1024)
+/* Work areas shared with RTAS must be 4K, naturally aligned. */


Why exactly 4K and not (for example) PAGE_SIZE?


4K is a platform requirement and isn't related to Linux's configured
page size. See the PAPR specification for RTAS functions such as
ibm,configure-connector, ibm,update-nodes, ibm,update-properties.


Good, since we are documenting things here - add to the comment ("per 
PAPR")?




There are other calls with work area parameters where alignment isn't
specified (e.g. ibm,get-system-parameter) but 4KB alignment is a safe
choice for those.


+#define RTAS_WORK_AREA_SIZE   4096
+
+/* Work areas allocated for user space access. */
+#define RTAS_USER_REGION_SIZE (RTAS_WORK_AREA_SIZE * 16)


This is still 64K but no clarity why. There is 16 of something, what
is it?


There are 16 4KB work areas in the region. I can name it
RTAS_NR_USER_WORK_AREAS or similar.



Why 16? PAPR (then add "per PAPR") or we just like 16 ("should be enough")?


--
Alexey


  1   2   3   4   5   6   7   8   9   10   >