Re: [PATCH v4 kvmtool 00/12] arm64: Allow the user to set RAM base address

2022-06-21 Thread Suzuki K Poulose

On 16/06/2022 14:48, Alexandru Elisei wrote:

The series can be found at [1]. It is loosely based on the patches that
allow the user to define the VM memory layout (RAM + MMIO) [2]. I've
cherry-picked a handful of patches from that series, the rest I wrote from
scratch since there have been several changes to the way guest memory is
handled. I've chosen to focus on specifying the RAM layout with only one
RAM bank and leave the rest for a later series because this was relatively
easy to accomplish, while still being very useful.

What this series does: for arm64, the user can now specify the base address
for RAM:

$ ./lkvm run -m1G@2G .. # Equivalent to ./lkvm run -m1024

The memory units are B (bytes), K (kilobytes), M (megabytes), G
(gigabytes), T (terrabytes), P (petabytes). Lowercase is also valid.

Want to put RAM at the top of the physical address range? Easy:

$ ./lkvm run -m2G@1022G .. # Assumes the maximum is 40 bits of IPA

There one limitation on the RAM base address: it must not overlap with the
MMIO range that kvmtool uses for arm/arm64, which lives below 2GB.

Why this is useful, in my opinion:

1. Testing how a payload handles different memory layouts without the need
to hack kvmtool or find the hardware that implements the desired layout.

2. It can serve as a development tool for adding support for larger PA
ranges for Linux and KVM (currently capped at 48 bits for 4k/16k pages), or
other payloads.

Summary of the series
==

* The series starts with refactoring how kvm->cfg.ram_size is validated
   and used, followed by several cleanups in the arm and arm64 code.

* Then patch #8 ("builtin_run: Allow standard size specifiers for memory")
   introduced the ability to specify the measurement unit for memory. I
   believe that typing the equivalent of 2TB in megabytes isn't appealing
   for anyone.

* More cleanups in the arm/arm64 code follow, which are needed for patch
   #12 ("arm64: Allow the user to specify the RAM base address"). This is
   where the ability to specify the RAM base address is introduced.

Testing
===

Same testing as before:

- Build tested each patch for all architectures.

- Ran an x86 kernel with and without setting the amount of RAM using the
   memory specifiers; tested that setting the RAM address results in an
   error.

- Ran an arm64 kernel without setting the size, with setting the size and
   with setting the size and address; tried different addresses (2G, 3G,
   256G); also tested that going below 2G or above the maximum IPA correctly
   results in an error.

- Ran all arm64 kvm-unit-test tests with similar combinations of memory
   size and address (instead of 256G I used 128G, as that's where I/O lives
   for qemu and kvm-unit-tests maps that unconditionally as I/O).

- Ran all 32bit arm tests on an arm64 host with various combinations of
   memory size and address (base address at 2G and 2.5G only due to a
   limitation in the way the tests are set up).


I have tested this series on arm64 Fast model, with memory placed from
32bit to 48bit IPA and it works well.

For the series:

Reviewed-and-Tested-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v11 03/40] arm64: cpufeature: Always specify and use a field width for capabilities

2022-02-10 Thread Suzuki K Poulose

On 07/02/2022 15:20, Mark Brown wrote:

Since all the fields in the main ID registers are 4 bits wide we have up
until now not bothered specifying the width in the code. Since we now
wish to use this mechanism to enumerate features from the floating point
feature registers which do not follow this pattern add a width to the
table.  This means updating all the existing table entries but makes it
less likely that we run into issues in future due to implicitly assuming
a 4 bit width.

Signed-off-by: Mark Brown 
Cc: Suzuki K Poulose 


Reviewed-by: Suzuki K Poulose 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 01/38] arm64: cpufeature: Always specify and use a field width for capabilities

2022-01-25 Thread Suzuki K Poulose

On 25/01/2022 00:10, Mark Brown wrote:

Since all the fields in the main ID registers are 4 bits wide we have up
until now not bothered specifying the width in the code. Since we now
wish to use this mechanism to enumerate features from the floating point
feature registers which do not follow this pattern add a width to the
table.  This means updating all the existing table entries but makes it
less likely that we run into issues in future due to implicitly assuming
a 4 bit width.

Signed-off-by: Mark Brown 
Cc: Suzuki K Poulose 
---
  arch/arm64/include/asm/cpufeature.h |   1 +
  arch/arm64/kernel/cpufeature.c  | 167 +---
  2 files changed, 102 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index ef6be92b1921..2728abd9cae4 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -356,6 +356,7 @@ struct arm64_cpu_capabilities {
struct {/* Feature register checking */
u32 sys_reg;
u8 field_pos;
+   u8 field_width;
u8 min_field_value;
u8 hwcap_type;
bool sign;




diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index a46ab3b1c4d5..d9f09e40aaf6 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1307,7 +1307,9 @@ u64 __read_sysreg_by_encoding(u32 sys_id)
  static bool
  feature_matches(u64 reg, const struct arm64_cpu_capabilities *entry)
  {
-   int val = cpuid_feature_extract_field(reg, entry->field_pos, 
entry->sign);
+   int val = cpuid_feature_extract_field_width(reg, entry->field_pos,
+   entry->field_width,
+   entry->sign);
  


Could we do something like :

+ int val = cpuid_feature_extract_field_width(reg,  
entry->field_pos,
entry->field_width ? : 4,
..
);

and leave the existing structures as they are ?


return val >= entry->min_field_value;
  }
@@ -1952,6 +1954,7 @@ static const struct arm64_cpu_capabilities 
arm64_features[] = {


There is arm64_errata[] array in cpu_errata.c. So, if we use the above
proposal we could leave things unchanged for all existing uses.


.matches = has_cpuid_feature,
.sys_reg = SYS_ID_AA64MMFR0_EL1,
.field_pos = ID_AA64MMFR0_ECV_SHIFT,
+   .field_width = 4,
.sign = FTR_UNSIGNED,
.min_field_value = 1,
},

...


-#define HWCAP_CPUID_MATCH(reg, field, s, min_value)
\
+#define HWCAP_CPUID_MATCH(reg, field, width, s, min_value) 
\
.matches = has_cpuid_feature,   
\
.sys_reg = reg, 
\
.field_pos = field, 
\
+   .field_width = width,   
\
.sign = s,  
\
.min_field_value = min_value,


And that could avoid these changes too. We could add :

#define HWCAP_CPUID_MATCH_WIDTH(...)

when/if we need one, which sets the width.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v4 02/39] KVM: arm64: Add lock/unlock memslot user API

2021-10-18 Thread Suzuki K Poulose

On 25/08/2021 17:17, Alexandru Elisei wrote:

Stage 2 faults triggered by the profiling buffer attempting to write to
memory are reported by the SPE hardware by asserting a buffer management
event interrupt. Interrupts are by their nature asynchronous, which means
that the guest might have changed its stage 1 translation tables since the
attempted write. SPE reports the guest virtual address that caused the data
abort, not the IPA, which means that KVM would have to walk the guest's
stage 1 tables to find the IPA. Using the AT instruction to walk the
guest's tables in hardware is not an option because it doesn't report the
IPA in the case of a stage 2 fault on a stage 1 table walk.

Avoid both issues by pre-mapping the guest memory at stage 2. This is being
done by adding a capability that allows the user to pin the memory backing
a memslot. The same capability can be used to unlock a memslot, which
unpins the pages associated with the memslot, but doesn't unmap the IPA
range from stage 2; in this case, the addresses will be unmapped from stage
2 via the MMU notifiers when the process' address space changes.

For now, the capability doesn't actually do anything other than checking
that the usage is correct; the memory operations will be added in future
patches.

Signed-off-by: Alexandru Elisei 
---
  Documentation/virt/kvm/api.rst   | 56 +++
  arch/arm64/include/asm/kvm_mmu.h |  3 ++
  arch/arm64/kvm/arm.c | 42 --
  arch/arm64/kvm/mmu.c | 76 
  include/uapi/linux/kvm.h |  8 
  5 files changed, 181 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index dae68e68ca23..741327ef06b0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6682,6 +6682,62 @@ MAP_SHARED mmap will result in an -EINVAL return.
  When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
  perform a bulk copy of tags to/from the guest.
  
+7.29 KVM_CAP_ARM_LOCK_USER_MEMORY_REGION

+
+
+:Architectures: arm64
+:Target: VM
+:Parameters: flags is one of KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_LOCK or
+ KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_UNLOCK
+ args[0] is the slot number
+ args[1] specifies the permisions when the memslot is locked or if
+ all memslots should be unlocked
+
+The presence of this capability indicates that KVM supports locking the memory
+associated with the memslot, and unlocking a previously locked memslot.
+
+The 'flags' parameter is defined as follows:
+
+7.29.1 KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_LOCK
+-
+
+:Capability: 'flags' parameter to KVM_CAP_ARM_LOCK_USER_MEMORY_REGION
+:Architectures: arm64
+:Target: VM
+:Parameters: args[0] contains the memory slot number
+ args[1] contains the permissions for the locked memory:
+ KVM_ARM_LOCK_MEMORY_READ (mandatory) to map it with
+ read permissions and KVM_ARM_LOCK_MEMORY_WRITE
+ (optional) with write permissions
+:Returns: 0 on success; negative error code on failure
+
+Enabling this capability causes the memory described by the memslot to be
+pinned in the process address space and the corresponding stage 2 IPA range
+mapped at stage 2. The permissions specified in args[1] apply to both
+mappings. The memory pinned with this capability counts towards the max
+locked memory limit for the current process.
+
+The capability must be enabled before any VCPUs have run. The virtual memory
+range described by the memslot must be mapped in the userspace process without
+any gaps. It is considered an error if write permissions are specified for a
+memslot which logs dirty pages.
+
+7.29.2 KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_UNLOCK
+---
+
+:Capability: 'flags' parameter to KVM_CAP_ARM_LOCK_USER_MEMORY_REGION
+:Architectures: arm64
+:Target: VM
+:Parameters: args[0] contains the memory slot number
+ args[1] optionally contains the flag KVM_ARM_UNLOCK_MEM_ALL,
+ which unlocks all previously locked memslots.
+:Returns: 0 on success; negative error code on failure
+
+Enabling this capability causes the memory pinned when locking the memslot
+specified in args[0] to be unpinned, or, optionally, the memory associated
+with all locked memslots, to be unpinned. The IPA range is not unmapped
+from stage 2.
+
  8. Other capabilities.
  ==
  
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h

index b52c5c4b9a3d..ef079b5eb475 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -216,6 +216,9 @@ static inline void __invalidate_icache_guest_page(void *va, 
size_t size)
  void kvm_set_way_flush(struct 

Re: [PATCH] KVM: arm64: Allow KVM to be disabled from the command line

2021-10-01 Thread Suzuki K Poulose

On 30/09/2021 11:29, Marc Zyngier wrote:

On Wed, 29 Sep 2021 11:35:46 +0100,
Suzuki K Poulose  wrote:


On 03/09/2021 10:16, Marc Zyngier wrote:

Although KVM can be compiled out of the kernel, it cannot be disabled
at runtime. Allow this possibility by introducing a new mode that
will prevent KVM from initialising.

This is useful in the (limited) circumstances where you don't want
KVM to be available (what is wrong with you?), or when you want
to install another hypervisor instead (good luck with that).

Signed-off-by: Marc Zyngier 
---
   Documentation/admin-guide/kernel-parameters.txt |  3 +++
   arch/arm64/include/asm/kvm_host.h   |  1 +
   arch/arm64/kernel/idreg-override.c  |  1 +
   arch/arm64/kvm/arm.c| 14 +-
   4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..cc5f68846434 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2365,6 +2365,9 @@
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
   +none: Forcefully disable KVM and run in nVHE
mode,
+ preventing KVM from ever initialising.
+
nvhe: Standard nVHE-based mode, without support for
  protected guests.
   diff --git a/arch/arm64/include/asm/kvm_host.h
b/arch/arm64/include/asm/kvm_host.h
index f8be56d5342b..019490c67976 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -58,6 +58,7 @@
   enum kvm_mode {
KVM_MODE_DEFAULT,
KVM_MODE_PROTECTED,
+   KVM_MODE_NONE,
   };
   enum kvm_mode kvm_get_mode(void);
   diff --git a/arch/arm64/kernel/idreg-override.c
b/arch/arm64/kernel/idreg-override.c
index d8e606fe3c21..57013c1b6552 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -95,6 +95,7 @@ static const struct {
charalias[FTR_ALIAS_NAME_LEN];
charfeature[FTR_ALIAS_OPTION_LEN];
   } aliases[] __initconst = {
+   { "kvm-arm.mode=none","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=nvhe","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=protected",   "id_aa64mmfr1.vh=0" },
{ "arm64.nobti",  "id_aa64pfr1.bt=0" },
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fe102cd2e518..cdc70e238316 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2064,6 +2064,11 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}
   +if (kvm_get_mode() == KVM_MODE_NONE) {
+   kvm_info("KVM disabled from command line\n");
+   return -ENODEV;
+   }
+
in_hyp_mode = is_kernel_in_hyp_mode();
if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
||
@@ -2137,8 +2142,15 @@ static int __init early_kvm_mode_cfg(char *arg)
return 0;
}
   -if (strcmp(arg, "nvhe") == 0 &&
!WARN_ON(is_kernel_in_hyp_mode()))
+   if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {
+   kvm_mode = KVM_MODE_DEFAULT;
return 0;
+   }
+
+   if (strcmp(arg, "none") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {


nit: Does this really need to WARN here ? Unlike the "nvhe" case, if the
user wants to keep the KVM out of the picture for, say debugging
something, it is perfectly Ok to allow the kernel to be running at EL2
without having to change the Firmware to alter the landing EL for the
kernel ?


Well, the doc says "run in nVHE mode" and the option forces
id_aa64mmfr1.vh=0. The WARN_ON() will only fires on broken^Wfruity HW
that is VHE only. Note that this doesn't rely on any firmware change
(we drop from EL2 to EL1 and stay there).


Ah, ok. So the "none" is in fact "nvhe + no-kvm". Thats the bit I
missed. TBH, that name to me sounds like "no KVM" at all, which is what
we want. The question is, do we really need "none" to force vh == 0 ? I
understand this is only a problem on a rare set of HWs. But the generic
option looks deceiving.

That said, I am happy to leave this as is and the doc says so.



We could add another option (none-vhe?) that stays at EL2 and still
disables KVM if there is an appetite for it.


Na. Don't think that is necessary.




Otherwise,

Acked-by: Suzuki K Poulose 


Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow KVM to be disabled from the command line

2021-09-29 Thread Suzuki K Poulose

On 03/09/2021 10:16, Marc Zyngier wrote:

Although KVM can be compiled out of the kernel, it cannot be disabled
at runtime. Allow this possibility by introducing a new mode that
will prevent KVM from initialising.

This is useful in the (limited) circumstances where you don't want
KVM to be available (what is wrong with you?), or when you want
to install another hypervisor instead (good luck with that).

Signed-off-by: Marc Zyngier 
---
  Documentation/admin-guide/kernel-parameters.txt |  3 +++
  arch/arm64/include/asm/kvm_host.h   |  1 +
  arch/arm64/kernel/idreg-override.c  |  1 +
  arch/arm64/kvm/arm.c| 14 +-
  4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..cc5f68846434 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2365,6 +2365,9 @@
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
  
+			none: Forcefully disable KVM and run in nVHE mode,

+ preventing KVM from ever initialising.
+
nvhe: Standard nVHE-based mode, without support for
  protected guests.
  
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h

index f8be56d5342b..019490c67976 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -58,6 +58,7 @@
  enum kvm_mode {
KVM_MODE_DEFAULT,
KVM_MODE_PROTECTED,
+   KVM_MODE_NONE,
  };
  enum kvm_mode kvm_get_mode(void);
  
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c

index d8e606fe3c21..57013c1b6552 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -95,6 +95,7 @@ static const struct {
charalias[FTR_ALIAS_NAME_LEN];
charfeature[FTR_ALIAS_OPTION_LEN];
  } aliases[] __initconst = {
+   { "kvm-arm.mode=none","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=nvhe","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=protected",   "id_aa64mmfr1.vh=0" },
{ "arm64.nobti",  "id_aa64pfr1.bt=0" },
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fe102cd2e518..cdc70e238316 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2064,6 +2064,11 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}
  
+	if (kvm_get_mode() == KVM_MODE_NONE) {

+   kvm_info("KVM disabled from command line\n");
+   return -ENODEV;
+   }
+
in_hyp_mode = is_kernel_in_hyp_mode();
  
  	if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||

@@ -2137,8 +2142,15 @@ static int __init early_kvm_mode_cfg(char *arg)
return 0;
}
  
-	if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode()))

+   if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {
+   kvm_mode = KVM_MODE_DEFAULT;
return 0;
+   }
+
+   if (strcmp(arg, "none") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {


nit: Does this really need to WARN here ? Unlike the "nvhe" case, if the
user wants to keep the KVM out of the picture for, say debugging
something, it is perfectly Ok to allow the kernel to be running at EL2
without having to change the Firmware to alter the landing EL for the
kernel ?

Otherwise,

Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v4 01/39] KVM: arm64: Make lock_all_vcpus() available to the rest of KVM

2021-09-22 Thread Suzuki K Poulose

On 25/08/2021 17:17, Alexandru Elisei wrote:

The VGIC code uses the lock_all_vcpus() function to make sure no VCPUs are
run while it fiddles with the global VGIC state. Move the declaration of
lock_all_vcpus() and the corresponding unlock function into asm/kvm_host.h
where it can be reused by other parts of KVM/arm64 and rename the functions
to kvm_{lock,unlock}_all_vcpus() to make them more generic.

Because the scope of the code potentially using the functions has
increased, add a lockdep check that the kvm->lock is held by the caller.
Holding the lock is necessary because otherwise userspace would be able to
create new VCPUs and run them while the existing VCPUs are locked.

No functional change intended.

Signed-off-by: Alexandru Elisei 



LGTM,

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support

2021-09-22 Thread Suzuki K Poulose

On 25/08/2021 17:17, Alexandru Elisei wrote:

This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
original series at [3].

Statistical Profiling Extension (SPE) is an optional feature added in
ARMv8.2. It allows sampling at regular intervals of the operations executed
by the PE and storing a record of each operation in a memory buffer. A high
level overview of the extension is presented in an article on arm.com [4].

This is another complete rewrite of the series, and nothing is set in
stone. If you think of a better way to do things, please suggest it.


Features added
==

The rewrite enabled me to add support for several features not
present in the previous iteration:

- Support for heterogeneous systems, where only some of the CPUs support SPE.
   This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.

- Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
   VCPU ioctl.

- The requirement for userspace to mlock() the guest memory has been removed,
   and now userspace can make changes to memory contents after the memory is
   mapped at stage 2.

- Better debugging of guest memory pinning by printing a warning when we
   get an unexpected read or write fault. This helped me catch several bugs
   during development, it has already proven very useful. Many thanks to
   James who suggested when reviewing v3.


Missing features


I've tried to keep the series as small as possible to make it easier to review,
while implementing the core functionality needed for the SPE emulation. As such,
I've chosen to not implement several features:

- Host profiling a guest which has the SPE feature bit set (see open
   questions).

- No errata workarounds have been implemented yet, and there are quite a few of
   them for Neoverse N1 and Neoverse V1.

- Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
   investigating other ways to get around automatic numa balancing, like
   requiring userspace to disable it via set_mempolicy(). I am also going to
   look at how VFIO gets around it. Suggestions welcome.

- There's plenty of room for optimization. Off the top of my head, using
   block mappings at stage 2, batch pinning of pages (similar to what VFIO
   does), optimize the way KVM keeps track of pinned pages (using a linked
   list triples the memory usage), context-switch the SPE registers on
   vcpu_load/vcpu_put on VHE if the host is not profiling, locking
   optimizations, etc, etc.

- ...and others. I'm sure I'm missing at least a few things which are
   important for someone.


Known issues


This is an RFC, so keep in mind that almost definitely there will be scary
bugs. For example, below is a list of known issues which don't affect the
correctness of the emulation, and which I'm planning to fix in a future
iteration:

- With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
   the VCPU executes the dcache clean pending ops.

- With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
   kvm_lock_all_vcpus()->mutex_trylock(>mutex) with more than 48
   VCPUs.

This BUG statement can also be triggered with mainline. To reproduce it,
compile kvmtool from this branch [5] and follow the instruction in the
kvmtool commit message.

One workaround could be to stop trying to lock all VCPUs when locking a
memslot and document the fact that it is required that no VCPUs are run
before the ioctl completes, otherwise bad things might happen to the VM.


Open questions
==

1. Implementing support for host profiling a guest with the SPE feature
means setting the profiling buffer owning regime to EL2. While that is in
effect,  PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
probes SPE during this time, the driver will fail; and the guest will be
able to determine when it is profiled. I see two options here:


This doesn't mean the EL2 is owning the SPE. It only tells you that a
higher level EL is owning the SPE. It could as well be EL3. (e.g, 
MDCR_EL3.NSPB == 0 or 1). So I think this is architecturally correct,

as long as we trap the guest access to other SPE registers and inject
and UNDEF.


Thanks
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] kvm: arm64: nvhe: Save the SPE context early

2021-03-02 Thread Suzuki K Poulose
The nVHE KVM hyp drains and disables the SPE buffer, before
entering the guest, as the EL1&0 translation regime
is going to be loaded with that of the guest.

But this operation is performed way too late, because :
  - The owning translation regime of the SPE buffer
is transferred to EL2. (MDCR_EL2_E2PB == 0)
  - The guest Stage1 is loaded.

Thus the flush could use the host EL1 virtual address,
but use the EL2 translations instead of host EL1, for writing
out any cached data.

Fix this by moving the SPE buffer handling early enough.
The restore path is doing the right thing.

Fixes: 014c4c77aad7 ("KVM: arm64: Improve debug register save/restore flow")
Cc: sta...@vger.kernel.org
Cc: Christoffer Dall 
Cc: Marc Zyngier 
Cc: Will Deacon 
Cc: Catalin Marinas 
Cc: Mark Rutland 
Cc: Alexandru Elisei 
Reviewed-by: Alexandru Elisei 
Signed-off-by: Suzuki K Poulose 
---
 arch/arm64/include/asm/kvm_hyp.h   |  5 +
 arch/arm64/kvm/hyp/nvhe/debug-sr.c | 12 ++--
 arch/arm64/kvm/hyp/nvhe/switch.c   | 11 ++-
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index c0450828378b..385bd7dd3d39 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -83,6 +83,11 @@ void sysreg_restore_guest_state_vhe(struct kvm_cpu_context 
*ctxt);
 void __debug_switch_to_guest(struct kvm_vcpu *vcpu);
 void __debug_switch_to_host(struct kvm_vcpu *vcpu);
 
+#ifdef __KVM_NVHE_HYPERVISOR__
+void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu);
+void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu);
+#endif
+
 void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
 void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c 
b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..f401724f12ef 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -58,16 +58,24 @@ static void __debug_restore_spe(u64 pmscr_el1)
write_sysreg_s(pmscr_el1, SYS_PMSCR_EL1);
 }
 
-void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
+void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
 {
/* Disable and flush SPE data generation */
__debug_save_spe(>arch.host_debug_state.pmscr_el1);
+}
+
+void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
+{
__debug_switch_to_guest_common(vcpu);
 }
 
-void __debug_switch_to_host(struct kvm_vcpu *vcpu)
+void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu)
 {
__debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
+}
+
+void __debug_switch_to_host(struct kvm_vcpu *vcpu)
+{
__debug_switch_to_host_common(vcpu);
 }
 
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index f3d0e9eca56c..59aa1045fdaf 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -192,6 +192,14 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
pmu_switch_needed = __pmu_switch_to_guest(host_ctxt);
 
__sysreg_save_state_nvhe(host_ctxt);
+   /*
+* We must flush and disable the SPE buffer for nVHE, as
+* the translation regime(EL1&0) is going to be loaded with
+* that of the guest. And we must do this before we change the
+* translation regime to EL2 (via MDCR_EL2_E2PB == 0) and
+* before we load guest Stage1.
+*/
+   __debug_save_host_buffers_nvhe(vcpu);
 
__adjust_pc(vcpu);
 
@@ -234,11 +242,12 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->arch.flags & KVM_ARM64_FP_ENABLED)
__fpsimd_save_fpexc32(vcpu);
 
+   __debug_switch_to_host(vcpu);
/*
 * This must come after restoring the host sysregs, since a non-VHE
 * system may enable SPE here and make use of the TTBRs.
 */
-   __debug_switch_to_host(vcpu);
+   __debug_restore_host_buffers_nvhe(vcpu);
 
if (pmu_switch_needed)
__pmu_switch_to_host(host_ctxt);
-- 
2.24.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 13/21] arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line

2021-01-25 Thread Suzuki K Poulose

On 1/25/21 10:50 AM, Marc Zyngier wrote:

As we want to be able to disable VHE at runtime, let's match
"id_aa64mmfr1.vh=" from the command line as an override.
This doesn't have much effect yet as our boot code doesn't look
at the cpufeature, but only at the HW registers.

Signed-off-by: Marc Zyngier 
Acked-by: David Brazdil 
---
  arch/arm64/include/asm/cpufeature.h |  2 ++
  arch/arm64/kernel/cpufeature.c  |  5 -
  arch/arm64/kernel/idreg-override.c  | 11 +++
  3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 4179cfc8ed57..b0ed37da4067 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -818,6 +818,8 @@ static inline unsigned int get_vmid_bits(u64 mmfr1)
return 8;
  }
  
+extern struct arm64_ftr_override id_aa64mmfr1_override;

+
  u32 get_kvm_ipa_limit(void);
  void dump_cpu_features(void);
  
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c

index 4b84a1e1dc51..c1d6712c4249 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -557,6 +557,8 @@ static const struct arm64_ftr_bits ftr_raz[] = {
  
  #define ARM64_FTR_REG(id, table) ARM64_FTR_REG_OVERRIDE(id, table, _override)
  
+struct arm64_ftr_override id_aa64mmfr1_override;


Does this need to be ro_after_init ?

Otherwise, looks good to me:

Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 10/21] arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding()

2021-01-23 Thread Suzuki K Poulose

On 1/22/21 6:53 PM, Catalin Marinas wrote:

On Mon, Jan 18, 2021 at 09:45:22AM +, Marc Zyngier wrote:

__read_sysreg_by_encoding() is used by a bunch of cpufeature helpers,
which should take the feature override into account. Let's do that.

For a good measure (and because we are likely to need to further
down the line), make this helper available to the rest of the
non-modular kernel.

Code that needs to know the *real* features of a CPU can still
use read_sysreg_s(), and find the bare, ugly truth.

Signed-off-by: Marc Zyngier 



diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index aaa075c6f029..48a011935d8c 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1149,14 +1149,17 @@ u64 read_sanitised_ftr_reg(u32 id)
  EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
  
  #define read_sysreg_case(r)	\

-   case r: return read_sysreg_s(r)
+   case r: val = read_sysreg_s(r); break;
  
  /*

   * __read_sysreg_by_encoding() - Used by a STARTING cpu before cpuinfo is 
populated.
   * Read the system register on the current CPU
   */
-static u64 __read_sysreg_by_encoding(u32 sys_id)
+u64 __read_sysreg_by_encoding(u32 sys_id)
  {
+   struct arm64_ftr_reg *regp;
+   u64 val;
+
switch (sys_id) {
read_sysreg_case(SYS_ID_PFR0_EL1);
read_sysreg_case(SYS_ID_PFR1_EL1);
@@ -1199,6 +1202,14 @@ static u64 __read_sysreg_by_encoding(u32 sys_id)
BUG();
return 0;
}
+
+   regp  = get_arm64_ftr_reg(sys_id);
+   if (regp && regp->override_mask && regp->override_val) {
+   val &= ~*regp->override_mask;
+   val |= (*regp->override_val & *regp->override_mask);
+   }
+
+   return val;


Ah, now the previous patch makes more sense. I don't particularly like
this but I can't tell how to work around it. I was hoping that the
overriding feature behaves more like a secondary CPU that limits all the
overridden features. However, this approach would fail for FTR_EXACT
cases (like PAC, though I wonder whether it fails already with your
previous patch since the boot CPU value won't match the override, hence
dropping to the safe one).



Correct !For FTR_EXACT, we dont want to override a value that is not safe, e.g 
PAC.
This is handled correctly in the previous patch and thus we are covered.

Reviewed-by: Suzuki K Poulose 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 09/21] arm64: cpufeature: Add global feature override facility

2021-01-23 Thread Suzuki K Poulose

On 1/18/21 9:45 AM, Marc Zyngier wrote:

Add a facility to globally override a feature, no matter what
the HW says. Yes, this sounds dangerous, but we do respect the
"safe" value for a given feature. This doesn't mean the user
doesn't need to know what they are doing.

Nothing uses this yet, so we are pretty safe. For now.

Signed-off-by: Marc Zyngier 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 09/21] arm64: cpufeature: Add global feature override facility

2021-01-12 Thread Suzuki K Poulose

On 1/12/21 11:51 AM, Marc Zyngier wrote:

On 2021-01-12 11:50, Marc Zyngier wrote:

Hi Suzuki,

On 2021-01-12 09:17, Suzuki K Poulose wrote:

Hi Marc,

On 1/11/21 7:48 PM, Marc Zyngier wrote:


[...]


diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 894af60b9669..00d99e593b65 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -774,6 +774,7 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  u64 strict_mask = ~0x0ULL;
  u64 user_mask = 0;
  u64 valid_mask = 0;
+    u64 override_val = 0, override_mask = 0;

  const struct arm64_ftr_bits *ftrp;
  struct arm64_ftr_reg *reg = get_arm64_ftr_reg(sys_reg);
@@ -781,9 +782,35 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  if (!reg)
  return;

+    if (reg->override_mask && reg->override_val) {
+    override_mask = *reg->override_mask;
+    override_val = *reg->override_val;
+    }
+
  for (ftrp = reg->ftr_bits; ftrp->width; ftrp++) {
  u64 ftr_mask = arm64_ftr_mask(ftrp);
  s64 ftr_new = arm64_ftr_value(ftrp, new);
+    s64 ftr_ovr = arm64_ftr_value(ftrp, override_val);
+
+    if ((ftr_mask & override_mask) == ftr_mask) {
+    if (ftr_ovr < ftr_new) {


Here we assume that all the features are FTR_LOWER_SAFE. We could
probably use arm64_ftr_safe_value(ftrp, ftr_new, ftr_ovr) here ?
That would cover us for both HIGHER_SAFE and LOWER_SAFE features.
However that may be restrictive for FTR_EXACT, as we the safe
value would be set to "ftr->safe_val". I guess that may be better
than forcing to use an unsafe value for the boot CPU, which could
anyway conflict with the other CPUs and eventually trigger the
ftr alue to be safe_val.


I like the idea of using the helper, as it cleanups up the code a bit.
However, not being to set a feature to a certain value could be restrictive,
as in general, it means that we can only disable a feature and not adjust
its level of support.

Take PMUVER for example: with the helper, I can't override it from v8.4 to
v8.1. I can only go to v8.0.


Actually, we can only *disable* the PMU altogether. Same question though...


It depends on two things :

1) What is the safe value for an EXACT typed feature ?
Usually, that means either disabled, or the lowest possible value.

2) How is this value consumed ?
  a) i.e, Do we use the per-CPU value
Then none of these changes have any effect
  b) System wide value ?
  Then we get the safe value as "influenced" by the infrastructure.

The safe value we use for EXACT features is exclusively for making sure
that the system uses the safe assumption and thus should be the best
option.

To answer your question, for PMU, it is 0, implies, v8.0. Or we could
update the safe value to -1 (0xf) as the safe value, which is a bit more safer,
kind of implying that the PMU is not a standard one.


Cheers
Suzuki





     M.



Is it something we care about?

Thanks,

    M.




___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 09/21] arm64: cpufeature: Add global feature override facility

2021-01-12 Thread Suzuki K Poulose

On 1/12/21 11:50 AM, Marc Zyngier wrote:

Hi Suzuki,

On 2021-01-12 09:17, Suzuki K Poulose wrote:

Hi Marc,

On 1/11/21 7:48 PM, Marc Zyngier wrote:


[...]


diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 894af60b9669..00d99e593b65 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -774,6 +774,7 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  u64 strict_mask = ~0x0ULL;
  u64 user_mask = 0;
  u64 valid_mask = 0;
+    u64 override_val = 0, override_mask = 0;

  const struct arm64_ftr_bits *ftrp;
  struct arm64_ftr_reg *reg = get_arm64_ftr_reg(sys_reg);
@@ -781,9 +782,35 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  if (!reg)
  return;

+    if (reg->override_mask && reg->override_val) {
+    override_mask = *reg->override_mask;
+    override_val = *reg->override_val;
+    }
+
  for (ftrp = reg->ftr_bits; ftrp->width; ftrp++) {
  u64 ftr_mask = arm64_ftr_mask(ftrp);
  s64 ftr_new = arm64_ftr_value(ftrp, new);
+    s64 ftr_ovr = arm64_ftr_value(ftrp, override_val);
+
+    if ((ftr_mask & override_mask) == ftr_mask) {
+    if (ftr_ovr < ftr_new) {


Here we assume that all the features are FTR_LOWER_SAFE. We could
probably use arm64_ftr_safe_value(ftrp, ftr_new, ftr_ovr) here ?
That would cover us for both HIGHER_SAFE and LOWER_SAFE features.
However that may be restrictive for FTR_EXACT, as we the safe
value would be set to "ftr->safe_val". I guess that may be better
than forcing to use an unsafe value for the boot CPU, which could
anyway conflict with the other CPUs and eventually trigger the
ftr alue to be safe_val.


I like the idea of using the helper, as it cleanups up the code a bit.
However, not being to set a feature to a certain value could be restrictive,
as in general, it means that we can only disable a feature and not adjust
its level of support.

Take PMUVER for example: with the helper, I can't override it from v8.4 to
v8.1. I can only go to v8.0.


My point is, we set this only for the "init" of cpu features. So, even if we
init to a custom , non-(default-safe) value, the secondary CPUs could scream,
and the system wide safe value could fall back to the "safe" value for EXACT 
features,
no matter what you did to init it.

Also, it will be dangerous for things like PAC, as the lower level value may
not be compatible with the "actual" cpu feature supported.

So simply setting to a lower value for EXACT features is generally not safe,
though I understand some are exceptions.

Suzuki





Is it something we care about?

Thanks,

     M.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 09/21] arm64: cpufeature: Add global feature override facility

2021-01-12 Thread Suzuki K Poulose

Hi Marc,

On 1/11/21 7:48 PM, Marc Zyngier wrote:

Hi Catalin,

On 2021-01-11 18:41, Catalin Marinas wrote:

Hi Marc,

On Mon, Jan 11, 2021 at 01:27:59PM +, Marc Zyngier wrote:

Add a facility to globally override a feature, no matter what
the HW says. Yes, this is dangerous.


Yeah, it's dangerous. We can make it less so if we only allow safe
values (e.g. lower if FTR_UNSIGNED).


My plan was also to allow non-safe values in order to trigger features
that are not advertised by the HW. But I can understand if you are
reluctant to allow such thing! :D


diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 9a555809b89c..465d2cb63bfc 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -75,6 +75,8 @@ struct arm64_ftr_reg {
 u64    sys_val;
 u64    user_val;
 const struct arm64_ftr_bits    *ftr_bits;
+    u64    *override_val;
+    u64    *override_mask;
 };


At the arm64_ftr_reg level, we don't have any information about the safe
values for a feature. Could we instead move this to arm64_ftr_bits? We
probably only need a single field. When populating the feature values,
we can make sure it doesn't go above the hardware one.

I attempted a feature modification for MTE here, though I dropped the
entire series in the meantime as we clarified the ARM ARM:

https://lore.kernel.org/linux-arm-kernel/20200515171612.1020-24-catalin.mari...@arm.com/

Srinivas copied it in his patch (but forgot to give credit ;)):

https://lore.kernel.org/linux-arm-msm/1610152163-16554-3-git-send-email-sram...@codeaurora.org/

The above adds a filter function but, instead, just use your mechanism in
this series for idreg.feature setting via cmdline. The arm64_ftr_value()
function extracts the hardware value and lowers it if a cmdline argument
was passed.


One thing is that it is not always possible to sanitise the value passed
if it is required very early on, as I do with VHE. But in that case
I actually check that we are VHE capable before starting to poke at
VHE-specific state.

I came up with the following patch on top, which preserves the current
global approach (no per arm64_ftr_bits state), but checks (and alters)
the override as it iterates through the various fields.

For example, if I pass "arm64.nopauth kvm-arm.mode=nvhe id_aa64pfr1.bt=5"
to the FVP, I get the following output:

[    0.00] CPU features: SYS_ID_AA64ISAR1_EL1[31:28]: forced from 1 to 0
[    0.00] CPU features: SYS_ID_AA64ISAR1_EL1[11:8]: forced from 1 to 0
[    0.00] CPU features: SYS_ID_AA64MMFR1_EL1[11:8]: forced from 1 to 0
[    0.00] CPU features: SYS_ID_AA64PFR1_EL1[3:0]: not forcing 1 to 5
[    0.00] CPU features: detected: GIC system register CPU interface
[    0.00] CPU features: detected: Hardware dirty bit management
[    0.00] CPU features: detected: Spectre-v4
[    0.00] CPU features: detected: Branch Target Identification

showing that the PAC features have been downgraded, together with VHE,
but that BTI is still detected as value 5 was obviously bogus.

Thoughts?

     M.

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 894af60b9669..00d99e593b65 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -774,6 +774,7 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  u64 strict_mask = ~0x0ULL;
  u64 user_mask = 0;
  u64 valid_mask = 0;
+    u64 override_val = 0, override_mask = 0;

  const struct arm64_ftr_bits *ftrp;
  struct arm64_ftr_reg *reg = get_arm64_ftr_reg(sys_reg);
@@ -781,9 +782,35 @@ static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
  if (!reg)
  return;

+    if (reg->override_mask && reg->override_val) {
+    override_mask = *reg->override_mask;
+    override_val = *reg->override_val;
+    }
+
  for (ftrp = reg->ftr_bits; ftrp->width; ftrp++) {
  u64 ftr_mask = arm64_ftr_mask(ftrp);
  s64 ftr_new = arm64_ftr_value(ftrp, new);
+    s64 ftr_ovr = arm64_ftr_value(ftrp, override_val);
+
+    if ((ftr_mask & override_mask) == ftr_mask) {
+    if (ftr_ovr < ftr_new) {


Here we assume that all the features are FTR_LOWER_SAFE. We could
probably use arm64_ftr_safe_value(ftrp, ftr_new, ftr_ovr) here ?
That would cover us for both HIGHER_SAFE and LOWER_SAFE features.
However that may be restrictive for FTR_EXACT, as we the safe
value would be set to "ftr->safe_val". I guess that may be better
than forcing to use an unsafe value for the boot CPU, which could
anyway conflict with the other CPUs and eventually trigger the
ftr alue to be safe_val.

i.e,
ftr_val = arm64_ftr_safe_value(ftrp, ftr_ovr, ftr_new);


Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 03/11] KVM: arm64: Make kvm_skip_instr() and co private to HYP

2020-10-27 Thread Suzuki K Poulose

On 10/26/20 1:34 PM, Marc Zyngier wrote:

In an effort to remove the vcpu PC manipulations from EL1 on nVHE
systems, move kvm_skip_instr() to be HYP-specific. EL1's intent
to increment PC post emulation is now signalled via a flag in the
vcpu structure.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_emulate.h   | 27 +--
  arch/arm64/include/asm/kvm_host.h  |  1 +
  arch/arm64/kvm/handle_exit.c   |  6 +--
  arch/arm64/kvm/hyp/include/hyp/adjust_pc.h | 56 ++
  arch/arm64/kvm/hyp/include/hyp/switch.h|  2 +
  arch/arm64/kvm/hyp/nvhe/switch.c   |  3 ++
  arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c   |  2 +
  arch/arm64/kvm/hyp/vgic-v3-sr.c|  2 +
  arch/arm64/kvm/hyp/vhe/switch.c|  3 ++
  arch/arm64/kvm/mmio.c  |  2 +-
  arch/arm64/kvm/mmu.c   |  2 +-
  arch/arm64/kvm/sys_regs.c  |  2 +-
  12 files changed, 77 insertions(+), 31 deletions(-)
  create mode 100644 arch/arm64/kvm/hyp/include/hyp/adjust_pc.h

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index 0864f425547d..6d2b5d1aa7b3 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -472,32 +472,9 @@ static inline unsigned long vcpu_data_host_to_guest(struct 
kvm_vcpu *vcpu,
return data;/* Leave LE untouched */
  }
  
-static __always_inline void kvm_skip_instr(struct kvm_vcpu *vcpu)

+static __always_inline void kvm_incr_pc(struct kvm_vcpu *vcpu)
  {
-   if (vcpu_mode_is_32bit(vcpu)) {
-   kvm_skip_instr32(vcpu);
-   } else {
-   *vcpu_pc(vcpu) += 4;
-   *vcpu_cpsr(vcpu) &= ~PSR_BTYPE_MASK;
-   }
-
-   /* advance the singlestep state machine */
-   *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
-}
-
-/*
- * Skip an instruction which has been emulated at hyp while most guest sysregs
- * are live.
- */
-static __always_inline void __kvm_skip_instr(struct kvm_vcpu *vcpu)
-{
-   *vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
-   vcpu_gp_regs(vcpu)->pstate = read_sysreg_el2(SYS_SPSR);
-
-   kvm_skip_instr(vcpu);
-
-   write_sysreg_el2(vcpu_gp_regs(vcpu)->pstate, SYS_SPSR);
-   write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR);
+   vcpu->arch.flags |= KVM_ARM64_INCREMENT_PC;
  }
  
  #endif /* __ARM64_KVM_EMULATE_H__ */

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 0aecbab6a7fb..9a75de3ad8da 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -406,6 +406,7 @@ struct kvm_vcpu_arch {
  #define KVM_ARM64_GUEST_HAS_SVE   (1 << 5) /* SVE exposed to 
guest */
  #define KVM_ARM64_VCPU_SVE_FINALIZED  (1 << 6) /* SVE config completed */
  #define KVM_ARM64_GUEST_HAS_PTRAUTH   (1 << 7) /* PTRAUTH exposed to guest */
+#define KVM_ARM64_INCREMENT_PC (1 << 8) /* Increment PC */
  
  #define vcpu_has_sve(vcpu) (system_supports_sve() && \

((vcpu)->arch.flags & KVM_ARM64_GUEST_HAS_SVE))
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 30bf8e22df54..d4e00a864ee6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -61,7 +61,7 @@ static int handle_smc(struct kvm_vcpu *vcpu)
 * otherwise return to the same address...
 */
vcpu_set_reg(vcpu, 0, ~0UL);
-   kvm_skip_instr(vcpu);
+   kvm_incr_pc(vcpu);
return 1;
  }
  
@@ -100,7 +100,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)

kvm_clear_request(KVM_REQ_UNHALT, vcpu);
}
  
-	kvm_skip_instr(vcpu);

+   kvm_incr_pc(vcpu);
  
  	return 1;

  }
@@ -221,7 +221,7 @@ static int handle_trap_exceptions(struct kvm_vcpu *vcpu)
 * that fail their condition code check"
 */
if (!kvm_condition_valid(vcpu)) {
-   kvm_skip_instr(vcpu);
+   kvm_incr_pc(vcpu);
handled = 1;
} else {
exit_handle_fn exit_handler;
diff --git a/arch/arm64/kvm/hyp/include/hyp/adjust_pc.h 
b/arch/arm64/kvm/hyp/include/hyp/adjust_pc.h
new file mode 100644
index ..4ecaf5cb2633
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/hyp/adjust_pc.h
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Guest PC manipulation helpers
+ *
+ * Copyright (C) 2012,2013 - ARM Ltd
+ * Copyright (C) 2020 - Google LLC
+ * Author: Marc Zyngier 
+ */
+
+#ifndef __ARM64_KVM_HYP_ADJUST_PC_H__
+#define __ARM64_KVM_HYP_ADJUST_PC_H__
+
+#include 
+#include 
+
+static inline void kvm_skip_instr(struct kvm_vcpu *vcpu)
+{
+   if (vcpu_mode_is_32bit(vcpu)) {
+   kvm_skip_instr32(vcpu);
+   } else {
+   *vcpu_pc(vcpu) += 4;
+   *vcpu_cpsr(vcpu) &= ~PSR_BTYPE_MASK;
+   }
+
+   /* advance the singlestep state machine */
+   *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
+}
+
+/*
+ * Skip 

Re: [PATCH 2/2] KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set

2020-08-11 Thread Suzuki K Poulose

On 08/11/2020 11:27 AM, Will Deacon wrote:

When an MMU notifier call results in unmapping a range that spans multiple
PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
since this avoids running into RCU stalls during VM teardown. Unfortunately,
if the VM is destroyed as a result of OOM, then blocking is not permitted
and the call to the scheduler triggers the following BUG():

  | BUG: sleeping function called from invalid context at 
arch/arm64/kvm/mmu.c:394
  | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
  | INFO: lockdep is turned off.
  | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
  | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
  | Call trace:
  |  dump_backtrace+0x0/0x284
  |  show_stack+0x1c/0x28
  |  dump_stack+0xf0/0x1a4
  |  ___might_sleep+0x2bc/0x2cc
  |  unmap_stage2_range+0x160/0x1ac
  |  kvm_unmap_hva_range+0x1a0/0x1c8
  |  kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
  |  __mmu_notifier_invalidate_range_start+0x218/0x31c
  |  mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
  |  __oom_reap_task_mm+0x128/0x268
  |  oom_reap_task+0xac/0x298
  |  oom_reaper+0x178/0x17c
  |  kthread+0x1e4/0x1fc
  |  ret_from_fork+0x10/0x30

Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
flags.

Cc: 
Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
Cc: Marc Zyngier 
Cc: Suzuki K Poulose 
Cc: James Morse 
Signed-off-by: Will Deacon 
---


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvmtool: arm64: Report missing support for 32bit guests

2020-07-02 Thread Suzuki K Poulose

Hi Marc

On 07/01/2020 04:42 PM, Marc Zyngier wrote:

On 2020-07-01 15:20, Suzuki K Poulose wrote:

When the host doesn't support 32bit guests, the kvmtool fails
without a proper message on what is wrong. i.e,

 $ lkvm run -c 1 Image --aarch32
  # lkvm run -k Image -m 256 -c 1 --name guest-105618
  Fatal: Unable to initialise vcpu

Given that there is no other easy way to check if the host supports 32bit
guests, it is always good to report this by checking the capability, 
rather

than leaving the users to hunt this down by looking at the code!

After this patch:

 $ lkvm run -c 1 Image --aarch32
  # lkvm run -k Image -m 256 -c 1 --name guest-105695
  Fatal: 32bit guests are not supported


Fancy!



Cc: Will Deacon 
Reported-by: Sami Mujawar 
Signed-off-by: Suzuki K Poulose 
---
 arm/kvm-cpu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arm/kvm-cpu.c b/arm/kvm-cpu.c
index 554414f..2acecae 100644
--- a/arm/kvm-cpu.c
+++ b/arm/kvm-cpu.c
@@ -46,6 +46,10 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm,
unsigned long cpu_id)
 .features = ARM_VCPU_FEATURE_FLAGS(kvm, cpu_id)
 };

+    if (kvm->cfg.arch.aarch32_guest &&
+    !kvm__supports_extension(kvm, KVM_CAP_ARM_EL1_32BIT))


Can you please check that this still compiles for 32bit host?


Yes, it does. I have built this on an arm32 rootfs with make ARCH=arm.
The kvm->cfg.arch is common across arm/arm64 and is defined here :

arm/include/arm-common/kvm-config-arch.h

And the aarch32 command line option is only available on aarch64 host.
So this is safe on an arm32 host.




+    die("32bit guests are not supported\n");
+
 vcpu = calloc(1, sizeof(struct kvm_cpu));
 if (!vcpu)
 return NULL;


With the above detail checked,

Acked-by: Marc Zyngier 


Thanks
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] kvmtool: arm64: Report missing support for 32bit guests

2020-07-01 Thread Suzuki K Poulose
When the host doesn't support 32bit guests, the kvmtool fails
without a proper message on what is wrong. i.e,

 $ lkvm run -c 1 Image --aarch32
  # lkvm run -k Image -m 256 -c 1 --name guest-105618
  Fatal: Unable to initialise vcpu

Given that there is no other easy way to check if the host supports 32bit
guests, it is always good to report this by checking the capability, rather
than leaving the users to hunt this down by looking at the code!

After this patch:

 $ lkvm run -c 1 Image --aarch32
  # lkvm run -k Image -m 256 -c 1 --name guest-105695
  Fatal: 32bit guests are not supported

Cc: Will Deacon 
Reported-by: Sami Mujawar 
Signed-off-by: Suzuki K Poulose 
---
 arm/kvm-cpu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arm/kvm-cpu.c b/arm/kvm-cpu.c
index 554414f..2acecae 100644
--- a/arm/kvm-cpu.c
+++ b/arm/kvm-cpu.c
@@ -46,6 +46,10 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned 
long cpu_id)
.features = ARM_VCPU_FEATURE_FLAGS(kvm, cpu_id)
};
 
+   if (kvm->cfg.arch.aarch32_guest &&
+   !kvm__supports_extension(kvm, KVM_CAP_ARM_EL1_32BIT))
+   die("32bit guests are not supported\n");
+
vcpu = calloc(1, sizeof(struct kvm_cpu));
if (!vcpu)
return NULL;
-- 
2.24.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH V4 06/17] arm64/cpufeature: Introduce ID_MMFR5 CPU register

2020-05-19 Thread Suzuki K Poulose

On 05/19/2020 10:40 AM, Anshuman Khandual wrote:

This adds basic building blocks required for ID_MMFR5 CPU register which
provides information about the implemented memory model and memory
management support in AArch32 state. This is added per ARM DDI 0487F.a
specification.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Mark Rutland 
Cc: James Morse 
Cc: Suzuki K Poulose 
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org

Suggested-by: Will Deacon 
Signed-off-by: Anshuman Khandual 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH V4 05/17] arm64/cpufeature: Introduce ID_DFR1 CPU register

2020-05-19 Thread Suzuki K Poulose

On 05/19/2020 10:40 AM, Anshuman Khandual wrote:

This adds basic building blocks required for ID_DFR1 CPU register which
provides top level information about the debug system in AArch32 state.
This is added per ARM DDI 0487F.a specification.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Mark Rutland 
Cc: James Morse 
Cc: Suzuki K Poulose 
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org

Suggested-by: Will Deacon 
Signed-off-by: Anshuman Khandual 



diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 600ce237c487..faf644a66e89 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -457,6 +457,11 @@ static const struct arm64_ftr_bits ftr_id_dfr0[] = {
ARM64_FTR_END,
  };
  
+static const struct arm64_ftr_bits ftr_id_dfr1[] = {

+   S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_DFR1_MTPMU_SHIFT, 4, 0),
+   ARM64_FTR_END,
+};
+



diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index cb79b083f97f..50a281703d9d 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -362,6 +362,7 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
/* Update the 32bit ID registers only if AArch32 is implemented */
if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
info->reg_id_dfr0 = read_cpuid(ID_DFR0_EL1);
+   info->reg_id_dfr1 = read_cpuid(ID_DFR1_EL1);
info->reg_id_isar0 = read_cpuid(ID_ISAR0_EL1);
info->reg_id_isar1 = read_cpuid(ID_ISAR1_EL1);
info->reg_id_isar2 = read_cpuid(ID_ISAR2_EL1);
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index b784b156edb3..0723cfbff7e9 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1457,7 +1457,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_SANITISED(MVFR2_EL1),
ID_UNALLOCATED(3,3),
ID_SANITISED(ID_PFR2_EL1),
-   ID_UNALLOCATED(3,5),
+   ID_HIDDEN(ID_DFR1_EL1),


It might be a good idea to mention why this is HIDDEN in the description.

With that :

Reviewed-by : Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm/arm64: release kvm->mmu_lock in loop to prevent starvation

2020-05-07 Thread Suzuki K Poulose

On 04/15/2020 09:42 AM, Jiang Yi wrote:

Do cond_resched_lock() in stage2_flush_memslot() like what is done in
unmap_stage2_range() and other places holding mmu_lock while processing
a possibly large range of memory.

Signed-off-by: Jiang Yi 
---
  virt/kvm/arm/mmu.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e3b9ee268823..7315af2c52f8 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -417,16 +417,19 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t next;
pgd_t *pgd;
  
  	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);

do {
next = stage2_pgd_addr_end(kvm, addr, end);
if (!stage2_pgd_none(kvm, *pgd))
stage2_flush_puds(kvm, pgd, addr, next);
+
+   if (next != end)
+   cond_resched_lock(>mmu_lock);
} while (pgd++, addr = next, addr != end);
  }


Given that this is called under the srcu_lock this looks
good to me:

Reviewed-by: Suzuki K Poulose 

  
  /**

   * stage2_flush_vm - Invalidate cache for pages mapped in stage 2
   * @kvm: The struct kvm pointer
   *
   * Go through the stage 2 page tables and invalidate any cache lines



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH V3 05/16] arm64/cpufeature: Introduce ID_DFR1 CPU register

2020-05-03 Thread Suzuki K Poulose

On 05/02/2020 02:33 PM, Anshuman Khandual wrote:

This adds basic building blocks required for ID_DFR1 CPU register which
provides top level information about the debug system in AArch32 state.
This is added per ARM DDI 0487F.a specification.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Mark Rutland 
Cc: James Morse 
Cc: Suzuki K Poulose 
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org

Suggested-by: Will Deacon 
Signed-off-by: Anshuman Khandual 
---
  arch/arm64/include/asm/cpu.h|  1 +
  arch/arm64/include/asm/sysreg.h |  3 +++
  arch/arm64/kernel/cpufeature.c  | 10 ++
  arch/arm64/kernel/cpuinfo.c |  1 +
  arch/arm64/kvm/sys_regs.c   |  2 +-
  5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index 464e828a994d..d9a78bdec409 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -33,6 +33,7 @@ struct cpuinfo_arm64 {
u64 reg_id_aa64zfr0;
  
  	u32		reg_id_dfr0;

+   u32 reg_id_dfr1;
u32 reg_id_isar0;
u32 reg_id_isar1;
u32 reg_id_isar2;
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index c977449e02db..2e1e922e1409 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -154,6 +154,7 @@
  #define SYS_MVFR1_EL1 sys_reg(3, 0, 0, 3, 1)
  #define SYS_MVFR2_EL1 sys_reg(3, 0, 0, 3, 2)
  #define SYS_ID_PFR2_EL1   sys_reg(3, 0, 0, 3, 4)
+#define SYS_ID_DFR1_EL1sys_reg(3, 0, 0, 3, 5)
  
  #define SYS_ID_AA64PFR0_EL1		sys_reg(3, 0, 0, 4, 0)

  #define SYS_ID_AA64PFR1_EL1   sys_reg(3, 0, 0, 4, 1)
@@ -763,6 +764,8 @@
  #define ID_ISAR4_WITHSHIFTS_SHIFT 4
  #define ID_ISAR4_UNPRIV_SHIFT 0
  
+#define ID_DFR1_MTPMU_SHIFT		0

+
  #define ID_ISAR0_DIVIDE_SHIFT 24
  #define ID_ISAR0_DEBUG_SHIFT  20
  #define ID_ISAR0_COPROC_SHIFT 16
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index a8247bf92959..2ce952d9668d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -451,6 +451,11 @@ static const struct arm64_ftr_bits ftr_id_dfr0[] = {
ARM64_FTR_END,
  };
  
+static const struct arm64_ftr_bits ftr_id_dfr1[] = {

+   S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_DFR1_MTPMU_SHIFT, 4, 0),




-   ID_UNALLOCATED(3,5),
+   ID_SANITISED(ID_DFR1_EL1),
ID_UNALLOCATED(3,6),
ID_UNALLOCATED(3,7),
  


IIUC, we should not expose the MTPMU to the KVM guests. Either we could 
drop this entire patch, or we should emulate the MTPMU to 0 in KVM.


Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 3/4] KVM: arm64: Change CONFIG_KVM to a menuconfig entry

2020-05-01 Thread Suzuki K Poulose

On 04/21/2020 02:17 PM, Fuad Tabba wrote:

From: Will Deacon 

Changing CONFIG_KVM to be a 'menuconfig' entry in Kconfig mean that we
can straightforwardly enumerate optional features, such as the virtual
PMU device as dependent options.

Signed-off-by: Will Deacon 
Signed-off-by: Fuad Tabba 


Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 1/4] KVM: arm64: Kill off CONFIG_KVM_ARM_HOST

2020-05-01 Thread Suzuki K Poulose

On 04/21/2020 02:17 PM, Fuad Tabba wrote:

From: Will Deacon 

CONFIG_KVM_ARM_HOST is just a proxy for CONFIG_KVM, so remove it in favour
of the latter.

Signed-off-by: Will Deacon 
Signed-off-by: Fuad Tabba 



Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 2/4] KVM: arm64: Update help text

2020-05-01 Thread Suzuki K Poulose

On 04/21/2020 02:17 PM, Fuad Tabba wrote:

From: Will Deacon 

arm64 KVM supports 16k pages since 02e0b7600f83
("arm64: kvm: Add support for 16K pages"), so update the Kconfig help
text accordingly.

Signed-off-by: Will Deacon 
Signed-off-by: Fuad Tabba 
---
  arch/arm64/kvm/Kconfig | 2 --
  1 file changed, 2 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ce724e526689..d2cf4f099454 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -44,8 +44,6 @@ config KVM
select TASK_DELAY_ACCT
---help---
  Support hosting virtualized guest machines.
- We don't support KVM with 16K page tables yet, due to the multiple
- levels of fake page tables.
  
  	  If unsure, say N.


Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 04/26] arm64: Detect the ARMv8.4 TTL feature

2020-04-27 Thread Suzuki K Poulose

On 04/22/2020 01:00 PM, Marc Zyngier wrote:

In order to reduce the cost of TLB invalidation, the ARMv8.4 TTL
feature allows TLBs to be issued with a level allowing for quicker
invalidation.

Let's detect the feature for now. Further patches will implement
its actual usage.

Signed-off-by: Marc Zyngier 
---




  arch/arm64/include/asm/cpucaps.h |  3 ++-
  arch/arm64/include/asm/sysreg.h  |  1 +
  arch/arm64/kernel/cpufeature.c   | 11 +++
  3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 8eb5a088ae658..cabb0c49a1d11 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -61,7 +61,8 @@
  #define ARM64_HAS_AMU_EXTN51
  #define ARM64_HAS_ADDRESS_AUTH52
  #define ARM64_HAS_GENERIC_AUTH53
+#define ARM64_HAS_ARMv8_4_TTL  54
  
-#define ARM64_NCAPS54

+#define ARM64_NCAPS55
  
  #endif /* __ASM_CPUCAPS_H */

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 5d10c9148e844..79cf186b7e471 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -726,6 +726,7 @@
  
  /* id_aa64mmfr2 */

  #define ID_AA64MMFR2_E0PD_SHIFT   60
+#define ID_AA64MMFR2_TTL_SHIFT 48
  #define ID_AA64MMFR2_FWB_SHIFT40
  #define ID_AA64MMFR2_AT_SHIFT 32
  #define ID_AA64MMFR2_LVA_SHIFT16
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9892a845d06c9..d8ab4f1e93bee 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -252,6 +252,7 @@ static const struct arm64_ftr_bits ftr_id_aa64mmfr1[] = {
  
  static const struct arm64_ftr_bits ftr_id_aa64mmfr2[] = {

ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_E0PD_SHIFT, 4, 0),
+   ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_TTL_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_FWB_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_AT_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_LVA_SHIFT, 4, 0),
@@ -1630,6 +1631,16 @@ static const struct arm64_cpu_capabilities 
arm64_features[] = {
.matches = has_cpuid_feature,
.cpu_enable = cpu_has_fwb,
},
+   {
+   .desc = "ARMv8.4 Translation Table Level",
+   .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+   .capability = ARM64_HAS_ARMv8_4_TTL,
+   .sys_reg = SYS_ID_AA64MMFR2_EL1,
+   .sign = FTR_UNSIGNED,
+   .field_pos = ID_AA64MMFR2_TTL_SHIFT,
+   .min_field_value = 1,
+   .matches = has_cpuid_feature,
+   },


Reviewed-by : Suzuki K Polose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 01/26] KVM: arm64: Check advertised Stage-2 page size capability

2020-04-22 Thread Suzuki K Poulose

On 04/22/2020 03:07 PM, Marc Zyngier wrote:

Hi Suzuki,

On 2020-04-22 14:40, Suzuki K Poulose wrote:

Hi Marc,

On 04/22/2020 01:00 PM, Marc Zyngier wrote:

With ARMv8.5-GTG, the hardware (or more likely a hypervisor) can
advertise the supported Stage-2 page sizes.

Let's check this at boot time.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_host.h |  2 +-
  arch/arm64/include/asm/sysreg.h   |  3 +++
  arch/arm64/kernel/cpufeature.c    |  8 +++
  arch/arm64/kvm/reset.c    | 40 ---
  virt/kvm/arm/arm.c    |  4 +---
  5 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h

index 32c8a675e5a4a..7dd8fefa6aecd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,7 +670,7 @@ static inline int kvm_arm_have_ssbd(void)
  void kvm_vcpu_load_sysregs(struct kvm_vcpu *vcpu);
  void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu);
  -void kvm_set_ipa_limit(void);
+int kvm_set_ipa_limit(void);
    #define __KVM_HAVE_ARCH_VM_ALLOC
  struct kvm *kvm_arch_alloc_vm(void);
diff --git a/arch/arm64/include/asm/sysreg.h 
b/arch/arm64/include/asm/sysreg.h

index ebc6224328318..5d10c9148e844 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -686,6 +686,9 @@
  #define ID_AA64ZFR0_SVEVER_SVE2    0x1
    /* id_aa64mmfr0 */
+#define ID_AA64MMFR0_TGRAN4_2_SHIFT    40
+#define ID_AA64MMFR0_TGRAN64_2_SHIFT    36
+#define ID_AA64MMFR0_TGRAN16_2_SHIFT    32
  #define ID_AA64MMFR0_TGRAN4_SHIFT    28
  #define ID_AA64MMFR0_TGRAN64_SHIFT    24
  #define ID_AA64MMFR0_TGRAN16_SHIFT    20
diff --git a/arch/arm64/kernel/cpufeature.c 
b/arch/arm64/kernel/cpufeature.c

index 9fac745aa7bb2..9892a845d06c9 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -208,6 +208,14 @@ static const struct arm64_ftr_bits 
ftr_id_aa64zfr0[] = {

  };
    static const struct arm64_ftr_bits ftr_id_aa64mmfr0[] = {
+    /*
+ * Page size not being supported at Stage-2 are not fatal. You
+ * just give up KVM if PAGE_SIZE isn't supported there. Go fix
+ * your favourite nesting hypervisor.
+ */
+    ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN4_2_SHIFT, 4, 1),
+    ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN64_2_SHIFT, 4, 1),
+    ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN16_2_SHIFT, 4, 1),


One minor issue with this is, if we get a system with cpus having values
0 and 2 (both of which indicates the stage-2 support), we might reset
the value to 1 for the feature indicating, we don't support and block
KVM. But, we can blame the nesting hypervisor for not emulating this to
(2). Do we need a comment to make this explicit here ?


Sure. How about something like:

"There is a small corner case where the hypervisor could explicitly 
advertise
  a given granule size at Stage-2 (value 2) on some vCPUs, and use the 
fallback
  to Stage-1 (value 0) for other vCPUs. Although this is not forbidden 
by the

  architecture, it indicates that the hypervisor is being silly (or buggy).
  We make no effort to cope with this and pretend that if these fields are
  inconsistent across vCPUs, then it isn't worth trying to bring KVM up."



Looks fine to me.

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 02/26] KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h

2020-04-22 Thread Suzuki K Poulose

Hi Marc,


On 04/22/2020 01:00 PM, Marc Zyngier wrote:

Having __load_guest_stage2 in kvm_hyp.h is quickly going to trigger
a circular include problem. In order to avoid this, let's move
it to kvm_mmu.h, where it will be a better fit anyway.

In the process, drop the __hyp_text annotation, which doesn't help
as the function is marked as __always_inline.

Signed-off-by: Marc Zyngier 



---
  arch/arm64/include/asm/kvm_hyp.h | 18 --
  arch/arm64/include/asm/kvm_mmu.h | 17 +
  2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index fe57f60f06a89..dcb63bf941057 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -10,7 +10,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  
  #define __hyp_text __section(.hyp.text) notrace

@@ -88,22 +87,5 @@ void deactivate_traps_vhe_put(void);
  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
  void __noreturn __hyp_do_panic(unsigned long, ...);
  
-/*

- * Must be called from hyp code running at EL2 with an updated VTTBR
- * and interrupts disabled.
- */
-static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
-{
-   write_sysreg(kvm->arch.vtcr, vtcr_el2);
-   write_sysreg(kvm_get_vttbr(kvm), vttbr_el2);
-
-   /*
-* ARM errata 1165522 and 1530923 require the actual execution of the
-* above before we can switch to the EL1/EL0 translation regime used by
-* the guest.
-*/
-   asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT_VHE));
-}
-
  #endif /* __ARM64_KVM_HYP_H__ */
  
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h

index 30b0e8d6b8953..5ba1310639ec6 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -604,5 +604,22 @@ static __always_inline u64 kvm_get_vttbr(struct kvm *kvm)
return kvm_phys_to_vttbr(baddr) | vmid_field | cnp;
  }
  
+/*

+ * Must be called from hyp code running at EL2 with an updated VTTBR
+ * and interrupts disabled.
+ */
+static __always_inline void __load_guest_stage2(struct kvm *kvm)
+{
+   write_sysreg(kvm->arch.vtcr, vtcr_el2);
+   write_sysreg(kvm_get_vttbr(kvm), vttbr_el2);
+
+   /*
+* ARM erratum 1165522 requires the actual execution of the above


Is it intentional to drop the reference to errata 1530923 ?

Otherwise :

Reviewed-by: Suzuki K Poulose 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 01/26] KVM: arm64: Check advertised Stage-2 page size capability

2020-04-22 Thread Suzuki K Poulose

Hi Marc,

On 04/22/2020 01:00 PM, Marc Zyngier wrote:

With ARMv8.5-GTG, the hardware (or more likely a hypervisor) can
advertise the supported Stage-2 page sizes.

Let's check this at boot time.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_host.h |  2 +-
  arch/arm64/include/asm/sysreg.h   |  3 +++
  arch/arm64/kernel/cpufeature.c|  8 +++
  arch/arm64/kvm/reset.c| 40 ---
  virt/kvm/arm/arm.c|  4 +---
  5 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4a..7dd8fefa6aecd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,7 +670,7 @@ static inline int kvm_arm_have_ssbd(void)
  void kvm_vcpu_load_sysregs(struct kvm_vcpu *vcpu);
  void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu);
  
-void kvm_set_ipa_limit(void);

+int kvm_set_ipa_limit(void);
  
  #define __KVM_HAVE_ARCH_VM_ALLOC

  struct kvm *kvm_arch_alloc_vm(void);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index ebc6224328318..5d10c9148e844 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -686,6 +686,9 @@
  #define ID_AA64ZFR0_SVEVER_SVE2   0x1
  
  /* id_aa64mmfr0 */

+#define ID_AA64MMFR0_TGRAN4_2_SHIFT40
+#define ID_AA64MMFR0_TGRAN64_2_SHIFT   36
+#define ID_AA64MMFR0_TGRAN16_2_SHIFT   32
  #define ID_AA64MMFR0_TGRAN4_SHIFT 28
  #define ID_AA64MMFR0_TGRAN64_SHIFT24
  #define ID_AA64MMFR0_TGRAN16_SHIFT20
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9fac745aa7bb2..9892a845d06c9 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -208,6 +208,14 @@ static const struct arm64_ftr_bits ftr_id_aa64zfr0[] = {
  };
  
  static const struct arm64_ftr_bits ftr_id_aa64mmfr0[] = {

+   /*
+* Page size not being supported at Stage-2 are not fatal. You
+* just give up KVM if PAGE_SIZE isn't supported there. Go fix
+* your favourite nesting hypervisor.
+*/
+   ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN4_2_SHIFT, 4, 1),
+   ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN64_2_SHIFT, 4, 1),
+   ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, 
ID_AA64MMFR0_TGRAN16_2_SHIFT, 4, 1),


One minor issue with this is, if we get a system with cpus having values
0 and 2 (both of which indicates the stage-2 support), we might reset
the value to 1 for the feature indicating, we don't support and block
KVM. But, we can blame the nesting hypervisor for not emulating this to
(2). Do we need a comment to make this explicit here ?

Otherwise this looks fine to me

Suzuki




/*
 * We already refuse to boot CPUs that don't support our configured
 * page size, so we can only detect mismatches for a page size other
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 30b7ea680f66c..241db35a7ef4f 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -9,6 +9,7 @@
   */
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -340,11 +341,42 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
return ret;
  }
  
-void kvm_set_ipa_limit(void)

+int kvm_set_ipa_limit(void)
  {
-   unsigned int ipa_max, pa_max, va_max, parange;
+   unsigned int ipa_max, pa_max, va_max, parange, tgran_2;
+   u64 mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
  
-	parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 0x7;

+   /*
+* Check with ARMv8.5-GTG that our PAGE_SIZE is supported at
+* Stage-2. If not, things will stop very quickly.
+*/
+   switch (PAGE_SIZE) {
+   default:
+   case SZ_4K:
+   tgran_2 = ID_AA64MMFR0_TGRAN4_2_SHIFT;
+   break;
+   case SZ_16K:
+   tgran_2 = ID_AA64MMFR0_TGRAN16_2_SHIFT;
+   break;
+   case SZ_64K:
+   tgran_2 = ID_AA64MMFR0_TGRAN64_2_SHIFT;
+   break;
+   }
+
+   switch (FIELD_GET(0xFUL << tgran_2, mmfr0)) {
+   default:
+   case 1:
+   kvm_err("PAGE_SIZE not supported at Stage-2, giving up\n");
+   return -EINVAL;
+   case 0:
+   kvm_debug("PAGE_SIZE supported at Stage-2 (default)\n");
+   break;
+   case 2:
+   kvm_debug("PAGE_SIZE supported at Stage-2 (advertised)\n");
+   break;
+   }
+
+   parange = mmfr0 & 0x7;
pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
  
  	/* Clamp the IPA limit to the PA size supported by the kernel */

@@ -378,6 +410,8 @@ void kvm_set_ipa_limit(void)
 "KVM IPA limit (%d bit) is smaller than default size\n", ipa_max);
kvm_ipa_limit = ipa_max;
kvm_info("IPA Size Limit: %dbits\n", kvm_ipa_limit);
+
+  

Re: [PATCH v2 8/8] arm64: cpufeature: Add an overview comment for the cpufeature framework

2020-04-21 Thread Suzuki K Poulose

On 04/21/2020 03:29 PM, Will Deacon wrote:

Now that Suzuki isn't within throwing distance, I thought I'd better add
a rough overview comment to cpufeature.c so that it doesn't take me days
to remember how it works next time.

Signed-off-by: Will Deacon 
---


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/8] arm64: cpufeature: Add CPU capability for AArch32 EL1 support

2020-04-17 Thread Suzuki K Poulose

On 04/15/2020 02:22 PM, Marc Zyngier wrote:

On Wed, 15 Apr 2020 14:15:51 +0100
Suzuki K Poulose  wrote:


On 04/15/2020 11:14 AM, Will Deacon wrote:

On Wed, Apr 15, 2020 at 11:13:54AM +0100, Suzuki K Poulose wrote:

On 04/14/2020 10:31 PM, Will Deacon wrote:

Although we emit a "SANITY CHECK" warning and taint the kernel if we
detect a CPU mismatch for AArch32 support at EL1, we still online the
CPU with disastrous consequences for any running 32-bit VMs.

Introduce a capability for AArch32 support at EL1 so that late onlining
of incompatible CPUs is forbidden.

Signed-off-by: Will Deacon 


One of the other important missing sanity check for KVM is the VMID width
check. I will code something up.


Cheers! Do we handle things like the IPA size already?


Good point. No, we don't. I will include this too.


There is also the question of the ARMv8.5-GTG extension. I have a patch
that treats it as non-strict, but that approach would fail with KVM if
we online a late CPU without support for the right page size at S2.


Good point. Again this can be added to the list of checks performed on
the hot-plugged CPUs along with IPA, VMID width.

Cheers
Suzuki

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 7/8] arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]

2020-04-17 Thread Suzuki K Poulose

On 04/15/2020 01:29 PM, Will Deacon wrote:

On Wed, Apr 15, 2020 at 12:37:31PM +0100, Suzuki K Poulose wrote:

On 04/15/2020 11:58 AM, Will Deacon wrote:

On Wed, Apr 15, 2020 at 11:50:58AM +0100, Suzuki K Poulose wrote:

On 04/14/2020 10:31 PM, Will Deacon wrote:

We don't need to be quite as strict about mismatched AArch32 support,
which is good because the friendly hardware folks have been busy
mismatching this to their hearts' content.

 * We don't care about EL2 or EL3 (there are silly comments concerning
   the latter, so remove those)

 * EL1 support is gated by the ARM64_HAS_32BIT_EL1 capability and handled
   gracefully when a mismatch occurs

 * EL1 support is gated by the ARM64_HAS_32BIT_EL0 capability and handled


s/EL1/EL0


   gracefully when a mismatch occurs

Relax the AArch32 checks to FTR_NONSTRICT.


Agreed. We should do something similar for the features exposed by the
ELF_HWCAP, of course in a separate series.


Hmm, I didn't think we needed to touch the HWCAPs, as they're derived from
the sanitised feature register values. What am I missing?


sorry, that was cryptic. I was suggesting to relax the ftr fields to
NONSTRICT for the fields covered by ELF HWCAPs (and other CPU hwcaps).


Ah, gotcha. Given that the HWCAPs usually describe EL0 features, I say we
can punt this down the road until people give us hardware with mismatched
AArch32 at EL0.


Btw, this is not just mismatched AArch32, but mismatched AArch64 HWCAPs
too, which I believe exists. Anyways as you said, we can delay this
until we get the reports :-)

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 8/8] arm64: cpufeature: Add an overview comment for the cpufeature framework

2020-04-16 Thread Suzuki K Poulose

Hi Will,

On 04/14/2020 10:31 PM, Will Deacon wrote:

Now that Suzuki isn't within throwing distance, I thought I'd better add
a rough overview comment to cpufeature.c so that it doesn't take me days
to remember how it works next time.

Signed-off-by: Will Deacon 
---
  arch/arm64/kernel/cpufeature.c | 43 ++
  1 file changed, 43 insertions(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 680a453ca8c4..421ca99dc8fc 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -3,6 +3,49 @@
   * Contains CPU feature definitions
   *
   * Copyright (C) 2015 ARM Ltd.
+ *
+ * A note for the weary kernel hacker: the code here is confusing and hard to
+ * follow! That's partly because it's solving a nasty problem, but also because
+ * there's a little bit of over-abstraction that tends to obscure what's going
+ * on behind a maze of helper functions and macros.


Thanks for writing this up !


+ *
+ * The basic problem is that hardware folks have started gluing together CPUs
+ * with distinct architectural features; in some cases even creating SoCs where
+ * user-visible instructions are available only on a subset of the available
+ * cores. We try to address this by snapshotting the feature registers of the
+ * boot CPU and comparing these with the feature registers of each secondary
+ * CPU when bringing them up. If there is a mismatch, then we update the
+ * snapshot state to indicate the lowest-common denominator of the feature,
+ * known as the "safe" value. This snapshot state can be queried to view the


I am not sure if the following is implied above.

  1) Against the "snapshot" state, where mismatches triggers updating
 the "snapshot" state to reflect the "safe" value.

  2) Compared against the CPU feature registers of *the boot CPU* for
"FTR_STRICT" fields and any mismatch triggers TAINT_CPU_OUT_OF_SPEC.
 This makes sure that warning is generated for each OUT_OF_SPEC
 secondary CPU.


+ * "sanitised" value of a feature register.
+ *
+ * The sanitised register values are used to decide which capabilities we
+ * have in the system. These may be in the form of traditional "hwcaps"
+ * advertised to userspace or internal "cpucaps" which are used to configure
+ * things like alternative patching and static keys. While a feature mismatch
+ * may result in a TAINT_CPU_OUT_OF_SPEC kernel taint, a capability mismatch
+ * may prevent a CPU from being onlined at all.
+ *
+ * Some implementation details worth remembering:
+ *
+ * - Mismatched features are *always* sanitised to a "safe" value, which
+ *   usually indicates that the feature is not supported.
+ *
+ * - A mismatched feature marked with FTR_STRICT will cause a "SANITY CHECK"
+ *   warning when onlining an offending CPU and the kernel will be tainted
+ *   with TAINT_CPU_OUT_OF_SPEC.


As mentioned above, this check is against that of the "boot CPU"
register state, which may not be implicit from the statement.


+ *
+ * - Features marked as FTR_VISIBLE have their sanitised value visible to
+ *   userspace. FTR_VISIBLE features in registers that are only visible
+ *   to EL0 by trapping *must* have a corresponding HWCAP so that late
+ *   onlining of CPUs cannot lead to features disappearing at runtime.
+ *


As you mentioned in the other response we could add information about
the guest view, something like :

  - KVM exposes the sanitised value of the feature registers to the
guests and is not affected by the FTR_VISIBLE. However,
depending on the individual feature support in the hypervisor,
some of the fields may be capped/limited.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/8] arm64: cpufeature: Add CPU capability for AArch32 EL1 support

2020-04-15 Thread Suzuki K Poulose

On 04/15/2020 11:14 AM, Will Deacon wrote:

On Wed, Apr 15, 2020 at 11:13:54AM +0100, Suzuki K Poulose wrote:

On 04/14/2020 10:31 PM, Will Deacon wrote:

Although we emit a "SANITY CHECK" warning and taint the kernel if we
detect a CPU mismatch for AArch32 support at EL1, we still online the
CPU with disastrous consequences for any running 32-bit VMs.

Introduce a capability for AArch32 support at EL1 so that late onlining
of incompatible CPUs is forbidden.

Signed-off-by: Will Deacon 


One of the other important missing sanity check for KVM is the VMID width
check. I will code something up.


Cheers! Do we handle things like the IPA size already?


Good point. No, we don't. I will include this too.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 7/8] arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]

2020-04-15 Thread Suzuki K Poulose

On 04/15/2020 11:58 AM, Will Deacon wrote:

On Wed, Apr 15, 2020 at 11:50:58AM +0100, Suzuki K Poulose wrote:

On 04/14/2020 10:31 PM, Will Deacon wrote:

We don't need to be quite as strict about mismatched AArch32 support,
which is good because the friendly hardware folks have been busy
mismatching this to their hearts' content.

* We don't care about EL2 or EL3 (there are silly comments concerning
  the latter, so remove those)

* EL1 support is gated by the ARM64_HAS_32BIT_EL1 capability and handled
  gracefully when a mismatch occurs

* EL1 support is gated by the ARM64_HAS_32BIT_EL0 capability and handled


s/EL1/EL0


  gracefully when a mismatch occurs

Relax the AArch32 checks to FTR_NONSTRICT.


Agreed. We should do something similar for the features exposed by the
ELF_HWCAP, of course in a separate series.


Hmm, I didn't think we needed to touch the HWCAPs, as they're derived from
the sanitised feature register values. What am I missing?


sorry, that was cryptic. I was suggesting to relax the ftr fields to
NONSTRICT for the fields covered by ELF HWCAPs (and other CPU hwcaps).

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 7/8] arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

We don't need to be quite as strict about mismatched AArch32 support,
which is good because the friendly hardware folks have been busy
mismatching this to their hearts' content.

   * We don't care about EL2 or EL3 (there are silly comments concerning
 the latter, so remove those)

   * EL1 support is gated by the ARM64_HAS_32BIT_EL1 capability and handled
 gracefully when a mismatch occurs

   * EL1 support is gated by the ARM64_HAS_32BIT_EL0 capability and handled


s/EL1/EL0


 gracefully when a mismatch occurs

Relax the AArch32 checks to FTR_NONSTRICT.


Agreed. We should do something similar for the features exposed by the
ELF_HWCAP, of course in a separate series.



Signed-off-by: Will Deacon 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 6/8] arm64: cpufeature: Relax AArch32 system checks if EL1 is 64-bit only

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

If AArch32 is not supported at EL1, the AArch32 feature register fields
no longer advertise support for some system features:

   * ISAR4.SMC
   * PFR1.{Virt_frac, Sec_frac, Virtualization, Security, ProgMod}

In which case, we don't need to emit "SANITY CHECK" failures for all of
them.

Add logic to relax the strictness of individual feature register fields
at runtime and use this for the fields above if 32-bit EL1 is not
supported.

Signed-off-by: Will Deacon 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 5/8] arm64: cpufeature: Factor out checking of AArch32 features

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

update_cpu_features() is pretty large, so split out the checking of the
AArch32 features into a separate function and call it after checking the
AArch64 features.

Signed-off-by: Will Deacon 
---
  arch/arm64/kernel/cpufeature.c | 108 +++--
  1 file changed, 61 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 7dfcdd9e75c1..32828a77acc3 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -715,6 +715,65 @@ static int check_update_ftr_reg(u32 sys_id, int cpu, u64 
val, u64 boot)
return 1;
  }
  
+static int update_32bit_cpu_features(int cpu, struct cpuinfo_arm64 *info,

+struct cpuinfo_arm64 *boot)
+{


...


-
if (id_aa64pfr0_sve(info->reg_id_aa64pfr0)) {
taint |= check_update_ftr_reg(SYS_ZCR_EL1, cpu,
info->reg_zcr, boot->reg_zcr);
@@ -845,6 +857,8 @@ void update_cpu_features(int cpu,
sve_update_vq_map();
}
  
+	taint |= update_32bit_cpu_features(cpu, info, boot);

+


This relies on the assumption that the id_aa64pfr0 has been sanitised. 
It may be worth adding a comment to make sure people (hacking the

kernel) don't move this around and break that dependency.

Either ways:

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 4/8] arm64: cpufeature: Remove redundant call to id_aa64pfr0_32bit_el0()

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

There's no need to call id_aa64pfr0_32bit_el0() twice because the
sanitised value of ID_AA64PFR0_EL1 has already been updated for the CPU
being onlined.

Remove the redundant function call.

Signed-off-by: Will Deacon 


Reviewed-by: Suzuki K Poulose 


---
  arch/arm64/kernel/cpufeature.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 838fe5cc8d7e..7dfcdd9e75c1 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -792,9 +792,7 @@ void update_cpu_features(int cpu,
 * If we have AArch32, we care about 32-bit features for compat.
 * If the system doesn't support AArch32, don't update them.
 */
-   if (id_aa64pfr0_32bit_el0(read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1)) 
&&
-   id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
-
+   if (id_aa64pfr0_32bit_el0(read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1))) 
{
taint |= check_update_ftr_reg(SYS_ID_DFR0_EL1, cpu,
info->reg_id_dfr0, boot->reg_id_dfr0);
taint |= check_update_ftr_reg(SYS_ID_ISAR0_EL1, cpu,



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/8] arm64: cpufeature: Add CPU capability for AArch32 EL1 support

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

Although we emit a "SANITY CHECK" warning and taint the kernel if we
detect a CPU mismatch for AArch32 support at EL1, we still online the
CPU with disastrous consequences for any running 32-bit VMs.

Introduce a capability for AArch32 support at EL1 so that late onlining
of incompatible CPUs is forbidden.

Signed-off-by: Will Deacon 


One of the other important missing sanity check for KVM is the VMID 
width check. I will code something up.


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/8] arm64: cpufeature: Spell out register fields for ID_ISAR4 and ID_PFR1

2020-04-15 Thread Suzuki K Poulose

On 04/14/2020 10:31 PM, Will Deacon wrote:

In preparation for runtime updates to the strictness of some AArch32
features, spell out the register fields for ID_ISAR4 and ID_PFR1 to make
things clearer to read. Note that this isn't functionally necessary, as
the feature arrays themselves are not modified dynamically and remain
'const'.

Signed-off-by: Will Deacon 
---
  arch/arm64/include/asm/sysreg.h | 17 +
  arch/arm64/kernel/cpufeature.c  | 28 ++--
  2 files changed, 43 insertions(+), 2 deletions(-)


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/8] arm64: cpufeature: Relax check for IESB support

2020-04-15 Thread Suzuki K Poulose

Hi Will

On 04/14/2020 10:31 PM, Will Deacon wrote:

From: Sai Prakash Ranjan 

We don't care if IESB is supported or not as we always set
SCTLR_ELx.IESB and, if it works, that's really great.

Relax the ID_AA64MMFR2.IESB cpufeature check so that we don't warn and
taint if it's mismatched.

Signed-off-by: Sai Prakash Ranjan 
[will: rewrote commit message]
Signed-off-by: Will Deacon 


Reviewed-by: Suzuki K Poulose 


---
  arch/arm64/kernel/cpufeature.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9fac745aa7bb..63df28e6a425 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -247,7 +247,7 @@ static const struct arm64_ftr_bits ftr_id_aa64mmfr2[] = {
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_FWB_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_AT_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_LVA_SHIFT, 4, 0),
-   ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_IESB_SHIFT, 4, 0),
+   ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_IESB_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_LSM_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_UAO_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR2_CNP_SHIFT, 4, 0),



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 0/6] Introduce ID_PFR2 and other CPU feature changes

2020-03-20 Thread Suzuki K Poulose

Hi Anshuman

On 02/14/2020 04:23 AM, Anshuman Khandual wrote:



On 01/28/2020 06:09 PM, Anshuman Khandual wrote:

This series is primarily motivated from an adhoc list from Mark Rutland
during our ID_ISAR6 discussion [1]. Besides, it also includes a patch
which does macro replacement for various open bits shift encodings in
various CPU ID registers. This series is based on linux-next 20200124.

[1] https://patchwork.kernel.org/patch/11287805/

Is there anything else apart from these changes which can be accommodated
in this series, please do let me know. Thank you.


Just a gentle ping. Any updates, does this series looks okay ? Is there
anything else related to CPU ID register feature bits, which can be added
up here. FWIW, the series still applies on v5.6-rc1.


Sorry for the delay ! The series looks good to me, except for some minor
comments. Please see the individual patches.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/6] arm64/cpufeature: Introduce ID_PFR2 CPU register

2020-03-20 Thread Suzuki K Poulose

On 01/28/2020 12:39 PM, Anshuman Khandual wrote:

This adds basic building blocks required for ID_PFR2 CPU register which
provides information about the AArch32 programmers model which must be
interpreted along with ID_PFR0 and ID_PFR1 CPU registers.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: James Morse 
Cc: Suzuki K Poulose 
Cc: Mark Rutland 
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 


Sorry for the delay !

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 1/2] arm64: Combine workarounds for speculative AT errata

2019-11-13 Thread Suzuki K Poulose



On 13/11/2019 11:41, Steven Price wrote:

Cortex-A57/A72 have a similar errata to Cortex-A76 regarding speculation
of the AT instruction. Since the workaround for A57/A72 doesn't require
VHE, the restriction enforcing VHE for A76 can be removed by combining
the workaround flag for both errata.

So combine WORKAROUND_1165522 and WORKAROUND_1319367 into
WORKAROUND_SPECULATIVE_AT. The majority of code is contained within VHE
or NVHE specific functions, for the cases where the code is shared extra
checks against has_vhe().

This also paves the way for adding a similar erratum for Cortex-A55.

Signed-off-by: Steven Price 



diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 4f8187a4fc46..b801f8e832aa 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -744,6 +744,16 @@ static const struct midr_range erratum_1418040_list[] = {
  };
  #endif
  
+#ifdef CONFIG_ARM64_WORKAROUND_SPECULATIVE_AT

+static const struct midr_range erratum_speculative_at_list[] = {
+#ifdef CONFIG_ARM64_ERRATUM_1165522
+   /* Cortex A76 r0p0 to r2p0 */
+   MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 2, 0),
+#endif
+   {},
+};
+#endif
+
  const struct arm64_cpu_capabilities arm64_errata[] = {
  #ifdef CONFIG_ARM64_WORKAROUND_CLEAN_CACHE
{
@@ -868,12 +878,11 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
ERRATA_MIDR_RANGE_LIST(erratum_1418040_list),
},
  #endif
-#ifdef CONFIG_ARM64_ERRATUM_1165522
+#ifdef CONFIG_ARM64_WORKAROUND_SPECULATIVE_AT
{
-   /* Cortex-A76 r0p0 to r2p0 */
.desc = "ARM erratum 1165522",
-   .capability = ARM64_WORKAROUND_1165522,
-   ERRATA_MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 2, 0),
+   .capability = ARM64_WORKAROUND_SPECULATIVE_AT,
+   ERRATA_MIDR_RANGE_LIST(erratum_speculative_at_list),
},
  #endif
  #ifdef CONFIG_ARM64_ERRATUM_1463225
@@ -910,7 +919,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
  #ifdef CONFIG_ARM64_ERRATUM_1319367
{
.desc = "ARM erratum 1319367",
-   .capability = ARM64_WORKAROUND_1319367,
+   .capability = ARM64_WORKAROUND_SPECULATIVE_AT,
ERRATA_MIDR_RANGE_LIST(ca57_a72),
},
  #endif


Have you tested this patch with both the errata CONFIGs turned on ?
Having multiple entries for the same capability should trigger a WARNING at
boot with init_cpu_hwcaps_indirect_list_from_array().
You could simply add the MIDRs to the midr_list and update the description
to include all the Errata numbers.

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v4 2/5] ptp: Reorganize ptp_kvm modules to make it arch-independent.

2019-09-27 Thread Suzuki K Poulose




On 26/09/2019 12:42, Jianyong Wu wrote:

Currently, ptp_kvm modules implementation is only for x86 which includs
large part of arch-specific code.  This patch move all of those code
into new arch related file in the same directory.

Signed-off-by: Jianyong Wu 


...


+int kvm_arch_ptp_get_clock_fn(unsigned long *cycle, struct timespec64 *tspec,
+ struct clocksource **cs)




diff --git a/include/asm-generic/ptp_kvm.h b/include/asm-generic/ptp_kvm.h
new file mode 100644
index ..208e842bfa64
--- /dev/null
+++ b/include/asm-generic/ptp_kvm.h



+int kvm_arch_ptp_get_clock_fn(long *cycle,
+   struct timespec64 *tspec, void *cs);



Conflicting types for kvm_arch_ptp_get_clock_fn() ?

Suzuki


Re: [RFC PATCH v4 2/5] ptp: Reorganize ptp_kvm modules to make it arch-independent.

2019-09-27 Thread Suzuki K Poulose




On 27/09/2019 11:10, Jianyong Wu (Arm Technology China) wrote:

Hi Suzuki,


-Original Message-
From: Suzuki K Poulose 
Sent: Thursday, September 26, 2019 9:06 PM
To: Jianyong Wu (Arm Technology China) ;
net...@vger.kernel.org; yangbo...@nxp.com; john.stu...@linaro.org;
t...@linutronix.de; pbonz...@redhat.com; sean.j.christopher...@intel.com;
m...@kernel.org; richardcoch...@gmail.com; Mark Rutland
; Will Deacon 
Cc: linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
kvmarm@lists.cs.columbia.edu; k...@vger.kernel.org; Steve Capper
; Kaly Xin (Arm Technology China)
; Justin He (Arm Technology China)
; nd 
Subject: Re: [RFC PATCH v4 2/5] ptp: Reorganize ptp_kvm modules to make it
arch-independent.

Hi Jianyong,

On 26/09/2019 12:42, Jianyong Wu wrote:

Currently, ptp_kvm modules implementation is only for x86 which
includs large part of arch-specific code.  This patch move all of
those code into new arch related file in the same directory.

Signed-off-by: Jianyong Wu 
---
   drivers/ptp/Makefile |  1 +
   drivers/ptp/{ptp_kvm.c => kvm_ptp.c} | 77 ++--
   drivers/ptp/ptp_kvm_x86.c| 87



   include/asm-generic/ptp_kvm.h| 12 
   4 files changed, 118 insertions(+), 59 deletions(-)
   rename drivers/ptp/{ptp_kvm.c => kvm_ptp.c} (63%)


minor nit: Could we not skip renaming the file ? Given you are following the
ptp_kvm_* for the arch specific files and the header files, wouldn't it be
good to keep ptp_kvm.c ?


If the module name ptp_kvm.ko is the same with its dependent object file 
ptp_kvm.o, warning will be given by compiler,
So I change the ptp_kvm.c to kvm_ptp.c to avoid that conflict.


Hmm, ok. How about ptp_kvm_common.c instead of kvm_ptp.c ?

Suzuki


Re: [PATCH 1/5] arm64: Add ARM64_WORKAROUND_1319367 for all A57 and A72 versions

2019-09-27 Thread Suzuki K Poulose




On 25/09/2019 12:19, Marc Zyngier wrote:

Rework the EL2 vector hardening that is only selected for A57 and A72
so that the table can also be used for ARM64_WORKAROUND_1319367.

Signed-off-by: Marc Zyngier 

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v4 2/5] ptp: Reorganize ptp_kvm modules to make it arch-independent.

2019-09-26 Thread Suzuki K Poulose

Hi Jianyong,

On 26/09/2019 12:42, Jianyong Wu wrote:

Currently, ptp_kvm modules implementation is only for x86 which includs
large part of arch-specific code.  This patch move all of those code
into new arch related file in the same directory.

Signed-off-by: Jianyong Wu 
---
  drivers/ptp/Makefile |  1 +
  drivers/ptp/{ptp_kvm.c => kvm_ptp.c} | 77 ++--
  drivers/ptp/ptp_kvm_x86.c| 87 
  include/asm-generic/ptp_kvm.h| 12 
  4 files changed, 118 insertions(+), 59 deletions(-)
  rename drivers/ptp/{ptp_kvm.c => kvm_ptp.c} (63%)


minor nit: Could we not skip renaming the file ? Given
you are following the ptp_kvm_* for the arch specific
files and the header files, wouldn't it be good to
keep ptp_kvm.c ?

Rest looks fine.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] MAINTAINERS: Fix spelling mistake in my name

2019-07-17 Thread Suzuki K Poulose
Fix a typo in my name against in KVM-ARM reviewers.

Signed-off-by: Suzuki K Poulose 
---
Will,

Please could you pick this one too ? There might be conflict with
the other updates.

---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c144bd6a432e..ce5e40d91041 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8795,7 +8795,7 @@ KERNEL VIRTUAL MACHINE FOR ARM/ARM64 (KVM/arm, KVM/arm64)
 M: Marc Zyngier 
 R: James Morse 
 R: Julien Thierry 
-R: Suzuki K Pouloze 
+R: Suzuki K Poulose 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
 L: kvmarm@lists.cs.columbia.edu
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git
-- 
2.21.0



Re: [PATCH 07/59] KVM: arm64: nv: Add EL2 system registers to vcpu context

2019-07-01 Thread Suzuki K Poulose

Hi Marc,

On 21/06/2019 10:37, Marc Zyngier wrote:

From: Jintack Lim 

ARM v8.3 introduces a new bit in the HCR_EL2, which is the NV bit. When
this bit is set, accessing EL2 registers in EL1 traps to EL2. In
addition, executing the following instructions in EL1 will trap to EL2:
tlbi, at, eret, and msr/mrs instructions to access SP_EL1. Most of the
instructions that trap to EL2 with the NV bit were undef at EL1 prior to
ARM v8.3. The only instruction that was not undef is eret.

This patch sets up a handler for EL2 registers and SP_EL1 register
accesses at EL1. The host hypervisor keeps those register values in
memory, and will emulate their behavior.

This patch doesn't set the NV bit yet. It will be set in a later patch
once nested virtualization support is completed.

Signed-off-by: Jintack Lim 
Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_host.h | 37 +++-
  arch/arm64/include/asm/sysreg.h   | 50 -
  arch/arm64/kvm/sys_regs.c | 74 ---
  3 files changed, 154 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 4bcd9c1291d5..2d4290d2513a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -173,12 +173,47 @@ enum vcpu_sysreg {
APGAKEYLO_EL1,
APGAKEYHI_EL1,
  
-	/* 32bit specific registers. Keep them at the end of the range */

+   /* 32bit specific registers. */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
FPEXC32_EL2,/* Floating-Point Exception Control Register */
DBGVCR32_EL2,   /* Debug Vector Catch Register */
  
+	/* EL2 registers sorted ascending by Op0, Op1, CRn, CRm, Op2 */

+   FIRST_EL2_SYSREG,
+   VPIDR_EL2 = FIRST_EL2_SYSREG,
+   /* Virtualization Processor ID Register */
+   VMPIDR_EL2, /* Virtualization Multiprocessor ID Register */
+   SCTLR_EL2,  /* System Control Register (EL2) */
+   ACTLR_EL2,  /* Auxiliary Control Register (EL2) */
+   HCR_EL2,/* Hypervisor Configuration Register */
+   MDCR_EL2,   /* Monitor Debug Configuration Register (EL2) */
+   CPTR_EL2,   /* Architectural Feature Trap Register (EL2) */
+   HSTR_EL2,   /* Hypervisor System Trap Register */
+   HACR_EL2,   /* Hypervisor Auxiliary Control Register */
+   TTBR0_EL2,  /* Translation Table Base Register 0 (EL2) */
+   TTBR1_EL2,  /* Translation Table Base Register 1 (EL2) */
+   TCR_EL2,/* Translation Control Register (EL2) */
+   VTTBR_EL2,  /* Virtualization Translation Table Base Register */
+   VTCR_EL2,   /* Virtualization Translation Control Register */
+   SPSR_EL2,   /* EL2 saved program status register */
+   ELR_EL2,/* EL2 exception link register */
+   AFSR0_EL2,  /* Auxiliary Fault Status Register 0 (EL2) */
+   AFSR1_EL2,  /* Auxiliary Fault Status Register 1 (EL2) */
+   ESR_EL2,/* Exception Syndrome Register (EL2) */
+   FAR_EL2,/* Hypervisor IPA Fault Address Register */
+   HPFAR_EL2,  /* Hypervisor IPA Fault Address Register */
+   MAIR_EL2,   /* Memory Attribute Indirection Register (EL2) */
+   AMAIR_EL2,  /* Auxiliary Memory Attribute Indirection Register 
(EL2) */
+   VBAR_EL2,   /* Vector Base Address Register (EL2) */
+   RVBAR_EL2,  /* Reset Vector Base Address Register */
+   RMR_EL2,/* Reset Management Register */
+   CONTEXTIDR_EL2, /* Context ID Register (EL2) */
+   TPIDR_EL2,  /* EL2 Software Thread ID Register */
+   CNTVOFF_EL2,/* Counter-timer Virtual Offset register */
+   CNTHCTL_EL2,/* Counter-timer Hypervisor Control register */
+   SP_EL2, /* EL2 Stack Pointer */


Does it need to include the following registers Counter-timer Hyp Physical timer
registers ? i.e, CNTHP_{CTL,CVAL,TVAL}_EL2. Or do we have some other magic with
NV for the virtual EL2 ?

Cheers
Suzuki

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 05/59] KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set

2019-06-24 Thread Suzuki K Poulose




On 21/06/2019 10:37, Marc Zyngier wrote:

From: Christoffer Dall 

Reset the VCPU with PSTATE.M = EL2h when the nested virtualization
feature is enabled on the VCPU.

Signed-off-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
  arch/arm64/kvm/reset.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 1140b4485575..675ca07dbb05 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -52,6 +52,11 @@ static const struct kvm_regs default_regs_reset = {
PSR_F_BIT | PSR_D_BIT),
  };
  
+static const struct kvm_regs default_regs_reset_el2 = {

+   .regs.pstate = (PSR_MODE_EL2h | PSR_A_BIT | PSR_I_BIT |
+   PSR_F_BIT | PSR_D_BIT),
+};
+
  static const struct kvm_regs default_regs_reset32 = {
.regs.pstate = (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT |
PSR_AA32_I_BIT | PSR_AA32_F_BIT),
@@ -302,6 +307,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
if (!cpu_has_32bit_el1())
goto out;
cpu_reset = _regs_reset32;
+   } else if (test_bit(KVM_ARM_VCPU_NESTED_VIRT, 
vcpu->arch.features)) {


minor nit: "else if nested_virt_in_use(vcpu)" ?

Either ways:

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 4/5] KVM: arm/arm64: remove pmc->bitmask

2019-06-17 Thread Suzuki K Poulose




On 17/06/2019 16:43, Andrew Murray wrote:

On Thu, Jun 13, 2019 at 05:50:43PM +0100, Suzuki K Poulose wrote:



On 13/06/2019 10:39, Andrew Murray wrote:

On Thu, Jun 13, 2019 at 08:30:51AM +0100, Julien Thierry wrote:



index ae1e886d4a1a..88ce24ae0b45 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -47,7 +47,10 @@ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 
select_idx)
counter += perf_event_read_value(pmc->perf_event, ,
 );
-   return counter & pmc->bitmask;
+   if (select_idx != ARMV8_PMU_CYCLE_IDX)
+   counter = lower_32_bits(counter);


Shouldn't this depend on PMCR.LC as well? If PMCR.LC is clear we only
want the lower 32bits of the cycle counter.


Yes that's correct. The hunk should look like this:

-   return counter & pmc->bitmask;
+   if (!(select_idx == ARMV8_PMU_CYCLE_IDX &&
+ __vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_LC))
+   counter = lower_32_bits(counter);
+
+   return counter;


May be you could add a macro :

#define vcpu_pmu_counter_is_64bit(vcpu, idx) ?


Yes I think a helper would be useful - though I'd prefer the name
'kvm_pmu_idx_is_long_cycle_counter'. This seems a little clearer as
you could otherwise argue that a chained counter is also 64 bits.


When you get to add 64bit PMU counter (v8.5), this would be handy. So
having it de-coupled from the cycle counter may be a good idea. Anyways,
I leave that to you.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 5/5] KVM: arm/arm64: support chained PMU counters

2019-06-14 Thread Suzuki K Poulose




On 12/06/2019 20:04, Andrew Murray wrote:

ARMv8 provides support for chained PMU counters, where an event type
of 0x001E is set for odd-numbered counters, the event counter will
increment by one for each overflow of the preceding even-numbered
counter. Let's emulate this in KVM by creating a 64 bit perf counter
when a user chains two emulated counters together.

For chained events we only support generating an overflow interrupt
on the high counter. We use the attributes of the low counter to
determine the attributes of the perf event.

Suggested-by: Marc Zyngier 
Signed-off-by: Andrew Murray 
Reviewed-by: Julien Thierry 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 4/5] KVM: arm/arm64: remove pmc->bitmask

2019-06-13 Thread Suzuki K Poulose




On 13/06/2019 10:39, Andrew Murray wrote:

On Thu, Jun 13, 2019 at 08:30:51AM +0100, Julien Thierry wrote:

Hi Andrew,

On 12/06/2019 20:04, Andrew Murray wrote:

We currently use pmc->bitmask to determine the width of the pmc - however
it's superfluous as the pmc index already describes if the pmc is a cycle
counter or event counter. The architecture clearly describes the widths of
these counters.

Let's remove the bitmask to simplify the code.

Signed-off-by: Andrew Murray 
---
  include/kvm/arm_pmu.h |  1 -
  virt/kvm/arm/pmu.c| 19 +--
  2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index b73f31baca52..2f0e28dc5a9e 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -28,7 +28,6 @@
  struct kvm_pmc {
u8 idx; /* index into the pmu->pmc array */
struct perf_event *perf_event;
-   u64 bitmask;
  };
  
  struct kvm_pmu {

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index ae1e886d4a1a..88ce24ae0b45 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -47,7 +47,10 @@ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 
select_idx)
counter += perf_event_read_value(pmc->perf_event, ,
 );
  
-	return counter & pmc->bitmask;

+   if (select_idx != ARMV8_PMU_CYCLE_IDX)
+   counter = lower_32_bits(counter);


Shouldn't this depend on PMCR.LC as well? If PMCR.LC is clear we only
want the lower 32bits of the cycle counter.


Yes that's correct. The hunk should look like this:

-   return counter & pmc->bitmask;
+   if (!(select_idx == ARMV8_PMU_CYCLE_IDX &&
+ __vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_LC))
+   counter = lower_32_bits(counter);
+
+   return counter;


May be you could add a macro :

#define vcpu_pmu_counter_is_64bit(vcpu, idx) ?

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 6/6] KVM: arm/arm64: support chained PMU counters

2019-06-10 Thread Suzuki K Poulose

Hi Andrew,


@@ -398,27 +531,43 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
  
  	/* Software increment event does't need to be backed by a perf event */

if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR &&
-   select_idx != ARMV8_PMU_CYCLE_IDX)
+   pmc->idx != ARMV8_PMU_CYCLE_IDX)
return;
  
  	memset(, 0, sizeof(struct perf_event_attr));

attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.pinned = 1;
-   attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, select_idx);
+   attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, pmc->idx);
attr.exclude_user = data & ARMV8_PMU_EXCLUDE_EL0 ? 1 : 0;
attr.exclude_kernel = data & ARMV8_PMU_EXCLUDE_EL1 ? 1 : 0;
attr.exclude_hv = 1; /* Don't count EL2 events */
attr.exclude_host = 1; /* Don't count host events */
-   attr.config = (select_idx == ARMV8_PMU_CYCLE_IDX) ?
+   attr.config = (pmc->idx == ARMV8_PMU_CYCLE_IDX) ?
ARMV8_PMUV3_PERFCTR_CPU_CYCLES : eventsel;
  
-	counter = kvm_pmu_get_counter_value(vcpu, select_idx);

-   /* The initial sample period (overflow count) of an event. */
-   attr.sample_period = (-counter) & GENMASK(31, 0);
+   counter = kvm_pmu_get_pair_counter_value(vcpu, pmc);
+
+   if (kvm_pmu_idx_has_chain_evtype(vcpu, pmc->idx)) {
+   /**
+* The initial sample period (overflow count) of an event. For
+* chained counters we only support overflow interrupts on the
+* high counter.
+*/
+   attr.sample_period = (-counter) & GENMASK(63, 0);
+   event = perf_event_create_kernel_counter(, -1, current,
+kvm_pmu_perf_overflow,
+pmc + 1);
  
-	event = perf_event_create_kernel_counter(, -1, current,

+   if (kvm_pmu_counter_is_enabled(vcpu, pmc->idx + 1))
+   attr.config1 |= PERF_ATTR_CFG1_KVM_PMU_CHAINED;
+   } else {
+   /* The initial sample period (overflow count) of an event. */
+   attr.sample_period = (-counter) & GENMASK(31, 0);
+   event = perf_event_create_kernel_counter(, -1, current,
 kvm_pmu_perf_overflow, pmc);
+   }
+


If this was the Cycle counter and t he PMCR_LC was set, shouldn't we be using
64bit mask here ? We fall back to using the Cycle counter in 64bit mode for
"normal" (read guest) kernel. So shouldn't we reflect that here ?

Rest looks fine to me.

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 5/6] KVM: arm/arm64: remove pmc->bitmask

2019-06-10 Thread Suzuki K Poulose




On 22/05/2019 17:26, Andrew Murray wrote:

On Wed, May 22, 2019 at 05:07:31PM +0100, Marc Zyngier wrote:

On 22/05/2019 16:30, Andrew Murray wrote:

We currently use pmc->bitmask to determine the width of the pmc - however
it's superfluous as the pmc index already describes if the pmc is a cycle
counter or event counter. The architecture clearly describes the widths of
these counters.

Let's remove the bitmask to simplify the code.

Signed-off-by: Andrew Murray 
---
  include/kvm/arm_pmu.h |  1 -
  virt/kvm/arm/pmu.c| 15 +--
  2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index b73f31baca52..2f0e28dc5a9e 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -28,7 +28,6 @@
  struct kvm_pmc {
u8 idx; /* index into the pmu->pmc array */
struct perf_event *perf_event;
-   u64 bitmask;
  };
  




-
-   if (val & ARMV8_PMU_PMCR_LC) {
-   pmc = >pmc[ARMV8_PMU_CYCLE_IDX];
-   pmc->bitmask = 0xUL;
-   }
  }


...

  
  static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u64 select_idx)

@@ -420,7 +415,7 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
  
  	counter = kvm_pmu_get_counter_value(vcpu, select_idx);

/* The initial sample period (overflow count) of an event. */
-   attr.sample_period = (-counter) & pmc->bitmask;
+   attr.sample_period = (-counter) & GENMASK(31, 0);


Isn't this the one case where the bitmask actually matters? If we're
dealing with the cycle counter, it shouldn't be truncated, right?


Ah yes, that should be conditional on idx as well.


The mask for Cycle counter also depends on the PMCR.LC field set by the
guest, isn't it ? So unless we correlate that with the idx, we could be
passing in wrong results ?

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] mm, compaction: Make sure we isolate a valid PFN

2019-05-24 Thread Suzuki K Poulose
When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(), triggered
by kswapd).

Later if we fail to find a page via fast_isolate_freepages(), we may
end up using the migrate_pfn we started the search with, as valid
page. This could lead to accessing NULL pointer derefernces like below,
due to an invalid mem_section pointer.

Unable to handle kernel NULL pointer dereference at virtual address 
0008 [47/1825]
 Mem abort info:
   ESR = 0x9604
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x0004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 82f94ae9
 [0008] pgd=
 Internal error: Oops: 9604 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 
09/25/2018
 pstate: 6005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 [...]
 Process qemu-system-aar (pid: 6080, stack limit = 0x95070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f11f 8b011401 9a801020 (f9400400)
 ---[ end trace af6a35219325a9b6 ]---

The issue was reported on an arm64 server with 128GB with holes in the zone
(e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while running 100 KVM
guest instances.

This patch fixes the issue by ensuring that the page belongs to a valid PFN
when we fallback to using the lower limit of the scan range upon failure in
fast_isolate_freepages().

Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a 
migration target")
Reported-by: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9febc8c..9e1b9ac 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1399,7 +1399,7 @@ fast_isolate_freepages(struct compact_control *cc)
page = pfn_to_page(highest);
cc->free_pfn = highest;
} else {
-   if (cc->direct_compaction) {
+   if (cc->direct_compaction && 
pfn_valid(min_pfn)) {
page = pfn_to_page(min_pfn);
cc->free_pfn = min_pfn;
}
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: mm/compaction: BUG: NULL pointer dereference

2019-05-24 Thread Suzuki K Poulose

Hi Mel,

Thanks for your quick response.

On 24/05/2019 11:39, Mel Gorman wrote:

On Fri, May 24, 2019 at 10:20:19AM +0100, Suzuki K Poulose wrote:

Hi,

We are hitting NULL pointer dereferences while running stress tests with KVM.
See splat [0]. The test is to spawn 100 VMs all doing standard debian
installation (Thanks to Marc's automated scripts, available here [1] ).
The problem has been reproduced with a better rate of success from 5.1-rc6
onwards.

The issue is only reproducible with swapping enabled and the entire
memory is used up, when swapping heavily. Also this issue is only reproducible
on only one server with 128GB, which has the following memory layout:

[32GB@4GB, hole , 96GB@544GB]

Here is my non-expert analysis of the issue so far.

Under extreme memory pressure, the kswapd could trigger 
reset_isolation_suitable()
to figure out the cached values for migrate/free pfn for a zone, by scanning 
through
the entire zone. On our server it does so in the range of [ 0x10_, 
0xa00_ ],
with the following area of holes : [ 0x20_, 0x880_ ].
In the failing case, we end up setting the cached migrate pfn as : 0x508_, 
which
is right in the center of the zone pfn range. i.e ( 0x10_ + 0xa00_ ) / 
2,
with reset_migrate = 0x88_4e00, reset_free = 0x10_.

Now these cached values are used by the fast_isolate_freepages() to find a pfn. 
However,
since we cant find anything during the search we fall back to using the page 
belonging
to the min_pfn (which is the migrate_pfn), without proper checks to see if that 
is valid
PFN or not. This is then passed on to fast_isolate_around() which tries to do :
set_pageblock_skip(page) on the page which blows up due to an NULL mem_section 
pointer.

The following patch seems to fix the issue for me, but I am not quite convinced 
that
it is the right fix. Thoughts ?



I think the patch is valid and the alternatives would be unnecessarily
complicated. During a normal scan for free pages to isolate, there
is a check for pageblock_pfn_to_page() which uses a pfn_valid check
for non-contiguous zones in __pageblock_pfn_to_page. Now, while the


I had the initial version with the pageblock_pfn_to_page(), but as you said,
it is a complicated way of perform the same check as pfn_valid().


non-contiguous check could be made in the area you highlight, it would be a
relatively small optimisation that would be unmeasurable overall. However,
it is definitely the case that if the PFN you highlight is invalid that
badness happens. If you want to express this as a signed-off patch with
an adjusted changelog then I'd be happy to add


Sure, will send it right away.



Reviewed-by: Mel Gorman 



Thanks.

Suzuki


mm/compaction: BUG: NULL pointer dereference

2019-05-24 Thread Suzuki K Poulose
Hi,

We are hitting NULL pointer dereferences while running stress tests with KVM.
See splat [0]. The test is to spawn 100 VMs all doing standard debian
installation (Thanks to Marc's automated scripts, available here [1] ).
The problem has been reproduced with a better rate of success from 5.1-rc6
onwards.

The issue is only reproducible with swapping enabled and the entire
memory is used up, when swapping heavily. Also this issue is only reproducible
on only one server with 128GB, which has the following memory layout:

[32GB@4GB, hole , 96GB@544GB]

Here is my non-expert analysis of the issue so far.

Under extreme memory pressure, the kswapd could trigger 
reset_isolation_suitable()
to figure out the cached values for migrate/free pfn for a zone, by scanning 
through
the entire zone. On our server it does so in the range of [ 0x10_, 
0xa00_ ],
with the following area of holes : [ 0x20_, 0x880_ ].
In the failing case, we end up setting the cached migrate pfn as : 0x508_, 
which
is right in the center of the zone pfn range. i.e ( 0x10_ + 0xa00_ ) / 
2,
with reset_migrate = 0x88_4e00, reset_free = 0x10_.

Now these cached values are used by the fast_isolate_freepages() to find a pfn. 
However,
since we cant find anything during the search we fall back to using the page 
belonging
to the min_pfn (which is the migrate_pfn), without proper checks to see if that 
is valid
PFN or not. This is then passed on to fast_isolate_around() which tries to do :
set_pageblock_skip(page) on the page which blows up due to an NULL mem_section 
pointer.

The following patch seems to fix the issue for me, but I am not quite convinced 
that
it is the right fix. Thoughts ?


diff --git a/mm/compaction.c b/mm/compaction.c
index 9febc8c..9e1b9ac 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1399,7 +1399,7 @@ fast_isolate_freepages(struct compact_control *cc)
page = pfn_to_page(highest);
cc->free_pfn = highest;
} else {
-   if (cc->direct_compaction) {
+   if (cc->direct_compaction && 
pfn_valid(min_pfn)) {
page = pfn_to_page(min_pfn);
cc->free_pfn = min_pfn;
}


Suzuki


[ 0 ] Kernel splat
 Unable to handle kernel NULL pointer dereference at virtual address 
0008 [47/1825]
 Mem abort info:
   ESR = 0x9604
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x0004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 82f94ae9
 [0008] pgd=
 Internal error: Oops: 9604 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 
09/25/2018
 pstate: 6005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 sp : 1fc03010
 x29: 1fc03010 x28:  
 x27:  x26: 10bf7000 
 x25: 06445000 x24: 06444e00 
 x23: 7e018f138000 x22: 0003 
 x21: 0001 x20: 06444e00 
 x19: 0001 x18:  
 x17:  x16: 809f7fe97268 
 x15: 000191138000 x14:  
 x13: 0070 x12:  
 x11: 1fc03108 x10:  
 x9 : 09222400 x8 : 0187 
 x7 : 063c4e00 x6 : 06444e00 
 x5 : 0008 x4 : 0001 
 x3 : 0003 x2 : 809f7fe92840 
 x1 : 0220 x0 :  
 Process qemu-system-aar (pid: 6080, stack limit = 0x95070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f11f 8b011401 9a801020 (f9400400) 
 ---[ end trace af6a35219325a9b6 ]---


[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/vminstall.git/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu

Re: [PATCH v7 4/5] arm64: perf: extract chain helper into header

2019-05-21 Thread Suzuki K Poulose




On 21/05/2019 16:52, Andrew Murray wrote:

The ARMv8 Performance Monitors Extension includes an architectural
event type named CHAIN which allows for chaining counters together.

Let's extract the test for this event into a header file such that
other users, such as KVM (for PMU emulation) can make use of.

Signed-off-by: Andrew Murray 
---
  arch/arm64/include/asm/perf_event.h | 5 +
  arch/arm64/kernel/perf_event.c  | 2 +-
  2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/perf_event.h 
b/arch/arm64/include/asm/perf_event.h
index c593761ba61c..cd13f3fd1055 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -219,6 +219,11 @@
  #define ARMV8_PMU_USERENR_CR  (1 << 2) /* Cycle counter can be read at EL0 */
  #define ARMV8_PMU_USERENR_ER  (1 << 3) /* Event counter can be read at EL0 */
  
+static inline bool armv8pmu_evtype_is_chain(u64 evtype)

+{
+   return (evtype == ARMV8_PMUV3_PERFCTR_CHAIN);
+}
+
  #ifdef CONFIG_PERF_EVENTS
  struct pt_regs;
  extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 314b1adedf06..265bd835a724 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -879,7 +879,7 @@ static int armv8pmu_set_event_filter(struct hw_perf_event 
*event,
  static int armv8pmu_filter_match(struct perf_event *event)
  {
unsigned long evtype = event->hw.config_base & ARMV8_PMU_EVTYPE_EVENT;
-   return evtype != ARMV8_PMUV3_PERFCTR_CHAIN;
+   return !armv8pmu_evtype_is_chain(evtype);
  }
  
  static void armv8pmu_reset(void *info)




Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH AUTOSEL 5.0 29/98] KVM: arm/arm64: Enforce PTE mappings at stage2 when needed

2019-04-23 Thread Suzuki K Poulose

Hi Sasha,

On 04/22/2019 08:40 PM, Sasha Levin wrote:

From: Suzuki K Poulose 

[ Upstream commit a80868f398554842b14d07060012c06efb57c456 ]

commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
made the checks to skip huge mappings, stricter. However it introduced
a bug where we still use huge mappings, ignoring the flag to
use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.

Also, the checks do not cover the PUD huge pages, that was
under review during the same period. This patch fixes both
the issues.

Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
Reported-by: Zenghui Yu 
Cc: Zenghui Yu 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose 
Signed-off-by: Marc Zyngier 
Signed-off-by: Sasha Levin (Microsoft) 


Please be aware that we need a follow up fix for this patch to fix the
problem for THP backed memory.

http://lists.infradead.org/pipermail/linux-arm-kernel/2019-April/645324.html


It should appear upstream soon.


Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 8/8] arm64: docs: document perf event attributes

2019-04-15 Thread Suzuki K Poulose




On 09/04/2019 20:22, Andrew Murray wrote:

The interaction between the exclude_{host,guest} flags,
exclude_{user,kernel,hv} flags and presence of VHE can result in
different exception levels being filtered by the ARMv8 PMU. As this
can be confusing let's document how they work on arm64.

Signed-off-by: Andrew Murray 
---


Thats really helpful ! Thanks for writing one  !

Reviewed-by: Suzuki K Poulose 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 7/8] arm64: KVM: avoid isb's by using direct pmxevtyper sysreg

2019-04-15 Thread Suzuki K Poulose




On 09/04/2019 20:22, Andrew Murray wrote:

Upon entering or exiting a guest we may modify multiple PMU counters to
enable of disable EL0 filtering. We presently do this via the indirect
PMXEVTYPER_EL0 system register (where the counter we modify is selected
by PMSELR). With this approach it is necessary to order the writes via
isb instructions such that we select the correct counter before modifying
it.

Let's avoid potentially expensive instruction barriers by using the
direct PMEVTYPER_EL0 registers instead.

As the change to counter type relates only to EL0 filtering we can rely
on the implicit instruction barrier which occurs when we transition from
EL2 to EL1 on entering the guest. On returning to userspace we can, at the
latest, rely on the implicit barrier between EL2 and EL0. We can also
depend on the explicit isb in armv8pmu_select_counter to order our write
against any other kernel changes by the PMU driver to the type register as
a result of preemption.

Signed-off-by: Andrew Murray 
---
  arch/arm64/kvm/pmu.c | 84 ++--
  1 file changed, 74 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 3407a529e116..4d55193ccc71 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -91,6 +91,74 @@ void __hyp_text __pmu_switch_to_host(struct kvm_cpu_context 
*host_ctxt)
write_sysreg(pmu->events_host, pmcntenset_el0);
  }
  
+#define PMEVTYPER_READ_CASE(idx)\

+   case idx:   \
+   return read_sysreg(pmevtyper##idx##_el0)
+
+#define PMEVTYPER_WRITE_CASE(idx)  \
+   case idx:   \
+   write_sysreg(val, pmevtyper##idx##_el0);\
+   break
+
+#define PMEVTYPER_CASES(readwrite) \
+   PMEVTYPER_##readwrite##_CASE(0);\
+   PMEVTYPER_##readwrite##_CASE(1);\
+   PMEVTYPER_##readwrite##_CASE(2);\
+   PMEVTYPER_##readwrite##_CASE(3);\
+   PMEVTYPER_##readwrite##_CASE(4);\
+   PMEVTYPER_##readwrite##_CASE(5);\
+   PMEVTYPER_##readwrite##_CASE(6);\
+   PMEVTYPER_##readwrite##_CASE(7);\
+   PMEVTYPER_##readwrite##_CASE(8);\
+   PMEVTYPER_##readwrite##_CASE(9);\
+   PMEVTYPER_##readwrite##_CASE(10);   \
+   PMEVTYPER_##readwrite##_CASE(11);   \
+   PMEVTYPER_##readwrite##_CASE(12);   \
+   PMEVTYPER_##readwrite##_CASE(13);   \
+   PMEVTYPER_##readwrite##_CASE(14);   \
+   PMEVTYPER_##readwrite##_CASE(15);   \
+   PMEVTYPER_##readwrite##_CASE(16);   \
+   PMEVTYPER_##readwrite##_CASE(17);   \
+   PMEVTYPER_##readwrite##_CASE(18);   \
+   PMEVTYPER_##readwrite##_CASE(19);   \
+   PMEVTYPER_##readwrite##_CASE(20);   \
+   PMEVTYPER_##readwrite##_CASE(21);   \
+   PMEVTYPER_##readwrite##_CASE(22);   \
+   PMEVTYPER_##readwrite##_CASE(23);   \
+   PMEVTYPER_##readwrite##_CASE(24);   \
+   PMEVTYPER_##readwrite##_CASE(25);   \
+   PMEVTYPER_##readwrite##_CASE(26);   \
+   PMEVTYPER_##readwrite##_CASE(27);   \
+   PMEVTYPER_##readwrite##_CASE(28);   \
+   PMEVTYPER_##readwrite##_CASE(29);   \
+   PMEVTYPER_##readwrite##_CASE(30)
+


Don't we need case 31 and deal with the PMCCFILTR, instead of WARN_ON(1)
below ?

Otherwise,

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 6/8] arm64: KVM: Enable VHE support for :G/:H perf event modifiers

2019-04-15 Thread Suzuki K Poulose




On 09/04/2019 20:22, Andrew Murray wrote:

With VHE different exception levels are used between the host (EL2) and
guest (EL1) with a shared exception level for userpace (EL0). We can take
advantage of this and use the PMU's exception level filtering to avoid
enabling/disabling counters in the world-switch code. Instead we just


s/Instead// ?


modify the counter type to include or exclude EL0 at vcpu_{load,put} time.





We also ensure that trapped PMU system register writes do not re-enable
EL0 when reconfiguring the backing perf events.

This approach completely avoids blackout windows seen with !VHE.

Suggested-by: Christoffer Dall 
Signed-off-by: Andrew Murray 
Acked-by: Will Deacon 





+/*
+ * On VHE ensure that only guest events have EL0 counting enabled
+ */
+void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpu_context *host_ctxt;
+   struct kvm_host_data *host;
+   u32 events_guest, events_host;
+
+   if (!has_vhe())
+   return;
+
+   host_ctxt = vcpu->arch.host_cpu_context;
+   host = container_of(host_ctxt, struct kvm_host_data, host_ctxt);
+   events_guest = host->pmu_events.events_guest;
+   events_host = host->pmu_events.events_host;
+
+   kvm_vcpu_pmu_enable_el0(events_guest);
+   kvm_vcpu_pmu_disable_el0(events_host);
+}
+
+/*
+ * On VHE ensure that only guest host have EL0 counting enabled


nit: s/guest/host/host events/


+ */
+void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpu_context *host_ctxt;
+   struct kvm_host_data *host;
+   u32 events_guest, events_host;
+
+   if (!has_vhe())
+   return;
+
+   host_ctxt = vcpu->arch.host_cpu_context;
+   host = container_of(host_ctxt, struct kvm_host_data, host_ctxt);
+   events_guest = host->pmu_events.events_guest;
+   events_host = host->pmu_events.events_host;
+
+   kvm_vcpu_pmu_enable_el0(events_host);
+   kvm_vcpu_pmu_disable_el0(events_guest);
+}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 539feecda5b8..c7fa47ad2387 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -695,6 +695,7 @@ static bool access_pmcr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
val |= p->regval & ARMV8_PMU_PMCR_MASK;
__vcpu_sys_reg(vcpu, PMCR_EL0) = val;
kvm_pmu_handle_pmcr(vcpu, val);
+   kvm_vcpu_pmu_restore_guest(vcpu);


nit: I am not sure if we need to do this for PMCR accesses ? Unless
we have modified some changes to the events (e.g, like the two instances
below). Or am I missing something here ?

Otherwise:

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 5/8] arm64: KVM: Enable !VHE support for :G/:H perf event modifiers

2019-04-15 Thread Suzuki K Poulose




On 09/04/2019 20:22, Andrew Murray wrote:

Enable/disable event counters as appropriate when entering and exiting
the guest to enable support for guest or host only event counting.

For both VHE and non-VHE we switch the counters between host/guest at
EL2.

The PMU may be on when we change which counters are enabled however
we avoid adding an isb as we instead rely on existing context
synchronisation events: the eret to enter the guest (__guest_enter)
and eret in kvm_call_hyp for __kvm_vcpu_run_nvhe on returning.

Signed-off-by: Andrew Murray 


Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 4/8] arm64: arm_pmu: Add !VHE support for exclude_host/exclude_guest attributes

2019-04-11 Thread Suzuki K Poulose

On 04/09/2019 08:22 PM, Andrew Murray wrote:

Add support for the :G and :H attributes in perf by handling the
exclude_host/exclude_guest event attributes.

We notify KVM of counters that we wish to be enabled or disabled on
guest entry/exit and thus defer from starting or stopping events based
on their event attributes.

With !VHE we switch the counters between host/guest at EL2. We are able
to eliminate counters counting host events on the boundaries of guest
entry/exit when using :G by filtering out EL2 for exclude_host. When
using !exclude_hv there is a small blackout window at the guest
entry/exit where host events are not captured.

Signed-off-by: Andrew Murray 
Acked-by: Will Deacon 
---





  arch/arm64/kernel/perf_event.c | 43 --
  1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index cccf4fc86d67..6bb28aaf5aea 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -26,6 +26,7 @@
  
  #include 

  #include 
+#include 
  #include 
  #include 
  #include 
@@ -528,11 +529,21 @@ static inline int armv8pmu_enable_counter(int idx)
  
  static inline void armv8pmu_enable_event_counter(struct perf_event *event)

  {
+   struct perf_event_attr *attr = >attr;
int idx = event->hw.idx;
+   u32 counter_bits = BIT(ARMV8_IDX_TO_COUNTER(idx));
  
-	armv8pmu_enable_counter(idx);

if (armv8pmu_event_is_chained(event))
-   armv8pmu_enable_counter(idx - 1);
+   counter_bits |= BIT(ARMV8_IDX_TO_COUNTER(idx - 1));
+
+   kvm_set_pmu_events(counter_bits, attr);
+
+   /* We rely on the hypervisor switch code to enable guest counters */
+   if (!kvm_pmu_counter_deferred(attr)) {
+   armv8pmu_enable_counter(idx);
+   if (armv8pmu_event_is_chained(event))
+   armv8pmu_enable_counter(idx - 1);
+   }
  }
  
  static inline int armv8pmu_disable_counter(int idx)

@@ -545,11 +556,21 @@ static inline int armv8pmu_disable_counter(int idx)
  static inline void armv8pmu_disable_event_counter(struct perf_event *event)
  {
struct hw_perf_event *hwc = >hw;
+   struct perf_event_attr *attr = >attr;
int idx = hwc->idx;
+   u32 counter_bits = BIT(ARMV8_IDX_TO_COUNTER(idx));
  
  	if (armv8pmu_event_is_chained(event))

-   armv8pmu_disable_counter(idx - 1);
-   armv8pmu_disable_counter(idx);
+   counter_bits |= BIT(ARMV8_IDX_TO_COUNTER(idx - 1));
+
+   kvm_clr_pmu_events(counter_bits);
+
+   /* We rely on the hypervisor switch code to disable guest counters */
+   if (!kvm_pmu_counter_deferred(attr)) {
+   if (armv8pmu_event_is_chained(event))
+   armv8pmu_disable_counter(idx - 1);
+   armv8pmu_disable_counter(idx);
+   }
  }
  
  static inline int armv8pmu_enable_intens(int idx)

@@ -829,11 +850,16 @@ static int armv8pmu_set_event_filter(struct hw_perf_event 
*event,
if (!attr->exclude_kernel)
config_base |= ARMV8_PMU_INCLUDE_EL2;
} else {
-   if (attr->exclude_kernel)
-   config_base |= ARMV8_PMU_EXCLUDE_EL1;
-   if (!attr->exclude_hv)
+   if (!attr->exclude_hv && !attr->exclude_host)
config_base |= ARMV8_PMU_INCLUDE_EL2;
}
+
+   /*
+* Filter out !VHE kernels and guest kernels
+*/
+   if (attr->exclude_kernel)
+   config_base |= ARMV8_PMU_EXCLUDE_EL1;
+
if (attr->exclude_user)
config_base |= ARMV8_PMU_EXCLUDE_EL0;
  
@@ -863,6 +889,9 @@ static void armv8pmu_reset(void *info)

armv8pmu_disable_intens(idx);
}
  
+	/* Clear the counters we flip at guest entry/exit */

+   kvm_clr_pmu_events(U32_MAX);
+
/*
 * Initialize & Reset PMNC. Request overflow interrupt for
 * 64 bit cycle counter but cheat in armv8pmu_write_counter().




Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/2] kvm: arm: Unify handling THP backed host memory

2019-04-11 Thread Suzuki K Poulose

Hi Zhengui,

On 11/04/2019 02:59, Zenghui Yu wrote:

Hi Suzuki,

On 2019/4/10 23:23, Suzuki K Poulose wrote:

We support mapping host memory backed by PMD transparent hugepages
at stage2 as huge pages. However the checks are now spread across
two different places. Let us unify the handling of the THPs to
keep the code cleaner (and future proof for PUD THP support).
This patch moves transparent_hugepage_adjust() closer to the caller
to avoid a forward declaration for fault_supports_stage2_huge_mappings().

Also, since we already handle the case where the host VA and the guest
PA may not be aligned, the explicit VM_BUG_ON() is not required.

Cc: Marc Zyngier 
Cc: Christoffer Dall 
Cc: Zneghui Yu 
Signed-off-by: Suzuki K Poulose 
---
   virt/kvm/arm/mmu.c | 123 
+++--
   1 file changed, 62 insertions(+), 61 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 6d73322..714eec2 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1380,53 +1380,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
return ret;
   }
   
-static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)

-{
-   kvm_pfn_t pfn = *pfnp;
-   gfn_t gfn = *ipap >> PAGE_SHIFT;
-   struct page *page = pfn_to_page(pfn);
-
-   /*
-* PageTransCompoundMap() returns true for THP and
-* hugetlbfs. Make sure the adjustment is done only for THP
-* pages.
-*/
-   if (!PageHuge(page) && PageTransCompoundMap(page)) {
-   unsigned long mask;
-   /*
-* The address we faulted on is backed by a transparent huge
-* page.  However, because we map the compound huge page and
-* not the individual tail page, we need to transfer the
-* refcount to the head page.  We have to be careful that the
-* THP doesn't start to split while we are adjusting the
-* refcounts.
-*
-* We are sure this doesn't happen, because mmu_notifier_retry
-* was successful and we are holding the mmu_lock, so if this
-* THP is trying to split, it will be blocked in the mmu
-* notifier before touching any of the pages, specifically
-* before being able to call __split_huge_page_refcount().
-*
-* We can therefore safely transfer the refcount from PG_tail
-* to PG_head and switch the pfn from a tail page to the head
-* page accordingly.
-*/
-   mask = PTRS_PER_PMD - 1;
-   VM_BUG_ON((gfn & mask) != (pfn & mask));
-   if (pfn & mask) {
-   *ipap &= PMD_MASK;
-   kvm_release_pfn_clean(pfn);
-   pfn &= ~mask;
-   kvm_get_pfn(pfn);
-   *pfnp = pfn;
-   }
-
-   return true;
-   }
-
-   return false;
-}
-
   /**
* stage2_wp_ptes - write protect PMD range
* @pmd: pointer to pmd entry
@@ -1677,6 +1630,61 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
   (hva & ~(map_size - 1)) + map_size <= uaddr_end;
   }
   
+/*

+ * Check if the given hva is backed by a transparent huge page (THP)
+ * and whether it can be mapped using block mapping in stage2. If so, adjust
+ * the stage2 PFN and IPA accordingly. Only PMD_SIZE THPs are currently
+ * supported. This will need to be updated to support other THP sizes.
+ *
+ * Returns the size of the mapping.
+ */
+static unsigned long
+transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+   unsigned long hva, kvm_pfn_t *pfnp,
+   phys_addr_t *ipap)
+{
+   kvm_pfn_t pfn = *pfnp;
+   struct page *page = pfn_to_page(pfn);
+
+   /*
+* PageTransCompoundMap() returns true for THP and
+* hugetlbfs. Make sure the adjustment is done only for THP
+* pages. Also make sure that the HVA and IPA are sufficiently
+* aligned and that the  block map is contained within the memslot.
+*/
+   if (!PageHuge(page) && PageTransCompoundMap(page) &&


We managed to get here, ensure that we only play with normal size pages
and no hugetlbfs pages will be involved.  "!PageHuge(page)" will always
return true and we can let it go.


I think that is a bit tricky. If someone ever modifies the user_mem_abort()
and we end up in getting called with a HugeTLB backed page things could go
wrong.

I could do remove the check, but would like to add a WARN_ON_ONCE() to make
sure our assumption is held.

i.e,
WARN_ON_ONCE(PageHuge(page));

	if (PageTransCompoundMap(page) &&>> +	 
fa

Re: [PATCH 1/2] kvm: arm: Clean up the checking for huge mapping

2019-04-11 Thread Suzuki K Poulose

On 04/11/2019 02:48 AM, Zenghui Yu wrote:


On 2019/4/10 23:23, Suzuki K Poulose wrote:

If we are checking whether the stage2 can map PAGE_SIZE,
we don't have to do the boundary checks as both the host
VMA and the guest memslots are page aligned. Bail the case
easily.

Cc: Christoffer Dall 
Cc: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
---
  virt/kvm/arm/mmu.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a39dcfd..6d73322 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1624,6 +1624,10 @@ static bool 
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,

  hva_t uaddr_start, uaddr_end;
  size_t size;
+    /* The memslot and the VMA are guaranteed to be aligned to 
PAGE_SIZE */

+    if (map_size == PAGE_SIZE)
+    return true;
+
  size = memslot->npages * PAGE_SIZE;
  gpa_start = memslot->base_gfn << PAGE_SHIFT;


We can do a comment clean up as well in this patch.

s/<< PAGE_SIZE/<< PAGE_SHIFT/


Sure, I missed that. Will fix it in the next version.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH 2/2] kvm: arm: Unify handling THP backed host memory

2019-04-10 Thread Suzuki K Poulose
We support mapping host memory backed by PMD transparent hugepages
at stage2 as huge pages. However the checks are now spread across
two different places. Let us unify the handling of the THPs to
keep the code cleaner (and future proof for PUD THP support).
This patch moves transparent_hugepage_adjust() closer to the caller
to avoid a forward declaration for fault_supports_stage2_huge_mappings().

Also, since we already handle the case where the host VA and the guest
PA may not be aligned, the explicit VM_BUG_ON() is not required.

Cc: Marc Zyngier 
Cc: Christoffer Dall 
Cc: Zneghui Yu 
Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 123 +++--
 1 file changed, 62 insertions(+), 61 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 6d73322..714eec2 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1380,53 +1380,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
return ret;
 }
 
-static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
-{
-   kvm_pfn_t pfn = *pfnp;
-   gfn_t gfn = *ipap >> PAGE_SHIFT;
-   struct page *page = pfn_to_page(pfn);
-
-   /*
-* PageTransCompoundMap() returns true for THP and
-* hugetlbfs. Make sure the adjustment is done only for THP
-* pages.
-*/
-   if (!PageHuge(page) && PageTransCompoundMap(page)) {
-   unsigned long mask;
-   /*
-* The address we faulted on is backed by a transparent huge
-* page.  However, because we map the compound huge page and
-* not the individual tail page, we need to transfer the
-* refcount to the head page.  We have to be careful that the
-* THP doesn't start to split while we are adjusting the
-* refcounts.
-*
-* We are sure this doesn't happen, because mmu_notifier_retry
-* was successful and we are holding the mmu_lock, so if this
-* THP is trying to split, it will be blocked in the mmu
-* notifier before touching any of the pages, specifically
-* before being able to call __split_huge_page_refcount().
-*
-* We can therefore safely transfer the refcount from PG_tail
-* to PG_head and switch the pfn from a tail page to the head
-* page accordingly.
-*/
-   mask = PTRS_PER_PMD - 1;
-   VM_BUG_ON((gfn & mask) != (pfn & mask));
-   if (pfn & mask) {
-   *ipap &= PMD_MASK;
-   kvm_release_pfn_clean(pfn);
-   pfn &= ~mask;
-   kvm_get_pfn(pfn);
-   *pfnp = pfn;
-   }
-
-   return true;
-   }
-
-   return false;
-}
-
 /**
  * stage2_wp_ptes - write protect PMD range
  * @pmd:   pointer to pmd entry
@@ -1677,6 +1630,61 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
   (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
+/*
+ * Check if the given hva is backed by a transparent huge page (THP)
+ * and whether it can be mapped using block mapping in stage2. If so, adjust
+ * the stage2 PFN and IPA accordingly. Only PMD_SIZE THPs are currently
+ * supported. This will need to be updated to support other THP sizes.
+ *
+ * Returns the size of the mapping.
+ */
+static unsigned long
+transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+   unsigned long hva, kvm_pfn_t *pfnp,
+   phys_addr_t *ipap)
+{
+   kvm_pfn_t pfn = *pfnp;
+   struct page *page = pfn_to_page(pfn);
+
+   /*
+* PageTransCompoundMap() returns true for THP and
+* hugetlbfs. Make sure the adjustment is done only for THP
+* pages. Also make sure that the HVA and IPA are sufficiently
+* aligned and that the  block map is contained within the memslot.
+*/
+   if (!PageHuge(page) && PageTransCompoundMap(page) &&
+   fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+   /*
+* The address we faulted on is backed by a transparent huge
+* page.  However, because we map the compound huge page and
+* not the individual tail page, we need to transfer the
+* refcount to the head page.  We have to be careful that the
+* THP doesn't start to split while we are adjusting the
+* refcounts.
+*
+* We are sure this doesn't happen, because mmu_notifier_retry
+* was successful and we are holding the mmu_lock, so if this
+* THP is trying to split, it will be block

[PATCH 1/2] kvm: arm: Clean up the checking for huge mapping

2019-04-10 Thread Suzuki K Poulose
If we are checking whether the stage2 can map PAGE_SIZE,
we don't have to do the boundary checks as both the host
VMA and the guest memslots are page aligned. Bail the case
easily.

Cc: Christoffer Dall 
Cc: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a39dcfd..6d73322 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1624,6 +1624,10 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
hva_t uaddr_start, uaddr_end;
size_t size;
 
+   /* The memslot and the VMA are guaranteed to be aligned to PAGE_SIZE */
+   if (map_size == PAGE_SIZE)
+   return true;
+
size = memslot->npages * PAGE_SIZE;
 
gpa_start = memslot->base_gfn << PAGE_SHIFT;
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH 0/2] kvm: arm: Unify stage2 mapping for THP backed memory

2019-04-10 Thread Suzuki K Poulose


This series cleans up the handling of the stage2 huge mapping for THP.
Applies on top of [1]


[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2019-April/645324.html

Suzuki K Poulose (2):
  kvm: arm: Clean up the checking for huge mapping
  kvm: arm: Unify handling THP backed host memory

 virt/kvm/arm/mmu.c | 127 -
 1 file changed, 66 insertions(+), 61 deletions(-)

-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-10 Thread Suzuki K Poulose
With commit a80868f398554842b14, we no longer ensure that the
THP page is properly aligned in the guest IPA. Skip the stage2
huge mapping for unaligned IPA backed by transparent hugepages.

Fixes: a80868f398554842b14 ("KVM: arm/arm64: Enforce PTE mappings at stage2 
when needed")
Reported-by: Eric Auger 
Cc: Marc Zyngier 
Cc: Chirstoffer Dall 
Cc: Zenghui Yu 
Cc: Zheng Xiang 
Cc: Andrew Murray 
Cc: Eric Auger 
Signed-off-by: Suzuki K Poulose 

---
Changes since V1:
 - Zenghui Yu reported that the first version misses the boundary
   check for the end address. Lets reuse the existing helper to
   check it.
---
 virt/kvm/arm/mmu.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..a39dcfd 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1781,8 +1781,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * Only PMD_SIZE transparent hugepages(THP) are
 * currently supported. This code will need to be
 * updated to support other THP sizes.
+*
+* Make sure the host VA and the guest IPA are sufficiently
+* aligned and that the block is contained within the memslot.
 */
-   if (transparent_hugepage_adjust(, _ipa))
+   if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) 
&&
+   transparent_hugepage_adjust(, _ipa))
vma_pagesize = PMD_SIZE;
}
 
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-10 Thread Suzuki K Poulose



On 10/04/2019 03:20, Zenghui Yu wrote:


On 2019/4/9 22:59, Suzuki K Poulose wrote:

Hi Zenghui

On 04/09/2019 09:05 AM, Zenghui Yu wrote:



On 2019/4/9 2:40, Suzuki K Poulose wrote:

Hi Zenhui,

On 04/08/2019 04:11 PM, Zenghui Yu wrote:

Hi Suzuki,

Thanks for the reply.



...


Hi Suzuki,

Why not making use of fault_supports_stage2_huge_mapping()?  Let
it do
some checks for us.

fault_supports_stage2_huge_mapping() was intended to do a *two-step*
check to tell us that can we create stage2 huge block mappings,
and this
check is both for hugetlbfs and THP.  With commit
a80868f398554842b14,
we pass PAGE_SIZE as "map_size" for normal size pages (which
turned out
to be almost meaningless), and unfortunately the THP check no longer
works.


Thats correct.



So we want to rework *THP* check process.  Your patch fixes the first
checking-step, but the second is still missed, am I wrong?


It fixes the step explicitly for the THP by making sure that the
GPA and
the HVA are aligned to the map size.


Yes, I understand how your patch had fixed the issue.  But what I'm
really concerned about here is the *second* checking-step in
fault_supports_stage2_huge_mapping().

We have to check if we are mapping a non-block aligned or non-block
sized memslot, if so, we can not create block mappings for the
beginning
and end of this memslot.  This is what the second part of
fault_supports_stage2_huge_mapping() had done.

I haven't seen this checking-step in your patch, did I miss something?



I see.


I don't think this calls for a VM_BUG_ON(). It is simply a case where
the GPA is not aligned to HVA, but for normal VMA that could be
made THP.

We had this VM_BUG_ON(), which would have never hit because we would
have set force_pte if they were not aligned.


Yes, I agree.


+    /* Skip memslots with unaligned IPA and user address */
+    if ((gfn & mask) != (pfn & mask))
+    return false;
   if (pfn & mask) {
   *ipap &= PMD_MASK;
   kvm_release_pfn_clean(pfn);



---8>---

Rework fault_supports_stage2_huge_mapping(), let it check THP again.

Signed-off-by: Zenghui Yu 
---
   virt/kvm/arm/mmu.c | 11 ++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..5e1b258 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1632,6 +1632,15 @@ static bool
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
   uaddr_end = uaddr_start + size;

   /*
+ * If the memslot is _not_ backed by hugetlbfs, then check if it
+ * can be backed by transparent hugepages.
+ *
+ * Currently only PMD_SIZE THPs are supported, revisit it later.
+ */
+    if (map_size == PAGE_SIZE)
+    map_size = PMD_SIZE;
+


This looks hackish. What is we support PUD_SIZE huge page in the
future
?


Yes, this might make the code a little difficult to understand. But by
doing so, we follow the same logic before commit a80868f398554842b14,
that said, we do the two-step checking for normal size pages in
fault_supports_stage2_huge_mapping(), to decide if we can create THP
mappings for these pages.

As for PUD_SIZE THPs, to be honest, I have no idea now :(


How about the following diff ?

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 97b5417..98e5cec 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1791,7 +1791,8 @@ static int user_mem_abort(struct kvm_vcpu
*vcpu, phys_addr_t fault_ipa,
    * currently supported. This code will need to be
    * updated to support other THP sizes.
    */
-    if (transparent_hugepage_adjust(, _ipa))
+    if (fault_supports_stage2_huge_mappings(memslot, hva,
PMD_SIZE) &&
+    transparent_hugepage_adjust(, _ipa))
   vma_pagesize = PMD_SIZE;
   }


I think this is good enough for the issue.

(One minor concern: With this change, it seems that we no longer need
"force_pte" and can just use "logging_active" instead. But this is not
much related to what we're fixing.)


I would still leave the force_pte there to avoid checking for a THP case
in a situation where we forced to PTE level mapping on a hugepage backed
VMA. It would serve to avoid another check.


Hi Suzuki,

Yes, I agree, thanks.


Cool, I have a patch to fix this properly and two other patches to clean up
and unify the way we handle the THP backed hugepages. Will send them out after
a bit of testing, later today.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-09 Thread Suzuki K Poulose

Hi Zenghui

On 04/09/2019 09:05 AM, Zenghui Yu wrote:



On 2019/4/9 2:40, Suzuki K Poulose wrote:

Hi Zenhui,

On 04/08/2019 04:11 PM, Zenghui Yu wrote:

Hi Suzuki,

Thanks for the reply.



...


Hi Suzuki,

Why not making use of fault_supports_stage2_huge_mapping()?  Let it do
some checks for us.

fault_supports_stage2_huge_mapping() was intended to do a *two-step*
check to tell us that can we create stage2 huge block mappings, and 
this

check is both for hugetlbfs and THP.  With commit a80868f398554842b14,
we pass PAGE_SIZE as "map_size" for normal size pages (which turned 
out

to be almost meaningless), and unfortunately the THP check no longer
works.


Thats correct.



So we want to rework *THP* check process.  Your patch fixes the first
checking-step, but the second is still missed, am I wrong?


It fixes the step explicitly for the THP by making sure that the GPA 
and

the HVA are aligned to the map size.


Yes, I understand how your patch had fixed the issue.  But what I'm
really concerned about here is the *second* checking-step in
fault_supports_stage2_huge_mapping().

We have to check if we are mapping a non-block aligned or non-block
sized memslot, if so, we can not create block mappings for the beginning
and end of this memslot.  This is what the second part of
fault_supports_stage2_huge_mapping() had done.

I haven't seen this checking-step in your patch, did I miss something?



I see.


I don't think this calls for a VM_BUG_ON(). It is simply a case where
the GPA is not aligned to HVA, but for normal VMA that could be made 
THP.


We had this VM_BUG_ON(), which would have never hit because we would
have set force_pte if they were not aligned.


Yes, I agree.


+    /* Skip memslots with unaligned IPA and user address */
+    if ((gfn & mask) != (pfn & mask))
+    return false;
  if (pfn & mask) {
  *ipap &= PMD_MASK;
  kvm_release_pfn_clean(pfn);



---8>---

Rework fault_supports_stage2_huge_mapping(), let it check THP again.

Signed-off-by: Zenghui Yu 
---
  virt/kvm/arm/mmu.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..5e1b258 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1632,6 +1632,15 @@ static bool 
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,

  uaddr_end = uaddr_start + size;

  /*
+ * If the memslot is _not_ backed by hugetlbfs, then check if it
+ * can be backed by transparent hugepages.
+ *
+ * Currently only PMD_SIZE THPs are supported, revisit it later.
+ */
+    if (map_size == PAGE_SIZE)
+    map_size = PMD_SIZE;
+


This looks hackish. What is we support PUD_SIZE huge page in the future
?


Yes, this might make the code a little difficult to understand. But by
doing so, we follow the same logic before commit a80868f398554842b14,
that said, we do the two-step checking for normal size pages in
fault_supports_stage2_huge_mapping(), to decide if we can create THP
mappings for these pages.

As for PUD_SIZE THPs, to be honest, I have no idea now :(


How about the following diff ?

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 97b5417..98e5cec 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1791,7 +1791,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,

   * currently supported. This code will need to be
   * updated to support other THP sizes.
   */
-    if (transparent_hugepage_adjust(, _ipa))
+    if (fault_supports_stage2_huge_mappings(memslot, hva, 
PMD_SIZE) &&

+    transparent_hugepage_adjust(, _ipa))
  vma_pagesize = PMD_SIZE;
  }


I think this is good enough for the issue.

(One minor concern: With this change, it seems that we no longer need
"force_pte" and can just use "logging_active" instead. But this is not
much related to what we're fixing.)


I would still leave the force_pte there to avoid checking for a THP case
in a situation where we forced to PTE level mapping on a hugepage backed
VMA. It would serve to avoid another check.

Cheers
Suzuki





thanks.




___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-08 Thread Suzuki K Poulose

Hi Zenhui,

On 04/08/2019 04:11 PM, Zenghui Yu wrote:

Hi Suzuki,

Thanks for the reply.



...


Hi Suzuki,

Why not making use of fault_supports_stage2_huge_mapping()?  Let it do
some checks for us.

fault_supports_stage2_huge_mapping() was intended to do a *two-step*
check to tell us that can we create stage2 huge block mappings, and this
check is both for hugetlbfs and THP.  With commit a80868f398554842b14,
we pass PAGE_SIZE as "map_size" for normal size pages (which turned out
to be almost meaningless), and unfortunately the THP check no longer
works.


Thats correct.



So we want to rework *THP* check process.  Your patch fixes the first
checking-step, but the second is still missed, am I wrong?


It fixes the step explicitly for the THP by making sure that the GPA and
the HVA are aligned to the map size.


Yes, I understand how your patch had fixed the issue.  But what I'm
really concerned about here is the *second* checking-step in
fault_supports_stage2_huge_mapping().

We have to check if we are mapping a non-block aligned or non-block
sized memslot, if so, we can not create block mappings for the beginning
and end of this memslot.  This is what the second part of
fault_supports_stage2_huge_mapping() had done.

I haven't seen this checking-step in your patch, did I miss something?



I see.


I don't think this calls for a VM_BUG_ON(). It is simply a case where
the GPA is not aligned to HVA, but for normal VMA that could be made THP.

We had this VM_BUG_ON(), which would have never hit because we would
have set force_pte if they were not aligned.


Yes, I agree.


+    /* Skip memslots with unaligned IPA and user address */
+    if ((gfn & mask) != (pfn & mask))
+    return false;
  if (pfn & mask) {
  *ipap &= PMD_MASK;
  kvm_release_pfn_clean(pfn);



---8>---

Rework fault_supports_stage2_huge_mapping(), let it check THP again.

Signed-off-by: Zenghui Yu 
---
  virt/kvm/arm/mmu.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..5e1b258 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1632,6 +1632,15 @@ static bool 
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,

  uaddr_end = uaddr_start + size;

  /*
+ * If the memslot is _not_ backed by hugetlbfs, then check if it
+ * can be backed by transparent hugepages.
+ *
+ * Currently only PMD_SIZE THPs are supported, revisit it later.
+ */
+    if (map_size == PAGE_SIZE)
+    map_size = PMD_SIZE;
+


This looks hackish. What is we support PUD_SIZE huge page in the future
?


Yes, this might make the code a little difficult to understand. But by
doing so, we follow the same logic before commit a80868f398554842b14,
that said, we do the two-step checking for normal size pages in
fault_supports_stage2_huge_mapping(), to decide if we can create THP
mappings for these pages.

As for PUD_SIZE THPs, to be honest, I have no idea now :(


How about the following diff ?

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 97b5417..98e5cec 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1791,7 +1791,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,

 * currently supported. This code will need to be
 * updated to support other THP sizes.
 */
-   if (transparent_hugepage_adjust(, _ipa))
+   if (fault_supports_stage2_huge_mappings(memslot, hva, PMD_SIZE) 
&&
+   transparent_hugepage_adjust(, _ipa))
vma_pagesize = PMD_SIZE;
}

--
2.7.4



Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-08 Thread Suzuki K Poulose

On 04/08/2019 04:50 AM, Zenghui Yu wrote:



On 2019/4/2 19:06, Suzuki K Poulose wrote:

With commit a80868f398554842b14, we no longer ensure that the
THP page is properly aligned in the guest IPA. Skip the stage2
huge mapping for unaligned IPA backed by transparent hugepages.

Fixes: a80868f398554842b14 ("KVM: arm/arm64: Enforce PTE mappings at 
stage2 when needed")

Reported-by: Eric Auger 
Cc: Marc Zyngier 
Cc: Chirstoffer Dall 
Cc: Zenghui Yu 
Cc: Zheng Xiang 
Tested-by: Eric Auger 
Signed-off-by: Suzuki K Poulose 


Hi Suzuki,

Why not making use of fault_supports_stage2_huge_mapping()?  Let it do
some checks for us.

fault_supports_stage2_huge_mapping() was intended to do a *two-step*
check to tell us that can we create stage2 huge block mappings, and this
check is both for hugetlbfs and THP.  With commit a80868f398554842b14,
we pass PAGE_SIZE as "map_size" for normal size pages (which turned out
to be almost meaningless), and unfortunately the THP check no longer
works.


Thats correct.



So we want to rework *THP* check process.  Your patch fixes the first
checking-step, but the second is still missed, am I wrong?


It fixes the step explicitly for the THP by making sure that the GPA and
the HVA are aligned to the map size.



Can you please give a look at the below diff?


thanks,

zenghui


---
  virt/kvm/arm/mmu.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..4a22f5b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1412,7 +1412,9 @@ static bool 
transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)

   * page accordingly.
   */
  mask = PTRS_PER_PMD - 1;
-    VM_BUG_ON((gfn & mask) != (pfn & mask));

Somehow, I'd prefer keeping the VM_BUG_ON() here, let it report some
potential issues in the future (of course I hope none:) )


I don't think this calls for a VM_BUG_ON(). It is simply a case where
the GPA is not aligned to HVA, but for normal VMA that could be made THP.

We had this VM_BUG_ON(), which would have never hit because we would
have set force_pte if they were not aligned.


+    /* Skip memslots with unaligned IPA and user address */
+    if ((gfn & mask) != (pfn & mask))
+    return false;
  if (pfn & mask) {
  *ipap &= PMD_MASK;
  kvm_release_pfn_clean(pfn);



---8>---

Rework fault_supports_stage2_huge_mapping(), let it check THP again.

Signed-off-by: Zenghui Yu 
---
  virt/kvm/arm/mmu.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..5e1b258 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1632,6 +1632,15 @@ static bool 
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,

  uaddr_end = uaddr_start + size;

  /*
+ * If the memslot is _not_ backed by hugetlbfs, then check if it
+ * can be backed by transparent hugepages.
+ *
+ * Currently only PMD_SIZE THPs are supported, revisit it later.
+ */
+    if (map_size == PAGE_SIZE)
+    map_size = PMD_SIZE;
+


This looks hackish. What is we support PUD_SIZE huge page in the future
?


+    /*
   * Pages belonging to memslots that don't have the same alignment
   * within a PMD/PUD for userspace and IPA cannot be mapped with 
stage-2

   * PMD/PUD entries, because we'll end up mapping the wrong pages.
@@ -1643,7 +1652,7 @@ static bool 
fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,

   *    |abcde|fgh  Stage-1 block  |    Stage-1 block tv|xyz|
   *    +-+++---+
   *
- *    memslot->base_gfn << PAGE_SIZE:
+ *    memslot->base_gfn << PAGE_SHIFT:


That could be fixed.


   *  +---+++-+
   *  |abc|def  Stage-2 block  |    Stage-2 block   |tvxyz|
   *  +---+++-+



But personally I don't think this is the right way to fix it. And as
mentioned, the VM_BUG_ON() is unnecessary and could easily be triggered
by the user by using unaligned GPA/HVA. All we need to do is, use page
mapping in such cases, which we do with my patch.


Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP

2019-04-02 Thread Suzuki K Poulose
With commit a80868f398554842b14, we no longer ensure that the
THP page is properly aligned in the guest IPA. Skip the stage2
huge mapping for unaligned IPA backed by transparent hugepages.

Fixes: a80868f398554842b14 ("KVM: arm/arm64: Enforce PTE mappings at stage2 
when needed")
Reported-by: Eric Auger 
Cc: Marc Zyngier 
Cc: Chirstoffer Dall 
Cc: Zenghui Yu 
Cc: Zheng Xiang 
Tested-by: Eric Auger 
Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..4a22f5b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1412,7 +1412,9 @@ static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, 
phys_addr_t *ipap)
 * page accordingly.
 */
mask = PTRS_PER_PMD - 1;
-   VM_BUG_ON((gfn & mask) != (pfn & mask));
+   /* Skip memslots with unaligned IPA and user address */
+   if ((gfn & mask) != (pfn & mask))
+   return false;
if (pfn & mask) {
*ipap &= PMD_MASK;
kvm_release_pfn_clean(pfn);
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 5/8] KVM: arm/arm64: Enforce PTE mappings at stage2 when needed

2019-04-02 Thread Suzuki K Poulose
On Mon, Apr 01, 2019 at 07:10:37PM +0200, Auger Eric wrote:
> Hi Suzuki,
> 
> On 3/28/19 2:36 PM, Marc Zyngier wrote:
> > From: Suzuki K Poulose 
> > 
> > commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD 
> > mappings")
> > made the checks to skip huge mappings, stricter. However it introduced
> > a bug where we still use huge mappings, ignoring the flag to
> > use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.
> > 
> > Also, the checks do not cover the PUD huge pages, that was
> > under review during the same period. This patch fixes both
> > the issues.
> 
> I face a regression with this patch. My guest gets stuck. I am running
> on AMD Seattle. Reverting the patch makes things work again for me. I
> run with qemu. In this scenario I don't use hugepages. I use 64kB page
> size for both the host and guest.

Hi Eric,

Thanks for the testing. Does the following patch fix the issue for you ?


---8>---
kvm: arm: Skip transparent huge pages in unaligned memslots

We silently create stage2 huge mappings for a memslot with
unaligned IPA and user address.

Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 27c9583..4a22f5b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1412,7 +1412,9 @@ static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, 
phys_addr_t *ipap)
 * page accordingly.
 */
mask = PTRS_PER_PMD - 1;
-   VM_BUG_ON((gfn & mask) != (pfn & mask));
+   /* Skip memslots with unaligned IPA and user address */
+   if ((gfn & mask) != (pfn & mask))
+   return false;
if (pfn & mask) {
*ipap &= PMD_MASK;
kvm_release_pfn_clean(pfn);
-- 
2.7.4


Kind regards
Suzuki


> 
> Thanks
> 
> Eric
> > 
> > Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD 
> > mappings")
> > Reported-by: Zenghui Yu 
> > Cc: Zenghui Yu 
> > Cc: Christoffer Dall 
> > Signed-off-by: Suzuki K Poulose 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  virt/kvm/arm/mmu.c | 43 +--
> >  1 file changed, 21 insertions(+), 22 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index ffd7acdceac7..bcdf978c0d1d 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -1594,8 +1594,9 @@ static void kvm_send_hwpoison_signal(unsigned long 
> > address,
> > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
> >  }
> >  
> > -static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot 
> > *memslot,
> > -  unsigned long hva)
> > +static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
> > *memslot,
> > +  unsigned long hva,
> > +  unsigned long map_size)
> >  {
> > gpa_t gpa_start;
> > hva_t uaddr_start, uaddr_end;
> > @@ -1610,34 +1611,34 @@ static bool 
> > fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot,
> >  
> > /*
> >  * Pages belonging to memslots that don't have the same alignment
> > -* within a PMD for userspace and IPA cannot be mapped with stage-2
> > -* PMD entries, because we'll end up mapping the wrong pages.
> > +* within a PMD/PUD for userspace and IPA cannot be mapped with stage-2
> > +* PMD/PUD entries, because we'll end up mapping the wrong pages.
> >  *
> >  * Consider a layout like the following:
> >  *
> >  *memslot->userspace_addr:
> >  *+-+++---+
> > -*|abcde|fgh  Stage-1 PMD|Stage-1 PMD   tv|xyz|
> > +*|abcde|fgh  Stage-1 block  |Stage-1 block tv|xyz|
> >  *+-+++---+
> >  *
> >  *memslot->base_gfn << PAGE_SIZE:
> >  *  +---+++-+
> > -*  |abc|def  Stage-2 PMD|Stage-2 PMD |tvxyz|
> > +*  |abc|def  Stage-2 block  |Stage-2 block   |tvxyz|
> >  *  +---+++-+
> >  *
> > -* If we create those stage-2 PMDs, we'll end up with this incorrect
> > +* If we create those stage-2 blocks, we'll end up wit

Re: [PATCH v12 2/8] arm64: KVM: encapsulate kvm_cpu_context in kvm_host_data

2019-03-28 Thread Suzuki K Poulose

On 03/28/2019 10:37 AM, Andrew Murray wrote:

The virt/arm core allocates a kvm_cpu_context_t percpu, at present this is
a typedef to kvm_cpu_context and is used to store host cpu context. The
kvm_cpu_context structure is also used elsewhere to hold vcpu context.
In order to use the percpu to hold additional future host information we
encapsulate kvm_cpu_context in a new structure and rename the typedef and
percpu to match.

Signed-off-by: Andrew Murray 
---
  arch/arm/include/asm/kvm_host.h   |  8 ++--
  arch/arm64/include/asm/kvm_asm.h  |  3 ++-
  arch/arm64/include/asm/kvm_host.h | 16 ++--
  arch/arm64/kernel/asm-offsets.c   |  1 +
  virt/kvm/arm/arm.c| 14 --
  5 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 770d73257ad9..427c28be6452 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -150,7 +150,11 @@ struct kvm_cpu_context {
u32 cp15[NR_CP15_REGS];
  };
  
-typedef struct kvm_cpu_context kvm_cpu_context_t;

+struct kvm_host_data {
+   struct kvm_cpu_context host_ctxt;
+};
+
+typedef struct kvm_host_data kvm_host_data_t;
  
  static inline void kvm_init_host_cpu_context(kvm_cpu_context_t *cpu_ctxt,

 int cpu)


We need to fix this function prototype to accept struct kvm_cpu_context,
instead of the now removed kvm_cpu_context_t, to prevent a build break
on arm32 ?

With that :

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm/arm64: Comments cleanup in mmu.c

2019-03-25 Thread Suzuki K Poulose




On 25/03/2019 08:02, Zenghui Yu wrote:

Some comments in virt/kvm/arm/mmu.c are outdated and incorrect.
Get rid of these comments, to keep the code correct and
current as a whole.

Only a cleanup here, no functional change.

Signed-off-by: Zenghui Yu 
---
  virt/kvm/arm/mmu.c | 23 +--
  1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index ffd7acd..0a6efe7 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -102,8 +102,7 @@ static bool kvm_is_device_pfn(unsigned long pfn)
   * @addr: IPA
   * @pmd:  pmd pointer for IPA
   *
- * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all
- * pages in the range dirty.
+ * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs.
   */
  static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd)
  {
@@ -121,8 +120,7 @@ static void stage2_dissolve_pmd(struct kvm *kvm, 
phys_addr_t addr, pmd_t *pmd)
   * @addr: IPA
   * @pud:  pud pointer for IPA
   *
- * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. Marks all
- * pages in the range dirty.
+ * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs.
   */
  static void stage2_dissolve_pud(struct kvm *kvm, phys_addr_t addr, pud_t 
*pudp)
  {
@@ -899,9 +897,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t 
size,
   * kvm_alloc_stage2_pgd - allocate level-1 table for stage-2 translation.
   * @kvm:  The KVM struct pointer for the VM.
   *
- * Allocates only the stage-2 HW PGD level table(s) (can support either full
- * 40-bit input addresses or limited to 32-bit input addresses). Clears the
- * allocated pages.
+ * Allocates only the stage-2 HW PGD level table(s) of size defined by
+ * stage2_pgd_size(kvm).
   *
   * Note we don't need locking here as this is only called when the VM is
   * created, which can only be done once.
@@ -1451,13 +1448,11 @@ static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
  }
  
  /**

-  * stage2_wp_puds - write protect PGD range
-  * @pgd:  pointer to pgd entry
-  * @addr: range start address
-  * @end:  range end address
-  *
-  * Process PUD entries, for a huge PUD we cause a panic.
-  */
+ * stage2_wp_puds - write protect PGD range
+ * @pgd:   pointer to pgd entry
+ * @addr:  range start address
+ * @end:   range end address
+ */



Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFC 01/14] arm64/mm: Introduce asid_info structure and move asid_generation/asid_map to it

2019-03-21 Thread Suzuki K Poulose

Hi Julien,

On 21/03/2019 16:36, Julien Grall wrote:

In an attempt to make the ASID allocator generic, create a new structure
asid_info to store all the information necessary for the allocator.

For now, move the variables asid_generation and asid_map to the new structure
asid_info. Follow-up patches will move more variables.

Note to avoid more renaming aftwards, a local variable 'info' has been
created and is a pointer to the ASID allocator structure.

Signed-off-by: Julien Grall 
---
  arch/arm64/mm/context.c | 46 ++
  1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 1f0ea2facf24..34db54f1a39a 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -30,8 +30,11 @@
  static u32 asid_bits;
  static DEFINE_RAW_SPINLOCK(cpu_asid_lock);
  
-static atomic64_t asid_generation;

-static unsigned long *asid_map;
+struct asid_info
+{
+   atomic64_t  generation;
+   unsigned long   *map;
+} asid_info;


Shouldn't this be static ? Rest looks fine.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2] kvm: arm: Fix handling of stage2 huge mappings

2019-03-20 Thread Suzuki K Poulose
We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.

However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.

Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang 
Cc: Zheng Xiang 
Cc: Zhenghui Yu 
Cc: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose 
---

Changes since v1:
 - Fix build break on arm32
   - Add missing definitions for S2_PUD_*
   - Use kvm_pud_pfn() instead of pud_pfn()
 - Make PUD handling similar to PMD, by dropping the else case

 arch/arm/include/asm/stage2_pgtable.h |  2 ++
 virt/kvm/arm/mmu.c| 59 +--
 2 files changed, 45 insertions(+), 16 deletions(-)


diff --git a/arch/arm/include/asm/stage2_pgtable.h 
b/arch/arm/include/asm/stage2_pgtable.h
index de20895..9e11dce 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -75,6 +75,8 @@ static inline bool kvm_stage2_has_pud(struct kvm *kvm)
 
 #define S2_PMD_MASKPMD_MASK
 #define S2_PMD_SIZEPMD_SIZE
+#define S2_PUD_MASKPUD_MASK
+#define S2_PUD_SIZEPUD_SIZE
 
 static inline bool kvm_stage2_has_pmd(struct kvm *kvm)
 {
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index fce0983..97b5417 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 {
pmd_t *pmd, old_pmd;
 
+retry:
pmd = stage2_get_pmd(kvm, cache, addr);
VM_BUG_ON(!pmd);
 
old_pmd = *pmd;
+   /*
+* Multiple vcpus faulting on the same PMD entry, can
+* lead to them sequentially updating the PMD with the
+* same value. Following the break-before-make
+* (pmd_clear() followed by tlb_flush()) process can
+* hinder forward progress due to refaults generated
+* on missing translations.
+*
+* Skip updating the page table if the entry is
+* unchanged.
+*/
+   if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+   return 0;
+
if (pmd_present(old_pmd)) {
/*
-* Multiple vcpus faulting on the same PMD entry, can
-* lead to them sequentially updating the PMD with the
-* same value. Following the break-before-make
-* (pmd_clear() followed by tlb_flush()) process can
-* hinder forward progress due to refaults generated
-* on missing translations.
+* If we already have PTE level mapping for this block,
+* we must unmap it to avoid inconsistent TLB state and
+* leaking the table page. We could end up in this situation
+* if the memory slot was marked for dirty logging and was
+* reverted, leaving PTE level mappings for the pages accessed
+* during the period. So, unmap the PTE level mapping for this
+* block and retry, as we could have released the upper level
+* table in the process.
 *
-* Skip updating the page table if the entry is
-* unchanged.
+* Normal THP split/merge follows mmu_notifier callbacks and do
+* get handled accordingly.
 */
-   if (pmd_val(old_pmd) == pmd_val(*new_pmd))
-   return 0;
-
+   if (!pmd_thp_or_huge(old_pmd)) {
+   unmap_stage2_range(kvm, addr & S2_PMD_MASK, 
S2_PMD_SIZE);
+   goto retry;
+   }
/*
 * Mapping in huge pages should only happen through a
 * fault.  If a page is merged into a transparent huge
@@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 * should become splitting first, unmapped, merged,
 * and mapped back in on-demand.
 */
-   VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
-
+   WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
pmd_clear(pmd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
} else {
@@ -

Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings

2019-03-20 Thread Suzuki K Poulose

Marc,

On 20/03/2019 10:35, Marc Zyngier wrote:

On Wed, 20 Mar 2019 10:23:39 +
Suzuki K Poulose  wrote:

Hi Suzuki,


Marc,

On 20/03/2019 10:11, Marc Zyngier wrote:

On Wed, 20 Mar 2019 09:44:38 +
Suzuki K Poulose  wrote:
   

Hi Marc,

On 20/03/2019 08:15, Marc Zyngier wrote:

Hi Suzuki,

On Tue, 19 Mar 2019 14:11:08 +,
Suzuki K Poulose  wrote:


...


+   if (!pmd_thp_or_huge(old_pmd)) {
+   unmap_stage2_range(kvm, addr & S2_PMD_MASK, 
S2_PMD_SIZE);
+   goto retry;
  



+   if (!stage2_pud_huge(kvm, old_pud)) {
+   unmap_stage2_range(kvm, addr & S2_PUD_MASK, 
S2_PUD_SIZE);
  



We should really get rid of the S2_P{U/M}D_* definitions, as they are
always the same as the host. The only thing that changes is the PGD size
which varies according to the IPA and the concatenation.
  


Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
above ? I could make that change with the respin.


Given that this is a fix, I'd like it to be as small as obvious as
possible, making it easier to backport.

I'm happy to take another patch for 5.2 that will drop the whole S2_P*
if we still think that this should be the case (though what I'd really
like is to have architectural levels instead of these arbitrary
definitions).


I only meant the two new instances added above in the patch. Of course, I
could send something to fix the existing ones.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings

2019-03-20 Thread Suzuki K Poulose

Marc,

On 20/03/2019 10:11, Marc Zyngier wrote:

On Wed, 20 Mar 2019 09:44:38 +
Suzuki K Poulose  wrote:


Hi Marc,

On 20/03/2019 08:15, Marc Zyngier wrote:

Hi Suzuki,

On Tue, 19 Mar 2019 14:11:08 +,
Suzuki K Poulose  wrote:


We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.

However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.

Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang 
Cc: Zheng Xiang 
Cc: Zhengui Yu 
Cc: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose ...




+   if (!pmd_thp_or_huge(old_pmd)) {
+   unmap_stage2_range(kvm, addr & S2_PMD_MASK, 
S2_PMD_SIZE);
+   goto retry;





+   if (!stage2_pud_huge(kvm, old_pud)) {
+   unmap_stage2_range(kvm, addr & S2_PUD_MASK, 
S2_PUD_SIZE);





We should really get rid of the S2_P{U/M}D_* definitions, as they are
always the same as the host. The only thing that changes is the PGD size
which varies according to the IPA and the concatenation.



Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
above ? I could make that change with the respin.



Sure, feel free to send a fixed version. I'll drop the currently queued
patch.




Thanks. Sorry for the trouble.

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings

2019-03-20 Thread Suzuki K Poulose

Hi Marc,

On 20/03/2019 08:15, Marc Zyngier wrote:

Hi Suzuki,

On Tue, 19 Mar 2019 14:11:08 +,
Suzuki K Poulose  wrote:


We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.

However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.

Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang 
Cc: Zheng Xiang 
Cc: Zhengui Yu 
Cc: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose 


...


+retry:
pmd = stage2_get_pmd(kvm, cache, addr);
VM_BUG_ON(!pmd);
  


...


if (pmd_present(old_pmd)) {
/*
-* Multiple vcpus faulting on the same PMD entry, can
-* lead to them sequentially updating the PMD with the
-* same value. Following the break-before-make
-* (pmd_clear() followed by tlb_flush()) process can
-* hinder forward progress due to refaults generated
-* on missing translations.
+* If we already have PTE level mapping for this block,
+* we must unmap it to avoid inconsistent TLB state and
+* leaking the table page. We could end up in this situation
+* if the memory slot was marked for dirty logging and was
+* reverted, leaving PTE level mappings for the pages accessed
+* during the period. So, unmap the PTE level mapping for this
+* block and retry, as we could have released the upper level
+* table in the process.
 *
-* Skip updating the page table if the entry is
-* unchanged.
+* Normal THP split/merge follows mmu_notifier callbacks and do
+* get handled accordingly.
 */
-   if (pmd_val(old_pmd) == pmd_val(*new_pmd))
-   return 0;
-
+   if (!pmd_thp_or_huge(old_pmd)) {
+   unmap_stage2_range(kvm, addr & S2_PMD_MASK, 
S2_PMD_SIZE);
+   goto retry;


This looks slightly dodgy. Doing this retry results in another call to
stage2_get_pmd(), which may or may not result in allocating a PUD. I
think this is safe as if we managed to get here, it means the whole
hierarchy was already present and nothing was allocated in the first
round.

Somehow, I would feel more comfortable with just not even trying.
Unmap, don't fix the fault, let the vcpu come again for additional
punishment. But this is probably more invasive, as none of the
stage2_set_p*() return value is ever evaluated. Oh well.



Yes. The other option was to unmap_stage2_ptes() and get the page refcount
on the new pmd. But that kind of makes it a bit difficult to follow the
code.


if (stage2_pud_present(kvm, old_pud)) {
-   stage2_pud_clear(kvm, pudp);
-   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   /*
+* If we already have table level mapping for this block, unmap
+* the range for this block and retry.
+*/
+   if (!stage2_pud_huge(kvm, old_pud)) {
+   unmap_stage2_range(kvm, addr & S2_PUD_MASK, 
S2_PUD_SIZE);


This broke 32bit. I've added the following hunk to fix it:


Grrr! Sorry about that.



diff --git a/arch/arm/include/asm/stage2_pgtable.h 
b/arch/arm/include/asm/stage2_pgtable.h
index de2089501b8b..b8f21088a744 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -68,6 +68,9 @@ stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, 
phys_addr_t end)
  #define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
  #define stage2_pud_table_empty(kvm, pudp) false
  
+#define S2_PUD_MASKPGDIR_MASK

+#define S2_PUD_SIZEPGDIR_SIZE
+


We should really get rid of the S2_P{U/M}D_* definitions, as they are
always the same as the host. The only thing that changes is the PGD size
which varies according to the IPA and the concatenation.


  static inline bool kvm_stage2_has_pud(struct kvm *kvm)
  {
return false;


+   goto retry;
+   } else {
+   WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
+   stage2_pud_clear(kvm, pudp);
+   kvm_

[PATCH] kvm: arm: Fix handling of stage2 huge mappings

2019-03-19 Thread Suzuki K Poulose
We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.

However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.

Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang 
Cc: Zheng Xiang 
Cc: Zhengui Yu 
Cc: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 63 ++
 1 file changed, 45 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index fce0983..6ad6f19d 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 {
pmd_t *pmd, old_pmd;
 
+retry:
pmd = stage2_get_pmd(kvm, cache, addr);
VM_BUG_ON(!pmd);
 
old_pmd = *pmd;
+   /*
+* Multiple vcpus faulting on the same PMD entry, can
+* lead to them sequentially updating the PMD with the
+* same value. Following the break-before-make
+* (pmd_clear() followed by tlb_flush()) process can
+* hinder forward progress due to refaults generated
+* on missing translations.
+*
+* Skip updating the page table if the entry is
+* unchanged.
+*/
+   if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+   return 0;
+
if (pmd_present(old_pmd)) {
/*
-* Multiple vcpus faulting on the same PMD entry, can
-* lead to them sequentially updating the PMD with the
-* same value. Following the break-before-make
-* (pmd_clear() followed by tlb_flush()) process can
-* hinder forward progress due to refaults generated
-* on missing translations.
+* If we already have PTE level mapping for this block,
+* we must unmap it to avoid inconsistent TLB state and
+* leaking the table page. We could end up in this situation
+* if the memory slot was marked for dirty logging and was
+* reverted, leaving PTE level mappings for the pages accessed
+* during the period. So, unmap the PTE level mapping for this
+* block and retry, as we could have released the upper level
+* table in the process.
 *
-* Skip updating the page table if the entry is
-* unchanged.
+* Normal THP split/merge follows mmu_notifier callbacks and do
+* get handled accordingly.
 */
-   if (pmd_val(old_pmd) == pmd_val(*new_pmd))
-   return 0;
-
+   if (!pmd_thp_or_huge(old_pmd)) {
+   unmap_stage2_range(kvm, addr & S2_PMD_MASK, 
S2_PMD_SIZE);
+   goto retry;
+   }
/*
 * Mapping in huge pages should only happen through a
 * fault.  If a page is merged into a transparent huge
@@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 * should become splitting first, unmapped, merged,
 * and mapped back in on-demand.
 */
-   VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
-
+   WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
pmd_clear(pmd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
} else {
@@ -1107,6 +1124,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache *cac
 {
pud_t *pudp, old_pud;
 
+retry:
pudp = stage2_get_pud(kvm, cache, addr);
VM_BUG_ON(!pudp);
 
@@ -1114,16 +1132,25 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache *cac
 
/*
 * A large number of vcpus faulting on the same stage 2 entry,
-* can lead to a refault due to the
-* stage2_pud_clear()/tlb_flush(). Skip updating the page
-* tables if there is no change.
+* can lead to a refault due to the stage2_pud_clear()/tlb_flush().
+* Skip updating the page tables if there is no change.
 */
if (pud_val(old_pud) == pud_val(*new_pudp))
return 0;
 
if (stage2_pud_presen

Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-18 Thread Suzuki K Poulose
Hi !
On Sun, Mar 17, 2019 at 09:34:11PM +0800, Zenghui Yu wrote:
> Hi Suzuki,
> 
> On 2019/3/15 22:56, Suzuki K Poulose wrote:
> >Hi Zhengui,
> 
> s/Zhengui/Zheng/
> 
> (I think you must wanted to say "Hi" to Zheng :-) )
> 

Sorry about that.

> 
> I have looked into your patch and the kernel log, and I believe that
> your patch had already addressed this issue. But I think we can do it
> a little better - two more points need to be handled with caution.
> 
> Take PMD hugepage (PMD_SIZE == 2M) for example:
>

...

> >Thats bad, we seem to be making upto 4 unbalanced put_page().
> >
> >>>>---
> >>>>   virt/kvm/arm/mmu.c | 51
> >>>>+++
> >>>>   1 file changed, 35 insertions(+), 16 deletions(-)
> >>>>
> >>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>>>index 66e0fbb5..04b0f9b 100644
> >>>>--- a/virt/kvm/arm/mmu.c
> >>>>+++ b/virt/kvm/arm/mmu.c
> >>>>@@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm
> >>>>*kvm, struct kvm_mmu_memory_cache
> >>>>    * Skip updating the page table if the entry is
> >>>>    * unchanged.
> >>>>    */
> >>>>-    if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >>>>+    if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
> >>>>   return 0;
> >>>>-
> >>>>+    } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
> >>>>   /*
> >>>>- * Mapping in huge pages should only happen through a
> >>>>- * fault.  If a page is merged into a transparent huge
> >>>>- * page, the individual subpages of that huge page
> >>>>- * should be unmapped through MMU notifiers before we
> >>>>- * get here.
> >>>>- *
> >>>>- * Merging of CompoundPages is not supported; they
> >>>>- * should become splitting first, unmapped, merged,
> >>>>- * and mapped back in on-demand.
> >>>>+ * If we have PTE level mapping for this block,
> >>>>+ * we must unmap it to avoid inconsistent TLB
> >>>>+ * state. We could end up in this situation if
> >>>>+ * the memory slot was marked for dirty logging
> >>>>+ * and was reverted, leaving PTE level mappings
> >>>>+ * for the pages accessed during the period.
> >>>>+ * Normal THP split/merge follows mmu_notifier
> >>>>+ * callbacks and do get handled accordingly.
> >>>>    */
> >>>>-    VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> >>>>+    unmap_stage2_range(kvm, (addr & S2_PMD_MASK),
> >>>>S2_PMD_SIZE);
> 
> First, using unmap_stage2_range() here is not quite appropriate. Suppose
> we've only accessed one 2M page in HPA [x, x+1]Gib range, with other
> pages unaccessed.  What will happen if unmap_stage2_range(this_2M_page)?
> We'll unexpectedly reach clear_stage2_pud_entry(), and things are going
> to get really bad.  So we'd better use unmap_stage2_ptes() here since we
> only want to unmap a 2M range.

Yes, you're right. If this PMD entry is the only entry in the parent PUD table,
then the PUD table may get free'd and we may install the table in a place which
is not plugged into the table.

> 
> 
> Second, consider below function stack:
> 
>   unmap_stage2_ptes()
> clear_stage2_pmd_entry()
>   put_page(virt_to_page(pmd))
> 
> It seems that we have one "redundant" put_page() here, (thus comes the
> bad kernel log ... ,) but actually we do not.  By stage2_set_pmd_huge(),
> the PMD table entry will then point to a 2M block (originally pointed
> to a PTE table), the _refcount of this PMD-level table page should _not_
> change after unmap_stage2_ptes().  So what we really should do is adding
> a get_page() after unmapping to keep the _refcount a balance!

Yes we need an additional refcount on the new huge pmd table, if we are
tearing down the PTE level table.

> 
> 
> thoughts ? A simple patch below (based on yours) for details.
> 
> 
> thanks,
> 
> zenghui
> 
> 
> >>
> >>It seems that kvm decreases the _refcount of the page twice in
> >>transparent_hugepage_adjust()
> >>and unmap_stage2_range().
> >
> >But I thought we should be doing that on the head_page already, as 

Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-15 Thread Suzuki K Poulose

Hi Zhengui,

On 15/03/2019 08:21, Zheng Xiang wrote:

Hi Suzuki,

I have tested this patch, VM doesn't hang and we get expected WARNING log:


Thanks for the quick testing !


However, we also get the following unexpected log:

[  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
[  908.339415] page:7e28bed073c0 count:-4 mapcount:0 
mapping: index:0x0
[  908.339416] flags: 0x4e00()
[  908.339418] raw: 4e00 dead0100 dead0200 

[  908.339419] raw:   fffc 

[  908.339420] page dumped because: nonzero _refcount
[  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: GB  
W5.0.0+ #1
[  908.339438] Call trace:
[  908.339439]  dump_backtrace+0x0/0x188
[  908.339441]  show_stack+0x24/0x30
[  908.339442]  dump_stack+0xa8/0xcc
[  908.339443]  bad_page+0xf0/0x150
[  908.339445]  free_pages_check_bad+0x84/0xa0
[  908.339446]  free_pcppages_bulk+0x4b8/0x750
[  908.339448]  free_unref_page_commit+0x13c/0x198
[  908.339449]  free_unref_page+0x84/0xa0
[  908.339451]  __free_pages+0x58/0x68
[  908.339452]  zap_huge_pmd+0x290/0x2d8
[  908.339454]  unmap_page_range+0x2b4/0x470
[  908.339455]  unmap_single_vma+0x94/0xe8
[  908.339457]  unmap_vmas+0x8c/0x108
[  908.339458]  exit_mmap+0xd4/0x178
[  908.339459]  mmput+0x74/0x180
[  908.339460]  do_exit+0x2b4/0x5b0
[  908.339462]  do_group_exit+0x3c/0xe0
[  908.339463]  __arm64_sys_exit_group+0x24/0x28
[  908.339465]  el0_svc_common+0xa0/0x180
[  908.339466]  el0_svc_handler+0x38/0x78
[  908.339467]  el0_svc+0x8/0xc


Thats bad, we seem to be making upto 4 unbalanced put_page().


---
   virt/kvm/arm/mmu.c | 51 +++
   1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 66e0fbb5..04b0f9b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
    * Skip updating the page table if the entry is
    * unchanged.
    */
-    if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+    if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
   return 0;
-
+    } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
   /*
- * Mapping in huge pages should only happen through a
- * fault.  If a page is merged into a transparent huge
- * page, the individual subpages of that huge page
- * should be unmapped through MMU notifiers before we
- * get here.
- *
- * Merging of CompoundPages is not supported; they
- * should become splitting first, unmapped, merged,
- * and mapped back in on-demand.
+ * If we have PTE level mapping for this block,
+ * we must unmap it to avoid inconsistent TLB
+ * state. We could end up in this situation if
+ * the memory slot was marked for dirty logging
+ * and was reverted, leaving PTE level mappings
+ * for the pages accessed during the period.
+ * Normal THP split/merge follows mmu_notifier
+ * callbacks and do get handled accordingly.
    */
-    VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
+    unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);


It seems that kvm decreases the _refcount of the page twice in 
transparent_hugepage_adjust()
and unmap_stage2_range().


But I thought we should be doing that on the head_page already, as this is THP.
I will take a look and get back to you on this. Btw, is it possible for you
to turn on CONFIG_DEBUG_VM and re-run with the above patch ?

Kind regards
Suzuki


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-14 Thread Suzuki K Poulose
y level-3 permission fault caused by a write 
> >> on a same IPA. After
> >> analyzing the source code, I find KVM always return from the bellow *if* 
> >> statement in
> >> stage2_set_pmd_huge() even if we only have a single VCPU:
> >>
> >> /*
> >>  * Multiple vcpus faulting on the same PMD entry, can
> >>  * lead to them sequentially updating the PMD with the
> >>  * same value. Following the break-before-make
> >>  * (pmd_clear() followed by tlb_flush()) process can
> >>  * hinder forward progress due to refaults generated
> >>  * on missing translations.
> >>  *
> >>  * Skip updating the page table if the entry is
> >>  * unchanged.
> >>  */
> >> if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >> return 0;
> >>
> >> The PMD has already set the PMD_S2_RDWR bit. I doubt 
> >> kvm_tlb_flush_vmid_ipa() does not invalidate
> >> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). 
> >> Finally I add some debug
> >> code to flush tlb for all subpages of the PMD, as shown bellow:
> >>
> >> /*
> >>  * Mapping in huge pages should only happen through a
> >>  * fault.  If a page is merged into a transparent huge
> >>  * page, the individual subpages of that huge page
> >>  * should be unmapped through MMU notifiers before we
> >>  * get here.
> >>  *
> >>  * Merging of CompoundPages is not supported; they
> >>  * should become splitting first, unmapped, merged,
> >>  * and mapped back in on-demand.
> >>  */
> >> VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> >>
> >> pmd_clear(pmd);
> >> for (cnt = 0; cnt < 512; cnt++)
> >> kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);
> >>
> >> Then the problem no longer reproduce.
> > 
> > This makes very little sense. We shouldn't be able to enter this path
> > for anything else but a permission update, otherwise the VM_BUG_ON
> > should fire.
> 
> Hmm, I think I didn't describe it very clearly.
> Look at the following sequence:
> 
> 1) Set a PMD READONLY and logging_active.
> 
> 2) KVM handles permission fault caused by writing a subpage(assumpt *b*) 
> within this huge PMD.
> 
> 3) KVM dissolves PMD and invalidates TLB for this PMD. Then set a writable 
> PTE.
> 
> 4) Read another 511 PTEs and setup Stage-2 PTE table.
> 
> 5) Now remove logging_active and keep another 511 PTEs READONLY.
> 
> 6) VM continues to write a subpage(assumpt *c*) and cause permission fault.
> 
> 7) KVM handles this new fault and makes a new writable PMD after 
> transparent_hugepage_adjust().
> 
> 8) KVM invalidates TLB for the first page(*a*) of the PMD.
>Here another 511 RO PTEs entries still stay in TLB, especially *c* which 
> will be wrote later.
> 
> 9) KVM then set this new writable PMD.
>Step 8-9 is what stage2_set_pmd_huge() does.
> 
> 10) VM continues to write *c*, but this time it hits the RO PTE entry in TLB 
> and causes permission fault again.
>Sometimes it can also cause TLB conflict aborts.
> 
> 11) KVM repeats step 6 and goes to the following statement and return 0:
> 
>  * Skip updating the page table if the entry is
>  * unchanged.
>  */
> if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> return 0;
> 
> 12) Then it will repeat step 10-11 until the PTE entry is invalidated.
> 
> I think there is something abnormal in step 8.
> Should I blame my hardware? Or is it a kernel bug?

Marc and I had a discussion about this and it looks like we may have an
issue here. So with the cancellation of logging, we do not trigger the
mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
have memory leaks while trying to install a huge mapping. Would it be
possible for you to try the patch below ? It will trigger a WARNING
to confirm our theory, but should not cause the hang. As we unmap
the PMD/PUD range of PTE mappings before reinstalling a block map.


---8>---

test: kvm: arm: Fix handling of stage2 huge mappings

We rely on the mmu_notifier call backs to handle the split/merging
of huge pages and thus we are guaranteed that while creating a
block mapping, the entire block is unmapped at stage2. However,
we miss a case where the block mapping is split for dirty logging
case and then could later be made block mapping, if we cancel the
dirty loggi

[PATCH] kvm: arm: Enforce PTE mappings at stage2 when needed

2019-03-12 Thread Suzuki K Poulose
commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
made the checks to skip huge mappings, stricter. However it introduced
a bug where we still use huge mappings, ignoring the flag to
use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.

Also, the checks do not cover the PUD huge pages, that was
under review during the same period. This patch fixes both
the issues.

Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
Reported-by: Zenghui Yu 
Cc: Zenghui Yu 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
---
 virt/kvm/arm/mmu.c | 43 +--
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..66e0fbb5 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1595,8 +1595,9 @@ static void kvm_send_hwpoison_signal(unsigned long 
address,
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
-static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot,
-  unsigned long hva)
+static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
+  unsigned long hva,
+  unsigned long map_size)
 {
gpa_t gpa_start, gpa_end;
hva_t uaddr_start, uaddr_end;
@@ -1612,34 +1613,34 @@ static bool fault_supports_stage2_pmd_mappings(struct 
kvm_memory_slot *memslot,
 
/*
 * Pages belonging to memslots that don't have the same alignment
-* within a PMD for userspace and IPA cannot be mapped with stage-2
-* PMD entries, because we'll end up mapping the wrong pages.
+* within a PMD/PUD for userspace and IPA cannot be mapped with stage-2
+* PMD/PUD entries, because we'll end up mapping the wrong pages.
 *
 * Consider a layout like the following:
 *
 *memslot->userspace_addr:
 *+-+++---+
-*|abcde|fgh  Stage-1 PMD|Stage-1 PMD   tv|xyz|
+*|abcde|fgh  Stage-1 block  |Stage-1 block tv|xyz|
 *+-+++---+
 *
 *memslot->base_gfn << PAGE_SIZE:
 *  +---+++-+
-*  |abc|def  Stage-2 PMD|Stage-2 PMD |tvxyz|
+*  |abc|def  Stage-2 block  |Stage-2 block   |tvxyz|
 *  +---+++-+
 *
-* If we create those stage-2 PMDs, we'll end up with this incorrect
+* If we create those stage-2 blocks, we'll end up with this incorrect
 * mapping:
 *   d -> f
 *   e -> g
 *   f -> h
 */
-   if ((gpa_start & ~S2_PMD_MASK) != (uaddr_start & ~S2_PMD_MASK))
+   if ((gpa_start & (map_size - 1)) != (uaddr_start & (map_size - 1)))
return false;
 
/*
 * Next, let's make sure we're not trying to map anything not covered
-* by the memslot. This means we have to prohibit PMD size mappings
-* for the beginning and end of a non-PMD aligned and non-PMD sized
+* by the memslot. This means we have to prohibit block size mappings
+* for the beginning and end of a non-block aligned and non-block sized
 * memory slot (illustrated by the head and tail parts of the
 * userspace view above containing pages 'abcde' and 'xyz',
 * respectively).
@@ -1648,8 +1649,8 @@ static bool fault_supports_stage2_pmd_mappings(struct 
kvm_memory_slot *memslot,
 * userspace_addr or the base_gfn, as both are equally aligned (per
 * the check above) and equally sized.
 */
-   return (hva & S2_PMD_MASK) >= uaddr_start &&
-  (hva & S2_PMD_MASK) + S2_PMD_SIZE <= uaddr_end;
+   return (hva & ~(map_size - 1)) >= uaddr_start &&
+  (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
@@ -1678,12 +1679,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (!fault_supports_stage2_pmd_mappings(memslot, hva))
-   force_pte = true;
-
-   if (logging_active)
-   force_pte = true;
-
/* Let's check if we will get back a huge page backed by hugetlbfs */
down_read(>mm->mmap_sem);
vma = find_vma_intersection(current->mm, hva, hva + 1);
@@ -1694,6 +1689,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
}
 
vma_pagesize = vma_kernel_pagesize(

Re: [PATCH v11 3/8] arm64: KVM: add accessors to track guest/host only counters

2019-03-11 Thread Suzuki K Poulose

Hi Andrew,


On 08/03/2019 12:07, Andrew Murray wrote:

In order to effeciently switch events_{guest,host} perf counters at
guest entry/exit we add bitfields to kvm_cpu_context for guest and host
events as well as accessors for updating them.

A function is also provided which allows the PMU driver to determine
if a counter should start counting when it is enabled. With exclude_host,
events on !VHE we may only start counting when entering the guest.


Some minor comments below.



Signed-off-by: Andrew Murray 
---
  arch/arm64/include/asm/kvm_host.h | 17 +++
  arch/arm64/kvm/Makefile   |  2 +-
  arch/arm64/kvm/pmu.c  | 49 +++
  3 files changed, 67 insertions(+), 1 deletion(-)
  create mode 100644 arch/arm64/kvm/pmu.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 1d36619d6650..4b7219128f2d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -207,8 +207,14 @@ struct kvm_cpu_context {
struct kvm_vcpu *__hyp_running_vcpu;
  };
  
+struct kvm_pmu_events {

+   u32 events_host;
+   u32 events_guest;
+};
+
  struct kvm_host_data {
struct kvm_cpu_context host_ctxt;
+   struct kvm_pmu_events pmu_events;
  };
  
  typedef struct kvm_host_data kvm_host_data_t;

@@ -479,11 +485,22 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
  void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu);
  void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu);
  
+static inline bool kvm_pmu_counter_defered(struct perf_event_attr *attr)


nit: s/defered/deferred/


+{
+   return attr->exclude_host;
+}
+


Does it need a definition for !CONFIG_KVM case, to make sure that
the events are always enabled for that case ? i.e, does this introduce
a change in behavior for !CONFIG_KVM case ?


  #ifdef CONFIG_KVM /* Avoid conflicts with core headers if CONFIG_KVM=n */
  static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
  {
return kvm_arch_vcpu_run_map_fp(vcpu);
  }
+
+void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr);
+void kvm_clr_pmu_events(u32 clr);
+#else
+static inline void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr) {}
+static inline void kvm_clr_pmu_events(u32 clr) {}
  #endif
  
  static inline void kvm_arm_vhe_guest_enter(void)

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 0f2a135ba15b..f34cb49b66ae 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -19,7 +19,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o 
$(KVM)/arm/perf.o
  kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
  kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
  kvm-$(CONFIG_KVM_ARM_HOST) += guest.o debug.o reset.o sys_regs.o 
sys_regs_generic_v8.o
-kvm-$(CONFIG_KVM_ARM_HOST) += vgic-sys-reg-v3.o fpsimd.o
+kvm-$(CONFIG_KVM_ARM_HOST) += vgic-sys-reg-v3.o fpsimd.o pmu.o
  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/aarch32.o
  
  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic.o

diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
new file mode 100644
index ..43965a3cc0f4
--- /dev/null
+++ b/arch/arm64/kvm/pmu.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * arch/arm64/kvm/pmu.c: Switching between guest and host counters


minor nit: You don't need the file name, it is superfluous.


+ *
+ * Copyright 2019 Arm Limited
+ * Author: Andrew Murray 
+ */
+#include 
+#include 
+



+DECLARE_PER_CPU(kvm_host_data_t, kvm_host_data);


nit: Do we really need this ? This is already in asm/kvm_host.h.


+
+/*
+ * Given the exclude_{host,guest} attributes, determine if we are going
+ * to need to switch counters at guest entry/exit.
+ */
+static bool kvm_pmu_switch_needed(struct perf_event_attr *attr)
+{
+   /* Only switch if attributes are different */
+   return (attr->exclude_host ^ attr->exclude_guest);


I back Julien's suggestion to keep this simple.


+}
+
+/*
+ * Add events to track that we may want to switch at guest entry/exit
+ * time.
+ */
+void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr)
+{
+   struct kvm_host_data *ctx = this_cpu_ptr(_host_data);
+
+   if (!kvm_pmu_switch_needed(attr))
+   return;
+
+   if (!attr->exclude_host)
+   ctx->pmu_events.events_host |= set;
+   if (!attr->exclude_guest)
+   ctx->pmu_events.events_guest |= set;
+}
+
+/*
+ * Stop tracking events
+ */
+void kvm_clr_pmu_events(u32 clr)
+{
+   struct kvm_host_data *ctx = this_cpu_ptr(_host_data);
+
+   ctx->pmu_events.events_host &= ~clr;
+   ctx->pmu_events.events_guest &= ~clr;
+}



Rest looks fine.

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] KVM: arm64: Force a PTE mapping when logging is enabled

2019-03-05 Thread Suzuki K Poulose

Hi Zenghui,

On 05/03/2019 11:09, Zenghui Yu wrote:

Hi Marc, Suzuki,

On 2019/3/5 1:34, Marc Zyngier wrote:

Hi Zenghui, Suzuki,

On 04/03/2019 17:13, Suzuki K Poulose wrote:

Hi Zenghui,

On Sun, Mar 03, 2019 at 11:14:38PM +0800, Zenghui Yu wrote:

I think there're still some problems in this patch... Details below.

On Sat, Mar 2, 2019 at 11:39 AM Zenghui Yu  wrote:


The idea behind this is: we don't want to keep tracking of huge pages when
logging_active is true, which will result in performance degradation.  We
still need to set vma_pagesize to PAGE_SIZE, so that we can make use of it
to force a PTE mapping.


Yes, you're right. We are indeed ignoring the force_pte flag.



Cc: Suzuki K Poulose 
Cc: Punit Agrawal 
Signed-off-by: Zenghui Yu 

---
Atfer looking into https://patchwork.codeaurora.org/patch/647985/ , the
"vma_pagesize = PAGE_SIZE" logic was not intended to be deleted. As far
as I can tell, we used to have "hugetlb" to force the PTE mapping, but
we have "vma_pagesize" currently instead. We should set it properly for
performance reasons (e.g, in VM migration). Did I miss something important?

---
   virt/kvm/arm/mmu.c | 7 +++
   1 file changed, 7 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..7d41b16 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1705,6 +1705,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
   (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
  !force_pte) {
  gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> 
PAGE_SHIFT;
+   } else {
+   /*
+* Fallback to PTE if it's not one of the stage2
+* supported hugepage sizes or the corresponding level
+* doesn't exist, or logging is enabled.


First, Instead of "logging is enabled", it should be "force_pte is true",
since "force_pte" will be true when:

  1) fault_supports_stage2_pmd_mappings() return false; or
  2) "logging is enabled" (e.g, in VM migration).

Second, fallback some unsupported hugepage sizes (e.g, 64K hugepage with
4K pages) to PTE is somewhat strange. And it will then _unexpectedly_
reach transparent_hugepage_adjust(), though no real adjustment will happen
since commit fd2ef358282c ("KVM: arm/arm64: Ensure only THP is candidate
for adjustment"). Keeping "vma_pagesize" there as it is will be better,
right?

So I'd just simplify the logic like:


We could fix this right in the beginning. See patch below:



  } else if (force_pte) {
  vma_pagesize = PAGE_SIZE;
  }


Will send a V2 later and waiting for your comments :)



diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..529331e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1693,7 +1693,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
   
-	vma_pagesize = vma_kernel_pagesize(vma);

+   /* If we are forced to map at page granularity, force the pagesize here 
*/
+   vma_pagesize = force_pte ? PAGE_SIZE : vma_kernel_pagesize(vma);
+
/*
 * The stage2 has a minimum of 2 level table (For arm64 see
 * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
@@ -1701,11 +1703,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * As for PUD huge maps, we must make sure that we have at least
 * 3 levels, i.e, PMD is not folded.
 */
-   if ((vma_pagesize == PMD_SIZE ||
-(vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
-   !force_pte) {
+   if (vma_pagesize == PMD_SIZE ||
+   (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> 
PAGE_SHIFT;
-   }
+
up_read(>mm->mmap_sem);
   
   	/* We need minimum second+third level pages */


A nicer implementation and easier to understand, thanks!


That's pretty interesting, because this is almost what we already have
in the NV code:

https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/tree/virt/kvm/arm/mmu.c?h=kvm-arm64/nv-wip-v5.0-rc7#n1752

(note that force_pte is gone in that branch).


haha :-) sorry about that. I haven't looked into the NV code yet, so ...

But I'm still wondering: should we fix this wrong mapping size problem
before NV is introduced? Since this problem has not much to do with NV,
and 5.0 has already been released with this problem (and 5.1 will
without fix ...).


Yes, we must fix it. I will soon send out a patch copying on it.
Its just that I find some more issues around forcing the PTE
mappings with PUD huge pages. I will send something out soon.

Cheers
Suzuki

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] KVM: arm64: Force a PTE mapping when logging is enabled

2019-03-04 Thread Suzuki K Poulose
Hi Zenghui,

On Sun, Mar 03, 2019 at 11:14:38PM +0800, Zenghui Yu wrote:
> I think there're still some problems in this patch... Details below.
> 
> On Sat, Mar 2, 2019 at 11:39 AM Zenghui Yu  wrote:
> >
> > The idea behind this is: we don't want to keep tracking of huge pages when
> > logging_active is true, which will result in performance degradation.  We
> > still need to set vma_pagesize to PAGE_SIZE, so that we can make use of it
> > to force a PTE mapping.

Yes, you're right. We are indeed ignoring the force_pte flag.

> >
> > Cc: Suzuki K Poulose 
> > Cc: Punit Agrawal 
> > Signed-off-by: Zenghui Yu 
> >
> > ---
> > Atfer looking into https://patchwork.codeaurora.org/patch/647985/ , the
> > "vma_pagesize = PAGE_SIZE" logic was not intended to be deleted. As far
> > as I can tell, we used to have "hugetlb" to force the PTE mapping, but
> > we have "vma_pagesize" currently instead. We should set it properly for
> > performance reasons (e.g, in VM migration). Did I miss something important?
> >
> > ---
> >  virt/kvm/arm/mmu.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index 30251e2..7d41b16 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -1705,6 +1705,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> > phys_addr_t fault_ipa,
> >  (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
> > !force_pte) {
> > gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> 
> > PAGE_SHIFT;
> > +   } else {
> > +   /*
> > +* Fallback to PTE if it's not one of the stage2
> > +* supported hugepage sizes or the corresponding level
> > +* doesn't exist, or logging is enabled.
> 
> First, Instead of "logging is enabled", it should be "force_pte is true",
> since "force_pte" will be true when:
> 
> 1) fault_supports_stage2_pmd_mappings() return false; or
> 2) "logging is enabled" (e.g, in VM migration).
> 
> Second, fallback some unsupported hugepage sizes (e.g, 64K hugepage with
> 4K pages) to PTE is somewhat strange. And it will then _unexpectedly_
> reach transparent_hugepage_adjust(), though no real adjustment will happen
> since commit fd2ef358282c ("KVM: arm/arm64: Ensure only THP is candidate
> for adjustment"). Keeping "vma_pagesize" there as it is will be better,
> right?
> 
> So I'd just simplify the logic like:

We could fix this right in the beginning. See patch below:

> 
> } else if (force_pte) {
> vma_pagesize = PAGE_SIZE;
> }
> 
> 
> Will send a V2 later and waiting for your comments :)


diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..529331e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1693,7 +1693,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   vma_pagesize = vma_kernel_pagesize(vma);
+   /* If we are forced to map at page granularity, force the pagesize here 
*/
+   vma_pagesize = force_pte ? PAGE_SIZE : vma_kernel_pagesize(vma);
+
/*
 * The stage2 has a minimum of 2 level table (For arm64 see
 * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
@@ -1701,11 +1703,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * As for PUD huge maps, we must make sure that we have at least
 * 3 levels, i.e, PMD is not folded.
 */
-   if ((vma_pagesize == PMD_SIZE ||
-(vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
-   !force_pte) {
+   if (vma_pagesize == PMD_SIZE ||
+   (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> 
PAGE_SHIFT;
-   }
+
up_read(>mm->mmap_sem);
 
/* We need minimum second+third level pages */

---

Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 5/5] KVM: arm/arm64: support chained PMU counters

2019-02-14 Thread Suzuki K Poulose




On 05/02/2019 14:33, Julien Thierry wrote:

Hi Andrew,

On 04/02/2019 16:53, Andrew Murray wrote:

Emulate chained PMU counters by creating a single 64 bit event counter
for a pair of chained KVM counters.

Signed-off-by: Andrew Murray 
---
  include/kvm/arm_pmu.h |   1 +
  virt/kvm/arm/pmu.c| 321 +-
  2 files changed, 269 insertions(+), 53 deletions(-)

diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index b73f31b..8e691ee 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -29,6 +29,7 @@ struct kvm_pmc {
u8 idx; /* index into the pmu->pmc array */
struct perf_event *perf_event;
u64 bitmask;
+   u64 overflow_count;
  };
  
  struct kvm_pmu {

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index a64aeb2..9318130 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -24,9 +24,25 @@
  #include 
  #include 
  
+#define ARMV8_PMUV3_PERFCTR_CHAIN 0x1E


I find it a bit awkward to have this redefined here.

Maybe we could define a helper in kvm_host.h:
bool kvm_pmu_typer_is_chain(u64 typer);

That would always return false for arm32?


We don't support ARMv7 host, so that doesn't matter. But
it is a good idea to wrap it in a function here.


kvm_vcpu_kick(vcpu);
@@ -440,6 +632,10 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
select_idx != ARMV8_PMU_CYCLE_IDX)
return;
  
+	/* Handled by even event */

+   if (eventsel == ARMV8_PMUV3_PERFCTR_CHAIN)
+   return;
+
memset(, 0, sizeof(struct perf_event_attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
@@ -452,6 +648,9 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
attr.config = (select_idx == ARMV8_PMU_CYCLE_IDX) ?
ARMV8_PMUV3_PERFCTR_CPU_CYCLES : eventsel;
  
+	if (kvm_pmu_event_is_chained(vcpu, select_idx))

+   attr.config1 |= 0x1;


I'm not very familiar with the usage of perf attributes configs, but is
there any chance we could name this flag? Even if only for the local
file? Something like PERF_ATTR_CFG1_KVM_PMU_CHAINED (unless there is an
existing naming convention for event attributes).


There is ARPMU_EVT_64BIT in "linux/perf/arm_pmu.h". But in general we don't
specify where the Bit goes in (i.e, CFG1 etc).

Cheers
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 4/5] KVM: arm/arm64: lazily create perf events on enable

2019-02-14 Thread Suzuki K Poulose

Hi Andrew,

On 04/02/2019 16:53, Andrew Murray wrote:

To prevent re-creating perf events everytime the counter registers
are changed, let's instead lazily create the event when the event
is first enabled and destroy it when it changes.

Signed-off-by: Andrew Murray 
---
  virt/kvm/arm/pmu.c | 96 ++
  1 file changed, 68 insertions(+), 28 deletions(-)

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 95d74ec..a64aeb2 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -24,7 +24,10 @@
  #include 
  #include 
  
+static void kvm_pmu_sync_counter_enable(struct kvm_vcpu *vcpu, u64 select_idx);

  static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx);
+static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc);
+
  /**
   * kvm_pmu_get_counter_value - get PMU counter value
   * @vcpu: The vcpu pointer
@@ -59,13 +62,15 @@ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 
select_idx)
  void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val)
  {
u64 reg;
+   struct kvm_pmu *pmu = >arch.pmu;
+   struct kvm_pmc *pmc = >pmc[select_idx];
  
  	reg = (select_idx == ARMV8_PMU_CYCLE_IDX)

  ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + select_idx;
__vcpu_sys_reg(vcpu, reg) += (s64)val - kvm_pmu_get_counter_value(vcpu, 
select_idx);
  
-	/* Recreate the perf event to reflect the updated sample_period */

-   kvm_pmu_create_perf_event(vcpu, select_idx);
+   kvm_pmu_stop_counter(vcpu, pmc);
+   kvm_pmu_sync_counter_enable(vcpu, select_idx);
  }
  
  /**

@@ -83,6 +88,7 @@ static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc)
  
  /**

   * kvm_pmu_stop_counter - stop PMU counter
+ * @vcpu: The vcpu pointer
   * @pmc: The PMU counter pointer
   *
   * If this counter has been configured to monitor some event, release it here.
@@ -143,6 +149,24 @@ u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu)
  }
  
  /**

+ * kvm_pmu_enable_counter - create/enable a counter
+ * @vcpu: The vcpu pointer
+ * @select_idx: The counter index
+ */
+static void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u64 select_idx)
+{
+   struct kvm_pmu *pmu = >arch.pmu;
+   struct kvm_pmc *pmc = >pmc[select_idx];
+
+   if (!pmc->perf_event)
+   kvm_pmu_create_perf_event(vcpu, select_idx);
+
+   perf_event_enable(pmc->perf_event);
+   if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE)
+   kvm_debug("failed to enable perf event\n");
+}
+
+/**
   * kvm_pmu_enable_counter_mask - enable selected PMU counters
   * @vcpu: The vcpu pointer
   * @val: the value guest writes to PMCNTENSET register
@@ -152,8 +176,6 @@ u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu)
  void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, u64 val)
  {
int i;
-   struct kvm_pmu *pmu = >arch.pmu;
-   struct kvm_pmc *pmc;
  
  	if (!(__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E) || !val)

return;
@@ -162,16 +184,39 @@ void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, 
u64 val)
if (!(val & BIT(i)))
continue;
  
-		pmc = >pmc[i];

-   if (pmc->perf_event) {
-   perf_event_enable(pmc->perf_event);
-   if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE)
-   kvm_debug("fail to enable perf event\n");
-   }
+   kvm_pmu_enable_counter(vcpu, i);
}
  }
  
  /**

+ * kvm_pmu_sync_counter_enable - reenable a counter if it should be enabled
+ * @vcpu: The vcpu pointer
+ * @select_idx: The counter index
+ */
+static void kvm_pmu_sync_counter_enable(struct kvm_vcpu *vcpu,
+   u64 select_idx)
+{
+   u64 set = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
+
+   if (set & BIT(select_idx))
+   kvm_pmu_enable_counter_mask(vcpu, BIT(select_idx));


I think there is a problem here. We could be creating an event for a
counter beyond what is supported by the CPU. i.e, we don't
seem to validate that the mask we are creating is within the
kvm_pmu_valid_counter_mask(). The other callers seem to verify this.
I guess it may be better to add a check for this in the
kvm_pmu_enable_counter_mask().

minor nit: Feel free to ignore. If we move the check for PMCNTENSET_EL0 to
pmu_enable_counter_mask() we could get rid of the above function. Anyways,
we should only be enabling the counters set in PMCNTENSET_EL0.



+}
+
+/**
+ * kvm_pmu_disable_counter - disable selected PMU counter
+ * @vcpu: The vcpu pointer
+ * @pmc: The counter to disable
+ */
+static void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u64 select_idx)
+{
+   struct kvm_pmu *pmu = >arch.pmu;
+   struct kvm_pmc *pmc = >pmc[select_idx];
+
+   if (pmc->perf_event)
+   perf_event_disable(pmc->perf_event);
+}
+
+/**
   * kvm_pmu_disable_counter_mask - 

Re: [PATCH] kvm: arm64: Fix comment for KVM_PHYS_SHIFT

2019-02-14 Thread Suzuki K Poulose

Hi

On 14/02/2019 01:45, Zenghui Yu wrote:

Since Suzuki K Poulose's work on Dynamic IPA support, KVM_PHYS_SHIFT will
be used only when machine_type's bits[7:0] equal to 0 (by default). Thus
the outdated comment should be fixed.

Cc: Suzuki K Poulose 
Signed-off-by: Zenghui Yu 
---
  arch/arm64/include/asm/kvm_mmu.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 8af4b1b..a12bf48 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -138,7 +138,8 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
})
  
  /*

- * We currently only support a 40bit IPA.
+ * We currently support using a VM-specified IPA size. For backward
+ * compatibility, the default IPA size is fixed to 40bits.
   */
  #define KVM_PHYS_SHIFT(40)


Thanks for the fix.

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 3/5] KVM: arm/arm64: re-create event when setting counter value

2019-02-13 Thread Suzuki K Poulose

Hi Andrew,

On 04/02/2019 16:53, Andrew Murray wrote:

The perf event sample_period is currently set based upon the current
counter value, when PMXEVTYPER is written to and the perf event is created.
However the user may choose to write the type before the counter value in
which case sample_period will be set incorrectly. Let's instead decouple
event creation from PMXEVTYPER and (re)create the event in either
suitation.

Signed-off-by: Andrew Murray 
---
  virt/kvm/arm/pmu.c | 40 +++-
  1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 6e7c179..95d74ec 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -24,6 +24,7 @@
  #include 
  #include 
  
+static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx);

  /**
   * kvm_pmu_get_counter_value - get PMU counter value
   * @vcpu: The vcpu pointer
@@ -62,6 +63,9 @@ void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 
select_idx, u64 val)
reg = (select_idx == ARMV8_PMU_CYCLE_IDX)
  ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + select_idx;
__vcpu_sys_reg(vcpu, reg) += (s64)val - kvm_pmu_get_counter_value(vcpu, 
select_idx);
+
+   /* Recreate the perf event to reflect the updated sample_period */
+   kvm_pmu_create_perf_event(vcpu, select_idx);
  }
  
  /**

@@ -378,23 +382,22 @@ static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu 
*vcpu, u64 select_idx)
  }
  
  /**

- * kvm_pmu_set_counter_event_type - set selected counter to monitor some event
+ * kvm_pmu_create_perf_event - create a perf event for a counter
   * @vcpu: The vcpu pointer
- * @data: The data guest writes to PMXEVTYPER_EL0
+ * @data: Type of event as per PMXEVTYPER_EL0 format
   * @select_idx: The number of selected counter


nit: data no longer exists.



- *
- * When OS accesses PMXEVTYPER_EL0, that means it wants to set a PMC to count 
an
- * event with given hardware event number. Here we call perf_event API to
- * emulate this action and create a kernel perf event for it.
   */
-void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data,
-   u64 select_idx)
+static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx)
  {


With the comment from Julien addressed,

Reviewed-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm/arm64: Update MAINTAINERS entries

2019-02-12 Thread Suzuki K Poulose

Hi Marc,

On 12/02/2019 11:06, Marc Zyngier wrote:

For historical reasons, KVM/arm and KVM/arm64 have had different
entries in the MAINTAINER file. This makes little sense, as they are
maintained together.

On top of that, we have a bunch of talented people helping with
the reviewing, and they deserve to be mentionned in the consolidated
entry.


Thanks for the recognition !



Acked-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
  MAINTAINERS | 18 +++---
  1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 51029a425dbe..91ec0682443b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8297,29 +8297,25 @@ S:  Maintained
  F:arch/x86/include/asm/svm.h
  F:arch/x86/kvm/svm.c
  
-KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)

+KERNEL VIRTUAL MACHINE FOR ARM/ARM64 (KVM/arm, KVM/arm64)
  M:Christoffer Dall 
  M:Marc Zyngier 
+R: James Morse 
+R: Julien Thierry 



+R: Suzuki Pouloze 


Please note that it is: "Suzuki K Poulose"

With that:

Acked-by: Suzuki K Poulose 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


  1   2   3   4   5   6   >