Re: [PATCH v6 03/10] KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing

2021-04-19 Thread wangyanan (Y)



On 2021/4/19 16:22, David Laight wrote:

From: wangyanan (Y)

Sent: 19 April 2021 07:40

Hi Paolo,

On 2021/4/17 21:23, Paolo Bonzini wrote:

On 30/03/21 10:08, Yanan Wang wrote:

In addition to function of CLOCK_MONOTONIC, flag CLOCK_MONOTONIC_RAW can
also shield possiable impact of NTP, which can provide more robustness.

Suggested-by: Vitaly Kuznetsov
Signed-off-by: Yanan Wang
Reviewed-by: Ben Gardon
Reviewed-by: Andrew Jones

I'm not sure about this one, is the effect visible?


In practice, difference between results got with CLOCK_MONOTONIC and
CLOCK_MONOTONIC_RAW
actually is too little to be visible. But if just in theory,
CLOCK_MONOTONIC_RAW can ensure time results
of the compared tests are based on the same local oscillator frequency,
which is not subject to possible
NTP frequency adjustment. Change in this patch seems like a bit of
optimization.

The real annoyance is when NTP is realigning the local clock.
This typically happens after boot - but can take quite a few
minutes (don't think it can quite get to an hour).
(I think something similar is caused by leap seconds.)

During this period CLOCK_MONOTONIC can run at a significantly
different rate from 'real time'.
This may not matter for timing self tests, but is significant
for RTP audio.

The problem there is that you want the NTP corrected time
during 'normal running' because the small correction (for
crystal error) is useful.

But the kernel HR timers are only defined for CLOCK_MONOTONIC
and the userspace requests for CLOCK_MONOTONIC_RAW are likely
to be real system calls.

What you really want is a clock whose frequency is adjusted
by NTP but doesn't have the NTP offset adjuctments.
In reality this ought to be CLOCK_MONOTONIC.

Hi David,

I see now, much thanks for the above explanation. :)
Still have a lot to learn about this part.

Thanks,
Yanan


David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v6 03/10] KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing

2021-04-19 Thread wangyanan (Y)

Hi Paolo,

On 2021/4/17 21:23, Paolo Bonzini wrote:

On 30/03/21 10:08, Yanan Wang wrote:

In addition to function of CLOCK_MONOTONIC, flag CLOCK_MONOTONIC_RAW can
also shield possiable impact of NTP, which can provide more robustness.

Suggested-by: Vitaly Kuznetsov
Signed-off-by: Yanan Wang
Reviewed-by: Ben Gardon
Reviewed-by: Andrew Jones


I'm not sure about this one, is the effect visible?

In practice, difference between results got with CLOCK_MONOTONIC and 
CLOCK_MONOTONIC_RAW
actually is too little to be visible. But if just in theory, 
CLOCK_MONOTONIC_RAW can ensure time results
of the compared tests are based on the same local oscillator frequency, 
which is not subject to possible
NTP frequency adjustment. Change in this patch seems like a bit of 
optimization.


But either of these two flags is good to me. If this patch is not 
convincing enough to be accepted, I will
post a patch later in fix use of CLOCK_MONOTONIC_RAW in 
kvm_page_table_test.c just to be consistent

with other kvm tests, please queue. :)

Thanks,
Yanan




Re: [PATCH v4 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers

2021-04-09 Thread wangyanan (Y)

Hi Quentin,

On 2021/4/9 16:08, Quentin Perret wrote:

Hi Yanan,

On Friday 09 Apr 2021 at 11:36:51 (+0800), Yanan Wang wrote:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+   if (icache_is_aliasing()) {
+   /* Flush any kind of VIPT icache */
+   __flush_icache_all();
+   } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+   /* PIPT or VPIPT at EL2 */
+   invalidate_icache_range((unsigned long)addr,
+   (unsigned long)addr + size);
+   }
+}
+

I would recommend to try and rebase this patch on kvmarm/next because
we've made a few changes in pgtable.c recently. It is now linked into
the EL2 NVHE code which means there are constraints on what can be used
from there -- you'll need a bit of extra work to make some of these
functions available to EL2.

I see, thanks for reminding me this.
I will work on kvmarm/next and send a new version later.

Thanks,
Yanan


Thanks,
Quentin
.


Re: [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely

2021-04-08 Thread wangyanan (Y)



On 2021/4/7 23:35, Alexandru Elisei wrote:

Hi Yanan,

On 3/26/21 3:16 AM, Yanan Wang wrote:

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/mmu.c | 25 -
  1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1eec9f63bc6f..05af40dc60c1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -810,19 +810,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
gfn = fault_ipa >> PAGE_SHIFT;
mmap_read_unlock(current->mm);
  
-	/*

-* Permission faults just need to update the existing leaf entry,
-* and so normally don't require allocations from the memcache. The
-* only exception to this is when dirty logging is enabled at runtime
-* and a write fault needs to collapse a block entry into a table.
-*/
-   if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-   ret = kvm_mmu_topup_memory_cache(memcache,
-kvm_mmu_cache_min_pages(kvm));
-   if (ret)
-   return ret;
-   }
-
mmu_seq = vcpu->kvm->mmu_notifier_seq;
/*
 * Ensure the read of mmu_notifier_seq happens before we call
@@ -880,6 +867,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;
  
+	/*

+* Allocations from the memcache are required only when granule of the
+* lookup level where the guest fault happened exceeds vma_pagesize,
+* which means new page tables will be created in the fault handlers.
+*/
+   if (fault_granule > vma_pagesize) {
+   ret = kvm_mmu_topup_memory_cache(memcache,
+kvm_mmu_cache_min_pages(kvm));
+   if (ret)
+   return ret;
+   }

As I explained in v1 [1], this looks correct to me. I still think that someone
else should have a look, but if Marc decides to pick up this patch as-is, he can
add my Reviewed-by: Alexandru Elisei .

Thanks again for this, Alex!

Hi Marc, Will,
Any thoughts about this patch?

Thanks,
Yanan

[1] https://lore.kernel.org/lkml/2c65bff2-be7f-b20c-9265-939bc7318...@arm.com/

Thanks,

Alex


+
/*
 * Under the premise of getting a FSC_PERM fault, we just need to relax
 * permissions only if vma_pagesize equals fault_granule. Otherwise,

.


Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers

2021-04-08 Thread wangyanan (Y)

Hi Alex,

On 2021/4/7 23:31, Alexandru Elisei wrote:

Hi Yanan,

On 3/26/21 3:16 AM, Yanan Wang wrote:

We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are

I can't figure out what BBM means.

Just as Will has explained, it's Break-Before-Make rule. When we need to
replace an old table entry with a new one, we should firstly invalidate
the old table entry(Break), before installation of the new entry(Make).

And I think this patch mainly introduces benefits in two specific scenarios:
1) In a VM startup, it will improve efficiency of handling page faults 
incurred

by vCPUs, when initially populating stage2 page tables.
2) After live migration, the heavy workload will be resumed on the 
destination

VMs, however all the stage2 page tables need to be rebuilt.

necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the page table code.

So let's move both clean of D-cache and invalidation of I-cache to the
map path and move only invalidation of I-cache to the permission path.
Since the original APIs for CMOs in mmu.c are only called in function
user_mem_abort, we now also move them to pgtable.c.

Signed-off-by: Yanan Wang 
---
  arch/arm64/include/asm/kvm_mmu.h | 31 ---
  arch/arm64/kvm/hyp/pgtable.c | 68 +---
  arch/arm64/kvm/mmu.c | 23 ++-
  3 files changed, 57 insertions(+), 65 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 90873851f677..c31f88306d4e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu 
*vcpu)
return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
  }
  
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)

-{
-   void *va = page_address(pfn_to_page(pfn));
-
-   /*
-* With FWB, we ensure that the guest always accesses memory using
-* cacheable attributes, and we don't have to clean to PoC when
-* faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-* PoU is not required either in this case.
-*/
-   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-   return;
-
-   kvm_flush_dcache_to_poc(va, size);
-}
-
-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
- unsigned long size)
-{
-   if (icache_is_aliasing()) {
-   /* any kind of VIPT cache */
-   __flush_icache_all();
-   } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-   /* PIPT or VPIPT at EL2 (see comment in 
__kvm_tlb_flush_vmid_ipa) */
-   void *va = page_address(pfn_to_page(pfn));
-
-   invalidate_icache_range((unsigned long)va,
-   (unsigned long)va + size);
-   }
-}
-
  void kvm_set_way_flush(struct kvm_vcpu *vcpu);
  void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
  
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c

index 4d177ce1d536..829a34eea526 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
return 0;
  }
  
+static bool stage2_pte_cacheable(kvm_pte_t pte)

+{
+   u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+   return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static bool stage2_pte_executable(kvm_pte_t pte)
+{
+   return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+   /*
+* With FWB, we ensure that the guest always accesses memory using
+* cacheable attributes, and we don't have to clean to PoC when
+* faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+* PoU is not required either in this case.
+*/
+   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+   return;
+
+   __flush_dcache_area(addr, size);
+}
+
+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+   if (icache_is_aliasing()) {
+   /* Flush any kind of VIPT icache */
+   __flush_icache_all();
+   } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+   /* PIPT or VPIPT at EL2 */
+   invalidate_icache_range((unsigned long)addr,
+

Re: [PATCH v6 00/10] KVM: selftests: some improvement and a new test for kvm page table

2021-04-05 Thread wangyanan (Y)

Kindly ping...

Hi Paolo,
Will this series be picked up soon, or is there any other work for me to do?

Regards,
Yanan


On 2021/3/30 16:08, Yanan Wang wrote:

Hi,
This v6 series can mainly include two parts.
Rebased on kvm queue branch: 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue

In the first part, all the known hugetlb backing src types specified
with different hugepage sizes are listed, so that we can specify use
of hugetlb source of the exact granularity that we want, instead of
the system default ones. And as all the known hugetlb page sizes are
listed, it's appropriate for all architectures. Besides, a helper that
can get granularity of different backing src types(anonumous/thp/hugetlb)
is added, so that we can use the accurate backing src granularity for
kinds of alignment or guest memory accessing of vcpus.

In the second part, a new test is added:
This test is added to serve as a performance tester and a bug reproducer
for kvm page table code (GPA->HPA mappings), it gives guidance for the
people trying to make some improvement for kvm. And the following explains
what we can exactly do through this test.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty logging).
Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal page
mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block entry,
because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 software
implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Links about the TLB conflict abort:
https://lore.kernel.org/lkml/20201201201034.116760-3-wangyana...@huawei.com/

---

Change logs:

v5->v6:
- Address Andrew Jones's comments for v5 series
- Add Andrew Jones's R-b tags in some patches
- Rebased on newest kvm/queue tree
- v5: 
https://lore.kernel.org/lkml/20210323135231.24948-1-wangyana...@huawei.com/

v4->v5:
- Use synchronization(sem_wait) for time measurement
- Add a new patch about TEST_ASSERT(patch 4)
- Address Andrew Jones's comments for v4 series
- Add Andrew Jones's R-b tags in some patches
- v4: 
https://lore.kernel.org/lkml/20210302125751.19080-1-wangyana...@huawei.com/

v3->v4:
- Add a helper to get system default hugetlb page size
- Add tags of Reviewed-by of Ben in the patches
- v3: 
https://lore.kernel.org/lkml/20210301065916.11484-1-wangyana...@huawei.com/

v2->v3:
- Add tags of Suggested-by, Reviewed-by in the patches
- Add a generic micro to get hugetlb page sizes
- Some changes for suggestions about v2 series
- v2: 
https://lore.kernel.org/lkml/20210225055940.18748-1-wangyana...@huawei.com/

v1->v2:
- Add a patch to sync header files
- Add helpers to get granularity of different backing src types
- Some changes for suggestions about v1 series
- v1: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/

---

Yanan Wang (10):
   tools headers: sync headers of asm-generic/hugetlb_encode.h
   mm/hugetlb: Add a macro to get HUGETLB page sizes for mmap
   KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing
   KVM: selftests: Print the errno besides error-string in TEST_ASSERT
   KVM: selftests: Make a generic helper to get vm guest mode strings
   KVM: selftests: Add a helper to get system configured THP page size
   KVM: selftests: Add a helper to get system default hugetlb page size
   KVM: selftests: List all hugetlb src types specified with page sizes
   KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers
   KVM: selftests: Add a test for kvm page table code

  include/uapi/linux/mman.h |   2 +
  

Re: [RFC PATCH v5 10/10] KVM: selftests: Add a test for kvm page table code

2021-03-29 Thread wangyanan (Y)

Hi Drew,
Thanks for having a look.
On 2021/3/29 19:38, Andrew Jones wrote:

On Tue, Mar 23, 2021 at 09:52:31PM +0800, Yanan Wang wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty logging).
Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal page
mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block entry,
because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 software
implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  tools/testing/selftests/kvm/.gitignore|   1 +
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 512 ++
  3 files changed, 516 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore 
b/tools/testing/selftests/kvm/.gitignore
index 32b87cc77c8e..137ab7273be6 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -35,6 +35,7 @@
  /dirty_log_perf_test
  /hardware_disable_test
  /kvm_create_max_vcpus
+/kvm_page_table_test
  /memslot_modification_stress_test
  /set_memory_region_test
  /steal_time
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a6d61f451f88..75dc57db36b4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -69,6 +69,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
  TEST_GEN_PROGS_x86_64 += hardware_disable_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
  TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
  TEST_GEN_PROGS_x86_64 += set_memory_region_test
  TEST_GEN_PROGS_x86_64 += steal_time
@@ -79,6 +80,7 @@ TEST_GEN_PROGS_aarch64 += demand_paging_test
  TEST_GEN_PROGS_aarch64 += dirty_log_test
  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
  TEST_GEN_PROGS_aarch64 += set_memory_region_test
  TEST_GEN_PROGS_aarch64 += steal_time
  
@@ -88,6 +90,7 @@ TEST_GEN_PROGS_s390x += s390x/sync_regs_test

  TEST_GEN_PROGS_s390x += demand_paging_test
  TEST_GEN_PROGS_s390x += dirty_log_test
  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
+TEST_GEN_PROGS_s390x += kvm_page_table_test
  TEST_GEN_PROGS_s390x += set_memory_region_test
  
  TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))

diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c 
b/tools/testing/selftests/kvm/kvm_page_table_test.c
new file mode 100644
index ..bbd5c489d61f
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM page table test
+ *
+ * Copyright (C) 2021, Huawei, Inc.
+ *
+ * Make sure that THP has been enabled or enough HUGETLB pages with specific
+ * page size have been pre-allocated on your system, if you are planning to
+ * use hugepages to back the guest memory for testing.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "guest_modes.h"
+
+#define TEST_MEM_SLOT_INDEX 1
+
+/* 

Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely

2021-03-25 Thread wangyanan (Y)

Hi Alex,

On 2021/3/26 1:26, Alexandru Elisei wrote:

Hi Yanan,

On 2/8/21 11:22 AM, Yanan Wang wrote:

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/mmu.c | 25 -
  1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..550498a9104e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
gfn = fault_ipa >> PAGE_SHIFT;
mmap_read_unlock(current->mm);
  
-	/*

-* Permission faults just need to update the existing leaf entry,
-* and so normally don't require allocations from the memcache. The
-* only exception to this is when dirty logging is enabled at runtime
-* and a write fault needs to collapse a block entry into a table.
-*/
-   if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-   ret = kvm_mmu_topup_memory_cache(memcache,
-kvm_mmu_cache_min_pages(kvm));
-   if (ret)
-   return ret;
-   }
-
mmu_seq = vcpu->kvm->mmu_notifier_seq;
/*
 * Ensure the read of mmu_notifier_seq happens before we call
@@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;
  
+	/*

+* Allocations from the memcache are required only when granule of the
+* lookup level where the guest fault happened exceeds vma_pagesize,
+* which means new page tables will be created in the fault handlers.
+*/
+   if (fault_granule > vma_pagesize) {
+   ret = kvm_mmu_topup_memory_cache(memcache,
+kvm_mmu_cache_min_pages(kvm));
+   if (ret)
+   return ret;
+   }

I distinguish three situations:

1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
then it means that all the tables that the translation table walker traversed
until the leaf are valid. No need to allocate a new page, as stage 2 will only
change the leaf to point to a valid PA.

2. fault_granule > vma_pagesize. This means that there's a table missing at some
point in the table walk, so we're going to need to allocate at least one table 
to
hold the leaf entry. We need to topup the memory cache.

3. fault_granule < vma_pagesize. From our discussion in patch #3, this can 
happen
only if the userspace translation tables use a block mapping, dirty page logging
is enabled, the fault_ipa is mapped as a last level entry, dirty page logging 
gets
disabled and then we get a fault. In this case, the PTE table will be coalesced
into a PMD block mapping, and the PMD table entry that pointed to the PTE table
will be changed to a block mapping. No table will be allocated.

Looks to me like this patch is valid, but getting it wrong can break a VM and I
would feel a lot more comfortable if someone who is more familiar with the code
would have a look.
Thanks for your explanation here. Above is also what I thought about 
this patch.


Thanks,
Yanan


Thanks,

Alex


+
/*
 * Under the premise of getting a FSC_PERM fault, we just need to relax
 * permissions only if vma_pagesize equals fault_granule. Otherwise,

.


Re: [RFC PATCH v5 10/10] KVM: selftests: Add a test for kvm page table code

2021-03-23 Thread wangyanan (Y)

Hi Drew,

BTW, any thoughts about the change in this patch? :)

Thanks,
Yanan
On 2021/3/23 21:52, Yanan Wang wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty logging).
Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal page
mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block entry,
because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 software
implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  tools/testing/selftests/kvm/.gitignore|   1 +
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 512 ++
  3 files changed, 516 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore 
b/tools/testing/selftests/kvm/.gitignore
index 32b87cc77c8e..137ab7273be6 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -35,6 +35,7 @@
  /dirty_log_perf_test
  /hardware_disable_test
  /kvm_create_max_vcpus
+/kvm_page_table_test
  /memslot_modification_stress_test
  /set_memory_region_test
  /steal_time
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a6d61f451f88..75dc57db36b4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -69,6 +69,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
  TEST_GEN_PROGS_x86_64 += hardware_disable_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
  TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
  TEST_GEN_PROGS_x86_64 += set_memory_region_test
  TEST_GEN_PROGS_x86_64 += steal_time
@@ -79,6 +80,7 @@ TEST_GEN_PROGS_aarch64 += demand_paging_test
  TEST_GEN_PROGS_aarch64 += dirty_log_test
  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
  TEST_GEN_PROGS_aarch64 += set_memory_region_test
  TEST_GEN_PROGS_aarch64 += steal_time
  
@@ -88,6 +90,7 @@ TEST_GEN_PROGS_s390x += s390x/sync_regs_test

  TEST_GEN_PROGS_s390x += demand_paging_test
  TEST_GEN_PROGS_s390x += dirty_log_test
  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
+TEST_GEN_PROGS_s390x += kvm_page_table_test
  TEST_GEN_PROGS_s390x += set_memory_region_test
  
  TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))

diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c 
b/tools/testing/selftests/kvm/kvm_page_table_test.c
new file mode 100644
index ..bbd5c489d61f
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM page table test
+ *
+ * Copyright (C) 2021, Huawei, Inc.
+ *
+ * Make sure that THP has been enabled or enough HUGETLB pages with specific
+ * page size have been pre-allocated on your system, if you are planning to
+ * use hugepages to back the guest memory for testing.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "guest_modes.h"
+
+#define TEST_MEM_SLOT_INDEX 1
+
+/* Default size(1GB) of 

Re: [RFC PATCH v5 00/10] KVM: selftests: some improvement and a new test for kvm page table

2021-03-23 Thread wangyanan (Y)



On 2021/3/23 23:58, Sean Christopherson wrote:

On Tue, Mar 23, 2021, Yanan Wang wrote:

Hi,
This v5 series can mainly include two parts.
Based on kvm queue branch: 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue

Given the number of Reviewed-by tags, I'm pretty sure you can drop the "RFC" :-)

Ah, yes! Will drop it in v6 where I hope everything would be fine. :)

Thanks,
Yanan

.


Re: [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap

2021-03-23 Thread wangyanan (Y)



On 2021/3/23 22:03, Andrew Jones wrote:

$SUBJECT says "tools headers", but this is actually changing
a UAPI header and then copying the change to tools.

Indeed. I think head of the subject should be "mm/hugetlb".
I will fix it.

Thanks,
Yanan

Thanks,
drew

On Tue, Mar 23, 2021 at 09:52:23PM +0800, Yanan Wang wrote:

We know that if a system supports multiple hugetlb page sizes,
the desired hugetlb page size can be specified in bits [26:31]
of the flag arguments. The value in these 6 bits will be the
shift of each hugetlb page size.

So add a macro to get the page size shift and then calculate the
corresponding hugetlb page size, using flag x.

Cc: Ben Gardon 
Cc: Ingo Molnar 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnd Bergmann 
Cc: Michael Kerrisk 
Cc: Thomas Gleixner 
Suggested-by: Ben Gardon 
Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  include/uapi/linux/mman.h   | 2 ++
  tools/include/uapi/linux/mman.h | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
  #define MAP_HUGE_2GB  HUGETLB_FLAG_ENCODE_2GB
  #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
  
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))

+
  #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
  #define MAP_HUGE_2GB  HUGETLB_FLAG_ENCODE_2GB
  #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
  
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))

+
  #endif /* _UAPI_LINUX_MMAN_H */
--
2.19.1


.


Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

2021-03-22 Thread wangyanan (Y)

Hi Alex,

On 2021/3/19 23:07, Alexandru Elisei wrote:

Hi Yanan,

Sorry for taking so long to reply, been busy with other things unfortunately.

Still appreciate your patient reply! :)

I
did notice that you sent a new version of this series, but I would like to
continue our discussion on this patch, since it's easier to get the full 
context.

On 3/4/21 7:07 AM, wangyanan (Y) wrote:

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.

What I am trying to do in this patch is to adjust the order of rebuilding block
mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry being
invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being invalid is
only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to find the
entry invalid
and as a result trigger an unnecessary translation fault.

Thank you for the explanation, that was my understand of it also, and I believe
your idea is correct. I was more concerned that I got some of the details wrong,
and you have kindly corrected me below.


Signed-off-by: Yanan Wang 
---
    arch/arm64/kvm/hyp/pgtable.c | 26 --
    1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
    kvm_pte_t    attr;
      kvm_pte_t    *anchor;
+    kvm_pte_t    *follow;
      struct kvm_s2_mmu    *mmu;
    struct kvm_mmu_memory_cache    *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
u32 level,
    if (!kvm_block_mapping_supported(addr, end, data->phys, level))
    return 0;
    -    kvm_set_invalid_pte(ptep);
-
    /*
- * Invalidate the whole stage-2, as we may have numerous leaf
- * entries below us which would otherwise need invalidating
- * individually.
+ * If we need to coalesce existing table entries into a block here,
+ * then install the block entry first and the sub-level page mappings
+ * will be unmapped later.
     */
-    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
    data->anchor = ptep;
+    data->follow = kvm_pte_follow(*ptep);
+    stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is
replacing):

Ok, will do this if v2 is going to be post.

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                     kvm_pte_t *ptep,
                     struct stage2_map_data *data)
{
   u64 granule = kvm_granule_size(level), phys = data->phys;
   kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

   kvm_set_invalid_pte(ptep);

   /*
    * Invalidate the whole stage-2, as we may have num

Re: [RFC PATCH v4 9/9] KVM: selftests: Add a test for kvm page table code

2021-03-22 Thread wangyanan (Y)



On 2021/3/12 22:20, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:51PM +0800, Yanan Wang wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty logging).
Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal page
mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block entry,
because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 software
implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 476 ++
  2 files changed, 479 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a6d61f451f88..bac81924166d 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -67,6 +67,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
  TEST_GEN_PROGS_x86_64 += demand_paging_test
  TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
  TEST_GEN_PROGS_x86_64 += hardware_disable_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
@@ -78,6 +79,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
  TEST_GEN_PROGS_aarch64 += demand_paging_test
  TEST_GEN_PROGS_aarch64 += dirty_log_test
  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_aarch64 += set_memory_region_test
  TEST_GEN_PROGS_aarch64 += steal_time
@@ -87,6 +89,7 @@ TEST_GEN_PROGS_s390x += s390x/resets
  TEST_GEN_PROGS_s390x += s390x/sync_regs_test
  TEST_GEN_PROGS_s390x += demand_paging_test
  TEST_GEN_PROGS_s390x += dirty_log_test
+TEST_GEN_PROGS_s390x += kvm_page_table_test
  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
  TEST_GEN_PROGS_s390x += set_memory_region_test

Please add these three lines in alphabetic order. Also we're missing
the .gitignore entry.

Will fix.
  
diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c

new file mode 100644
index ..032b49d1483b
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -0,0 +1,476 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM page table test
+ *
+ * Copyright (C) 2021, Huawei, Inc.
+ *
+ * Make sure that THP has been enabled or enough HUGETLB pages with specific
+ * page size have been pre-allocated on your system, if you are planning to
+ * use hugepages to back the guest memory for testing.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "guest_modes.h"
+
+#define TEST_MEM_SLOT_INDEX 1
+
+/* Default size(1GB) of the memory for testing */
+#define DEFAULT_TEST_MEM_SIZE  (1 << 30)
+
+/* Default guest test virtual memory offset */
+#define DEFAULT_GUEST_TEST_MEM 0xc000
+
+/* Number of guest memory accessing types(read/write) */
+#define NUM_ACCESS_TYPES   2

This define doesn't really seem necessary.

Agreed!

+
+/* 

Re: [RFC PATCH v4 7/9] KVM: selftests: List all hugetlb src types specified with page sizes

2021-03-22 Thread wangyanan (Y)



On 2021/3/12 20:02, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:49PM +0800, Yanan Wang wrote:

With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
default hugetlb pages to back the testing guest memory. In order to
add flexibility, now list all the known hugetlb backing src types with
different page sizes, so that we can specify use of hugetlb pages of the
exact granularity that we want. And as all the known hugetlb page sizes
are listed, it's appropriate for all architectures.

Besides, the helper get_backing_src_pagesz() is added to get the
granularity of different backing src types(anonumous, thp, hugetlb).

Suggested-by: Ben Gardon 
Signed-off-by: Yanan Wang 
---
  .../testing/selftests/kvm/include/test_util.h | 18 +-
  tools/testing/selftests/kvm/lib/kvm_util.c|  2 +-
  tools/testing/selftests/kvm/lib/test_util.c   | 59 +++
  3 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index e087174eefe5..fade3130eb01 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -71,16 +71,32 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS,
VM_MEM_SRC_ANONYMOUS_THP,
VM_MEM_SRC_ANONYMOUS_HUGETLB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_64KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_8MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_32MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_256MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
+   NUM_SRC_TYPES,
  };
  
  struct vm_mem_backing_src_alias {

const char *name;
-   enum vm_mem_backing_src_type type;
+   uint32_t flag;
  };
  
  bool thp_configured(void);

  size_t get_trans_hugepagesz(void);
  size_t get_def_hugetlb_pagesz(void);
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
+size_t get_backing_src_pagesz(uint32_t i);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
  
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c

index cc22c4ab7d67..b91c8e3a7ee1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -757,7 +757,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->mmap_start = mmap(NULL, region->mmap_size,
  PROT_READ | PROT_WRITE,
  MAP_PRIVATE | MAP_ANONYMOUS
- | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? 
MAP_HUGETLB : 0),
+ | vm_mem_backing_src_alias(src_type)->flag,
  -1, 0);
TEST_ASSERT(region->mmap_start != MAP_FAILED,
"test_malloc failed, mmap_start: %p errno: %i",
diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
b/tools/testing/selftests/kvm/lib/test_util.c
index 80d68dbd72d2..df8a42eff1f8 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "linux/kernel.h"
  
  #include "test_util.h"

@@ -112,12 +113,6 @@ void print_skip(const char *fmt, ...)
puts(", skipping test");
  }
  
-const struct vm_mem_backing_src_alias backing_src_aliases[] = {

-   {"anonymous", VM_MEM_SRC_ANONYMOUS,},
-   {"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
-   {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
-};
-
  bool thp_configured(void)
  {
int ret;
@@ -180,22 +175,64 @@ size_t get_def_hugetlb_pagesz(void)
return 0;
  }
  
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)

+{
+   static const struct vm_mem_backing_src_alias aliases[] = {
+   { "anonymous",   0},
+   { "anonymous_thp",   0},
+   { "anonymous_hugetlb",   MAP_HUGETLB  },
+   { "anonymous_hugetlb_16kb",  MAP_HUGETLB | MAP_HUGE_16KB  },
+   { "anonymous_hugetlb_64kb",  MAP_HUGETLB | MAP_HUGE_64KB  },
+   { "anonymous_hugetlb_512kb", MAP_HUGETLB | MAP_HUGE_512KB },
+   { "anonymous_hugetlb_1mb",   MAP_HUGETLB | MAP_HUGE_1MB   },
+   { "anonymous_hugetlb_2mb",   MAP_HUGETLB | MAP_HUGE_2MB   },
+   { "anonymous_hugetlb_8mb",   MAP_HUGETLB | MAP_HUGE_8MB   },
+   { "anonymous_hugetlb_16mb",  MAP_HUGETLB | 

Re: [RFC PATCH v4 7/9] KVM: selftests: List all hugetlb src types specified with page sizes

2021-03-22 Thread wangyanan (Y)



On 2021/3/12 19:49, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:49PM +0800, Yanan Wang wrote:

With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
default hugetlb pages to back the testing guest memory. In order to
add flexibility, now list all the known hugetlb backing src types with
different page sizes, so that we can specify use of hugetlb pages of the
exact granularity that we want. And as all the known hugetlb page sizes
are listed, it's appropriate for all architectures.

Besides, the helper get_backing_src_pagesz() is added to get the
granularity of different backing src types(anonumous, thp, hugetlb).

Suggested-by: Ben Gardon 
Signed-off-by: Yanan Wang 
---
  .../testing/selftests/kvm/include/test_util.h | 18 +-
  tools/testing/selftests/kvm/lib/kvm_util.c|  2 +-
  tools/testing/selftests/kvm/lib/test_util.c   | 59 +++
  3 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index e087174eefe5..fade3130eb01 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -71,16 +71,32 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS,
VM_MEM_SRC_ANONYMOUS_THP,
VM_MEM_SRC_ANONYMOUS_HUGETLB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_64KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_8MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_32MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_256MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
+   NUM_SRC_TYPES,
  };
  
  struct vm_mem_backing_src_alias {

const char *name;
-   enum vm_mem_backing_src_type type;
+   uint32_t flag;
  };
  
  bool thp_configured(void);

  size_t get_trans_hugepagesz(void);
  size_t get_def_hugetlb_pagesz(void);
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
+size_t get_backing_src_pagesz(uint32_t i);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
  
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c

index cc22c4ab7d67..b91c8e3a7ee1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -757,7 +757,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->mmap_start = mmap(NULL, region->mmap_size,
  PROT_READ | PROT_WRITE,
  MAP_PRIVATE | MAP_ANONYMOUS
- | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? 
MAP_HUGETLB : 0),
+ | vm_mem_backing_src_alias(src_type)->flag,
  -1, 0);
TEST_ASSERT(region->mmap_start != MAP_FAILED,
"test_malloc failed, mmap_start: %p errno: %i",
diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
b/tools/testing/selftests/kvm/lib/test_util.c
index 80d68dbd72d2..df8a42eff1f8 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "linux/kernel.h"
  
  #include "test_util.h"

@@ -112,12 +113,6 @@ void print_skip(const char *fmt, ...)
puts(", skipping test");
  }
  
-const struct vm_mem_backing_src_alias backing_src_aliases[] = {

-   {"anonymous", VM_MEM_SRC_ANONYMOUS,},
-   {"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
-   {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
-};
-
  bool thp_configured(void)
  {
int ret;
@@ -180,22 +175,64 @@ size_t get_def_hugetlb_pagesz(void)
return 0;
  }
  
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)

+{
+   static const struct vm_mem_backing_src_alias aliases[] = {
+   { "anonymous",   0},
+   { "anonymous_thp",   0},
+   { "anonymous_hugetlb",   MAP_HUGETLB  },
+   { "anonymous_hugetlb_16kb",  MAP_HUGETLB | MAP_HUGE_16KB  },
+   { "anonymous_hugetlb_64kb",  MAP_HUGETLB | MAP_HUGE_64KB  },
+   { "anonymous_hugetlb_512kb", MAP_HUGETLB | MAP_HUGE_512KB },
+   { "anonymous_hugetlb_1mb",   MAP_HUGETLB | MAP_HUGE_1MB   },
+   { "anonymous_hugetlb_2mb",   MAP_HUGETLB | MAP_HUGE_2MB   },
+   { "anonymous_hugetlb_8mb",   MAP_HUGETLB | MAP_HUGE_8MB   },
+   { "anonymous_hugetlb_16mb",  MAP_HUGETLB | 

Re: [RFC PATCH v4 6/9] KVM: selftests: Add a helper to get system default hugetlb page size

2021-03-22 Thread wangyanan (Y)



On 2021/3/12 19:40, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:48PM +0800, Yanan Wang wrote:

If HUGETLB is configured in the host kernel, then we can know the system
default hugetlb page size through *cat /proc/meminfo*. Otherwise, we will
not see the information of hugetlb pages in file /proc/meminfo if it's not
configured. So add a helper to determine whether HUGETLB is configured and
then get the default page size by reading /proc/meminfo.

This helper can be useful when a program wants to use the default hugetlb
pages of the system and doesn't know the default page size.

Signed-off-by: Yanan Wang 
---
  .../testing/selftests/kvm/include/test_util.h |  1 +
  tools/testing/selftests/kvm/lib/test_util.c   | 27 +++
  2 files changed, 28 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index ef24c76ba89a..e087174eefe5 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -80,6 +80,7 @@ struct vm_mem_backing_src_alias {
  
  bool thp_configured(void);

  size_t get_trans_hugepagesz(void);
+size_t get_def_hugetlb_pagesz(void);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
  
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c

index f2d133f76c67..80d68dbd72d2 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -153,6 +153,33 @@ size_t get_trans_hugepagesz(void)
return size;
  }
  
+size_t get_def_hugetlb_pagesz(void)

+{
+   char buf[64];
+   const char *tag = "Hugepagesize:";
+   FILE *f;
+
+   f = fopen("/proc/meminfo", "r");
+   TEST_ASSERT(f != NULL, "Error in opening /proc/meminfo: %d", errno);
+
+   while (fgets(buf, sizeof(buf), f) != NULL) {
+   if (strstr(buf, tag) == buf) {
+   fclose(f);
+   return strtoull(buf + strlen(tag), NULL, 10) << 10;
+   }
+   }
+
+   if (feof(f)) {
+   fclose(f);
+   TEST_FAIL("HUGETLB is not configured in host kernel");
+   } else {
+   fclose(f);
+   TEST_FAIL("Error in reading /proc/meminfo: %d", errno);
+   }

fclose() can be factored out.


+
+   return 0;
+}
+
  void backing_src_help(void)
  {
int i;
--
2.23.0


Besides the fclose comment and the same errno comment as the previous
patch

I will fix it and add your R-b in this patch.

Thanks,
Yanan

Reviewed-by: Andrew Jones 

.


Re: [RFC PATCH v4 5/9] KVM: selftests: Add a helper to get system configured THP page size

2021-03-22 Thread wangyanan (Y)

Hi Drew,

Thanks for your attention to this series!
On 2021/3/12 19:31, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:47PM +0800, Yanan Wang wrote:

If we want to have some tests about transparent hugepages, the system
configured THP hugepage size should better be known by the tests, which
can be used for kinds of alignment or guest memory accessing of vcpus...
So it makes sense to add a helper to get the transparent hugepage size.

With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
we now stat /sys/kernel/mm/transparent_hugepage to check whether THP is
configured in the host kernel before madvise(). Based on this, we can also
read file /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to get THP
hugepage size.

Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  .../testing/selftests/kvm/include/test_util.h |  2 ++
  tools/testing/selftests/kvm/lib/test_util.c   | 36 +++
  2 files changed, 38 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index b7f41399f22c..ef24c76ba89a 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -78,6 +78,8 @@ struct vm_mem_backing_src_alias {
enum vm_mem_backing_src_type type;
  };
  
+bool thp_configured(void);

+size_t get_trans_hugepagesz(void);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
  
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c

index c7c0627c6842..f2d133f76c67 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "linux/kernel.h"
  
  #include "test_util.h"

@@ -117,6 +118,41 @@ const struct vm_mem_backing_src_alias 
backing_src_aliases[] = {
{"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
  };
  
+bool thp_configured(void)

+{
+   int ret;
+   struct stat statbuf;
+
+   ret = stat("/sys/kernel/mm/transparent_hugepage", );
+   TEST_ASSERT(ret == 0 || (ret == -1 && errno == ENOENT),
+   "Error in stating /sys/kernel/mm/transparent_hugepage: %d",
+   errno);

TEST_ASSERT will already output errno's string. Is that not sufficient? If
not, I think extending TEST_ASSERT to output errno too would be fine.
I think it's a good idea to output the errno together with it's string 
in TEST_ASSERT,
it will explicitly indicate that the string is an error information and 
the errno is much
easier to be used for debugging than the string. I will make this change 
a separate

patch in next version and add your S-b tag.

+
+   return ret == 0;
+}
+
+size_t get_trans_hugepagesz(void)
+{
+   size_t size;
+   char buf[16];
+   FILE *f;
+
+   TEST_ASSERT(thp_configured(), "THP is not configured in host kernel");
+
+   f = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r");
+   TEST_ASSERT(f != NULL,
+   "Error in opening transparent_hugepage/hpage_pmd_size: %d",
+   errno);

Same comment as above.


+
+   if (fread(buf, sizeof(char), sizeof(buf), f) == 0) {
+   fclose(f);
+   TEST_FAIL("Unable to read transparent_hugepage/hpage_pmd_size");
+   }
+
+   size = strtoull(buf, NULL, 10);

fscanf with %lld?

This makes senses. But it should be %ld corresponding to size_t.

Thanks,
Yanan.

+   return size;
+}
+
  void backing_src_help(void)
  {
int i;
--
2.23.0


Thanks,
drew

.


Re: [RFC PATCH v4 2/9] tools headers: Add a macro to get HUGETLB page sizes for mmap

2021-03-14 Thread wangyanan (Y)



On 2021/3/12 19:14, Andrew Jones wrote:

On Tue, Mar 02, 2021 at 08:57:44PM +0800, Yanan Wang wrote:

We know that if a system supports multiple hugetlb page sizes,
the desired hugetlb page size can be specified in bits [26:31]
of the flag arguments. The value in these 6 bits will be the
shift of each hugetlb page size.

So add a macro to get the page size shift and then calculate the
corresponding hugetlb page size, using flag x.

Cc: Ben Gardon 
Cc: Ingo Molnar 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnd Bergmann 
Cc: Michael Kerrisk 
Cc: Thomas Gleixner 
Suggested-by: Ben Gardon 
Signed-off-by: Yanan Wang 
Reviewed-by: Ben Gardon 
---
  include/uapi/linux/mman.h   | 2 ++
  tools/include/uapi/linux/mman.h | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..8bd41128a0ee 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
  #define MAP_HUGE_2GB  HUGETLB_FLAG_ENCODE_2GB
  #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
  
+#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))

Needs to be '1ULL' to avoid shift overflow when given MAP_HUGE_16GB.

Thanks, drew. Will fix it.

Thanks,
drew


+
  #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index f55bc680b5b0..8bd41128a0ee 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
  #define MAP_HUGE_2GB  HUGETLB_FLAG_ENCODE_2GB
  #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
  
+#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))

+
  #endif /* _UAPI_LINUX_MMAN_H */
--
2.23.0


.


Re: [RFC PATCH v4 0/9] KVM: selftests: some improvement and a new test for kvm page table

2021-03-11 Thread wangyanan (Y)

Hi all,

Kindly ping :)!

Are there any further comments for this v4 series? Please let me know if 
there

is still something that needs fixing.

Or is this v4 series fine enough to be queued? Most of the patches have been
added with Reviewed-by. If there are merge conflicts with the newest branch,
please also let me know and I will send a new version fixed.

Regards,
Yanan

On 2021/3/2 20:57, Yanan Wang wrote:

Hi,
This v4 series can mainly include two parts.
Based on kvm queue branch: 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue
Links of v1: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/
Links of v2: 
https://lore.kernel.org/lkml/20210225055940.18748-1-wangyana...@huawei.com/
Links of v3: 
https://lore.kernel.org/lkml/20210301065916.11484-1-wangyana...@huawei.com/

In the first part, all the known hugetlb backing src types specified
with different hugepage sizes are listed, so that we can specify use
of hugetlb source of the exact granularity that we want, instead of
the system default ones. And as all the known hugetlb page sizes are
listed, it's appropriate for all architectures. Besides, a helper that
can get granularity of different backing src types(anonumous/thp/hugetlb)
is added, so that we can use the accurate backing src granularity for
kinds of alignment or guest memory accessing of vcpus.

In the second part, a new test is added:
This test is added to serve as a performance tester and a bug reproducer
for kvm page table code (GPA->HPA mappings), it gives guidance for the
people trying to make some improvement for kvm. And the following explains
what we can exactly do through this test.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty logging).
Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal page
mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block entry,
because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 software
implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Links about the TLB conflict abort:
https://lore.kernel.org/lkml/20201201201034.116760-3-wangyana...@huawei.com/

---

Change logs:

v3->v4:
- Add a helper to get system default hugetlb page size
- Add tags of Reviewed-by of Ben in the patches

v2->v3:
- Add tags of Suggested-by, Reviewed-by in the patches
- Add a generic micro to get hugetlb page sizes
- Some changes for suggestions about v2 series

v1->v2:
- Add a patch to sync header files
- Add helpers to get granularity of different backing src types
- Some changes for suggestions about v1 series

---

Yanan Wang (9):
   tools headers: sync headers of asm-generic/hugetlb_encode.h
   tools headers: Add a macro to get HUGETLB page sizes for mmap
   KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing
   KVM: selftests: Make a generic helper to get vm guest mode strings
   KVM: selftests: Add a helper to get system configured THP page size
   KVM: selftests: Add a helper to get system default hugetlb page size
   KVM: selftests: List all hugetlb src types specified with page sizes
   KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers
   KVM: selftests: Add a test for kvm page table code

  include/uapi/linux/mman.h |   2 +
  tools/include/asm-generic/hugetlb_encode.h|   3 +
  tools/include/uapi/linux/mman.h   |   2 +
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/demand_paging_test.c|   8 +-
  .../selftests/kvm/dirty_log_perf_test.c   |  14 +-
  

Re: [PATCH 2/2] KVM: arm64: Skip the cache flush when coalescing tables into a block

2021-03-09 Thread wangyanan (Y)



On 2021/3/9 16:43, Marc Zyngier wrote:

On Tue, 09 Mar 2021 08:34:43 +,
"wangyanan (Y)"  wrote:


On 2021/3/9 0:34, Will Deacon wrote:

On Mon, Jan 25, 2021 at 10:10:44PM +0800, Yanan Wang wrote:

After dirty-logging is stopped for a VM configured with huge mappings,
KVM will recover the table mappings back to block mappings. As we only
replace the existing page tables with a block entry and the cacheability
has not been changed, the cache maintenance opreations can be skipped.

Signed-off-by: Yanan Wang 
---
   arch/arm64/kvm/mmu.c | 12 +---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8e8549ea1d70..37b427dcbc4f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -744,7 +744,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
   {
int ret = 0;
bool write_fault, writable, force_pte = false;
-   bool exec_fault;
+   bool exec_fault, adjust_hugepage;
bool device = false;
unsigned long mmu_seq;
struct kvm *kvm = vcpu->kvm;
@@ -872,12 +872,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
mark_page_dirty(kvm, gfn);
}
   -if (fault_status != FSC_PERM && !device)
+   /*
+* There is no necessity to perform cache maintenance operations if we
+* will only replace the existing table mappings with a block mapping.
+*/
+   adjust_hugepage = fault_granule < vma_pagesize ? true : false;

nit: you don't need the '? true : false' part

That said, your previous patch checks for 'fault_granule > vma_pagesize',
so I'm not sure the local variable helps all that much here because it
obscures the size checks in my opinion. It would be more straight-forward
if we could structure the logic as:


if (fault_granule < vma_pagesize) {

} else if (fault_granule > vma_page_size) {

} else {

}

With some comments describing what we can infer about the memcache and cache
maintenance requirements for each case.

Thanks for your suggestion here, Will.
But I have resent another newer series [1] (KVM: arm64: Improve
efficiency of stage2 page table)
recently, which has the same theme but different solutions that I
think are better.
[1]
https://lore.kernel.org/lkml/20210208112250.163568-1-wangyana...@huawei.com/

Could you please comment on that series ?  I think it can be found in
your inbox :).

There were already a bunch of comments on that series, and I stopped
at the point where the cache maintenance was broken. Please respin
that series if you want further feedback on it.

Ok, I will send a new version later.


In the future, if you deprecate a series (which is completely
understandable), please leave a note on the list with a pointer to the
new series so that people don't waste time reviewing an obsolete
series. Or post the new series with a new version number so that it is
obvious that the original series has been superseded.

I apologize for this, I will be more careful in the future.

Thanks,

Yanan

Thanks,

M.



Re: [PATCH 2/2] KVM: arm64: Skip the cache flush when coalescing tables into a block

2021-03-09 Thread wangyanan (Y)



On 2021/3/9 0:34, Will Deacon wrote:

On Mon, Jan 25, 2021 at 10:10:44PM +0800, Yanan Wang wrote:

After dirty-logging is stopped for a VM configured with huge mappings,
KVM will recover the table mappings back to block mappings. As we only
replace the existing page tables with a block entry and the cacheability
has not been changed, the cache maintenance opreations can be skipped.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/mmu.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8e8549ea1d70..37b427dcbc4f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -744,7 +744,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
  {
int ret = 0;
bool write_fault, writable, force_pte = false;
-   bool exec_fault;
+   bool exec_fault, adjust_hugepage;
bool device = false;
unsigned long mmu_seq;
struct kvm *kvm = vcpu->kvm;
@@ -872,12 +872,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
mark_page_dirty(kvm, gfn);
}
  
-	if (fault_status != FSC_PERM && !device)

+   /*
+* There is no necessity to perform cache maintenance operations if we
+* will only replace the existing table mappings with a block mapping.
+*/
+   adjust_hugepage = fault_granule < vma_pagesize ? true : false;

nit: you don't need the '? true : false' part

That said, your previous patch checks for 'fault_granule > vma_pagesize',
so I'm not sure the local variable helps all that much here because it
obscures the size checks in my opinion. It would be more straight-forward
if we could structure the logic as:


if (fault_granule < vma_pagesize) {

} else if (fault_granule > vma_page_size) {

} else {

}

With some comments describing what we can infer about the memcache and cache
maintenance requirements for each case.

Thanks for your suggestion here, Will.
But I have resent another newer series [1] (KVM: arm64: Improve 
efficiency of stage2 page table)
recently, which has the same theme but different solutions that I think 
are better.
[1] 
https://lore.kernel.org/lkml/20210208112250.163568-1-wangyana...@huawei.com/


Could you please comment on that series ?  I think it can be found in 
your inbox :).


Thanks,

Yanan


Will
.


Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

2021-03-03 Thread wangyanan (Y)



On 2021/3/4 15:07, wangyanan (Y) wrote:

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
When KVM needs to coalesce the normal page mappings into a block 
mapping,
we currently invalidate the old table entry first followed by 
invalidation
of TLB, then unmap the page mappings, and install the block entry 
at last.


It will cost a long time to unmap the numerous page mappings, 
which means
there will be a long period when the table entry can be found 
invalid.
If other vCPUs access any guest page within the block range and 
find the
table entry invalid, they will all exit from guest with a 
translation fault
which is not necessary. And KVM will make efforts to handle these 
faults,

especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure 
uninterrupted
memory access of the other vCPUs, and then unmap the page mappings 
after
installation. This will reduce most of the time when the table 
entry is

invalid, and avoid most of the unnecessary translation faults.
I'm not convinced I've fully understood what is going on yet, but 
it seems to me

that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of 
rebuilding block

mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 
2), and 3),

and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the 
entry being

invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being 
invalid is

only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance 
to find the

entry invalid
and as a result trigger an unnecessary translation fault.
Thank you for the explanation, that was my understand of it also, and 
I believe
your idea is correct. I was more concerned that I got some of the 
details wrong,

and you have kindly corrected me below.


Signed-off-by: Yanan Wang 
---
   arch/arm64/kvm/hyp/pgtable.c | 26 --
   1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c 
b/arch/arm64/kvm/hyp/pgtable.c

index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
   kvm_pte_t    attr;
     kvm_pte_t    *anchor;
+    kvm_pte_t    *follow;
     struct kvm_s2_mmu    *mmu;
   struct kvm_mmu_memory_cache    *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 
addr, u64 end,

u32 level,
   if (!kvm_block_mapping_supported(addr, end, data->phys, 
level))

   return 0;
   -    kvm_set_invalid_pte(ptep);
-
   /*
- * Invalidate the whole stage-2, as we may have numerous leaf
- * entries below us which would otherwise need invalidating
- * individually.
+ * If we need to coalesce existing table entries into a block 
here,
+ * then install the block entry first and the sub-level page 
mappings

+ * will be unmapped later.
    */
-    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
   data->anchor = ptep;
+    data->follow = kvm_pte_follow(*ptep);
+    stage2_coalesce_tables_into_block(addr, level, ptep, data);
Here's how stage2_coalesce_tables_into_block() is implemented from 
the previous
patch (it might be worth merging it with this patch, I found it 
impossible to
judge if the function is correct without seeing how it is used and 
what is

replacing):

Ok, will do this if v2 is going to be post.

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                    kvm_pte_t *ptep,
                    struct stage2_map_data *data)
{
  u64 granule = kvm_granule_size(level), phys = data->phys;
  kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, 
level);


  kvm_set_invalid_pte(ptep);

  /*
   * Invalidate the whole stage-2, as we may have numerous leaf 
entries
   * below us which would otherwise need invalidating 
individually.

   */
  kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
  smp_store_release(ptep, new);
  data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value 
before calling the
pre callback

Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

2021-03-03 Thread wangyanan (Y)

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.

What I am trying to do in this patch is to adjust the order of rebuilding block
mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry being
invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being invalid is
only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to find the
entry invalid
and as a result trigger an unnecessary translation fault.

Thank you for the explanation, that was my understand of it also, and I believe
your idea is correct. I was more concerned that I got some of the details wrong,
and you have kindly corrected me below.


Signed-off-by: Yanan Wang 
---
   arch/arm64/kvm/hyp/pgtable.c | 26 --
   1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
   kvm_pte_t    attr;
     kvm_pte_t    *anchor;
+    kvm_pte_t    *follow;
     struct kvm_s2_mmu    *mmu;
   struct kvm_mmu_memory_cache    *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
u32 level,
   if (!kvm_block_mapping_supported(addr, end, data->phys, level))
   return 0;
   -    kvm_set_invalid_pte(ptep);
-
   /*
- * Invalidate the whole stage-2, as we may have numerous leaf
- * entries below us which would otherwise need invalidating
- * individually.
+ * If we need to coalesce existing table entries into a block here,
+ * then install the block entry first and the sub-level page mappings
+ * will be unmapped later.
    */
-    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
   data->anchor = ptep;
+    data->follow = kvm_pte_follow(*ptep);
+    stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is
replacing):

Ok, will do this if v2 is going to be post.

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                    kvm_pte_t *ptep,
                    struct stage2_map_data *data)
{
  u64 granule = kvm_granule_size(level), phys = data->phys;
  kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

  kvm_set_invalid_pte(ptep);

  /*
   * Invalidate the whole stage-2, as we may have numerous leaf entries
   * below us which would otherwise need invalidating individually.
   */
  kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
  smp_store_release(ptep, new);
  data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value before calling 
the
pre callback, and it visits the next level table based on the initial pte value,
n

Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

2021-03-03 Thread wangyanan (Y)

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of 
rebuilding block mappings from page mappings.

Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), 
and 3), and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry 
being invalid is long enough to

affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being 
invalid is only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to 
find the entry invalid

and as a result trigger an unnecessary translation fault.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 26 --
  1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
kvm_pte_t   attr;
  
  	kvm_pte_t			*anchor;

+   kvm_pte_t   *follow;
  
  	struct kvm_s2_mmu		*mmu;

struct kvm_mmu_memory_cache *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, 
u32 level,
if (!kvm_block_mapping_supported(addr, end, data->phys, level))
return 0;
  
-	kvm_set_invalid_pte(ptep);

-
/*
-* Invalidate the whole stage-2, as we may have numerous leaf
-* entries below us which would otherwise need invalidating
-* individually.
+* If we need to coalesce existing table entries into a block here,
+* then install the block entry first and the sub-level page mappings
+* will be unmapped later.
 */
-   kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
data->anchor = ptep;
+   data->follow = kvm_pte_follow(*ptep);
+   stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is 
replacing):

Ok, will do this if v2 is going to be post.

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                       kvm_pte_t *ptep,
                       struct stage2_map_data *data)
{
     u64 granule = kvm_granule_size(level), phys = data->phys;
     kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

     kvm_set_invalid_pte(ptep);

     /*
      * Invalidate the whole stage-2, as we may have numerous leaf entries
      * below us which would otherwise need invalidating individually.
      */
     kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
     smp_store_release(ptep, new);
     data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value before calling 
the
pre callback, and it visits the next level table based on the initial pte value,
not the new value written by stage2_coalesce_tables_into_block().
Right. So before replacing the initial pte value with the new value, we 
have to use
*data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to 
save
the initial pte value in advance. And data->follow will be used when  we 
start to


Re: [RFC PATCH v3 6/8] KVM: selftests: List all hugetlb src types specified with page sizes

2021-03-02 Thread wangyanan (Y)

Hi Ben,

On 2021/3/2 1:09, Ben Gardon wrote:

On Sun, Feb 28, 2021 at 11:00 PM Yanan Wang  wrote:

With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
default hugetlb pages to back the testing guest memory. In order to
add flexibility, now list all the known hugetlb backing src types with
different page sizes, so that we can specify use of hugetlb pages of the
exact granularity that we want. And as all the known hugetlb page sizes
are listed, it's appropriate for all architectures.

Besides, the helper get_backing_src_pagesz() is added to get the
granularity of different backing src types(anonumous, thp, hugetlb).

Suggested-by: Ben Gardon 
Signed-off-by: Yanan Wang 
---
  .../testing/selftests/kvm/include/test_util.h | 19 ++-
  tools/testing/selftests/kvm/lib/kvm_util.c|  2 +-
  tools/testing/selftests/kvm/lib/test_util.c   | 56 +++
  3 files changed, 63 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index ef24c76ba89a..be5d08bcdca7 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -70,16 +70,31 @@ struct timespec timespec_div(struct timespec ts, int 
divisor);
  enum vm_mem_backing_src_type {
 VM_MEM_SRC_ANONYMOUS,
 VM_MEM_SRC_ANONYMOUS_THP,
-   VM_MEM_SRC_ANONYMOUS_HUGETLB,

I apologize I didn't catch this in v2, but it looks like this patch
removes a default hugetlb size option. I could see this being
intentional if we want to force developers to think about there being
multiple page sizes, but it might also be nice for folks to have an
option to use the system default hugepage size.
Thanks for pointing out this. I was trying to let developers to use the 
accurate
page size in all cases if they want to use hugetlb pages, so I removed 
the default

enum of VM_MEM_SRC_ANONYMOUS_HUGETLB. But maybe it's not right to do so.

It's possible that a program just wants to use hugetlb pages and doesn't 
really care
the page size and the default option is the best in this case. Anyway, I 
will add

VM_MEM_SRC_ANONYMOUS_HUGETLB in next version :). As for the default hugetlb
page size, it can be got by reading file /proc/meminfo.

Otherwise, this series looks good to me. Please feel free to add
Reviewed-by: Ben Gardon .


Thanks for your review of this series and the good suggestions,

Yanan


+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_64KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_8MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_32MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_256MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
+   NUM_SRC_TYPES,
  };

  struct vm_mem_backing_src_alias {
 const char *name;
-   enum vm_mem_backing_src_type type;
+   uint32_t flag;
  };

  bool thp_configured(void);
  size_t get_trans_hugepagesz(void);
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
+size_t get_backing_src_pagesz(uint32_t i);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index cc22c4ab7d67..b91c8e3a7ee1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -757,7 +757,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 region->mmap_start = mmap(NULL, region->mmap_size,
   PROT_READ | PROT_WRITE,
   MAP_PRIVATE | MAP_ANONYMOUS
- | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? 
MAP_HUGETLB : 0),
+ | vm_mem_backing_src_alias(src_type)->flag,
   -1, 0);
 TEST_ASSERT(region->mmap_start != MAP_FAILED,
 "test_malloc failed, mmap_start: %p errno: %i",
diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
b/tools/testing/selftests/kvm/lib/test_util.c
index f2d133f76c67..1f5e7241c80e 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "linux/kernel.h"

  #include "test_util.h"
@@ -112,12 +113,6 @@ void print_skip(const char *fmt, ...)
 puts(", skipping test");
  }

-const struct vm_mem_backing_src_alias backing_src_aliases[] = {
-   {"anonymous", VM_MEM_SRC_ANONYMOUS,},
-   {"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
-   {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
-};

Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

2021-02-28 Thread wangyanan (Y)



On 2021/2/8 19:22, Yanan Wang wrote:

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.
BTW: Here show the benefit of this patch alone for reference (testing 
based on patch1) .
This patch aims to speed up the reconstruction of block 
mappings(especially for 1G blocks)
after they have been split, and the following test results represent the 
significant change.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/ 



---

hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyana...@huawei.com 
applied)


multiple vcpus concurrently access 20G memory.
execution time of KVM reconstituting the block mappings after dirty 
logging.


cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
   (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.881s 2.883s 2.885s 2.879s 2.882s
After  patch: KVM_ADJUST_MAPPINGS: 0.310s 0.301s 0.312s 0.299s 0.306s  
*average 89% improvement*


cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
   (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.954s 2.955s 2.949s 2.951s 2.953s
After  patch: KVM_ADJUST_MAPPINGS: 0.381s 0.366s 0.381s 0.380s 0.378s  
*average 87% improvement*


cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 60
   (60 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 3.118s 3.112s 3.130s 3.128s 3.119s
After  patch: KVM_ADJUST_MAPPINGS: 0.524s 0.534s 0.536s 0.525s 0.539s  
*average 83% improvement*


---

Thanks,

Yanan


Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 26 --
  1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
kvm_pte_t   attr;
  
  	kvm_pte_t			*anchor;

+   kvm_pte_t   *follow;
  
  	struct kvm_s2_mmu		*mmu;

struct kvm_mmu_memory_cache *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, 
u32 level,
if (!kvm_block_mapping_supported(addr, end, data->phys, level))
return 0;
  
-	kvm_set_invalid_pte(ptep);

-
/*
-* Invalidate the whole stage-2, as we may have numerous leaf
-* entries below us which would otherwise need invalidating
-* individually.
+* If we need to coalesce existing table entries into a block here,
+* then install the block entry first and the sub-level page mappings
+* will be unmapped later.
 */
-   kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
data->anchor = ptep;
+   data->follow = kvm_pte_follow(*ptep);
+   stage2_coalesce_tables_into_block(addr, level, ptep, data);
return 0;
  }
  
@@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,

  kvm_pte_t *ptep,
  struct stage2_map_data *data)
  {
-   int ret = 0;
-
if (!data->anchor)
return 0;
  
-	free_page((unsigned long)kvm_pte_follow(*ptep));

-   put_page(virt_to_page(ptep));
-
-   if (data->anchor == ptep) {
+   if (data->anchor != ptep) {
+   free_page((unsigned long)kvm_pte_follow(*ptep));
+   put_page(virt_to_page(ptep));
+   } else {
+   free_page((unsigned long)data->follow);
data->anchor = NULL;
-   ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
}
  
-	return ret;

+   return 0;
  }
  
  /*


Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler

2021-02-26 Thread wangyanan (Y)



On 2021/2/25 17:55, Marc Zyngier wrote:

Hi Yanan,

On Mon, 08 Feb 2021 11:22:47 +,
Yanan Wang  wrote:

We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

That's an interesting approach. However, wouldn't it be better to
identify early that there is already something mapped, and return to
the guest ASAP?

Can you quantify the benefit of this patch alone?


Signed-off-by: Yanan Wang 
---
  arch/arm64/include/asm/kvm_mmu.h | 16 --
  arch/arm64/kvm/hyp/pgtable.c | 38 
  arch/arm64/kvm/mmu.c | 14 +++-
  3 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..4ec9879e82ed 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu 
*vcpu)
return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
  }
  
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)

-{
-   void *va = page_address(pfn_to_page(pfn));
-
-   /*
-* With FWB, we ensure that the guest always accesses memory using
-* cacheable attributes, and we don't have to clean to PoC when
-* faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-* PoU is not required either in this case.
-*/
-   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-   return;
-
-   kvm_flush_dcache_to_poc(va, size);
-}
-
  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
  unsigned long size)
  {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..2f4f87021980 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
return 0;
  }
  
+static bool stage2_pte_cacheable(kvm_pte_t pte)

+{
+   u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+   return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+   /*
+* With FWB, we ensure that the guest always accesses memory using
+* cacheable attributes, and we don't have to clean to PoC when
+* faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+* PoU is not required either in this case.
+*/
+   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+   return;
+
+   __flush_dcache_area(addr, size);
+}
+
  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
  kvm_pte_t *ptep,
  struct stage2_map_data *data)
@@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
put_page(page);
}
  
+	/* Flush data cache before installation of the new PTE */

+   if (stage2_pte_cacheable(new))
+   stage2_flush_dcache(__va(phys), granule);
+
smp_store_release(ptep, new);
get_page(page);
data->phys += granule;
@@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 
addr, u64 size,
return ret;
  }
  
-static void stage2_flush_dcache(void *addr, u64 size)

-{
-   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-   return;
-
-   __flush_dcache_area(addr, size);
-}
-
-static bool stage2_pte_cacheable(kvm_pte_t pte)
-{
-   u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-   return memattr == PAGE_S2_MEMATTR(NORMAL);
-}
-
  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
   enum kvm_pgtable_walk_flags flag,
   void * const arg)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..d151927a7d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }
  
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)

-{
-   __clean_dcache_guest_page(pfn, size);
-}
-
  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
  {

Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler

2021-02-26 Thread wangyanan (Y)

Hi Marc, Alex,

On 2021/2/26 2:30, Marc Zyngier wrote:

On Thu, 25 Feb 2021 17:39:00 +,
Alexandru Elisei  wrote:

Hi Marc,

On 2/25/21 9:55 AM, Marc Zyngier wrote:

Hi Yanan,

On Mon, 08 Feb 2021 11:22:47 +,
Yanan Wang  wrote:

We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

That's an interesting approach. However, wouldn't it be better to
identify early that there is already something mapped, and return to
the guest ASAP?

Wouldn't that introduce overhead for the common case, when there's
only one VCPU that faults on an address? For each data abort caused
by a missing stage 2 entry we would now have to determine if the IPA
isn't already mapped and that means walking the stage 2 tables.

The problem is that there is no easy to define "common case". It all
depends on what you are running in the guest.


Or am I mistaken and either:

(a) The common case is multiple simultaneous translation faults from
different VCPUs on the same IPA. Or

(b) There's a fast way to check if an IPA is mapped at stage 2 and
the overhead would be negligible.

Checking that something is mapped is simple enough: walk the S2 PT (in
SW or using AT/PAR), and return early if there is *anything*. You
already have taken the fault, which is the most expensive part of the
handling.
I think maybe it could be better to move CMOs (both dcache and icache) 
to the fault handlers.
The map path and permission path are actually a page table walk, and we 
can easily distinguish
between conditions that need CMOs and the ones that don't in the paths 
now.  Why do we have
to add one more PTW early just for identifying the cases of CMOs and 
ignore the existing one?


Besides, if we know in advance there is already something mapped (page 
table is valid), maybe it's
not appropriate to just return early in all cases. What if we are going 
to change the output address(OA)
of the existing table entry? We can't just return in this case. I'm not 
sure whether this is a correct example :).


Actually, moving CMOs to the fault handlers will not ruin the existing 
stage2 page table framework,

and there will not be so much code change. Please see below.

Can you quantify the benefit of this patch alone?

And this ^^^ part is crucial to evaluating the merit of this patch,
specially outside of the micro-benchmark space.
The following test results represent the benefit of this patch alone, 
and it's
indicated that the benefit increase as the page table granularity 
increases.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/ 



---
hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyana...@huawei.com 
applied)


(1) multiple vcpus concurrently access 1G memory.
    execution time of: a) KVM create new page mappings(normal 4K), b) 
update the mappings from RO to RW.


cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 50
   (50 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 62.752s 62.123s 61.733s 62.562s 
61.847s
   After  patch: KVM_CREATE_MAPPINGS: 58.800s 58.364s 58.163s 58.370s 
58.677s *average 7% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 49.083s 49.920s 49.484s 49.551s 
49.410s
   After  patch: KVM_UPDATE_MAPPINGS: 48.723s 49.259s 49.204s 48.207s 
49.112s *no change*


cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 100
   (100 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 129.70s 129.66s 126.78s 126.07s 
130.21s
   After  patch: KVM_CREATE_MAPPINGS: 120.69s 120.28s 120.68s 121.09s 
121.34s *average 9% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 94.097s 94.501s 92.589s 93.957s 
94.317s
   After  patch: KVM_UPDATE_MAPPINGS: 93.677s 93.701s 93.036s 93.484s 
93.584s *no change*


(2) multiple vcpus concurrently access 20G memory.
    execution time of: a) KVM create new block mappings(THP 2M), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.


cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 20
   (20 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 12.546s 13.300s 12.448s 12.496s 
12.420s
   After  patch: KVM_CREATE_MAPPINGS:  5.679s  5.773s  5.759s 5.698s  
5.835s *average 54% improvement*
b) Before patch: 

Re: [RFC PATCH v2 3/7] KVM: selftests: Make a generic helper to get vm guest mode strings

2021-02-25 Thread wangyanan (Y)



On 2021/2/25 13:59, Yanan Wang wrote:

For generality and conciseness, make an API which can be used in all
kvm libs and selftests to get vm guest mode strings. And the index i
is checked in the API in case of possiable faults.

Signed-off-by: Yanan Wang 
And here too, will include Suggested-by: Sean Christopherson 
.

---
  .../testing/selftests/kvm/include/kvm_util.h  |  4 +--
  tools/testing/selftests/kvm/lib/kvm_util.c| 29 ---
  2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 2d7eb6989e83..f52a7492f47f 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -68,9 +68,6 @@ enum vm_guest_mode {
  #define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
  #define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
  
-#define vm_guest_mode_string(m) vm_guest_mode_string[m]

-extern const char * const vm_guest_mode_string[];
-
  struct vm_guest_mode_params {
unsigned int pa_bits;
unsigned int va_bits;
@@ -84,6 +81,7 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap 
*cap);
  int vcpu_enable_cap(struct kvm_vm *vm, uint32_t vcpu_id,
struct kvm_enable_cap *cap);
  void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
+const char *vm_guest_mode_string(uint32_t i);
  
  struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);

  void kvm_vm_free(struct kvm_vm *vmp);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index d787cb802b4a..cc22c4ab7d67 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -141,17 +141,24 @@ static void vm_open(struct kvm_vm *vm, int perm)
"rc: %i errno: %i", vm->fd, errno);
  }
  
-const char * const vm_guest_mode_string[] = {

-   "PA-bits:52,  VA-bits:48,  4K pages",
-   "PA-bits:52,  VA-bits:48, 64K pages",
-   "PA-bits:48,  VA-bits:48,  4K pages",
-   "PA-bits:48,  VA-bits:48, 64K pages",
-   "PA-bits:40,  VA-bits:48,  4K pages",
-   "PA-bits:40,  VA-bits:48, 64K pages",
-   "PA-bits:ANY, VA-bits:48,  4K pages",
-};
-_Static_assert(sizeof(vm_guest_mode_string)/sizeof(char *) == NUM_VM_MODES,
-  "Missing new mode strings?");
+const char *vm_guest_mode_string(uint32_t i)
+{
+   static const char * const strings[] = {
+   [VM_MODE_P52V48_4K] = "PA-bits:52,  VA-bits:48,  4K pages",
+   [VM_MODE_P52V48_64K]= "PA-bits:52,  VA-bits:48, 64K pages",
+   [VM_MODE_P48V48_4K] = "PA-bits:48,  VA-bits:48,  4K pages",
+   [VM_MODE_P48V48_64K]= "PA-bits:48,  VA-bits:48, 64K pages",
+   [VM_MODE_P40V48_4K] = "PA-bits:40,  VA-bits:48,  4K pages",
+   [VM_MODE_P40V48_64K]= "PA-bits:40,  VA-bits:48, 64K pages",
+   [VM_MODE_PXXV48_4K] = "PA-bits:ANY, VA-bits:48,  4K pages",
+   };
+   _Static_assert(sizeof(strings)/sizeof(char *) == NUM_VM_MODES,
+  "Missing new mode strings?");
+
+   TEST_ASSERT(i < NUM_VM_MODES, "Guest mode ID %d too big", i);
+
+   return strings[i];
+}
  
  const struct vm_guest_mode_params vm_guest_mode_params[] = {

{ 52, 48,  0x1000, 12 },


Re: [RFC PATCH v2 6/7] KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers

2021-02-25 Thread wangyanan (Y)



On 2021/2/26 7:44, Ben Gardon wrote:

On Wed, Feb 24, 2021 at 10:03 PM Yanan Wang  wrote:

With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
we have to get the transparent hugepage size for HVA alignment. With the
new helpers, we can use get_backing_src_pagesz() to check whether THP is
configured and then get the exact configured hugepage size.

As different architectures may have different THP page sizes configured,
this can get the accurate THP page sizes on any platform.

Signed-off-by: Yanan Wang 
---
  tools/testing/selftests/kvm/lib/kvm_util.c | 27 +++---
  1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index b91c8e3a7ee1..0105fbfed036 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -18,7 +18,6 @@
  #include 
  #include 

-#define KVM_UTIL_PGS_PER_HUGEPG 512
  #define KVM_UTIL_MIN_PFN   2

  /* Aligns x up to the next multiple of size. Size must be a power of 2. */
@@ -686,7 +685,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
  {
 int ret;
 struct userspace_mem_region *region;
-   size_t huge_page_size = KVM_UTIL_PGS_PER_HUGEPG * vm->page_size;
+   size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
 size_t alignment;

 TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
@@ -748,7 +747,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
  #endif

 if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
-   alignment = max(huge_page_size, alignment);
+   alignment = max(backing_src_pagesz, alignment);

 /* Add enough memory to align up if necessary */
 if (alignment > 1)
@@ -767,22 +766,12 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 region->host_mem = align(region->mmap_start, alignment);

 /* As needed perform madvise */
-   if (src_type == VM_MEM_SRC_ANONYMOUS || src_type == 
VM_MEM_SRC_ANONYMOUS_THP) {
-   struct stat statbuf;
-
-   ret = stat("/sys/kernel/mm/transparent_hugepage", );
-   TEST_ASSERT(ret == 0 || (ret == -1 && errno == ENOENT),
-   "stat /sys/kernel/mm/transparent_hugepage");
-
-   TEST_ASSERT(ret == 0 || src_type != VM_MEM_SRC_ANONYMOUS_THP,
-   "VM_MEM_SRC_ANONYMOUS_THP requires THP to be configured 
in the host kernel");
-
-   if (ret == 0) {
-   ret = madvise(region->host_mem, npages * vm->page_size,
- src_type == VM_MEM_SRC_ANONYMOUS ? 
MADV_NOHUGEPAGE : MADV_HUGEPAGE);
-   TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 
0x%lx src_type: %x",
-   region->host_mem, npages * vm->page_size, 
src_type);
-   }
+   if (src_type <= VM_MEM_SRC_ANONYMOUS_THP && thp_configured()) {

This check relies on an unstated property of the backing src type
enums where VM_MEM_SRC_ANONYMOUS and VM_MEM_SRC_ANONYMOUS_THP are
declared first.
It would probably be more readable for folks if the check was explicit:
if ((src_type == VM_MEM_SRC_ANONYMOUS || src_type ==
VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {


Yes, this makes sense, I will fix it.

Thanks,

Yanan


+   ret = madvise(region->host_mem, npages * vm->page_size,
+ src_type == VM_MEM_SRC_ANONYMOUS ? 
MADV_NOHUGEPAGE : MADV_HUGEPAGE);
+   TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx 
src_type: %s",
+   region->host_mem, npages * vm->page_size,
+   vm_mem_backing_src_alias(src_type)->name);
 }

 region->unused_phy_pages = sparsebit_alloc();
--
2.19.1


.


Re: [RFC PATCH v2 5/7] KVM: selftests: List all hugetlb src types specified with page sizes

2021-02-25 Thread wangyanan (Y)



On 2021/2/26 7:42, Ben Gardon wrote:

On Wed, Feb 24, 2021 at 10:03 PM Yanan Wang  wrote:

With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
default hugetlb pages to back the testing guest memory. In order to
add flexibility, now list all the known hugetlb backing src types with
different page sizes, so that we can specify use of hugetlb pages of the
exact granularity that we want. And as all the known hugetlb page sizes
are listed, it's appropriate for all architectures.

Besides, the helper get_backing_src_pagesz() is added to get the
granularity of different backing src types(anonumous, thp, hugetlb).

Signed-off-by: Yanan Wang 
---
  .../testing/selftests/kvm/include/test_util.h | 19 ++-
  tools/testing/selftests/kvm/lib/kvm_util.c|  2 +-
  tools/testing/selftests/kvm/lib/test_util.c   | 56 +++
  3 files changed, 63 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index ef24c76ba89a..be5d08bcdca7 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -70,16 +70,31 @@ struct timespec timespec_div(struct timespec ts, int 
divisor);
  enum vm_mem_backing_src_type {
 VM_MEM_SRC_ANONYMOUS,
 VM_MEM_SRC_ANONYMOUS_THP,
-   VM_MEM_SRC_ANONYMOUS_HUGETLB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_64KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512KB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_8MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_32MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_256MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_512MB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_2GB,
+   VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
+   NUM_SRC_TYPES,
  };

  struct vm_mem_backing_src_alias {
 const char *name;
-   enum vm_mem_backing_src_type type;
+   uint32_t flag;
  };

  bool thp_configured(void);
  size_t get_trans_hugepagesz(void);
+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
+size_t get_backing_src_pagesz(uint32_t i);
  void backing_src_help(void);
  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index cc22c4ab7d67..b91c8e3a7ee1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -757,7 +757,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 region->mmap_start = mmap(NULL, region->mmap_size,
   PROT_READ | PROT_WRITE,
   MAP_PRIVATE | MAP_ANONYMOUS
- | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? 
MAP_HUGETLB : 0),
+ | vm_mem_backing_src_alias(src_type)->flag,
   -1, 0);
 TEST_ASSERT(region->mmap_start != MAP_FAILED,
 "test_malloc failed, mmap_start: %p errno: %i",
diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
b/tools/testing/selftests/kvm/lib/test_util.c
index f2d133f76c67..6780aa058f35 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "linux/kernel.h"

  #include "test_util.h"
@@ -112,12 +113,6 @@ void print_skip(const char *fmt, ...)
 puts(", skipping test");
  }

-const struct vm_mem_backing_src_alias backing_src_aliases[] = {
-   {"anonymous", VM_MEM_SRC_ANONYMOUS,},
-   {"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
-   {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
-};
-
  bool thp_configured(void)
  {
 int ret;
@@ -153,22 +148,61 @@ size_t get_trans_hugepagesz(void)
 return size;
  }

+const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
+{
+   static const struct vm_mem_backing_src_alias aliases[] = {
+   { "anonymous",   0},
+   { "anonymous_thp",   0},
+   { "anonymous_hugetlb_16kb",  MAP_HUGETLB | MAP_HUGE_16KB  },
+   { "anonymous_hugetlb_64kb",  MAP_HUGETLB | MAP_HUGE_64KB  },
+   { "anonymous_hugetlb_512kb", MAP_HUGETLB | MAP_HUGE_512KB },
+   { "anonymous_hugetlb_1mb",   MAP_HUGETLB | MAP_HUGE_1MB   },
+   { "anonymous_hugetlb_2mb",   MAP_HUGETLB | MAP_HUGE_2MB   },
+   { "anonymous_hugetlb_8mb",   MAP_HUGETLB | MAP_HUGE_8MB   },
+   { "anonymous_hugetlb_16mb",  MAP_HUGETLB | MAP_HUGE_16MB  },
+   { "anonymous_hugetlb_32mb",  MAP_HUGETLB | MAP_HUGE_32MB  },
+   

Re: [RFC PATCH v2 2/7] KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing

2021-02-25 Thread wangyanan (Y)



On 2021/2/26 2:54, Andrew Jones wrote:

On Thu, Feb 25, 2021 at 01:59:35PM +0800, Yanan Wang wrote:

In addition to function of CLOCK_MONOTONIC, flag CLOCK_MONOTONIC_RAW can
also shield possiable impact of NTP, which can provide more robustness.

IIRC, this should include

Suggested-by: Vitaly Kuznetsov 


Oh, sorry for my rashness. I will include it in v3.

Thanks,

Yanan


Signed-off-by: Yanan Wang 
---
  tools/testing/selftests/kvm/demand_paging_test.c  |  8 
  tools/testing/selftests/kvm/dirty_log_perf_test.c | 14 +++---
  tools/testing/selftests/kvm/lib/test_util.c   |  2 +-
  tools/testing/selftests/kvm/steal_time.c  |  4 ++--
  4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
b/tools/testing/selftests/kvm/demand_paging_test.c
index 5f7a229c3af1..efbf0c1e9130 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -53,7 +53,7 @@ static void *vcpu_worker(void *data)
vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
run = vcpu_state(vm, vcpu_id);
  
-	clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
  
  	/* Let the guest access its memory */

ret = _vcpu_run(vm, vcpu_id);
@@ -86,7 +86,7 @@ static int handle_uffd_page_request(int uffd, uint64_t addr)
copy.len = perf_test_args.host_page_size;
copy.mode = 0;
  
-	clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
  
  	r = ioctl(uffd, UFFDIO_COPY, );

if (r == -1) {
@@ -123,7 +123,7 @@ static void *uffd_handler_thread_fn(void *arg)
struct timespec start;
struct timespec ts_diff;
  
-	clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
while (!quit_uffd_thread) {
struct uffd_msg msg;
struct pollfd pollfd[2];
@@ -336,7 +336,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
  
  	pr_info("Finished creating vCPUs and starting uffd threads\n");
  
-	clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
  
  	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {

pthread_create(_threads[vcpu_id], NULL, vcpu_worker,
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 04a2641261be..6cff4ccf9525 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -50,7 +50,7 @@ static void *vcpu_worker(void *data)
while (!READ_ONCE(host_quit)) {
int current_iteration = READ_ONCE(iteration);
  
-		clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
ret = _vcpu_run(vm, vcpu_id);
ts_diff = timespec_elapsed(start);
  
@@ -141,7 +141,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)

iteration = 0;
host_quit = false;
  
-	clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
vcpu_last_completed_iteration[vcpu_id] = -1;
  
@@ -162,7 +162,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)

ts_diff.tv_sec, ts_diff.tv_nsec);
  
  	/* Enable dirty logging */

-   clock_gettime(CLOCK_MONOTONIC, );
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX,
KVM_MEM_LOG_DIRTY_PAGES);
ts_diff = timespec_elapsed(start);
@@ -174,7 +174,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 * Incrementing the iteration number will start the vCPUs
 * dirtying memory again.
 */
-   clock_gettime(CLOCK_MONOTONIC, );
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
iteration++;
  
  		pr_debug("Starting iteration %d\n", iteration);

@@ -189,7 +189,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
  
-		clock_gettime(CLOCK_MONOTONIC, );

+   clock_gettime(CLOCK_MONOTONIC_RAW, );
kvm_vm_get_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, bmap);
  
  		ts_diff = timespec_elapsed(start);

@@ -199,7 +199,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
  
  		if (dirty_log_manual_caps) {

-   clock_gettime(CLOCK_MONOTONIC, );
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
kvm_vm_clear_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, 
bmap, 0,
   host_num_pages);
  
@@ -212,7 +212,7 @@ static void run_test(enum 

Re: [PATCH 04/15] KVM: selftests: Force stronger HVA alignment (1gb) for hugepages

2021-02-25 Thread wangyanan (Y)



On 2021/2/11 7:06, Sean Christopherson wrote:

Align the HVA for hugepage memslots to 1gb, as opposed to incorrectly
assuming all architectures' hugepages are 512*page_size.

For x86, multiplying by 512 is correct, but only for 2mb pages, e.g.
systems that support 1gb pages will never be able to use them for mapping
guest memory, and thus those flows will not be exercised.

For arm64, powerpc, and s390 (and mips?), hardcoding the multiplier to
512 is either flat out wrong, or at best correct only in certain
configurations.

Hardcoding the _alignment_ to 1gb is a compromise between correctness and
simplicity.  Due to the myriad flavors of hugepages across architectures,
attempting to enumerate the exact hugepage size is difficult, and likely
requires probing the kernel.

But, there is no need for precision since a stronger alignment will not
prevent creating a smaller hugepage.  For all but the most extreme cases,
e.g. arm64's 16gb contiguous PMDs, aligning to 1gb is sufficient to allow
KVM to back the guest with hugepages.
I have implemented a helper get_backing_src_pagesz() to get granularity 
of different
backing src types (anonymous/thp/hugetlb) which is suitable for 
different architectures.
See: 
https://lore.kernel.org/lkml/20210225055940.18748-6-wangyana...@huawei.com/
if it looks fine for you, maybe we can use the accurate page sizes for 
GPA/HVA alignment:).


Thanks,

Yanan

Add the new alignment in kvm_util.h so that it can be used by callers of
vm_userspace_mem_region_add(), e.g. to also ensure GPAs are aligned.

Cc: Ben Gardon 
Cc: Yanan Wang 
Cc: Andrew Jones 
Cc: Peter Xu 
Cc: Aaron Lewis 
Signed-off-by: Sean Christopherson 
---
  tools/testing/selftests/kvm/include/kvm_util.h | 13 +
  tools/testing/selftests/kvm/lib/kvm_util.c |  4 +---
  2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 4b5d2362a68a..a7dbdf46aa51 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -68,6 +68,19 @@ enum vm_guest_mode {
  #define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
  #define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
  
+/*

+ * KVM_UTIL_HUGEPAGE_ALIGNMENT is selftest's required alignment for both host
+ * and guest addresses when backing guest memory with hugepages.  This is not
+ * the exact size of hugepages, rather it's a size that should allow backing
+ * the guest with hugepages on all architectures.  Precisely tracking the exact
+ * sizes across all architectures is more pain than gain, e.g. x86 supports 2mb
+ * and 1gb hugepages, arm64 supports 2mb and 1gb hugepages when using 4kb pages
+ * and 512mb hugepages when using 64kb pages (ignoring contiguous TLB entries),
+ * powerpc radix supports 1gb hugepages when using 64kb pages, s390 supports 
1mb
+ * hugepages, and so on and so forth.
+ */
+#define KVM_UTIL_HUGEPAGE_ALIGNMENT(1ULL << 30)
+
  #define vm_guest_mode_string(m) vm_guest_mode_string[m]
  extern const char * const vm_guest_mode_string[];
  
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c

index deaeb47b5a6d..2e497fbab6ae 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -18,7 +18,6 @@
  #include 
  #include 
  
-#define KVM_UTIL_PGS_PER_HUGEPG 512

  #define KVM_UTIL_MIN_PFN  2
  
  /*

@@ -670,7 +669,6 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
  {
int ret;
struct userspace_mem_region *region;
-   size_t huge_page_size = KVM_UTIL_PGS_PER_HUGEPG * vm->page_size;
size_t alignment;
  
  	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,

@@ -733,7 +731,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
  
  	if (src_type == VM_MEM_SRC_ANONYMOUS_THP ||

src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB)
-   alignment = max(huge_page_size, alignment);
+   alignment = max((size_t)KVM_UTIL_HUGEPAGE_ALIGNMENT, alignment);
else
ASSERT_EQ(src_type, VM_MEM_SRC_ANONYMOUS);
  


Re: [PATCH 03/15] KVM: selftests: Align HVA for HugeTLB-backed memslots

2021-02-24 Thread wangyanan (Y)

Hi Sean,

On 2021/2/11 7:06, Sean Christopherson wrote:

Align the HVA for HugeTLB memslots, not just THP memslots.  Add an
assert so any future backing types are forced to assess whether or not
they need to be aligned.

Cc: Ben Gardon 
Cc: Yanan Wang 
Cc: Andrew Jones 
Cc: Peter Xu 
Cc: Aaron Lewis 
Signed-off-by: Sean Christopherson 
---
  tools/testing/selftests/kvm/lib/kvm_util.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 584167c6dbc7..deaeb47b5a6d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -731,8 +731,11 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
alignment = 1;
  #endif
  
-	if (src_type == VM_MEM_SRC_ANONYMOUS_THP)

+   if (src_type == VM_MEM_SRC_ANONYMOUS_THP ||
+   src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB)

Sorry for the late reply, I just returned from vacation.
I am not sure HVA alignment is really necessary here for hugetlb pages. 
Different from hugetlb pages,
the THP pages are dynamically allocated by later madvise(), so the value 
of HVA returned from mmap()
is host page size aligned but not THP page size aligned, so we indeed 
have to perform alignment.
But hugetlb pages are pre-allocated on systems. The following test 
results also indicate that,
with MAP_HUGETLB flag, the HVA returned from mmap() is already aligned 
to the corresponding
hugetlb page size. So maybe HVAs of each hugetlb pages are aligned 
during allocation of them
or in mmap() ? If so, I think we better not do this again here, because 
the later *region->mmap_size += alignment*

will cause one more hugetlb page mapped but will not be used.

cmdline: ./kvm_page_table_test -m 4 -b 1G -s anonymous_hugetlb_1gb
some outputs:
Host  virtual  test memory offset: 0x4000
Host  virtual  test memory offset: 0x
Host  virtual  test memory offset: 0x4000

cmdline: ./kvm_page_table_test -m 4 -b 1G -s anonymous_hugetlb_2mb
some outputs:
Host  virtual  test memory offset: 0x4800
Host  virtual  test memory offset: 0x6540
Host  virtual  test memory offset: 0x6ba0

cmdline: ./kvm_page_table_test -m 4 -b 1G -s anonymous_hugetlb_32mb
some outputs:
Host  virtual  test memory offset: 0x7000
Host  virtual  test memory offset: 0x4c00
Host  virtual  test memory offset: 0x7200

cmdline: ./kvm_page_table_test -m 4 -b 1G -s anonymous_hugetlb_64kb
some outputs:
Host  virtual  test memory offset: 0x5823
Host  virtual  test memory offset: 0x6ef0
Host  virtual  test memory offset: 0x7c15

Thanks,
Yanan

alignment = max(huge_page_size, alignment);
+   else
+   ASSERT_EQ(src_type, VM_MEM_SRC_ANONYMOUS);
  
  	/* Add enough memory to align up if necessary */

if (alignment > 1)


Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table

2021-02-24 Thread wangyanan (Y)



On 2021/2/25 1:20, Alexandru Elisei wrote:

Hi,

On 2/24/21 2:35 AM, wangyanan (Y) wrote:


Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:

Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did 
you
use another branch as the base for your patches?

Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before (Link:
https://lore.kernel.org/r/20210114121350.123684-4-wangyana...@huawei.com).
And they have already been merged into up-to-data upstream master (commit:
509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
v5.11-rc7.
Could you please try the newest upstream master(since commit:
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
apply errors occur.

That worked for me, thank you for the quick reply.

Just to double check, when you run the benchmarks, the before results are for a
kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
the fault is handled successfully"), and the after results are with this series 
on
top, right?


Yes, that's right. So the performance change results have nothing to do 
with the series of commit 509552e65ae8.


Thanks,

Yanan



Thanks,

Alex


Thanks,

Yanan.


Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:

Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB 

Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table

2021-02-23 Thread wangyanan (Y)

Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:

Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did 
you
use another branch as the base for your patches?

Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyana...@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.


Thanks,

Yanan.


Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:

Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB not supported)
host kernel: Linux mainline (v5.11-rc6)

(1) performance change of patch 1
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s

Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
   

Re: [PATCH][next] net: hns3: Fix uninitialized return from function

2021-02-17 Thread lipeng (Y)



在 2021/2/10 23:26, Colin King 写道:

From: Colin Ian King 

Currently function hns3_reset_notify_uninit_enet is returning
the contents of the uninitialized variable ret.  Fix this by
removing ret (since it is no longer used) and replace it with
a return of the literal value 0.



you can not remove "ret"  this way.

try to change  "int hns3_uninit_all_ring"  to "void 
hns3_uninit_all_ring" and fix related code is better.





Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 64749c9c38a9 ("net: hns3: remove redundant return value of 
hns3_uninit_all_ring()")
Signed-off-by: Colin Ian King 
---
  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 9565b7999426..bf4302a5cf95 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -4640,7 +4640,6 @@ static int hns3_reset_notify_uninit_enet(struct 
hnae3_handle *handle)
  {
struct net_device *netdev = handle->kinfo.netdev;
struct hns3_nic_priv *priv = netdev_priv(netdev);
-   int ret;
  
  	if (!test_and_clear_bit(HNS3_NIC_STATE_INITED, >state)) {

netdev_warn(netdev, "already uninitialized\n");
@@ -4662,7 +4661,7 @@ static int hns3_reset_notify_uninit_enet(struct 
hnae3_handle *handle)
  
  	hns3_put_ring_config(priv);
  
-	return ret;

+   return 0;
  }
  
  static int hns3_reset_notify(struct hnae3_handle *handle,


Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-10 Thread wangyanan (Y)



On 2021/2/10 1:57, Ben Gardon wrote:

On Tue, Feb 9, 2021 at 1:43 AM wangyanan (Y)  wrote:


On 2021/2/9 4:29, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() is designed to cover conditions where a single vcpu
or multiple vcpus access guest pages within the same memory range, in three
VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings or
block mappings can be chosen by users to be created in the test.

If use of ANONYMOUS memory is specified, kvm will create page mappings for the
tested memory region before dirty-logging, and update attributes of the page
mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
specified, kvm will create block mappings for the tested memory region before
dirty-logging, and split the blcok mappings into page mappings during
dirty-logging, and coalesce the page mappings back into block mappings after
dirty-logging is stopped.

So in summary, as a performance tester, this test can present the performance
of kvm creating/updating normal page mappings, or the performance of kvm
creating/splitting/recovering block mappings, through execution time.

When we need to coalesce the page mappings back to block mappings after dirty
logging is stopped, we have to firstly invalidate *all* the TLB entries for the
page mappings right before installation of the block entry, because a TLB 
conflict
abort error could occur if we can't invalidate the TLB entries fully. We have
hit this TLB conflict twice on aarch64 software implementation and fixed it.
As this test can imulate process from dirty-logging enabled to dirty-logging
stopped of a VM with block mappings, so it can also reproduce this TLB conflict
abort due to inadequate TLB invalidation when coalescing tables.

Signed-off-by: Yanan Wang 

Thanks for sending this! Happy to see more tests for weird TLB
flushing edge cases and races.

Just out of curiosity, were you unable to replicate the bug with the
dirty_log_perf_test and setting the wr_fract option?
With "KVM: selftests: Disable dirty logging with vCPUs running"
(https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
most of the same features as this one.
Please correct me if I'm wrong, but it seems like the major difference
here is a more careful pattern of which pages are dirtied when.

Within Google we have a system for pre-specifying sets of arguments to
e.g. the dirty_log_perf_test. I wonder if something similar, even as
simple as a script that just runs dirty_log_perf_test several times
would be helpful for cases where different arguments are needed for
the test to cover different specific cases. Even with this test, for
example, I assume the test doesn't work very well with just 1 vCPU,
but it's still a good default in the test, so having some kind of
configuration (lite) file would be useful.


---
   tools/testing/selftests/kvm/Makefile  |   3 +
   .../selftests/kvm/kvm_page_table_test.c   | 518 ++
   2 files changed, 521 insertions(+)
   create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index fe41c6a0fa67..697318019bd4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
   TEST_GEN_PROGS_x86_64 += demand_paging_test
   TEST_GEN_PROGS_x86_64 += dirty_log_test
   TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
   TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
   TEST_GEN_PROGS_x86_64 += set_memory_region_test
   TEST_GEN_PROGS_x86_64 += steal_time
@@ -71,6 +72,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
   TEST_GEN_PROGS_aarch64 += demand_paging_test
   TEST_GEN_PROGS_aarch64 += dirty_log_test
   TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
   TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
   TEST_GEN_PROGS_aarch64 += set_memory_region_test
   TEST_GEN_PROGS_aarch64 += steal_time
@@ -80,6 +82,7 @@ TEST_GEN_PROGS_s390x += s390x/resets
   TEST_GEN_PROGS_s390x += s390x/sync_regs_test
   TEST_GEN_PROGS_s390x += demand_paging_test
   TEST_GEN_PROGS_s390x += dirty_log_test
+TEST_GEN_PROGS_s390x += kvm_page_table_test
   TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
   TEST_GEN_PROGS_s390x += set_memory_region_test

diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c 
b/tools/testing/selftests/kvm/kvm_page_table_test.c
new file mode 100644
index ..b09c05288937
--- /dev/nul

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-09 Thread wangyanan (Y)



On 2021/2/10 1:38, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 11:22 PM wangyanan (Y)  wrote:

Hi Ben,

On 2021/2/9 4:29, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() is designed to cover conditions where a single vcpu
or multiple vcpus access guest pages within the same memory range, in three
VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings or
block mappings can be chosen by users to be created in the test.

If use of ANONYMOUS memory is specified, kvm will create page mappings for the
tested memory region before dirty-logging, and update attributes of the page
mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
specified, kvm will create block mappings for the tested memory region before
dirty-logging, and split the blcok mappings into page mappings during
dirty-logging, and coalesce the page mappings back into block mappings after
dirty-logging is stopped.

So in summary, as a performance tester, this test can present the performance
of kvm creating/updating normal page mappings, or the performance of kvm
creating/splitting/recovering block mappings, through execution time.

When we need to coalesce the page mappings back to block mappings after dirty
logging is stopped, we have to firstly invalidate *all* the TLB entries for the
page mappings right before installation of the block entry, because a TLB 
conflict
abort error could occur if we can't invalidate the TLB entries fully. We have
hit this TLB conflict twice on aarch64 software implementation and fixed it.
As this test can imulate process from dirty-logging enabled to dirty-logging
stopped of a VM with block mappings, so it can also reproduce this TLB conflict
abort due to inadequate TLB invalidation when coalescing tables.

Signed-off-by: Yanan Wang 

Thanks for sending this! Happy to see more tests for weird TLB
flushing edge cases and races.

Just out of curiosity, were you unable to replicate the bug with the
dirty_log_perf_test and setting the wr_fract option?
With "KVM: selftests: Disable dirty logging with vCPUs running"
(https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
most of the same features as this one.
Please correct me if I'm wrong, but it seems like the major difference
here is a more careful pattern of which pages are dirtied when.

Actually the procedures in KVM_UPDATE_MAPPINGS stage are specially
designed for
reproduce of the TLB conflict bug. The following explains why.
In x86 implementation, the related page mappings will be all destroyed
in advance when
stopping dirty logging while vcpus are still running. So after dirty
logging is successfully
stopped, there will certainly be page faults when accessing memory, and
KVM will handle
the faults and create block mappings once again. (Is this right?)
So in this case, dirty_log_perf_test can replicate the bug theoretically.

But there is difference in ARM implementation. The related page mappings
will not be
destroyed immediately when stopping dirty logging and will  be kept
instead. And after
dirty logging, KVM will destroy these mappings together with creation of
block mappings
when handling a guest fault (page fault or permission fault).  So based
on guest_code() in
dirty_log_perf_test, there will not be any page faults after dirty
logging because all the
page mappings have been created and KVM has no chance to recover block
mappings
at all. So this is why I left half of the pages clean and another half
dirtied.

Ah okay, I'm sorry. I shouldn't have assumed that ARM does the same
thing as x86 when disabling dirty logging. It makes sense then why
your guest code is so carefully structured. Does that mean that if a
VM dirties all its memory during dirty logging, that it will never be
able to reconstitute the broken down mappings into large page / block
mappings?


Indeed, but it's really a rare case to happen. I think both the x86 way 
and ARM way have


it's own benefits and are based on different considerations. Anyway, the 
more carefully


structured code is compatible for the TLB bug of different architectures.


Within Google we have a system for pre-specifying sets of arguments to
e.g. the dirty_log_perf_test. I wonder if something similar, even as
simple as a script that just runs dirty_log_perf_test several times
would be helpful for cases where different arguments are needed for
the test to cover different specific cases. Even with this test, for

I not sure I have got your point :), but it depends on what exactly the
specific cases are,
and sometimes we have to use different arguments. Is this right?

Exac

Re: [RFC PATCH 1/2] KVM: selftests: Add a macro to get string of vm_mem_backing_src_type

2021-02-09 Thread wangyanan (Y)



On 2021/2/10 1:35, Sean Christopherson wrote:

On Tue, Feb 09, 2021, Ben Gardon wrote:

On Tue, Feb 9, 2021 at 3:21 AM wangyanan (Y)  wrote:


On 2021/2/9 2:13, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

Add a macro to get string of the backing source memory type, so that
application can add choices for source types in the help() function,
and users can specify which type to use for testing.

Coincidentally, I sent out a change last week to do the same thing:
"KVM: selftests: Add backing src parameter to dirty_log_perf_test"
(https://lkml.org/lkml/2021/2/2/1430)
Whichever way this ends up being implemented, I'm happy to see others
interested in testing different backing source types too.

Thanks Ben! I have a little question here.

Can we just present three IDs (0/1/2) but not strings for users to
choose which backing_src_type to use like the way of guest modes,

That would be fine with me. The string names are easier for me to read
than an ID number (especially if you were to add additional options
e.g. 1G hugetlb or file backed  / shared memory) but it's mostly an
aesthetic preference, so I don't have strong feelings either way.

I vote to expose/consume strings, being able to do ".dirty_log_perf_test --help"
and understand the backing options without having to dig into source was super
nice.
.
Fine then:), I will make some change based on 
(https://lkml.org/lkml/2021/2/2/1430), thanks!


Re: [RFC PATCH 1/2] KVM: selftests: Add a macro to get string of vm_mem_backing_src_type

2021-02-09 Thread wangyanan (Y)



On 2021/2/9 2:13, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

Add a macro to get string of the backing source memory type, so that
application can add choices for source types in the help() function,
and users can specify which type to use for testing.

Coincidentally, I sent out a change last week to do the same thing:
"KVM: selftests: Add backing src parameter to dirty_log_perf_test"
(https://lkml.org/lkml/2021/2/2/1430)
Whichever way this ends up being implemented, I'm happy to see others
interested in testing different backing source types too.


Thanks Ben! I have a little question here.

Can we just present three IDs (0/1/2) but not strings for users to 
choose which backing_src_type to use like the way of guest modes,


which I think can make cmdlines more consise and easier to print. And is 
it better to make a universal API to get backing_src_strings


like Sean have suggested, so that the API can be used elsewhere ?


Signed-off-by: Yanan Wang 
---
  tools/testing/selftests/kvm/include/kvm_util.h | 3 +++
  tools/testing/selftests/kvm/lib/kvm_util.c | 8 
  2 files changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 5cbb861525ed..f5fc29dc9ee6 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -69,7 +69,9 @@ enum vm_guest_mode {
  #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)

  #define vm_guest_mode_string(m) vm_guest_mode_string[m]
+#define vm_mem_backing_src_type_string(s) vm_mem_backing_src_type_string[s]
  extern const char * const vm_guest_mode_string[];
+extern const char * const vm_mem_backing_src_type_string[];

  struct vm_guest_mode_params {
 unsigned int pa_bits;
@@ -83,6 +85,7 @@ enum vm_mem_backing_src_type {
 VM_MEM_SRC_ANONYMOUS,
 VM_MEM_SRC_ANONYMOUS_THP,
 VM_MEM_SRC_ANONYMOUS_HUGETLB,
+   NUM_VM_BACKING_SRC_TYPES,
  };

  int kvm_check_cap(long cap);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index fa5a90e6c6f0..a9b651c7f866 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -165,6 +165,14 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = 
{
  _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
vm_guest_mode_params) == NUM_VM_MODES,
"Missing new mode params?");

+const char * const vm_mem_backing_src_type_string[] = {
+   "VM_MEM_SRC_ANONYMOUS",
+   "VM_MEM_SRC_ANONYMOUS_THP",
+   "VM_MEM_SRC_ANONYMOUS_HUGETLB",
+};
+_Static_assert(sizeof(vm_mem_backing_src_type_string)/sizeof(char *) == 
NUM_VM_BACKING_SRC_TYPES,
+  "Missing new source type strings?");
+
  /*
   * VM Create
   *
--
2.23.0


.


Re: [RFC PATCH 1/2] KVM: selftests: Add a macro to get string of vm_mem_backing_src_type

2021-02-09 Thread wangyanan (Y)

Hi Sean,

On 2021/2/9 1:43, Sean Christopherson wrote:

On Mon, Feb 08, 2021, Yanan Wang wrote:

Add a macro to get string of the backing source memory type, so that
application can add choices for source types in the help() function,
and users can specify which type to use for testing.

Signed-off-by: Yanan Wang 
---
  tools/testing/selftests/kvm/include/kvm_util.h | 3 +++
  tools/testing/selftests/kvm/lib/kvm_util.c | 8 
  2 files changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 5cbb861525ed..f5fc29dc9ee6 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -69,7 +69,9 @@ enum vm_guest_mode {
  #define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
  
  #define vm_guest_mode_string(m) vm_guest_mode_string[m]

+#define vm_mem_backing_src_type_string(s) vm_mem_backing_src_type_string[s]

Oof, I see this is just following vm_guest_mode_string.  IMO, defining the
string to look like a function is unnecessary and rather mean.


  extern const char * const vm_guest_mode_string[];
+extern const char * const vm_mem_backing_src_type_string[];
  
  struct vm_guest_mode_params {

unsigned int pa_bits;
@@ -83,6 +85,7 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS,
VM_MEM_SRC_ANONYMOUS_THP,
VM_MEM_SRC_ANONYMOUS_HUGETLB,
+   NUM_VM_BACKING_SRC_TYPES,
  };
  
  int kvm_check_cap(long cap);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index fa5a90e6c6f0..a9b651c7f866 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -165,6 +165,14 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = 
{
  _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
vm_guest_mode_params) == NUM_VM_MODES,
   "Missing new mode params?");
  
+const char * const vm_mem_backing_src_type_string[] = {

A shorter name would be nice, though I don't have a good suggestion.


+   "VM_MEM_SRC_ANONYMOUS",
+   "VM_MEM_SRC_ANONYMOUS_THP",
+   "VM_MEM_SRC_ANONYMOUS_HUGETLB",

It'd be more robust to explicitly assign indices, that way tweaks to
vm_mem_backing_src_type won't cause silent breakage.  Ditto for the existing
vm_guest_mode_string.

E.g. I think something like this would work (completely untested)

const char *vm_guest_mode_string(int i)
{
static const char *const strings[] = {
[VM_MODE_P52V48_4K] = "PA-bits:52,  VA-bits:48,  4K pages",
[VM_MODE_P52V48_64K]= "PA-bits:52,  VA-bits:48, 64K pages",
[VM_MODE_P48V48_4K] = "PA-bits:48,  VA-bits:48,  4K pages",
[VM_MODE_P48V48_64K]= "PA-bits:48,  VA-bits:48, 64K pages",
[VM_MODE_P40V48_4K] = "PA-bits:40,  VA-bits:48,  4K pages",
[VM_MODE_P40V48_64K]= "PA-bits:40,  VA-bits:48, 64K pages",
[VM_MODE_PXXV48_4K] = "PA-bits:ANY, VA-bits:48,  4K pages",
};

_Static_assert(sizeof(strings)/sizeof(char *) == NUM_VM_MODES,
   "Missing new mode strings?");

TEST_ASSERT(i < NUM_VM_MODES);

return strings[i];
}


I think this is better. Moving these three staffs together into a single 
function and check the indices here is more reasonable.


Thanks,

Yanan.




+};
+_Static_assert(sizeof(vm_mem_backing_src_type_string)/sizeof(char *) == 
NUM_VM_BACKING_SRC_TYPES,
+  "Missing new source type strings?");
+
  /*
   * VM Create
   *
--
2.23.0


.


Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-09 Thread wangyanan (Y)



On 2021/2/9 4:29, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() is designed to cover conditions where a single vcpu
or multiple vcpus access guest pages within the same memory range, in three
VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings or
block mappings can be chosen by users to be created in the test.

If use of ANONYMOUS memory is specified, kvm will create page mappings for the
tested memory region before dirty-logging, and update attributes of the page
mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
specified, kvm will create block mappings for the tested memory region before
dirty-logging, and split the blcok mappings into page mappings during
dirty-logging, and coalesce the page mappings back into block mappings after
dirty-logging is stopped.

So in summary, as a performance tester, this test can present the performance
of kvm creating/updating normal page mappings, or the performance of kvm
creating/splitting/recovering block mappings, through execution time.

When we need to coalesce the page mappings back to block mappings after dirty
logging is stopped, we have to firstly invalidate *all* the TLB entries for the
page mappings right before installation of the block entry, because a TLB 
conflict
abort error could occur if we can't invalidate the TLB entries fully. We have
hit this TLB conflict twice on aarch64 software implementation and fixed it.
As this test can imulate process from dirty-logging enabled to dirty-logging
stopped of a VM with block mappings, so it can also reproduce this TLB conflict
abort due to inadequate TLB invalidation when coalescing tables.

Signed-off-by: Yanan Wang 

Thanks for sending this! Happy to see more tests for weird TLB
flushing edge cases and races.

Just out of curiosity, were you unable to replicate the bug with the
dirty_log_perf_test and setting the wr_fract option?
With "KVM: selftests: Disable dirty logging with vCPUs running"
(https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
most of the same features as this one.
Please correct me if I'm wrong, but it seems like the major difference
here is a more careful pattern of which pages are dirtied when.

Within Google we have a system for pre-specifying sets of arguments to
e.g. the dirty_log_perf_test. I wonder if something similar, even as
simple as a script that just runs dirty_log_perf_test several times
would be helpful for cases where different arguments are needed for
the test to cover different specific cases. Even with this test, for
example, I assume the test doesn't work very well with just 1 vCPU,
but it's still a good default in the test, so having some kind of
configuration (lite) file would be useful.


---
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 518 ++
  2 files changed, 521 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index fe41c6a0fa67..697318019bd4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
  TEST_GEN_PROGS_x86_64 += demand_paging_test
  TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_x86_64 += set_memory_region_test
  TEST_GEN_PROGS_x86_64 += steal_time
@@ -71,6 +72,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
  TEST_GEN_PROGS_aarch64 += demand_paging_test
  TEST_GEN_PROGS_aarch64 += dirty_log_test
  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_aarch64 += set_memory_region_test
  TEST_GEN_PROGS_aarch64 += steal_time
@@ -80,6 +82,7 @@ TEST_GEN_PROGS_s390x += s390x/resets
  TEST_GEN_PROGS_s390x += s390x/sync_regs_test
  TEST_GEN_PROGS_s390x += demand_paging_test
  TEST_GEN_PROGS_s390x += dirty_log_test
+TEST_GEN_PROGS_s390x += kvm_page_table_test
  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
  TEST_GEN_PROGS_s390x += set_memory_region_test

diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c 
b/tools/testing/selftests/kvm/kvm_page_table_test.c
new file mode 100644
index ..b09c05288937
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -0,0 +1,518 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM page 

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-08 Thread wangyanan (Y)

Hi Ben,

On 2021/2/9 4:29, Ben Gardon wrote:

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:

This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() is designed to cover conditions where a single vcpu
or multiple vcpus access guest pages within the same memory range, in three
VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings or
block mappings can be chosen by users to be created in the test.

If use of ANONYMOUS memory is specified, kvm will create page mappings for the
tested memory region before dirty-logging, and update attributes of the page
mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
specified, kvm will create block mappings for the tested memory region before
dirty-logging, and split the blcok mappings into page mappings during
dirty-logging, and coalesce the page mappings back into block mappings after
dirty-logging is stopped.

So in summary, as a performance tester, this test can present the performance
of kvm creating/updating normal page mappings, or the performance of kvm
creating/splitting/recovering block mappings, through execution time.

When we need to coalesce the page mappings back to block mappings after dirty
logging is stopped, we have to firstly invalidate *all* the TLB entries for the
page mappings right before installation of the block entry, because a TLB 
conflict
abort error could occur if we can't invalidate the TLB entries fully. We have
hit this TLB conflict twice on aarch64 software implementation and fixed it.
As this test can imulate process from dirty-logging enabled to dirty-logging
stopped of a VM with block mappings, so it can also reproduce this TLB conflict
abort due to inadequate TLB invalidation when coalescing tables.

Signed-off-by: Yanan Wang 

Thanks for sending this! Happy to see more tests for weird TLB
flushing edge cases and races.

Just out of curiosity, were you unable to replicate the bug with the
dirty_log_perf_test and setting the wr_fract option?
With "KVM: selftests: Disable dirty logging with vCPUs running"
(https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
most of the same features as this one.
Please correct me if I'm wrong, but it seems like the major difference
here is a more careful pattern of which pages are dirtied when.
Actually the procedures in KVM_UPDATE_MAPPINGS stage are specially 
designed for

reproduce of the TLB conflict bug. The following explains why.
In x86 implementation, the related page mappings will be all destroyed 
in advance when
stopping dirty logging while vcpus are still running. So after dirty 
logging is successfully
stopped, there will certainly be page faults when accessing memory, and 
KVM will handle

the faults and create block mappings once again. (Is this right?)
So in this case, dirty_log_perf_test can replicate the bug theoretically.

But there is difference in ARM implementation. The related page mappings 
will not be
destroyed immediately when stopping dirty logging and will  be kept 
instead. And after
dirty logging, KVM will destroy these mappings together with creation of 
block mappings
when handling a guest fault (page fault or permission fault).  So based 
on guest_code() in
dirty_log_perf_test, there will not be any page faults after dirty 
logging because all the
page mappings have been created and KVM has no chance to recover block 
mappings
at all. So this is why I left half of the pages clean and another half 
dirtied.

Within Google we have a system for pre-specifying sets of arguments to
e.g. the dirty_log_perf_test. I wonder if something similar, even as
simple as a script that just runs dirty_log_perf_test several times
would be helpful for cases where different arguments are needed for
the test to cover different specific cases. Even with this test, for
I not sure I have got your point :), but it depends on what exactly the 
specific cases are,

and sometimes we have to use different arguments. Is this right?

example, I assume the test doesn't work very well with just 1 vCPU,
but it's still a good default in the test, so having some kind of
configuration (lite) file would be useful.
Actually it's only with 1 vCPU that the real efficiency of KVM page 
table code path can be tested,
such as efficiency of creating new mappings or efficiency of updating 
existing mappings.
And with numerous vCPUs, efficiency of KVM handling concurrent 
conditions can be tested.



---
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 518 ++
  2 files changed, 521 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git 

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-08 Thread wangyanan (Y)

Hi Vitaly,

On 2021/2/8 18:21, Vitaly Kuznetsov wrote:

Yanan Wang  writes:


This test serves as a performance tester and a bug reproducer for
kvm page table code (GPA->HPA mappings), so it gives guidance for
people trying to make some improvement for kvm.

The function guest_code() is designed to cover conditions where a single vcpu
or multiple vcpus access guest pages within the same memory range, in three
VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings or
block mappings can be chosen by users to be created in the test.

If use of ANONYMOUS memory is specified, kvm will create page mappings for the
tested memory region before dirty-logging, and update attributes of the page
mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
specified, kvm will create block mappings for the tested memory region before
dirty-logging, and split the blcok mappings into page mappings during
dirty-logging, and coalesce the page mappings back into block mappings after
dirty-logging is stopped.

So in summary, as a performance tester, this test can present the performance
of kvm creating/updating normal page mappings, or the performance of kvm
creating/splitting/recovering block mappings, through execution time.

When we need to coalesce the page mappings back to block mappings after dirty
logging is stopped, we have to firstly invalidate *all* the TLB entries for the
page mappings right before installation of the block entry, because a TLB 
conflict
abort error could occur if we can't invalidate the TLB entries fully. We have
hit this TLB conflict twice on aarch64 software implementation and fixed it.
As this test can imulate process from dirty-logging enabled to dirty-logging
stopped of a VM with block mappings, so it can also reproduce this TLB conflict
abort due to inadequate TLB invalidation when coalescing tables.

Signed-off-by: Yanan Wang 

This looks like a really useful thing, thanks! A few nitpicks below.


---
  tools/testing/selftests/kvm/Makefile  |   3 +
  .../selftests/kvm/kvm_page_table_test.c   | 518 ++
  2 files changed, 521 insertions(+)
  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index fe41c6a0fa67..697318019bd4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
  TEST_GEN_PROGS_x86_64 += demand_paging_test
  TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += kvm_page_table_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_x86_64 += set_memory_region_test
  TEST_GEN_PROGS_x86_64 += steal_time
@@ -71,6 +72,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
  TEST_GEN_PROGS_aarch64 += demand_paging_test
  TEST_GEN_PROGS_aarch64 += dirty_log_test
  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += kvm_page_table_test
  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
  TEST_GEN_PROGS_aarch64 += set_memory_region_test
  TEST_GEN_PROGS_aarch64 += steal_time
@@ -80,6 +82,7 @@ TEST_GEN_PROGS_s390x += s390x/resets
  TEST_GEN_PROGS_s390x += s390x/sync_regs_test
  TEST_GEN_PROGS_s390x += demand_paging_test
  TEST_GEN_PROGS_s390x += dirty_log_test
+TEST_GEN_PROGS_s390x += kvm_page_table_test
  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
  TEST_GEN_PROGS_s390x += set_memory_region_test
  
diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c

new file mode 100644
index ..b09c05288937
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -0,0 +1,518 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM page table test
+ * Based on dirty_log_test.c
+ * Based on dirty_log_perf_test.c
+ *
+ * Copyright (C) 2018, Red Hat, Inc.
+ * Copyright (C) 2020, Google, Inc.
+ * Copyright (C) 2021, Huawei, Inc.

[Paolo's call but] I think we can drop 'based on .. ' and all but the
last copyright notices as I don't quite see what value this gives. Yes,
when a new test is implemented we use something else as a template but
these are just tests after all.

Ok, I will remove it.

+ *
+ * Make sure that enough THP/HUGETLB pages have been allocated on systems
+ * to cover the testing memory region before running this program, if you
+ * wish to create block mappings in this test.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "guest_modes.h"
+
+#define TEST_MEM_SLOT_INDEX 1
+
+/* Default size(1GB) of the memory for testing */
+#define 

Re: [PATCH v2 3/3] KVM: arm64: Mark the page dirty only if the fault is handled successfully

2021-01-14 Thread wangyanan (Y)



On 2021/1/13 23:51, Will Deacon wrote:

On Wed, Dec 16, 2020 at 08:28:44PM +0800, Yanan Wang wrote:

We now mark the page dirty and set the bitmap before calling fault handlers
in user_mem_abort(), and we might end up having spurious dirty pages if
update of permissions or mapping has failed.
So, mark the page dirty only if the fault is handled successfully.

Let the guest directly enter again but not return to userspace if we were
trying to recreate the same mapping or only change access permissions
with BBM, which is not permitted in the mapping path.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/mmu.c | 18 ++
  1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 75814a02d189..72e516a10914 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,11 +879,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
   , _ipa);
-   if (writable) {
+   if (writable)
prot |= KVM_PGTABLE_PROT_W;
-   kvm_set_pfn_dirty(pfn);
-   mark_page_dirty(kvm, gfn);
-   }
  
  	if (fault_status != FSC_PERM && !device)

clean_dcache_guest_page(pfn, vma_pagesize);
@@ -911,6 +908,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 memcache);
}
  
+	/* Mark the page dirty only if the fault is handled successfully */

+   if (writable && !ret) {
+   kvm_set_pfn_dirty(pfn);
+   mark_page_dirty(kvm, gfn);
+   }
+
+   /* Let the guest directly enter again if we were trying to recreate the
+* same mapping or only change access permissions with BBM, which is not
+* permitted in the mapping path.
+*/
+   if (ret == -EAGAIN)
+   ret = 0;

Maybe just 'return ret != -EAGAIN ? ret : 0;' at the end of the function?

Yes, it's more concise.


Re: [PATCH v2 2/3] KVM: arm64: Add prejudgement for relaxing permissions only case in stage2 translation fault handler

2021-01-14 Thread wangyanan (Y)



On 2021/1/13 23:44, Will Deacon wrote:

On Wed, Dec 16, 2020 at 08:28:43PM +0800, Yanan Wang wrote:

In dirty-logging, or dirty-logging-stopped time, even normal running
time of a guest configed with huge mappings and numbers of vCPUs,
translation faults by different vCPUs on the same GPA could occur
successively almost at the same time. There are two reasons for it.

(1) If there are some vCPUs accessing the same GPA at the same time and
the leaf PTE is not set yet, then they will all cause translation faults
and the first vCPU holding mmu_lock will set valid leaf PTE, and the
others will later update the old PTE with a new one if they are different.

(2) When changing a leaf entry or a table entry with break-before-make,
if there are some vCPUs accessing the same GPA just catch the moment when
the target PTE is set invalid in a BBM procedure coincidentally, they will
all cause translation faults and will later update the old PTE with a new
one if they are different.

The worst case can be like this: vCPU A causes a translation fault with RW
prot and sets the leaf PTE with RW permissions, and then the next vCPU B
with RO prot updates the PTE back to RO permissions with break-before-make.
And the BBM-invalid moment may trigger more unnecessary translation faults,
then some useless small loops might occur which could lead to vCPU stuck.

To avoid unnecessary update and small loops, add prejudgement in the
translation fault handler: Skip updating the PTE with break-before-make
if we are trying to recreate the exact same mapping or only change the
access permissions. Actually, change of permissions will be handled
through the relax_perms path next time if necessary.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 28 +++-
  1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 350f9f810930..8225ced49bad 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -45,6 +45,10 @@
  
  #define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
  
+#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \

+KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
  struct kvm_pgtable_walk_data {
struct kvm_pgtable  *pgt;
struct kvm_pgtable_walker   *walker;
@@ -460,7 +464,7 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
return 0;
  }
  
-static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,

+static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
   kvm_pte_t *ptep,
   struct stage2_map_data *data)
  {
@@ -469,13 +473,18 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
struct page *page = virt_to_page(ptep);
  
  	if (!kvm_block_mapping_supported(addr, end, phys, level))

-   return false;
+   return 1;

It would probably be cleaner to return another error code here, as we
have failed to install a mapping (e.g. E2BIG or perhaps more perversely,
ENOTBLK). Then the caller can decide to install a trable.


Ok, I will make some adjustment for this and post the v3 soon after.

Thanks,

Yanan.



Re: [PATCH v2 0/3] RFC: Solve several problems in stage 2 translation

2020-12-27 Thread wangyanan (Y)

Hi Will, Marc,

Gently Ping.

Is there any comments about this V2 series?


Many thanks,

Yanan.

On 2020/12/16 20:28, Yanan Wang wrote:

Hi, this is the second version, thanks for reading.

PATCH1/3:
Procedures of hyp stage 1 mapping and guest stage 2 mapping are different, but
they are tied closely by function kvm_set_valid_leaf_pte(). So separate them by
rewriting kvm_set_valid_leaf_pte().

PATCH2/3:
To avoid unnecessary update and small loops, add prejudgement in the translation
fault handler: Skip updating the PTE with break-before-make if we are trying to
recreate the exact same mapping or only change the access permissions. Actually,
change of permissions will be handled through the relax_perms path next time if
necessary.

(1) If there are some vCPUs accessing the same GPA at the same time and the leaf
PTE is not set yet, then they will all cause translation faults and the first 
vCPU
holding mmu_lock will set valid leaf PTE, and the others will later update the 
old
PTE with a new one if they are different.

(2) When changing a leaf entry or a table entry with break-before-make, if there
are some vCPUs accessing the same GPA just catch the moment when the target PTE
is set invalid in a BBM procedure coincidentally, they will all cause 
translation
faults and will later update the old PTE with a new one if they are different.

The worst case can be like this: vCPU A causes a translation fault with RW prot 
and
sets the leaf PTE with RW permissions, and then the next vCPU B with RO prot 
updates
the PTE back to RO permissions with break-before-make. And the BBM-invalid 
moment
may trigger more unnecessary translation faults, then some useless small loops 
might
occur which could lead to vCPU stuck.

PATCH3/3:
We now mark the page dirty and set the bitmap before calling fault handlers in
user_mem_abort(), and we might end up having spurious dirty pages if update of
permissions or mapping has failed. So, mark the page dirty only if the fault is
handled successfully.

Let the guest directly enter again but not return to userspace if we were trying
to recreate the same mapping or only change access permissions with BBM, which 
is
not permitted in the mapping path.

Changes from v1:
- Make part of the diff as an independent patch (PATCH1/3),
   and add Will's Signed-off-by.
- Use *return -EPERM* way when changing permissions only in the mapping path.
- Add a new patch (PATCH3/3).

Yanan Wang (3):
   KVM: arm64: Decouple partial code of hyp stage 1 mapping and guest
 stage 2 mapping
   KVM: arm64: Add prejudgement for relaxing permissions only case in
 stage2 translation fault handler
   KVM: arm64: Mark the page dirty only if the fault is handled
 successfully

  arch/arm64/kvm/hyp/pgtable.c | 78 
  arch/arm64/kvm/mmu.c | 18 +++--
  2 files changed, 58 insertions(+), 38 deletions(-)



Re: [PATCH -next] ext4: use DEFINE_MUTEX (and mutex_init() had been too late)

2020-12-23 Thread Theodore Y. Ts'o
On Wed, Dec 23, 2020 at 10:12:54PM +0800, Zheng Yongjun wrote:
> Signed-off-by: Zheng Yongjun 

Why is mutex_init() too late?  We only take the mutex after we
mounting an ext4 file system, and that can't happen until ext4_init_fs
is called.

- Ted

>  fs/ext4/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 94472044f4c1..8776f06a639d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -59,7 +59,7 @@
>  #include 
>  
>  static struct ext4_lazy_init *ext4_li_info;
> -static struct mutex ext4_li_mtx;
> +static DEFINE_MUTEX(ext4_li_mtx);
>  static struct ratelimit_state ext4_mount_msg_ratelimit;
>  
>  static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
> @@ -6640,7 +6640,6 @@ static int __init ext4_init_fs(void)
>  
>   ratelimit_state_init(_mount_msg_ratelimit, 30 * HZ, 64);
>   ext4_li_info = NULL;
> - mutex_init(_li_mtx);
>  
>   /* Build-time check for flags consistency */
>   ext4_check_flag_values();
> -- 
> 2.22.0
> 


[GIT PULL] ext4 updates for v5.11-rc1

2020-12-22 Thread Theodore Y. Ts'o
The following changes since commit 418baf2c28f3473039f2f7377760bd8f6897ae18:

  Linux 5.10-rc5 (2020-11-22 15:36:08 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to be993933d2e997fdb72b8b1418d2a84df79b8962:

  ext4: remove unnecessary wbc parameter from ext4_bio_write_page (2020-12-22 
13:08:45 -0500)

NOTE: The reason why the branch had recently changed was to add a
one-line fix which added flush_work() call to an error/cleanup patgh,
to address a syzbot reported failure.  See the thread at:

http://lore.kernel.org/r/1faff305b709b...@google.com

There were also some commit description updates to add some Cc:
sta...@kernel.org tags.

This branch was tested and passes xfstests regression tests, and in
any case, it's all bug fixes and cleanups:

TESTRUNID: tytso-20201222152130
KERNEL:5.10.0-rc5-xfstests-00029-gbe993933d2e9 #2064 SMP Tue Dec 22 
15:19:12 EST 2020 x86_64
CMDLINE:   -c ext4/4k -g auto
CPUS:  2
MEM:   7680

ext4/4k: 520 tests, 43 skipped, 6608 seconds
Totals: 477 tests, 43 skipped, 0 failures, 0 errors, 6554s


Various bug fixes and cleanups for ext4; no new features this cycle.



Alexander Lochmann (1):
  Updated locking documentation for transaction_t

Chunguang Xu (7):
  ext4: use ASSERT() to replace J_ASSERT()
  ext4: remove redundant mb_regenerate_buddy()
  ext4: simplify the code of mb_find_order_for_block
  ext4: update ext4_data_block_valid related comments
  ext4: delete nonsensical (commented-out) code inside 
ext4_xattr_block_set()
  ext4: fix a memory leak of ext4_free_data
  ext4: avoid s_mb_prefetch to be zero in individual scenarios

Colin Ian King (1):
  ext4: remove redundant assignment of variable ex

Dan Carpenter (1):
  ext4: fix an IS_ERR() vs NULL check

Gustavo A. R. Silva (1):
  ext4: fix fall-through warnings for Clang

Harshad Shirwadkar (3):
  ext4: add docs about fast commit idempotence
  ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
  jbd2: add a helper to find out number of fast commit blocks

Jan Kara (8):
  ext4: fix deadlock with fs freezing and EA inodes
  ext4: don't remount read-only with errors=continue on reboot
  ext4: remove redundant sb checksum recomputation
  ext4: standardize error message in ext4_protect_reserved_inode()
  ext4: make ext4_abort() use __ext4_error()
  ext4: move functions in super.c
  ext4: simplify ext4 error translation
  ext4: defer saving error info from atomic context

Kaixu Xia (2):
  ext4: remove redundant operation that set bh to NULL
  ext4: remove the unused EXT4_CURRENT_REV macro

Lei Chen (1):
  ext4: remove unnecessary wbc parameter from ext4_bio_write_page

Roman Anufriev (2):
  ext4: add helpers for checking whether quota can be enabled/is journalled
  ext4: print quota journalling mode on (re-)mount

Theodore Ts'o (1):
  ext4: check for invalid block size early when mounting a file system

Xianting Tian (1):
  ext4: remove the null check of bio_vec page

 Documentation/filesystems/ext4/journal.rst |  50 ++
 fs/ext4/balloc.c   |   2 +-
 fs/ext4/block_validity.c   |  16 +-
 fs/ext4/ext4.h |  77 ++---
 fs/ext4/ext4_jbd2.c|   4 +-
 fs/ext4/ext4_jbd2.h|   9 +-
 fs/ext4/extents.c  |   5 +-
 fs/ext4/fast_commit.c  |  99 +++-
 fs/ext4/fast_commit.h  |  78 +++--
 fs/ext4/fsync.c|   2 +-
 fs/ext4/indirect.c |   4 +-
 fs/ext4/inode.c|  35 ++--
 fs/ext4/mballoc.c  |  39 ++---
 fs/ext4/namei.c|  12 +-
 fs/ext4/page-io.c  |   5 +-
 fs/ext4/super.c| 422 
-
 fs/ext4/xattr.c|   1 -
 fs/jbd2/journal.c  |   8 +-
 include/linux/jbd2.h   |  14 +-
 19 files changed, 504 insertions(+), 378 deletions(-)


Re: general protection fault in ext4_commit_super

2020-12-22 Thread Theodore Y. Ts'o
On Tue, Dec 22, 2020 at 12:28:53PM +0100, Jan Kara wrote:
> > Fix e810c942a325 ("ext4: save error info to sb through journal if 
> > available")
> > by flushing work as part of rollback.
> 
> Thanks for having a look. I don't think the fix is quite correct though. The
> flush_work() should be at failed_mount3: label. So something like attached
> fixup. Ted, can you please fold it into the buggy commit?

Done.  I folded it into "ext4: defer saving error info from atomic
context" since this is the commit where we introduced the s_error_work
workqueue.

Thanks!!

- Ted


Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o
On Thu, Dec 17, 2020 at 08:51:14PM +, Satya Tangirala wrote:
> On Thu, Dec 17, 2020 at 01:08:49PM -0500, Theodore Y. Ts'o wrote:
> > On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> > > This patch series adds support for metadata encryption to F2FS using
> > > blk-crypto.
> > 
> > Is there a companion patch series needed so that f2fstools can
> > check/repair a file system with metadata encryption enabled?
> > 
> > - Ted
> Yes! It's at
> https://lore.kernel.org/linux-f2fs-devel/20201217151013.1513045-1-sat...@google.com/

Cool, I've been meaning to update f2fs-tools in Debian, and including
these patches will allow us to generate {kvm,gce,android}-xfstests
images with this support.  I'm hoping to get to it sometime betweeen
Christmas and New Year's.

I guess there will need to be some additional work needed to create
the f2fs image with a fixed keys for a particular file system in
xfstests-bld, and then mounting and checking said image with the
appropriatre keys as well.   Is that something you've put together?

Cheers,

- Ted


Re: [PATCH] ext4: Don't leak old mountpoint samples

2020-12-17 Thread Theodore Y. Ts'o
On Tue, Dec 01, 2020 at 04:13:01PM +0100, Richard Weinberger wrote:
> As soon the first file is opened, ext4 samples the mountpoint
> of the filesystem in 64 bytes of the super block.
> It does so using strlcpy(), this means that the remaining bytes
> in the super block string buffer are untouched.
> If the mount point before had a longer path than the current one,
> it can be reconstructed.
> 
> Consider the case where the fs was mounted to "/media/johnjdeveloper"
> and later to "/".
> The the super block buffer then contains "/\x00edia/johnjdeveloper".
> 
> This case was seen in the wild and caused confusion how the name
> of a developer ands up on the super block of a filesystem used
> in production...
> 
> Fix this by clearing the string buffer before writing to it,
> 
> Signed-off-by: Richard Weinberger 

Thank for reporting this issue.  In fact, the better fix is to use
strncpy().  See my revised patch for an explanation of why

commit cdc9ad7d3f201a77749432878fb4caa490862de6
Author: Theodore Ts'o 
Date:   Thu Dec 17 13:24:15 2020 -0500

ext4: don't leak old mountpoint samples

When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block.  It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched.  If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/".  The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy().  The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h.  The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Reported-by: Richard Weinberger 
Signed-off-by: Theodore Ts'o 

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1cd3d26e3217..349b27f0dda0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -810,7 +810,7 @@ static int ext4_sample_last_mounted(struct super_block *sb,
if (err)
goto out_journal;
lock_buffer(sbi->s_sbh);
-   strlcpy(sbi->s_es->s_last_mounted, cp,
+   strncpy(sbi->s_es->s_last_mounted, cp,
sizeof(sbi->s_es->s_last_mounted));
ext4_superblock_csum_set(sb);
unlock_buffer(sbi->s_sbh);


Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o
On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> This patch series adds support for metadata encryption to F2FS using
> blk-crypto.

Is there a companion patch series needed so that f2fstools can
check/repair a file system with metadata encryption enabled?

- Ted


Re: [PATCH] fs: ext4: remove unnecessary wbc parameter from ext4_bio_write_page

2020-12-16 Thread Theodore Y. Ts'o
On Fri, Dec 11, 2020 at 02:54:24PM +0800, chenle...@gmail.com wrote:
> From: Lei Chen 
> 
> ext4_bio_write_page does not need wbc parameter, since its parameter
> io contains the io_wbc field. The io::io_wbc is initialized by
> ext4_io_submit_init which is called in ext4_writepages and
> ext4_writepage functions prior to ext4_bio_write_page.
> Therefor, when ext4_bio_write_page is called, wbc info
> has already been included in io parameter.
> 
> Signed-off-by: Lei Chen 

Thanks, applied.

- Ted


Re: [PATCH 031/141] ext4: Fix fall-through warnings for Clang

2020-12-16 Thread Theodore Y. Ts'o
On Fri, Nov 20, 2020 at 12:28:32PM -0600, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of just letting the code
> fall through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 

Thanks, applied.

- Ted


Re: [PATCH] ext4: fix -Wstringop-truncation warnings

2020-12-15 Thread Theodore Y. Ts'o
On Thu, Nov 12, 2020 at 05:33:24PM +0800, Kang Wenlin wrote:
> From: Wenlin Kang 
> 
> The strncpy() function may create a unterminated string,
> use strscpy_pad() instead.
> 
> This fixes the following warning:
> 
> fs/ext4/super.c: In function '__save_error_info':
> fs/ext4/super.c:349:2: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>   strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func));
>   ^~~
> fs/ext4/super.c:353:3: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>strncpy(es->s_first_error_func, func,
>^
> sizeof(es->s_first_error_func));
> ~~~

What compiler are you using?  s_last_error_func is defined to not
necessarily be NUL terminated.  So strscpy_pad() is not a proper
replacement for strncpy() in this use case.

>From Documentation/process/deprecated:

   If a caller is using non-NUL-terminated strings, strncpy() can
   still be used, but destinations should be marked with the `__nonstring
   `_
   attribute to avoid future compiler warnings.

s_{first,last}_error_func is properly annotated with __nonstring in
fs/ext4/ext4.h.

- Ted


Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-14 Thread Theodore Y. Ts'o
(Dropping off-topic lists)

On Mon, Dec 14, 2020 at 03:37:37PM +0100, Dmitry Vyukov wrote:
> > It's going to make everyone else's tags who pull from ext4.git messy,
> > though, with gobs of tags that probably won't be of use to them.  It
> > does avoid the need to use git fetch --tags --force, and I guess
> > people are used to the need to GC tags with the linux-repo.

(I had meant to say linux-next repo above.)

> syzbot is now prepared and won't fail next time, nor on other similar
> trees. Which is good.
> So it's really up to you.

I'm curious --- are you having to do anything special in terms of
deleting old tags to keep the size of the repo under control?  Git
will keep a tag around indefinitely, so if you have huge numbers of
next-MMDD tags in your repo, the size will grow without bound.
Are you doing anything to automatically garbage collect tags to preven
this from being a problem?

(I am not pulling linux-next every day; only when I need to debug a
bug reported against the -next tree, so I just manually delete the
tags as necessary.  So I'm curious what folks who are following
linux-next are doing, and whether they have something specific for
linux-next tags, or whether they have a more general solution.)

Cheers,

- Ted


Re: [RFC PATCH] KVM: arm64: Add prejudgement for relaxing permissions only case in stage2 translation fault handler

2020-12-13 Thread wangyanan (Y)

Hi Will, Marc,

On 2020/12/11 18:00, Will Deacon wrote:

On Fri, Dec 11, 2020 at 09:49:28AM +, Marc Zyngier wrote:

On 2020-12-11 08:01, Yanan Wang wrote:

@@ -461,25 +462,56 @@ static int stage2_map_set_prot_attr(enum
kvm_pgtable_prot prot,
return 0;
  }

+static bool stage2_set_valid_leaf_pte_pre(u64 addr, u32 level,
+ kvm_pte_t *ptep, kvm_pte_t new,
+ struct stage2_map_data *data)
+{
+   kvm_pte_t old = *ptep, old_attr, new_attr;
+
+   if ((old ^ new) & (~KVM_PTE_LEAF_ATTR_PERMS))
+   return false;
+
+   /*
+* Skip updating if we are trying to recreate exactly the same mapping
+* or to reduce the access permissions only. And update the valid leaf
+* PTE without break-before-make if we are trying to add more access
+* permissions only.
+*/
+   old_attr = (old & KVM_PTE_LEAF_ATTR_PERMS) ^
KVM_PTE_LEAF_ATTR_HI_S2_XN;
+   new_attr = (new & KVM_PTE_LEAF_ATTR_PERMS) ^
KVM_PTE_LEAF_ATTR_HI_S2_XN;
+   if (new_attr <= old_attr)
+   return true;
+
+   WRITE_ONCE(*ptep, new);
+   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);

I think what bothers me the most here is that we are turning a mapping into
a permission update, which makes the code really hard to read, and mixes
two things that were so far separate.

I wonder whether we should instead abort the update and simply take the
fault
again, if we ever need to do it.

That's a nice idea. If we could enforce that we don't alter permissions on
the map path, and instead just return e.g. -EAGAIN then that would be a
very neat solution and would cement the permission vs translation fault
division.


I agree with that we can indeed simplify the code, separate 
permission-relaxing and


mapping by the *straightly return* way, although the cost is one more 
vCPU trap on


permission fault next time possibly.

So how about the new two diffs below? I split them into two patches with 
different aims.


Thanks,

Yanan.


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 23a01dfcb27a..a74a62283012 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -170,10 +170,9 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, 
kvm_pte_t *childp)

    smp_store_release(ptep, pte);
 }

-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
-  u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
 {
-   kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+   kvm_pte_t pte = kvm_phys_to_pte(pa);
    u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? 
KVM_PTE_TYPE_PAGE :

KVM_PTE_TYPE_BLOCK;

@@ -181,12 +180,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, 
u64 pa, kvm_pte_t attr,

    pte |= FIELD_PREP(KVM_PTE_TYPE, type);
    pte |= KVM_PTE_VALID;

-   /* Tolerate KVM recreating the exact same mapping. */
-   if (kvm_pte_valid(old))
-   return old == pte;
-
-   smp_store_release(ptep, pte);
-   return true;
+   return pte;
 }

 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, 
u64 addr,
@@ -341,12 +335,17 @@ static int hyp_map_set_prot_attr(enum 
kvm_pgtable_prot prot,

 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
    kvm_pte_t *ptep, struct 
hyp_map_data *data)

 {
+   kvm_pte_t new, old = *ptep;
    u64 granule = kvm_granule_size(level), phys = data->phys;

    if (!kvm_block_mapping_supported(addr, end, phys, level))
    return false;

-   WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
+   /* Tolerate KVM recreating the exact same mapping. */
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (old != new && !WARN_ON(kvm_pte_valid(old)))
+   smp_store_release(ptep, new);
+
    data->phys += granule;
    return true;
 }
@@ -465,21 +464,29 @@ static bool stage2_map_walker_try_leaf(u64 addr, 
u64 end, u32 level,

   kvm_pte_t *ptep,
   struct stage2_map_data *data)
 {
+   kvm_pte_t new, old = *ptep;
    u64 granule = kvm_granule_size(level), phys = data->phys;
+   struct page *page = virt_to_page(ptep);

    if (!kvm_block_mapping_supported(addr, end, phys, level))
    return false;

-   if (kvm_pte_valid(*ptep))
-   put_page(virt_to_page(ptep));
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (kvm_pte_valid(old)) {
+   /* Tolerate KVM recreating the exact same mapping. */
+   if (old == new)
+   goto out;

-   if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
-   goto out;
+   /* There's an existing 

Re: [RFC PATCH] KVM: arm64: Add prejudgement for relaxing permissions only case in stage2 translation fault handler

2020-12-13 Thread wangyanan (Y)



On 2020/12/11 17:49, Marc Zyngier wrote:

Hi Yanan,

On 2020-12-11 08:01, Yanan Wang wrote:

In dirty-logging, or dirty-logging-stopped time, even normal running
time of a guest configed with huge mappings and numbers of vCPUs,
translation faults by different vCPUs on the same GPA could occur
successively almost at the same time. There are two reasons for it.

(1) If there are some vCPUs accessing the same GPA at the same time
and the leaf PTE is not set yet, then they will all cause translation
faults and the first vCPU holding mmu_lock will set valid leaf PTE,
and the others will later choose to update the leaf PTE or not.

(2) When changing a leaf entry or a table entry with break-before-make,
if there are some vCPUs accessing the same GPA just catch the moment
when the target PTE is set invalid in a BBM procedure coincidentally,
they will all cause translation faults and will later choose to update
the leaf PTE or not.

The worst case can be like this: some vCPUs cause translation faults
on the same GPA with different prots, they will fight each other by
changing back access permissions of the PTE with break-before-make.
And the BBM-invalid moment might trigger more unnecessary translation
faults. As a result, some useless small loops will occur, which could
lead to vCPU stuck.

To avoid unnecessary update and small loops, add prejudgement in the
translation fault handler: Skip updating the valid leaf PTE if we are
trying to recreate exactly the same mapping or to reduce access
permissions only(such as RW-->RO). And update the valid leaf PTE without
break-before-make if we are trying to add more permissions only.


I'm a bit perplexed with this: why are you skipping the update if the
permissions need to be reduced? Even more, how can we reduce the
permissions from a vCPU fault? I can't really think of a scenario where
that happens.

Or are you describing a case where two vcpus fault simultaneously with
conflicting permissions:

- Both vcpus fault on translation fault
- vcpu A wants W access
- vpcu B wants R access

and 'A' gets in first, set the permissions to RW (because R is
implicitly added to W), followed by 'B' which downgrades it to RO?

If that's what you are describing, then I agree we could do better.

Yes, this is exactly what I want to describe.




Signed-off-by: Yanan Wang 
---
 arch/arm64/kvm/hyp/pgtable.c | 73 +---
 1 file changed, 52 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 23a01dfcb27a..f8b3248cef1c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -45,6 +45,8 @@

 #define KVM_PTE_LEAF_ATTR_HI_S2_XN    BIT(54)

+#define KVM_PTE_LEAF_ATTR_PERMS    (GENMASK(7, 6) | BIT(54))
+
 struct kvm_pgtable_walk_data {
 struct kvm_pgtable    *pgt;
 struct kvm_pgtable_walker    *walker;
@@ -170,10 +172,9 @@ static void kvm_set_table_pte(kvm_pte_t *ptep,
kvm_pte_t *childp)
 smp_store_release(ptep, pte);
 }

-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, 
kvm_pte_t attr,

-   u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 
level)

 {
-    kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+    kvm_pte_t pte = kvm_phys_to_pte(pa);
 u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? 
KVM_PTE_TYPE_PAGE :

    KVM_PTE_TYPE_BLOCK;

@@ -181,12 +182,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t
*ptep, u64 pa, kvm_pte_t attr,
 pte |= FIELD_PREP(KVM_PTE_TYPE, type);
 pte |= KVM_PTE_VALID;

-    /* Tolerate KVM recreating the exact same mapping. */
-    if (kvm_pte_valid(old))
-    return old == pte;
-
-    smp_store_release(ptep, pte);
-    return true;
+    return pte;
 }

 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data 
*data, u64 addr,

@@ -341,12 +337,17 @@ static int hyp_map_set_prot_attr(enum
kvm_pgtable_prot prot,
 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 kvm_pte_t *ptep, struct hyp_map_data *data)
 {
+    kvm_pte_t new, old = *ptep;
 u64 granule = kvm_granule_size(level), phys = data->phys;

 if (!kvm_block_mapping_supported(addr, end, phys, level))
 return false;

-    WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
+    /* Tolerate KVM recreating the exact same mapping. */
+    new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+    if (old != new && !WARN_ON(kvm_pte_valid(old)))
+    smp_store_release(ptep, new);
+
 data->phys += granule;
 return true;
 }
@@ -461,25 +462,56 @@ static int stage2_map_set_prot_attr(enum
kvm_pgtable_prot prot,
 return 0;
 }

+static bool stage2_set_valid_leaf_pte_pre(u64 addr, u32 level,
+  kvm_pte_t *ptep, kvm_pte_t new,
+  struct stage2_map_data *data)
+{
+    kvm_pte_t old = *ptep, old_attr, new_attr;
+
+    if ((old ^ new) & (~KVM_PTE_LEAF_ATTR_PERMS))
+   

Re: [RFC PATCH] KVM: arm64: Add prejudgement for relaxing permissions only case in stage2 translation fault handler

2020-12-13 Thread wangyanan (Y)



On 2020/12/11 17:53, Will Deacon wrote:

Hi Yanan,

On Fri, Dec 11, 2020 at 04:01:15PM +0800, Yanan Wang wrote:

In dirty-logging, or dirty-logging-stopped time, even normal running
time of a guest configed with huge mappings and numbers of vCPUs,
translation faults by different vCPUs on the same GPA could occur
successively almost at the same time. There are two reasons for it.

(1) If there are some vCPUs accessing the same GPA at the same time
and the leaf PTE is not set yet, then they will all cause translation
faults and the first vCPU holding mmu_lock will set valid leaf PTE,
and the others will later choose to update the leaf PTE or not.

(2) When changing a leaf entry or a table entry with break-before-make,
if there are some vCPUs accessing the same GPA just catch the moment
when the target PTE is set invalid in a BBM procedure coincidentally,
they will all cause translation faults and will later choose to update
the leaf PTE or not.

The worst case can be like this: some vCPUs cause translation faults
on the same GPA with different prots, they will fight each other by
changing back access permissions of the PTE with break-before-make.
And the BBM-invalid moment might trigger more unnecessary translation
faults. As a result, some useless small loops will occur, which could
lead to vCPU stuck.

To avoid unnecessary update and small loops, add prejudgement in the
translation fault handler: Skip updating the valid leaf PTE if we are
trying to recreate exactly the same mapping or to reduce access
permissions only(such as RW-->RO). And update the valid leaf PTE without
break-before-make if we are trying to add more permissions only.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 73 +---
  1 file changed, 52 insertions(+), 21 deletions(-)

Cheers for this. Given that this patch is solving a few different problems,
do you think you could split it up please? That would certainly make it much
easier to review, as there's quite a lot going on here. A chunk of the
changes seem to be the diff I posted previously:

https://lore.kernel.org/r/20201201141632.GC26973@willie-the-truck

so maybe that could be its own patch?

Yeah, I will split the diff into two patches at next version, thanks.



diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 23a01dfcb27a..f8b3248cef1c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -45,6 +45,8 @@
  
  #define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
  
+#define KVM_PTE_LEAF_ATTR_PERMS	(GENMASK(7, 6) | BIT(54))

You only use this on the S2 path, so how about:

#define KVM_PTE_LEAF_ATTR_S2_PERMS  KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN

or something like that?

Yes, it's more reasonable.

  struct kvm_pgtable_walk_data {
struct kvm_pgtable  *pgt;
struct kvm_pgtable_walker   *walker;
@@ -170,10 +172,9 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t 
*childp)
smp_store_release(ptep, pte);
  }
  
-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,

-  u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
  {
-   kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+   kvm_pte_t pte = kvm_phys_to_pte(pa);
u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
   KVM_PTE_TYPE_BLOCK;
  
@@ -181,12 +182,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,

pte |= FIELD_PREP(KVM_PTE_TYPE, type);
pte |= KVM_PTE_VALID;
  
-	/* Tolerate KVM recreating the exact same mapping. */

-   if (kvm_pte_valid(old))
-   return old == pte;
-
-   smp_store_release(ptep, pte);
-   return true;
+   return pte;
  }
  
  static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,

@@ -341,12 +337,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
  static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep, struct hyp_map_data *data)
  {
+   kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
  
  	if (!kvm_block_mapping_supported(addr, end, phys, level))

return false;
  
-	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));

+   /* Tolerate KVM recreating the exact same mapping. */
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (old != new && !WARN_ON(kvm_pte_valid(old)))
+   smp_store_release(ptep, new);
+
data->phys += granule;
return true;
  }
@@ -461,25 +462,56 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot 
prot,

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-10 Thread Theodore Y. Ts'o
On Thu, Dec 10, 2020 at 09:09:51AM +0100, Dmitry Vyukov wrote:
> >  * [new tag]   ext4-for-linus-5.8-rc1-2 -> 
> > ext4-for-linus-5.8-rc1-2
> >  ! [rejected]  ext4_for_linus   -> ext4_for_linus  
> > (would clobber existing tag)
> 
> Interesting. First time I see this. Should syzkaller use 'git fetch
> --tags --force"?...
> StackOverflow suggests it should help:
> https://stackoverflow.com/questions/58031165/how-to-get-rid-of-would-clobber-existing-tag

Yeah, sorry, ext4_for_linus is a signed tag which is only used to
authenticate my pull request to Linus.  After Linus accepts the pull,
the digital signature is going to be upstream, and so I end up
deleting and the reusing that tag for the next merge window.

I guess I could just start always using ext4_for_linus- and
just delete the tags once they have been accepted, to keep my list of
tags clean. 

It's going to make everyone else's tags who pull from ext4.git messy,
though, with gobs of tags that probably won't be of use to them.  It
does avoid the need to use git fetch --tags --force, and I guess
people are used to the need to GC tags with the linux-repo.  So maybe
that's the right thing to do going forward.


- Ted


Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-09 Thread Theodore Y. Ts'o
On Tue, Dec 08, 2020 at 11:33:11PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:15ac8fdb Add linux-next specific files for 20201207
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1125c92350
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3696b8138207d24d
> dashboard link: https://syzkaller.appspot.com/bug?extid=345b75652b1d24227443
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=151bf86b50
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=139212cb50

#syz test git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
e360ba58d067a30a4e3e7d55ebdd919885a058d6

>From 3d3bc303a8a8f7123cf486f49fa9060116fa1465 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o 
Date: Wed, 9 Dec 2020 15:59:11 -0500
Subject: [PATCH] ext4: check for invalid block size early when mounting a file
 system

Check for valid block size directly by validating s_log_block_size; we
were doing this in two places.  First, by calculating blocksize via
BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
was valid.  And then secondly, by checking s_log_block_size directly.

The first check is not reliable, and can trigger an UBSAN warning if
s_log_block_size on a maliciously corrupted superblock is greater than
22.  This is harmless, since the second test will correctly reject the
maliciously fuzzed file system, but to make syzbot shut up, and
because the two checks are duplicative in any case, delete the
blocksize check, and move the s_log_block_size earlier in
ext4_fill_super().

Signed-off-by: Theodore Ts'o 
Reported-by: syzbot+345b75652b1d24227...@syzkaller.appspotmail.com
---
 fs/ext4/super.c | 40 
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f86220a8df50..4a16bbf0432c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4202,18 +4202,25 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
 */
sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
 
-   blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
-
-   if (blocksize == PAGE_SIZE)
-   set_opt(sb, DIOREAD_NOLOCK);
-
-   if (blocksize < EXT4_MIN_BLOCK_SIZE ||
-   blocksize > EXT4_MAX_BLOCK_SIZE) {
+   if (le32_to_cpu(es->s_log_block_size) >
+   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
ext4_msg(sb, KERN_ERR,
-  "Unsupported filesystem blocksize %d (%d 
log_block_size)",
-blocksize, le32_to_cpu(es->s_log_block_size));
+"Invalid log block size: %u",
+le32_to_cpu(es->s_log_block_size));
goto failed_mount;
}
+   if (le32_to_cpu(es->s_log_cluster_size) >
+   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
+   ext4_msg(sb, KERN_ERR,
+"Invalid log cluster size: %u",
+le32_to_cpu(es->s_log_cluster_size));
+   goto failed_mount;
+   }
+
+   blocksize = EXT4_MIN_BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
+
+   if (blocksize == PAGE_SIZE)
+   set_opt(sb, DIOREAD_NOLOCK);
 
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
sbi->s_inode_size = EXT4_GOOD_OLD_INODE_SIZE;
@@ -4432,21 +4439,6 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
if (!ext4_feature_set_ok(sb, (sb_rdonly(sb
goto failed_mount;
 
-   if (le32_to_cpu(es->s_log_block_size) >
-   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log block size: %u",
-le32_to_cpu(es->s_log_block_size));
-   goto failed_mount;
-   }
-   if (le32_to_cpu(es->s_log_cluster_size) >
-   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log cluster size: %u",
-le32_to_cpu(es->s_log_cluster_size));
-   goto failed_mount;
-   }
-
if (le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) > (blocksize / 4)) {
ext4_msg(sb, KERN_ERR,
 "Number of reserved GDT blocks insanely large: %d",
-- 
2.28.0



Re: Pass modules to Linux kernel without initrd

2020-12-08 Thread Theodore Y. Ts'o
On Tue, Dec 08, 2020 at 10:24:08AM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> Trying to reduce the boot time of standard distributions, I would like to
> get rid of the initrd. The initrd is for mounting the root file system and
> on most end user systems with standard distributions that means loading the
> bus driver for the drive and the file system driver. Everyone could build
> their own Linux kernel and build the drivers into the Linux kernel, but most
> users enjoy using the distribution Linux kernel, which build the drivers as
> modules to support a lot of systems. (I think Fedora builds the default file
> system driver (of the installer) into the Linux kernel.)

It's unclear what you are trying to speed up by replacing the initrd
with "appending the required modules to the Linux kernel image".  Why
do you think this will speed things up?  What do you think is
currently slow with using an initrd?

If what you are concerned about is the speed to load an initrd which
has all of the kernel modules shipped by the distribution, including
those not needed by a particular hardware platform, there are
distributions which can be configured to automatically include only
those kernel modules needed for a particular system.

There are also some shell scripts which some people have written that
will automatically create a kernel config file which only has the
device drivers needed for a particular system.  Creating a system
which used such a script, and then compiled a custom kernel image
would also not be hard.

You seem to be assuming that building a custom kernel image ish
hard(tm), and so no user would want to do this.  If this were 
automated, what is your objection to such an approach?

Without a clear understanding what part of the boot process you think
is slow, and which you are trying to optimize, and what precisely your
constraints are, or at least, what you *think* your constraints are,
and why you think things have to be that way, it's going to be hard to
comment further.

Cheers,

- Ted


[PATCH] dmaengine: bam_dma: fix return of bam_dma_irq()

2020-12-06 Thread Parth Y Shah
While performing suspend/resume, we were getting below kernel crash.

[   54.541672] [FTS][Info]gesture suspend...
[   54.605256] [FTS][Error][GESTURE]Enter into gesture(suspend) failed!
[   54.605256]
[   58.345850] irq event 10: bogus return value fff3
..

[   58.345966] [] el1_irq+0xb0/0x124
[   58.345971] [] arch_cpu_idle+0x10/0x18
[   58.345975] [] do_idle+0x1ac/0x1e0
[   58.345979] [] cpu_startup_entry+0x20/0x28
[   58.345983] [] rest_init+0xd0/0xdc
[   58.345988] [] start_kernel+0x390/0x3a4
[   58.345990] handlers:
[   58.345994] [] bam_dma_irq

The reason for the crash we found is, bam_dma_irq() was returning
negative value when the device resumes in some conditions.

In addition, the irq handler should have one of the below return values.

IRQ_NONEinterrupt was not from this device or was not handled
IRQ_HANDLED interrupt was handled by this device
IRQ_WAKE_THREAD handler requests to wake the handler thread

Therefore, to resolve this crash, we have changed the return value to
IRQ_NONE.

Signed-off-by: Parth Y Shah 
---
 drivers/dma/qcom/bam_dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dma/qcom/bam_dma.c b/drivers/dma/qcom/bam_dma.c
index 4eeb8bb..d5773d4 100644
--- a/drivers/dma/qcom/bam_dma.c
+++ b/drivers/dma/qcom/bam_dma.c
@@ -875,7 +875,7 @@ static irqreturn_t bam_dma_irq(int irq, void *data)
 
ret = bam_pm_runtime_get_sync(bdev->dev);
if (ret < 0)
-   return ret;
+   return IRQ_NONE;
 
if (srcs & BAM_IRQ) {
clr_mask = readl_relaxed(bam_addr(bdev, 0, BAM_IRQ_STTS));
-- 
2.7.4



[PATCH] Fixes kernel crash generating from bam_dma_irq()

2020-12-06 Thread Parth Y Shah
While performing suspend/resume, we were getting below kernel crash.

[   54.541672] [FTS][Info]gesture suspend...
[   54.605256] [FTS][Error][GESTURE]Enter into gesture(suspend) failed!
[   54.605256]
[   58.345850] irq event 10: bogus return value fff3
..

[   58.345966] [] el1_irq+0xb0/0x124
[   58.345971] [] arch_cpu_idle+0x10/0x18
[   58.345975] [] do_idle+0x1ac/0x1e0
[   58.345979] [] cpu_startup_entry+0x20/0x28
[   58.345983] [] rest_init+0xd0/0xdc
[   58.345988] [] start_kernel+0x390/0x3a4
[   58.345990] handlers:
[   58.345994] [] bam_dma_irq

The reason for the crash we found is, bam_dma_irq() was returning
negative value when the device resumes in some conditions.

In addition, the irq handler should have one of the below return values.

IRQ_NONEinterrupt was not from this device or was not handled
IRQ_HANDLED interrupt was handled by this device
IRQ_WAKE_THREAD handler requests to wake the handler thread

Therefore, to resolve this crash, we have changed the return value to
IRQ_NONE.

Signed-off-by: Parth Y Shah 
---
 drivers/dma/qcom/bam_dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dma/qcom/bam_dma.c b/drivers/dma/qcom/bam_dma.c
index 4eeb8bb..d5773d4 100644
--- a/drivers/dma/qcom/bam_dma.c
+++ b/drivers/dma/qcom/bam_dma.c
@@ -875,7 +875,7 @@ static irqreturn_t bam_dma_irq(int irq, void *data)
 
ret = bam_pm_runtime_get_sync(bdev->dev);
if (ret < 0)
-   return ret;
+   return IRQ_NONE;
 
if (srcs & BAM_IRQ) {
clr_mask = readl_relaxed(bam_addr(bdev, 0, BAM_IRQ_STTS));
-- 
2.7.4



Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-04 Thread Theodore Y. Ts'o
On Fri, Dec 04, 2020 at 02:59:35PM +, David Howells wrote:
> Hi Chuck, Bruce,
> 
> Why is gss_krb5_crypto.c using an auxiliary cipher?  For reference, the
> gss_krb5_aes_encrypt() code looks like the attached.
> 
> From what I can tell, in AES mode, the difference between the main cipher and
> the auxiliary cipher is that the latter is "cbc(aes)" whereas the former is
> "cts(cbc(aes))" - but they have the same key.
> 
> Reading up on CTS, I'm guessing the reason it's like this is that CTS is the
> same as the non-CTS, except for the last two blocks, but the non-CTS one is
> more efficient.

The reason to use CTS is if you don't want to expand the size of the
cipher text to the cipher block size.  e.g., if you have a 53 byte
plaintext, and you can't afford to let the ciphertext be 56 bytes, the
cryptographic engineer will reach for CTS instead of CBC.

So that probably explains the explanation to use CTS (and it's
required by the spec in any case).  As far as why CBC is being used
instead of CTS, the only reason I can think of is the one you posted.
Perhaps there was some hardware or software configureation where
cbc(aes) was hardware accelerated, and cts(cbc(aes)) would not be?

In any case, using cbc(aes) for all but the last two blocks, and using
cts(cbc(aes)) for the last two blocks, is identical to using
cts(cbc(aes)) for the whole encryption.  So the only reason to do this
in the more complex way would be because for performance reasons.

 - Ted


Re: [PATCH V2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-04 Thread Theodore Y. Ts'o
On Thu, Dec 03, 2020 at 08:18:23AM +0200, Amir Goldstein wrote:
> Here is a recent example, where during patch review, I requested NOT to 
> include
> any stable backport triggers [1]:
> "...We should consider sending this to stable, but maybe let's merge
> first and let it
>  run in master for a while before because it is not a clear and
> immediate danger..."
>
> As a developer and as a reviewer, I wish (as Dave implied) that I had a way to
> communicate to AUTOSEL that auto backport of this patch has more risk than
> the risk of not backporting.

My suggestion is that we could put something in the MAINTAINERS file
which indicates what the preferred delay time should be for (a)
patches explicitly cc'ed to stable, and (b) preferred time should be
for patches which are AUTOSEL'ed for stable for that subsystem.  That
time might be either in days/weeks, or "after N -rc releases", "after
the next full release", or, "never" (which would be a way for a
subsystem to opt out of the AUTOSEL process).

It should also be possible specify the delay in the trailer, e.g.:

Stable-Defer: 
Auto-Stable-Defer: 

The advantage of specifying the delay relative to when they show up in
Linus's tree helps deal with the case where the submaintainer might
not be sure when their patches will get pushed to Linus by the
maintainer.

Cheers,

- Ted


Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-03 Thread Theodore Y. Ts'o
On Thu, Dec 03, 2020 at 12:43:52AM +0100, Vlastimil Babka wrote:
> 
> there was a bit of debate on Twitter about this, so I thought I would bring it
> here. Imagine a scenario where patch sits as a commit in -next and there's a 
> bug
> report or fix, possibly by a bot or with some static analysis. The maintainer
> decides to fold it into the original patch, which makes sense for e.g.
> bisectability. But there seem to be no clear rules about attribution in this
> case, which looks like there should be, probably in
> Documentation/maintainer/modifying-patches.rst

I don't think there should be any kind of fixed, inflexible rules
about this.  

1) Sometimes there will be a *huge* number of comments and
suggestions.  Do we really want to require links to dozens of mail
message id's, and/or dozens or more e-mail addresses?

2) Sometimes a fixup is pretty trivial; even if it is expressed in the
form of a one-line patch, versus someone who does a detailed review of
a patch, but doesn't actually end up appending an explicit
Reviewed-by, perhaps because he or she didn't completely agree with
the final version of the patch.

3) I think this very much should be up to the maintainer's discretion,
as opposed to making rules that may result in some rediculous amount
of bloat in the git log.

4) It's really unhealthy, in my opinion for people to be fixed on
counting attributions.  If we create fixed rules, this can turn into
people try to game the system.  It's the same reason why I'm not
terribly enthusiastic about people trying to game Signed-off-by counts
by sending gazillions of white space or spelling fixes.

If the fix is large enough that for copyright reasons we need to
acknowledge the work, then folding in the SoB as for DCO reason makes
perfect sense.  But if it's a trivial patch (the kind where projects
that require copyright assignment wouldn't require executed legal
agreements), then perhaps attribution is not always a requirement.
Again, there are times when people who spend a lot of work discussing
patch may not get attributiionm even if they didn't actually create
the one-line whitespace fix and sent it in as a patch with a
signed-off-by with a demand that the attribution be preserved.

Common sense really needs to prevale here, and I'm concerned that
people who like to create rules don't realize what a mess this can
create when contributors approach their participation with a sense of
entitlement.

Cheers,

- Ted


Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o
On Thu, Dec 03, 2020 at 03:38:40PM +0100, Alexander Lochmann wrote:
> 
> 
> On 03.12.20 15:04, Theodore Y. Ts'o wrote:
> > On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> > > Hi folks,
> > > 
> > > I've updated the lock documentation according to our finding for
> > > transaction_t.
> > > Does this patch look good to you?
> > 
> > I updated the annotations to match with the local usage, e.g:
> > 
> >  * When commit was requested [journal_t.j_state_lock]
> > 
> > became:
> > 
> >  * When commit was requested [j_state_lock]What do you mean by local 
> > usage?
> The annotations of other members of transaction_t?

Yes, I'd like the annotations of the other objects to be consistent,
and just use j_state_lock, j_list_lock, etc., for the other annotations.

> Shouldn't the annotation look like this?
> [t_journal->j_state_lock]
> It would be more precise.

It's more precise, but it's also unnecessary in this case, since all
of the elements of the journal have a j_ prefix, elements of a
transaction_t have a t_ prefix, etc.  There is also no other structure
element which has a j_state_lock name *other* than in journal_t.

Cheers,

- Ted


Re: [PATCH][next] ext4: remove redundant assignment of variable ex

2020-12-03 Thread Theodore Y. Ts'o
On Wed, Oct 21, 2020 at 02:23:26PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Variable ex is assigned a variable that is not being read, the assignment
> is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Thanks, applied.

- Ted


Re: [PATCH] ext4: remove the null check of bio_vec page

2020-12-03 Thread Theodore Y. Ts'o
On Wed, Oct 21, 2020 at 12:25:03PM +0200, Jan Kara wrote:
> On Tue 20-10-20 16:22:01, Xianting Tian wrote:
> > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check,
> > as we did in other places when calling bio_for_each_segment_all() to go
> > through all bio_vec of a bio.
> > 
> > Signed-off-by: Xianting Tian 
> 
> Thanks for the patch. It looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted


Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o
On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> Hi folks,
> 
> I've updated the lock documentation according to our finding for
> transaction_t.
> Does this patch look good to you?

I updated the annotations to match with the local usage, e.g:

 * When commit was requested [journal_t.j_state_lock]

became:

 * When commit was requested [j_state_lock]

Otherwise, looks good.  Thanks for the patch!

   - Ted


Re: [PATCH v7 6/8] ext4: support direct I/O with fscrypt using blk-crypto

2020-12-03 Thread Theodore Y. Ts'o
On Tue, Nov 17, 2020 at 02:07:06PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Wire up ext4 with fscrypt direct I/O support. Direct I/O with fscrypt is
> only supported through blk-crypto (i.e. CONFIG_BLK_INLINE_ENCRYPTION must
> have been enabled, the 'inlinecrypt' mount option must have been specified,
> and either hardware inline encryption support must be present or
> CONFIG_BLK_INLINE_ENCYRPTION_FALLBACK must have been enabled). Further,
> direct I/O on encrypted files is only supported when the *length* of the
> I/O is aligned to the filesystem block size (which is *not* necessarily the
> same as the block device's block size).
> 
> fscrypt_limit_io_blocks() is called before setting up the iomap to ensure
> that the blocks of each bio that iomap will submit will have contiguous
> DUNs. Note that fscrypt_limit_io_blocks() is normally a no-op, as normally
> the DUNs simply increment along with the logical blocks. But it's needed
> to handle an edge case in one of the fscrypt IV generation methods.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> Reviewed-by: Jaegeuk Kim 

Acked-by: Theodore Ts'o 



Re: [PATCH v9 2/2] fs: ext4: Modify inode-test.c to use KUnit parameterized testing feature

2020-12-02 Thread Theodore Y. Ts'o
On Mon, Nov 16, 2020 at 11:11:50AM +0530, Arpitha Raghunandan wrote:
> Modify fs/ext4/inode-test.c to use the parameterized testing
> feature of KUnit.
> 
> Signed-off-by: Arpitha Raghunandan <98.a...@gmail.com>
> Signed-off-by: Marco Elver 

Acked-by: Theodore Ts'o 



Re: [PATCH v9 1/2] kunit: Support for Parameterized Testing

2020-12-02 Thread Theodore Y. Ts'o
On Mon, Nov 30, 2020 at 02:22:22PM -0800, 'Brendan Higgins' via KUnit 
Development wrote:
> 
> Looks good to me. I would definitely like to pick this up. But yeah,
> in order to pick up 2/2 we will need an ack from either Ted or Iurii.
> 
> Ted seems to be busy right now, so I think I will just ask Shuah to go
> ahead and pick this patch up by itself and we or Ted can pick up patch
> 2/2 later.

I have been paying attention to this patch series, but I had presumed
that this was much more of a kunit change than an ext4 change, and the
critical bits was a review of the kunit infrastructure.  I certainly
have no objection to changing the ext4 test to use the new
parameterized testing, and if you'd like me to give a quick review,
I'll take a quick look.  I assume, Brendan, that you've already tried
doing a compile and run test of the patch series, so I'm not going to
do that?

- Ted


Re: [PATCH v2 0/3] Fix several bugs in KVM stage 2 translation

2020-12-02 Thread wangyanan (Y)



On 2020/12/2 20:23, Marc Zyngier wrote:

Hi Yanan,

[...]


BTW: there are two more things below that I want to talk about.

1.  Recently, I have been focusing on the ARMv8.4-TTRem feature which
is aimed at changing block size in stage 2 mapping.

I have a plan to implement this feature for stage 2 translation when
splitting a block into tables or merging tables into a block.

This feature supports changing block size without performing
*break-before-make*, which might have some improvement on performance.

What do you think about this?


It would be interesting if you can demonstrate some significant
performance improvements compared to the same workload with BBM.

I'm not completely convinced this would change much, given that
it is only when moving from a table to a block mapping that you
can elide BBM when the support level is 1 or 2. As far as I can
tell, this only happens in the "stop logging" case.

Is that something that happens often enough to justify the added
complexity? Having to handle TLB Conflict Abort is annoying, for
example.


I will take more consideration about the necessity  and maybe some tests

on the performance will be made later.


Thanks,


Yanan




2. Given that the issues we discussed before were found in practice
when guest state changes from dirty logging to dirty logging canceled.

I could add a test file testing on this case to selftests/ or kvm unit
tests/, if it's necessary.


That would be awesome, and I'd be very grateful if you did. It is the
second time we break this exact case, and having a reliable way to
verify it would definitely help.

Thanks,

    M.


Re: [PATCH v2 0/3] Fix several bugs in KVM stage 2 translation

2020-12-02 Thread wangyanan (Y)

Hi Will, Marc,
On 2020/12/2 4:59, Will Deacon wrote:

On Wed, Dec 02, 2020 at 04:10:31AM +0800, Yanan Wang wrote:

When installing a new pte entry or updating an old valid entry in stage 2
translation, we use get_page()/put_page() to record page_count of the page-table
pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
which might make page-table pages unable to be freed when unmapping a range.

When dirty logging of a guest with hugepages is finished, we should merge tables
back into a block entry if adjustment of huge mapping is found necessary.
In addition to installing the block entry, we should not only free the non-huge
page-table pages but also invalidate all the TLB entries of non-huge mappings 
for
the block. PATCH 2/3 adds enough TLBI when merging tables into a block entry.

The rewrite of page-table code and fault handling add two different handlers
for "just relaxing permissions" and "map by stage2 page-table walk", that's
good improvement. Yet, in function user_mem_abort(), conditions where we choose
the above two fault handlers are not strictly distinguished. This will causes
guest errors such as infinite-loop (soft lockup will occur in result), because 
of
calling the inappropriate fault handler. So, a solution that can strictly
distinguish conditions is introduced in PATCH 3/3.

For the series:

Acked-by: Will Deacon 

Thanks for reporting these, helping me to understand the issues and then
spinning a v2 so promptly.

Will
.


Thanks for the help and suggestions.

BTW: there are two more things below that I want to talk about.

1.  Recently, I have been focusing on the ARMv8.4-TTRem feature which is 
aimed at changing block size in stage 2 mapping.


I have a plan to implement this feature for stage 2 translation when 
splitting a block into tables or merging tables into a block.


This feature supports changing block size without performing 
*break-before-make*, which might have some improvement on performance.


What do you think about this?


2. Given that the issues we discussed before were found in practice when 
guest state changes from dirty logging to dirty logging canceled.


I could add a test file testing on this case to selftests/ or kvm unit 
tests/, if it's necessary.



Yanan



Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2

2020-12-01 Thread wangyanan (Y)



On 2020/12/2 2:15, Will Deacon wrote:

On Wed, Dec 02, 2020 at 01:19:35AM +0800, wangyanan (Y) wrote:

On 2020/12/1 22:16, Will Deacon wrote:

On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:

On 2020/11/30 21:21, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:

@@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
/* There's an existing valid leaf entry, so perform break-before-make */
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+   put_page(virt_to_page(ptep));
kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
out:
data->phys += granule;

Isn't this hunk alone sufficient to solve the problem?


Not sufficient enough. When the old ptep is valid and old pte equlas new
pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()

and get_page() will still be called.

I had a go at fixing this without introducing refcounting to the hyp stage-1
case, and ended up with the diff below. What do you think?

Functionally this diff looks fine to me. A small comment inline, please see
below.

I had made an alternative fix (after sending v1) and it looks much more
concise.

If you're ok with it, I can send it as v2 (together with patch#2 and #3)
after some tests.

Thanks.


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..b232bdd142a6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64
end, u32 level,
     if (!kvm_block_mapping_supported(addr, end, phys, level))
     return false;

+   if (kvm_pte_valid(*ptep))
+   put_page(virt_to_page(ptep));
+
     if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
     goto out;

This is certainly smaller, but see below.


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..78e2c0dc47ae 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t 
*childp)
smp_store_release(ptep, pte);
   }
-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
-  u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
   {
-   kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+   kvm_pte_t pte = kvm_phys_to_pte(pa);
u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
   KVM_PTE_TYPE_BLOCK;
pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
pte |= FIELD_PREP(KVM_PTE_TYPE, type);
pte |= KVM_PTE_VALID;
-
-   /* Tolerate KVM recreating the exact same mapping. */
-   if (kvm_pte_valid(old))
-   return old == pte;
-
-   smp_store_release(ptep, pte);
-   return true;
+   return pte;
   }
   static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 
addr,
@@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
   static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep, struct hyp_map_data *data)
   {
+   kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
if (!kvm_block_mapping_supported(addr, end, phys, level))
return false;
-   WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
+   /* Tolerate KVM recreating the exact same mapping. */
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (old != new && !WARN_ON(kvm_pte_valid(old)))
+   smp_store_release(ptep, new);
+
data->phys += granule;
return true;
   }
@@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
   kvm_pte_t *ptep,
   struct stage2_map_data *data)
   {
+   kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
if (!kvm_block_mapping_supported(addr, end, phys, level))
return false;
-   if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
-   goto out;
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (kvm_pte_valid(old)) {
+   /*
+* There's an existing valid leaf entry, so perform
+* break-before-make.
+*/
+   kvm_set_invalid_pte(ptep);
+   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+   put_page(virt_to_page(ptep));

When old PTE is v

Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

2020-12-01 Thread wangyanan (Y)

On 2020/12/1 22:35, Marc Zyngier wrote:


Hi Yanan,

On 2020-12-01 14:11, wangyanan (Y) wrote:

On 2020/12/1 21:46, Will Deacon wrote:

On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:


[...]

The point is at b.iii where the TLBI is not enough. There are many 
page

mappings that we need to merge into a block mapping.

We invalidate the TLB for the input address without level hint at 
b.iii, but

this operation just flush TLB for one page mapping, there

are still some TLB entries for the other page mappings in the 
cache, the MMU

hardware walker can still hit these entries next time.
Ah, yes, I see. Thanks. I hadn't considered the case where there are 
table

entries beneath the anchor. So how about the diff below?

Will

--->8


Hi, I think it's inappropriate to put the TLBI of all the leaf entries
in function stage2_map_walk_table_post(),

because the *ptep must be an upper table entry when we enter
stage2_map_walk_table_post().

We should make the TLBI for every leaf entry not table entry in the
last lookup level,  just as I am proposing

to add the additional TLBI in function stage2_map_walk_leaf().


Could you make your concerns explicit? As far as I can tell, this should
address the bug you found, at least from a correctness perspective.

Are you worried about the impact of the full S2 invalidation? Or do you
see another correctness issue?



Hi Will, Marc,


After recheck of the diff, the full S2 invalidation in 
stage2_map_walk_table_post() should be well enough to solve this problem.


But I was wondering if we can add the full S2 invalidation in 
stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called 
for only one time.


If we add the full TLBI in stage2_map_walk_table_post(), 
__kvm_tlb_flush_vmid() might be called for many times in the loop and 
lots of (unnecessary) CPU instructions will be wasted.


What I'm saying is something like below, please let me know what do you 
think.


If this is OK, I can update the diff in v2 and send it with your SOB (is 
it appropriate?) after some tests.



diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b232bdd142a6..f11fb2996080 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 
end, u32 level,

    return 0;

    kvm_set_invalid_pte(ptep);
-   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+   kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
    data->anchor = ptep;
    return 0;
 }


Thanks,

Yanan



Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2

2020-12-01 Thread wangyanan (Y)

On 2020/12/1 22:16, Will Deacon wrote:


On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:

On 2020/11/30 21:21, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..696b6aa83faf 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, 
kvm_pte_t attr,
return old == pte;
smp_store_release(ptep, pte);
+   get_page(virt_to_page(ptep));

This is also used for the hypervisor stage-1 page-table, so I'd prefer to
leave this function as-is.

I agree at this point.

return true;
   }
@@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
/* There's an existing valid leaf entry, so perform break-before-make */
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+   put_page(virt_to_page(ptep));
kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
   out:
data->phys += granule;

Isn't this hunk alone sufficient to solve the problem?

Will
.

Not sufficient enough. When the old ptep is valid and old pte equlas new
pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()

and get_page() will still be called.

I had a go at fixing this without introducing refcounting to the hyp stage-1
case, and ended up with the diff below. What do you think?


Hi Will,

Functionally this diff looks fine to me. A small comment inline, please 
see below.


I had made an alternative fix (after sending v1) and it looks much more 
concise.


If you're ok with it, I can send it as v2 (together with patch#2 and #3) 
after some tests.



Thanks,

Yanan


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..b232bdd142a6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 
end, u32 level,

    if (!kvm_block_mapping_supported(addr, end, phys, level))
    return false;

+   if (kvm_pte_valid(*ptep))
+   put_page(virt_to_page(ptep));
+
    if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
    goto out;



--->8

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..78e2c0dc47ae 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t 
*childp)
smp_store_release(ptep, pte);
  }
  
-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,

-  u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
  {
-   kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+   kvm_pte_t pte = kvm_phys_to_pte(pa);
u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
   KVM_PTE_TYPE_BLOCK;
  
  	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);

pte |= FIELD_PREP(KVM_PTE_TYPE, type);
pte |= KVM_PTE_VALID;
-
-   /* Tolerate KVM recreating the exact same mapping. */
-   if (kvm_pte_valid(old))
-   return old == pte;
-
-   smp_store_release(ptep, pte);
-   return true;
+   return pte;
  }
  
  static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,

@@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
  static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep, struct hyp_map_data *data)
  {
+   kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
  
  	if (!kvm_block_mapping_supported(addr, end, phys, level))

return false;
  
-	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));

+   /* Tolerate KVM recreating the exact same mapping. */
+   new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+   if (old != new && !WARN_ON(kvm_pte_valid(old)))
+   smp_store_release(ptep, new);
+
data->phys += granule;
return true;
  }
@@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
   kvm_pte_t *ptep,
   struct stage2_map_data *data)
  {
+   kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
  
  	if (!kvm_block_mapping_supported(addr, end, phys, level))

return false;
  
-	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))

-   goto out;
+ 

Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

2020-12-01 Thread wangyanan (Y)



On 2020/12/1 21:46, Will Deacon wrote:

On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:

On 2020/12/1 0:01, Will Deacon wrote:

On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:

On 2020/11/30 21:34, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 
level,
return 0;
}
+static void stage2_flush_dcache(void *addr, u64 size);
+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t 
*ptep,
struct stage2_map_data *data)
{
@@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
struct page *page = virt_to_page(ptep);
if (data->anchor) {
-   if (kvm_pte_valid(pte))
+   if (kvm_pte_valid(pte)) {
+   kvm_set_invalid_pte(ptep);
+   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+addr, level);
put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?

Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort  was found in guest.

This problem is fixed before rework of the page table code, you can have a
look in the following two links:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

Ok, let's go through this, because I still don't see the bug. Please correct
me if you spot any mistakes:

1. We have a block mapping for X => Y
2. Dirty logging is enabled, so the block mapping is write-protected and
   ends up being split into page mappings
3. Dirty logging is disabled due to a failed migration.

--- At this point, I think we agree that the state of the MMU is alright ---

4. We take a stage-2 fault and want to reinstall the block mapping:

   a. kvm_pgtable_stage2_map() is invoked to install the block mapping
   b. stage2_map_walk_table_pre() finds a table where we would like to
  install the block:

i.   The anchor is set to point at this entry
ii.  The entry is made invalid
iii. We invalidate the TLB for the input address. This is
 TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.

*** At this point, the page-table pointed to by the old table entry
is not reachable to the hardware walker ***

   c. stage2_map_walk_leaf() is called for each leaf entry in the
  now-unreachable subtree, dropping page-references for each valid
entry it finds.
   d. stage2_map_walk_table_post() is eventually called for the entry
  which we cleared back in b.ii, so we install the new block mapping.

You are proposing to add additional TLB invalidation to (c), but I don't
think that is necessary, thanks to the invalidation already performed in
b.iii. What am I missing here?

The point is at b.iii where the TLBI is not enough. There are many page
mappings that we need to merge into a block mapping.

We invalidate the TLB for the input address without level hint at b.iii, but
this operation just flush TLB for one page mapping, there

are still some TLB entries for the other page mappings in the cache, the MMU
hardware walker can still hit these entries next time.

Ah, yes, I see. Thanks. I hadn't considered the case where there are table
entries beneath the anchor. So how about the diff below?

Will

--->8


Hi, I think it's inappropriate to put the TLBI of all the leaf entries 
in function stage2_map_walk_table_post(),


because the *ptep must be an upper table entry when we enter 
stage2_map_walk_table_post().


We should make the TLBI for every leaf entry not table entry in the last 
lookup level,  just as I am proposing


to add the additional TLBI in function stage2_map_walk_leaf().

Thanks.


Yanan



diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..12526d8c7ae4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 
level,
return 0;
  
  	kvm_set_invalid_pte(ptep);

-   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+   /* TLB invalidation is deferred until th

Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2

2020-11-30 Thread wangyanan (Y)

Hi Will,

On 2020/11/30 21:21, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:

When installing a new leaf pte onto an invalid ptep, we need to get_page(ptep).
When just updating a valid leaf ptep, we shouldn't get_page(ptep).
Incorrect page_count of translation tables might lead to memory leak,
when unmapping a stage 2 memory range.

Did you find this by inspection, or did you hit this in practice? I'd be
interested to see the backtrace for mapping over an existing mapping.


Actually this is found by inspection.

In the current code, get_page() will uniformly called at "out_get_page" 
in function stage2_map_walk_leaf(),


no matter the old ptep is valid or not.

When using stage2_unmap_walker() API to unmap a memory range, some 
page-table pages might not be


freed if page_count of the pages is not right.


Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..696b6aa83faf 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, 
kvm_pte_t attr,
return old == pte;
  
  	smp_store_release(ptep, pte);

+   get_page(virt_to_page(ptep));

This is also used for the hypervisor stage-1 page-table, so I'd prefer to
leave this function as-is.

I agree at this point.

return true;
  }
  
@@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,

/* There's an existing valid leaf entry, so perform break-before-make */
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+   put_page(virt_to_page(ptep));
kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
  out:
data->phys += granule;

Isn't this hunk alone sufficient to solve the problem?

Will
.


Not sufficient enough. When the old ptep is valid and old pte equlas new 
pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()


and get_page() will still be called.


Yanan



Re: [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()

2020-11-30 Thread wangyanan (Y)

Hi Will,

On 2020/11/30 21:49, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:47PM +0800, Yanan Wang wrote:

If we get a FSC_PERM fault, just using (logging_active && writable) to determine
calling kvm_pgtable_stage2_map(). There will be two more cases we should 
consider.

(1) After logging_active is configged back to false from true. When we get a
FSC_PERM fault with write_fault and adjustment of hugepage is needed, we should
merge tables back to a block entry. This case is ignored by still calling
kvm_pgtable_stage2_relax_perms(), which will lead to an endless loop and guest
panic due to soft lockup.

(2) We use (FSC_PERM && logging_active && writable) to determine collapsing
a block entry into a table by calling kvm_pgtable_stage2_map(). But sometimes
we may only need to relax permissions when trying to write to a page other than
a block. In this condition, using kvm_pgtable_stage2_relax_perms() will be fine.

The ISS filed bit[1:0] in ESR_EL2 regesiter indicates the stage2 lookup level
at which a D-abort or I-abort occured. By comparing granule of the fault lookup
level with vma_pagesize, we can strictly distinguish conditions of calling
kvm_pgtable_stage2_relax_perms() or kvm_pgtable_stage2_map(), and the above
two cases will be well considered.

Suggested-by: Keqian Zhu 
Signed-off-by: Yanan Wang 
---
  arch/arm64/include/asm/esr.h |  1 +
  arch/arm64/include/asm/kvm_emulate.h |  5 +
  arch/arm64/kvm/mmu.c | 11 +--
  3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 22c81f1edda2..85a3e49f92f4 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -104,6 +104,7 @@
  /* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */
  #define ESR_ELx_FSC   (0x3F)
  #define ESR_ELx_FSC_TYPE  (0x3C)
+#define ESR_ELx_FSC_LEVEL  (0x03)
  #define ESR_ELx_FSC_EXTABT(0x10)
  #define ESR_ELx_FSC_SERROR(0x11)
  #define ESR_ELx_FSC_ACCESS(0x08)
diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index 5ef2669ccd6c..2e0e8edf6306 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -350,6 +350,11 @@ static __always_inline u8 
kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vc
return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
  }
  
+static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu)

+{
+   return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL;
+{
+
  static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
  {
switch (kvm_vcpu_trap_get_fault(vcpu)) {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1a01da9fdc99..75814a02d189 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -754,10 +754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
gfn_t gfn;
kvm_pfn_t pfn;
bool logging_active = memslot_is_logging(memslot);
-   unsigned long vma_pagesize;
+   unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
+   unsigned long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
  
+	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);

I like the idea, but is this macro reliable for stage-2 page-tables, given
that we could have a concatenated pgd?

Will
.


Yes, it's fine even when we have a concatenated pgd table.

No matter a concatenated pgd will be made or not, the initial lookup 
level (start _level) is set in VTCR_EL2 register.


The MMU hardware walker will know the start_level according to 
information in VTCR_EL2.


This idea runs well in practice on host where ia_bits is 40, PAGE_SIZE 
is 4k, and a concatenated pgd is made for guest stage2.


According to the kernel info printed, the start_level is 1, and stage 2 
translation runs as expected.



Yanan



Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

2020-11-30 Thread wangyanan (Y)

Hi Will,

On 2020/12/1 0:01, Will Deacon wrote:

Hi,

Cheers for the quick reply. See below for more questions...

On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:

On 2020/11/30 21:34, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 
level,
return 0;
   }
+static void stage2_flush_dcache(void *addr, u64 size);
+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
   static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t 
*ptep,
struct stage2_map_data *data)
   {
@@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
struct page *page = virt_to_page(ptep);
if (data->anchor) {
-   if (kvm_pte_valid(pte))
+   if (kvm_pte_valid(pte)) {
+   kvm_set_invalid_pte(ptep);
+   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+addr, level);
put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?

Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort  was found in guest.

This problem is fixed before rework of the page table code, you can have a
look in the following two links:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

Ok, let's go through this, because I still don't see the bug. Please correct
me if you spot any mistakes:

   1. We have a block mapping for X => Y
   2. Dirty logging is enabled, so the block mapping is write-protected and
  ends up being split into page mappings
   3. Dirty logging is disabled due to a failed migration.

--- At this point, I think we agree that the state of the MMU is alright ---

   4. We take a stage-2 fault and want to reinstall the block mapping:

  a. kvm_pgtable_stage2_map() is invoked to install the block mapping
  b. stage2_map_walk_table_pre() finds a table where we would like to
 install the block:

i.   The anchor is set to point at this entry
ii.  The entry is made invalid
iii. We invalidate the TLB for the input address. This is
 TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.

*** At this point, the page-table pointed to by the old table entry
is not reachable to the hardware walker ***

  c. stage2_map_walk_leaf() is called for each leaf entry in the
 now-unreachable subtree, dropping page-references for each valid
entry it finds.
  d. stage2_map_walk_table_post() is eventually called for the entry
 which we cleared back in b.ii, so we install the new block mapping.

You are proposing to add additional TLB invalidation to (c), but I don't
think that is necessary, thanks to the invalidation already performed in
b.iii. What am I missing here?


The point is at b.iii where the TLBI is not enough. There are many page 
mappings that we need to merge into a block mapping.


We invalidate the TLB for the input address without level hint at b.iii, 
but this operation just flush TLB for one page mapping, there


are still some TLB entries for the other page mappings in the cache, the 
MMU hardware walker can still hit these entries next time.



Maybe we can imagine a concrete example here. If we now need to merge 
page mappings into a 1G block mapping, and the


fault_ipa to user_mem_abort() is 0x225043000, after ALIGNMENT to 1G, the 
input address will be 0x2, then the TLBI


operation at b.iii will invalidate the TLB entry for address 
0x2. But what about address 0x21000 , 0x22000 ... ?


After the installing of 1G block mapping is finished, when the fault_ipa 
is 0x27000, MMU can still hit the TLB entry for page


mapping in the cache and can also access memory through the new block entry.


So adding TLBI operation in stage2_map_walk_leaf() aims to invalidate 
TLB entries for all the page mappings that will be merged.


In this way, after installing of block mapping, MMU can only access 
memory through the new block entry.



+   if (stage2_pte_cacheable(pte))
+   stage2_flush_dcache(kvm_pte_follow(pte),
+   kvm_granule_size(level));

I don't u

Re: drivers/char/random.c needs a (new) maintainer

2020-11-30 Thread Theodore Y. Ts'o
On Mon, Nov 30, 2020 at 04:15:23PM +0100, Jason A. Donenfeld wrote:
> I am willing to maintain random.c and have intentions to have a
> formally verified RNG. I've mentioned this to Ted before.
> 
> But I think Ted's reluctance to not accept the recent patches sent to
> this list is mostly justified, and I have no desire to see us rush
> into replacing random.c with something suboptimal or FIPSy.

Being a maintainer is not about *accepting* patches, it's about
*reviewing* them.  I do plan to make time to catch up on reviewing
patches this cycle.  One thing that would help me is if folks
(especially Jason, if you would) could start with a detailed review of
Nicolai's patches.  His incremental approach is I believe the best one
from a review perspective, and certainly his cleanup patches are ones
which I would expect are no-brainers.

- Ted


Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

2020-11-30 Thread wangyanan (Y)



On 2020/11/30 21:34, Will Deacon wrote:

On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:

In dirty logging case(logging_active == True), we need to collapse a block
entry into a table if necessary. After dirty logging is canceled, when merging
tables back into a block entry, we should not only free the non-huge page
tables but also unmap the non-huge mapping for the block. Without the unmap,
inconsistent TLB entries for the pages in the the block will be created.

We could also use unmap_stage2_range API to unmap the non-huge mapping,
but this could potentially free the upper level page-table page which
will be useful later.

Signed-off-by: Yanan Wang 
---
  arch/arm64/kvm/hyp/pgtable.c | 15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 
level,
return 0;
  }
  
+static void stage2_flush_dcache(void *addr, u64 size);

+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct stage2_map_data *data)
  {
@@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
struct page *page = virt_to_page(ptep);
  
  	if (data->anchor) {

-   if (kvm_pte_valid(pte))
+   if (kvm_pte_valid(pte)) {
+   kvm_set_invalid_pte(ptep);
+   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+addr, level);
put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?


Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort  was found in guest.

This problem is fixed before rework of the page table code, you can have 
a look in the following two links:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html


+   if (stage2_pte_cacheable(pte))
+   stage2_flush_dcache(kvm_pte_follow(pte),
+   kvm_granule_size(level));

I don't understand the need for the flush either, as we're just coalescing
existing entries into a larger block mapping.


In my opinion, after unmapping, it is necessary to ensure the cache 
coherency, because it is unknown whether the future mapping memory 
attribute is changed or not (cacheable -> non_cacheable) theoretically.



Will
.


[GIT PULL] ext4 bug fixes for 5.10-rc

2020-11-22 Thread Theodore Y. Ts'o
The following changes since commit 09162bc32c880a791c6c0668ce0745cf7958f576:

  Linux 5.10-rc4 (2020-11-15 16:44:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_fixes2

for you to fetch changes up to f902b216501094495ff75834035656e8119c537f:

  ext4: fix bogus warning in ext4_update_dx_flag() (2020-11-19 22:41:10 -0500)


A final set of miscellaneous bug fixes for ext4


Jan Kara (1):
  ext4: fix bogus warning in ext4_update_dx_flag()

Mauro Carvalho Chehab (1):
  jbd2: fix kernel-doc markups

Theodore Ts'o (1):
  ext4: drop fast_commit from /proc/mounts

 fs/ext4/ext4.h|  3 ++-
 fs/ext4/super.c   |  4 
 fs/jbd2/journal.c | 34 ++
 fs/jbd2/transaction.c | 31 ---
 include/linux/jbd2.h  |  2 +-
 5 files changed, 37 insertions(+), 37 deletions(-)


Re: [PATCH v4 12/27] jbd2: fix kernel-doc markups

2020-11-19 Thread Theodore Y. Ts'o
On Mon, Nov 16, 2020 at 11:18:08AM +0100, Mauro Carvalho Chehab wrote:
> Kernel-doc markup should use this format:
> identifier - description
> 
> They should not have any type before that, as otherwise
> the parser won't do the right thing.
> 
> Also, some identifiers have different names between their
> prototypes and the kernel-doc markup.
> 
> Reviewed-by: Jan Kara 
> Signed-off-by: Mauro Carvalho Chehab 

Applied to the ext4 tree, thanks!

- Ted


Re: [RESEND][PATCH] ima: Set and clear FMODE_CAN_READ in ima_calc_file_hash()

2020-11-17 Thread Theodore Y. Ts'o
On Tue, Nov 17, 2020 at 10:23:58AM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 10:35 AM Mimi Zohar  wrote:
> >
> > We need to differentiate between signed files, which by definition are
> > immutable, and those that are mutable.  Appending to a mutable file,
> > for example, would result in the file hash not being updated.
> > Subsequent reads would fail.
> 
> Why would that require any reading of the file at all AT WRITE TIME?
> 
> Don't do it. Really.
> 
> When opening the file write-only, you just invalidate the hash. It
> doesn't matter anyway - you're only writing.
> 
> Later on, when reading, only at that point does the hash matter, and
> then you can do the verification.
> 
> Although honestly, I don't even see the point. You know the hash won't
> match, if you wrote to the file.

I think the use case the IMA folks might be thinking about is where
they want to validate the file at open time, *before* the userspace
application starts writing to the file, since there might be some
subtle attacks where Boris changes the first part of the file before
Alice appends "I agree" to said file.

Of course, Boris will be able to modify the file after Alice has
modified it, so it's a bit of a moot point, but one could imagine a
scenario where the file is modified while the system is not running
(via an evil hotel maid) and then after Alice modifies the file, of
*course* the hash will be invalid, so no one would notice.  A sane
application would have read the file to make sure it contained the
proper contents before appending "I agree" to said file, so it's a bit
of an esoteric point.

The other case I could imagine is if the file is marked execute-only,
without read access, and IMA wanted to be able to read the file to
check the hash.  But we already make an execption for allowing the
file to be read during page faults, so that's probably less
controversial.

- Ted



Re: [PATCH v2 1/3] libfs: Add generic function for setting dentry_ops

2020-11-17 Thread Theodore Y. Ts'o
On Tue, Nov 17, 2020 at 04:03:13AM +, Daniel Rosenberg wrote:
> This adds a function to set dentry operations at lookup time that will
> work for both encrypted filenames and casefolded filenames.
> 
> A filesystem that supports both features simultaneously can use this
> function during lookup preparations to set up its dentry operations once
> fscrypt no longer does that itself.
> 
> Currently the casefolding dentry operation are always set if the
> filesystem defines an encoding because the features is toggleable on
> empty directories. Since we don't know what set of functions we'll
> eventually need, and cannot change them later, we add just add them.
> 
> Signed-off-by: Daniel Rosenberg 

Reviewed-by: Theodore Ts'o 

- Ted


Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o
On Tue, Nov 17, 2020 at 09:04:11AM -0800, Jaegeuk Kim wrote:
> 
> I'd like to pick this patch series in f2fs/dev for -next, so please let me 
> know
> if you have any concern.

No concern for me as far as ext4 is concerned, thanks!

 - Ted


Re: [PATCH v7 0/8] add support for direct I/O with fscrypt using blk-crypto

2020-11-17 Thread Theodore Y. Ts'o
What is the expected use case for Direct I/O using fscrypt?  This
isn't a problem which is unique to fscrypt, but one of the really
unfortunate aspects of the DIO interface is the silent fallback to
buffered I/O.  We've lived with this because DIO goes back decades,
and the original use case was to keep enterprise databases happy, and
the rules around what is necessary for DIO to work was relatively well
understood.

But with fscrypt, there's going to be some additional requirements
(e.g., using inline crypto) required or else DIO silently fall back to
buffered I/O for encrypted files.  Depending on the intended use case
of DIO with fscrypt, this caveat might or might not be unfortunately
surprising for applications.

I wonder if we should have some kind of interface so we can more
explicitly allow applications to query exactly what the requirements
might be for a particular file vis-a-vis Direct I/O.  What are the
memory alignment requirements, what are the file offset alignment
requirements, what are the write size requirements, for a particular
file.

- Ted


Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o
On Tue, Nov 17, 2020 at 04:03:14AM +, Daniel Rosenberg wrote:
> This shifts the responsibility of setting up dentry operations from
> fscrypt to the individual filesystems, allowing them to have their own
> operations while still setting fscrypt's d_revalidate as appropriate.
> 
> Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
> they have their own specific dentry operations as well. That operation
> will set the minimal d_ops required under the circumstances.
> 
> Since the fscrypt d_ops are set later on, we must set all d_ops there,
> since we cannot adjust those later on. This should not result in any
> change in behavior.
> 
> Signed-off-by: Daniel Rosenberg 

Acked-by: Theodore Ts'o 



[GIT PULL] more ext4 fixes for v5.10-rc4

2020-11-12 Thread Theodore Y. Ts'o
The following changes since commit 52d1998d09af92d44ffce7454637dd3fd1afdc7d:

  Merge tag 'fscrypt-for-linus' of 
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt (2020-11-10 10:05:37 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_bugfixes

for you to fetch changes up to d196e229a80c39254f4adbc312f55f5198e98941:

  Revert "ext4: fix superblock checksum calculation race" (2020-11-11 14:24:18 
-0500)


Two ext4 bug fixes, one via a revert of a commit sent during the merge window.


Harshad Shirwadkar (1):
  ext4: handle dax mount option collision

Theodore Ts'o (1):
  Revert "ext4: fix superblock checksum calculation race"

 fs/ext4/ext4.h  |  6 +++---
 fs/ext4/super.c | 11 ---
 2 files changed, 3 insertions(+), 14 deletions(-)


Re: How to enable auto-suspend by default

2020-11-10 Thread Theodore Y. Ts'o
One note...  I'll double check, but on my XPS 13 9380, as I recall, I
have to manually disable autosuspend on all of the XHCI controllers
and internal hubs after running "powertop --auto-tune", or else any
external mouse attached to said USB device will be dead to the world
for 2-3 seconds if the autosuspend timeout has kicked in, which was
***super*** annoying.

- Ted


Re: [PATCH 0/2] Tristate moount option comatibility fixup

2020-11-10 Thread Theodore Y. Ts'o
On Mon, Nov 09, 2020 at 08:10:07PM +0100, Michal Suchanek wrote:
> Hello,
> 
> after the tristate dax option change some applications fail to detect
> pmem devices because the dax option no longer shows in mtab when device
> is mounted with -o dax.

Which applications?  Name them.

We *really* don't want to encourage applications to make decisions
only based on the mount options.  For example, it could be that the
application's files will have the S_DAX flag set.

It would be a real shame if we are actively encourage applications to
use a broken configuration mechanism which was only used as a hack
while DAX was in experimental status.

- Ted


[GIT PULL] ext4 cleanups for 5.10-rc4

2020-11-09 Thread Theodore Y. Ts'o
(Resent with missing cc's, sorry.)

The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891:

  Linux 5.10-rc2 (2020-11-01 14:43:51 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_cleanups

for you to fetch changes up to 05d5233df85e9621597c5838e95235107eb624a2:

  jbd2: fix up sparse warnings in checkpoint code (2020-11-07 00:09:08 -0500)


More fixes and cleanups for the new fast_commit features, but also a
few other miscellaneous bug fixes and a cleanup for the MAINTAINERS
file.


Chao Yu (1):
  MAINTAINERS: add missing file in ext4 entry

Dan Carpenter (1):
  ext4: silence an uninitialized variable warning

Harshad Shirwadkar (22):
  ext4: describe fast_commit feature flags
  ext4: mark fc ineligible if inode gets evictied due to mem pressure
  ext4: drop redundant calls ext4_fc_track_range
  ext4: fixup ext4_fc_track_* functions' signature
  jbd2: rename j_maxlen to j_total_len and add jbd2_journal_max_txn_bufs
  ext4: clean up the JBD2 API that initializes fast commits
  jbd2: drop jbd2_fc_init documentation
  jbd2: don't use state lock during commit path
  jbd2: don't pass tid to jbd2_fc_end_commit_fallback()
  jbd2: add todo for a fast commit performance optimization
  jbd2: don't touch buffer state until it is filled
  jbd2: don't read journal->j_commit_sequence without taking a lock
  ext4: dedpulicate the code to wait on inode that's being committed
  ext4: fix code documentatioon
  ext4: mark buf dirty before submitting fast commit buffer
  ext4: remove unnecessary fast commit calls from ext4_file_mmap
  ext4: fix inode dirty check in case of fast commits
  ext4: disable fast commit with data journalling
  ext4: issue fsdev cache flush before starting fast commit
  ext4: make s_mount_flags modifications atomic
  jbd2: don't start fast commit on aborted journal
  ext4: cleanup fast commit mount options

Joseph Qi (1):
  ext4: unlock xattr_sem properly in ext4_inline_data_truncate()

Kaixu Xia (1):
  ext4: correctly report "not supported" for {usr,grp}jquota when 
!CONFIG_QUOTA

Theodore Ts'o (2):
  ext4: fix sparse warnings in fast_commit code
  jbd2: fix up sparse warnings in checkpoint code

 Documentation/filesystems/ext4/journal.rst |   6 ++
 Documentation/filesystems/ext4/super.rst   |   7 +++
 Documentation/filesystems/journalling.rst  |   6 +-
 MAINTAINERS|   1 +
 fs/ext4/ext4.h |  66 ++--
 fs/ext4/extents.c  |   7 +--
 fs/ext4/fast_commit.c  | 174 
+++--
 fs/ext4/fast_commit.h  |   6 +-
 fs/ext4/file.c |   6 +-
 fs/ext4/fsmap.c|   2 +-
 fs/ext4/fsync.c|   2 +-
 fs/ext4/inline.c   |   1 +
 fs/ext4/inode.c|  19 +++---
 fs/ext4/mballoc.c  |   6 +-
 fs/ext4/namei.c|  61 +--
 fs/ext4/super.c|  47 ---
 fs/jbd2/checkpoint.c   |   2 +
 fs/jbd2/commit.c   |  11 +++-
 fs/jbd2/journal.c  | 138 
+++---
 fs/jbd2/recovery.c |   6 +-
 fs/jbd2/transaction.c  |   4 +-
 fs/ocfs2/journal.c |   2 +-
 include/linux/jbd2.h   |  23 ---
 include/trace/events/ext4.h|  10 +--
 24 files changed, 342 insertions(+), 271 deletions(-)


Re: [PATCH] MAINTAINERS: add missing file in ext4 entry

2020-11-06 Thread Theodore Y. Ts'o
On Fri, Oct 30, 2020 at 10:24:35AM +0800, Chao Yu wrote:
> include/trace/events/ext4.h belongs to ext4 module, add the file path into
> ext4 entry in MAINTAINERS.
> 
> Signed-off-by: Chao Yu 

Thanks, applied.

- Ted


Re: [PATCH] ext4: Use generic casefolding support

2020-10-29 Thread Theodore Y. Ts'o
On Wed, Oct 28, 2020 at 05:08:20AM +, Daniel Rosenberg wrote:
> This switches ext4 over to the generic support provided in libfs.
> 
> Since casefolded dentries behave the same in ext4 and f2fs, we decrease
> the maintenance burden by unifying them, and any optimizations will
> immediately apply to both.
> 
> Signed-off-by: Daniel Rosenberg 
> Reviewed-by: Eric Biggers 

Applied, thanks.

- Ted


  1   2   3   4   5   6   7   8   9   10   >