Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-14 Thread Zenghui Yu

Hi Suzuki,

On 2019/3/14 18:55, Suzuki K Poulose wrote:

Hi Zheng,

On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote:



On 2019/3/13 2:18, Marc Zyngier wrote:

Hi Zheng,

On 12/03/2019 15:30, Zheng Xiang wrote:

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:

Hi Zheng,

On 11/03/2019 16:31, Zheng Xiang wrote:

Hi all,

While a page is merged into a transparent huge page, KVM will invalidate 
Stage-2 for
the base address of the huge page and the whole of Stage-1.
However, this just only invalidates the first page within the huge page and the 
other
pages are not invalidated, see bellow:

 +---+--+
 |abcde   2MB-Page  |
 +---+--+

 TLB before setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  4KB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

 TLB after setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  2MB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

When VM access *b* address, it will hit the TLB and result in TLB conflict 
aborts or other potential exceptions.


That's really bad. I can only imagine two scenarios:

1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
the PTE table in the process, and place the PMD instead. I can't see
this happening.

2) We fail to invalidate on unmap, and that slightly less bad (but still
quite bad).

Which of the two cases are you seeing?


For example, we need to keep tracking of the VM memory dirty pages when VM is 
in live migration.
KVM will set the memslot READONLY and split the huge pages.
After live migration is canceled and abort, the pages will be merged into THP.
The later access to these pages which are READONLY will cause level-3 
Permission Fault until they are invalidated.

So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), 
like __flush_tlb_range()?
Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.


We should perform an invalidate on each unmap. unmap_stage2_range seems
to do the right thing. __flush_tlb_range only caters for Stage1
mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
TLBs for the whole VM.

I'd really like to understand what you're seeing, and how to reproduce
it. Do you have a minimal example I could run on my own HW?


When I start the live migration for a VM, qemu then begins to log and count 
dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how 
many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later 
the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a 
same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* 
statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

 /*
  * Multiple vcpus faulting on the same PMD entry, can
  * lead to them sequentially updating the PMD with the
  * same value. Following the break-before-make
  * (pmd_clear() followed by tlb_flush()) process can
  * hinder forward progress due to refaults generated
  * on missing translations.
  *
  * Skip updating the page table if the entry is
  * unchanged.
  */
 if (pmd_val(old_pmd) == pmd_val(*new_pmd))
 return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() 
does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally 
I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

 /*
  * Mapping in huge pages should only happen through a
  * fault.  If a page is merged into a transparent huge
  * page, the individual subpages of that huge page
  * should be unmapped through MMU notifiers before we
  * get here.
  *
  * Merging of CompoundPages is not supported; they
  * should become splitting first, unmapped, merged,
  * and mapped back in on-demand.
  */
 VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

 pmd_clear(pmd);
 for (cnt = 0; cnt < 512; cnt++)
 kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the 

Re: [RFC] arm/cpu: fix soft lockup panic after resuming from stop

2019-03-14 Thread Heyi Guo

Hi Peter and Steven,


On 2019/3/13 18:11, Steven Price wrote:

On 12/03/2019 14:12, Marc Zyngier wrote:

Hi Peter,

On 12/03/2019 10:08, Peter Maydell wrote:

On Tue, 12 Mar 2019 at 06:10, Heyi Guo  wrote:

When we stop a VM for more than 30 seconds and then resume it, by qemu
monitor command "stop" and "cont", Linux on VM will complain of "soft
lockup - CPU#x stuck for xxs!" as below:

[ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
[ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
[ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
[ 2783.809563] Modules linked in...

This is because Guest Linux uses generic timer virtual counter as
a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
by qemu.

This patch is to fix this issue by saving the value of CNTVCT_EL0 when
stopping and restoring it when resuming.

An alternative way of fixing this particular issue ("stop"/"cont"
commands in QEMU) would be to wire up KVM_KVMCLOCK_CTRL for arm to allow
QEMU to signal to the guest that it was forcibly stopped for a while
(and so the watchdog timeout can be ignored by the guest).


Hi -- I know we have issues with the passage of time in Arm VMs
running under KVM when the VM is suspended, but the topic is
a tricky one, and it's not clear to me that this is the correct
way to fix it. I would prefer to see us start with a discussion
on the kvm-arm mailing list about the best approach to the problem.

I've cc'd that list and a couple of the Arm KVM maintainers
for their opinion.

QEMU patch left below for context -- the brief summary is that
it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
to save it on VM pause and write that value back on VM resume.

That's probably good enough for this particular use case, but I think
there is more. I can get into similar trouble if I suspend my laptop, or
suspend QEMU. It also has the slightly bizarre effect of skewing time,
and this will affect timekeeping in the guest in ways that are much more
subtle than just shouty CPUs.

Indeed this is the bigger issue - user space doesn't get an opportunity
to be involved when suspending/resuming, so saving/restoring (or using
KVM_KVMCLOCK_CTRL) in user space won't fix these cases.


Christoffer and Steve had some stuff regarding Live Physical Time, which
should cover that, and other cases such as host system suspend, and QEMU
being suspended.

Live Physical Time (LPT) is only part of the solution - this handles the
mess that otherwise would occur when moving to a new host with a
different clock frequency.

Personally I think what we need is:

* Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
firing when user space explicitly stops scheduling the guest for a while.

Per Steven's comments, the above change is still needed even when LPT is 
implemented, so do it make sense for us to apply this patch first? At least it 
fixes something and does not cause conflict with future solution. And do you 
think it is better to use KVM_KVMCLOCK_CTRL?

Thanks,
Heyi



* KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
guest doesn't see time pass during a suspend.

* Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
the guest to query the wall clock time from the host and provides an
offset between CNTVCT_EL0 to wall clock time which the KVM can update
during suspend/resume. This means that during a suspend/resume the guest
can observe that wall clock time has passed, without having to be
bothered about CNTVCT_EL0 jumping forwards.

Steve

.




___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Kick cpu when WFI in single-threaded kvm integration

2019-03-14 Thread Jan Bolke
Hi all,

Currently I am working on a SystemC integration of kvm on arm.
Therefore, I use the kvm api and of course SystemC (library to simulate 
hardware platforms with C++).

As I need the virtual cpu to interrupt its execution loop from time to time to 
let the rest of the SystemC simulation execute,
I use a perf_event and let the kernel send a signal on overflow to the 
simulation thread which kicks the virtual cpu (suggested by this mailing list, 
thanks again).
Thus I am able to simulate a quantum mechanism for the virtual cpu.

As I am running benchmarks (e.g. Coremark) on my virtual platform this works 
fine.

I also get to boot Linux until it spawns the terminal and then wait for 
interrupts from my virtual uart.
Here comes the problem:
The perf event counting mechanism does increment its counted instructions very 
very slowly when the virtual cpu executes wfi.
Thus my whole simulation starts to hang.
As my simulation is single threaded I need the signal from the kernel to kick 
my cpu to let the virtual uart deliver its interrupt to react to my input.
I tried to use the request_interrupt_window flag but this does not seem to work.

Is there a way to kick the virtual cpu when it is waiting for interrupts? Or do 
I have to patch my kvm code?

Thanks in advance
Jan Bölke

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-14 Thread Suzuki K Poulose
Hi Zheng,

On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote:
> 
> 
> On 2019/3/13 2:18, Marc Zyngier wrote:
> > Hi Zheng,
> > 
> > On 12/03/2019 15:30, Zheng Xiang wrote:
> >> Hi Marc,
> >>
> >> On 2019/3/12 19:32, Marc Zyngier wrote:
> >>> Hi Zheng,
> >>>
> >>> On 11/03/2019 16:31, Zheng Xiang wrote:
>  Hi all,
> 
>  While a page is merged into a transparent huge page, KVM will invalidate 
>  Stage-2 for
>  the base address of the huge page and the whole of Stage-1.
>  However, this just only invalidates the first page within the huge page 
>  and the other
>  pages are not invalidated, see bellow:
> 
>  +---+--+
>  |abcde   2MB-Page  |
>  +---+--+
> 
>  TLB before setting new pmd:
>  +---+--+
>  |  VA   |PAGESIZE  |
>  +---+--+
>  |  a|  4KB |
>  +---+--+
>  |  b|  4KB |
>  +---+--+
>  |  c|  4KB |
>  +---+--+
>  |  d|  4KB |
>  +---+--+
> 
>  TLB after setting new pmd:
>  +---+--+
>  |  VA   |PAGESIZE  |
>  +---+--+
>  |  a|  2MB |
>  +---+--+
>  |  b|  4KB |
>  +---+--+
>  |  c|  4KB |
>  +---+--+
>  |  d|  4KB |
>  +---+--+
> 
>  When VM access *b* address, it will hit the TLB and result in TLB 
>  conflict aborts or other potential exceptions.
> >>>
> >>> That's really bad. I can only imagine two scenarios:
> >>>
> >>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> >>> the PTE table in the process, and place the PMD instead. I can't see
> >>> this happening.
> >>>
> >>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> >>> quite bad).
> >>>
> >>> Which of the two cases are you seeing?
> >>>
>  For example, we need to keep tracking of the VM memory dirty pages when 
>  VM is in live migration.
>  KVM will set the memslot READONLY and split the huge pages.
>  After live migration is canceled and abort, the pages will be merged 
>  into THP.
>  The later access to these pages which are READONLY will cause level-3 
>  Permission Fault until they are invalidated.
> 
>  So should we invalidate the tlb entries for all relative pages(e.g 
>  a,b,c,d), like __flush_tlb_range()?
>  Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> >>>
> >>> We should perform an invalidate on each unmap. unmap_stage2_range seems
> >>> to do the right thing. __flush_tlb_range only caters for Stage1
> >>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> >>> TLBs for the whole VM.
> >>>
> >>> I'd really like to understand what you're seeing, and how to reproduce
> >>> it. Do you have a minimal example I could run on my own HW?
> >>
> >> When I start the live migration for a VM, qemu then begins to log and 
> >> count dirty pages.
> >> During the live migration, KVM set the pages READONLY so that we can count 
> >> how many pages
> >> would be wrote afterwards.
> >>
> >> Anything is OK until I cancel the live migration and qemu stop logging. 
> >> Later the VM gets hang.
> >> The trace log shows repeatedly level-3 permission fault caused by a write 
> >> on a same IPA. After
> >> analyzing the source code, I find KVM always return from the bellow *if* 
> >> statement in
> >> stage2_set_pmd_huge() even if we only have a single VCPU:
> >>
> >> /*
> >>  * Multiple vcpus faulting on the same PMD entry, can
> >>  * lead to them sequentially updating the PMD with the
> >>  * same value. Following the break-before-make
> >>  * (pmd_clear() followed by tlb_flush()) process can
> >>  * hinder forward progress due to refaults generated
> >>  * on missing translations.
> >>  *
> >>  * Skip updating the page table if the entry is
> >>  * unchanged.
> >>  */
> >> if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >> return 0;
> >>
> >> The PMD has already set the PMD_S2_RDWR bit. I doubt 
> >> kvm_tlb_flush_vmid_ipa() does not invalidate
> >> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). 
> >> Finally I add some debug
> >> code to flush tlb for all subpages of the PMD, as shown bellow:
> >>
> >> /*
> >>  * Mapping in huge pages should only happen through a
> >>   

RE: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when using gicv4

2019-03-14 Thread Tangnianyao (ICT)
Hi, Marc, Shameer

> -Original Message-
> From: Marc Zyngier [mailto:marc.zyng...@arm.com]
> Sent: Thursday, March 14, 2019 12:31 AM
> To: Shameerali Kolothum Thodi ;
> Tangnianyao (ICT) ; Christoffer Dall
> ; james.mo...@arm.com; julien.thie...@arm.com;
> suzuki.poul...@arm.com; catalin.mari...@arm.com; will.dea...@arm.com;
> alex.ben...@linaro.org; mark.rutl...@arm.com; andre.przyw...@arm.com;
> Zhangshaokun ; keesc...@chromium.org;
> linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> linux-ker...@vger.kernel.org
> Cc: Linuxarm 
> Subject: Re: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when
> using gicv4
> 
> On 13/03/2019 15:59, Shameerali Kolothum Thodi wrote:
> 
> Hi Shameer,
> 
> >> Can you please give the following patch a go? I can't test it, but
> >> hopefully you can.
> >
> > Thanks for your quick response. I just did a quick test on one of our
> > platforms with VHE+GICv4 and it seems to fix the performance issue we
> > were seeing when GICv4 is enabled.
> >
> > Test setup:
> >
> > Host connected to a PC over a 10G port.
> > Launch Guest with an assigned vf dev.
> > Run iperf from Guest,
> >
> > 5.0 kernel:
> > [ ID]  Interval   Transfer Bandwidth
> > [  3]  0.0-10.0 sec  1.30 GBytes  1.12 Gbits/sec
> >
> > +Patch:
> >
> > [ ID]   Interval   Transfer Bandwidth
> > [  3]  0.0-10.0 sec  10.9 GBytes  9.39 Gbits/sec
> 
> Ah, that looks much better! I'll try to write it properly, as I think what we 
> have
> today is slightly bizarre (we write ICH_HCR_EL2 from too many paths, and I
> need to remember how the whole thing works).
> 
> Thanks,
> 
>   M.
> --
> Jazz is not dead. It just smells funny...

Test setup:
VHE enable. 
Connected to an intel 8180 server over 10G port.
Launch guest with an assigned vf dev.
Intel server as server. Arm guest vm as client.
Run netperf in guest, package length 512byte:

5.0.0-rc3 kernel:
without patch:
2600 Mbits/s
with patch:
5800 Mbits/s


Thanks,
-Nianyao Tang
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm