Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Hi Suzuki, On 2019/3/14 18:55, Suzuki K Poulose wrote: Hi Zheng, On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote: On 2019/3/13 2:18, Marc Zyngier wrote: Hi Zheng, On 12/03/2019 15:30, Zheng Xiang wrote: Hi Marc, On 2019/3/12 19:32, Marc Zyngier wrote: Hi Zheng, On 11/03/2019 16:31, Zheng Xiang wrote: Hi all, While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for the base address of the huge page and the whole of Stage-1. However, this just only invalidates the first page within the huge page and the other pages are not invalidated, see bellow: +---+--+ |abcde 2MB-Page | +---+--+ TLB before setting new pmd: +---+--+ | VA |PAGESIZE | +---+--+ | a| 4KB | +---+--+ | b| 4KB | +---+--+ | c| 4KB | +---+--+ | d| 4KB | +---+--+ TLB after setting new pmd: +---+--+ | VA |PAGESIZE | +---+--+ | a| 2MB | +---+--+ | b| 4KB | +---+--+ | c| 4KB | +---+--+ | d| 4KB | +---+--+ When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions. That's really bad. I can only imagine two scenarios: 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing the PTE table in the process, and place the PMD instead. I can't see this happening. 2) We fail to invalidate on unmap, and that slightly less bad (but still quite bad). Which of the two cases are you seeing? For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration. KVM will set the memslot READONLY and split the huge pages. After live migration is canceled and abort, the pages will be merged into THP. The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated. So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()? Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries. We should perform an invalidate on each unmap. unmap_stage2_range seems to do the right thing. __flush_tlb_range only caters for Stage1 mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes TLBs for the whole VM. I'd really like to understand what you're seeing, and how to reproduce it. Do you have a minimal example I could run on my own HW? When I start the live migration for a VM, qemu then begins to log and count dirty pages. During the live migration, KVM set the pages READONLY so that we can count how many pages would be wrote afterwards. Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang. The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After analyzing the source code, I find KVM always return from the bellow *if* statement in stage2_set_pmd_huge() even if we only have a single VCPU: /* * Multiple vcpus faulting on the same PMD entry, can * lead to them sequentially updating the PMD with the * same value. Following the break-before-make * (pmd_clear() followed by tlb_flush()) process can * hinder forward progress due to refaults generated * on missing translations. * * Skip updating the page table if the entry is * unchanged. */ if (pmd_val(old_pmd) == pmd_val(*new_pmd)) return 0; The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug code to flush tlb for all subpages of the PMD, as shown bellow: /* * Mapping in huge pages should only happen through a * fault. If a page is merged into a transparent huge * page, the individual subpages of that huge page * should be unmapped through MMU notifiers before we * get here. * * Merging of CompoundPages is not supported; they * should become splitting first, unmapped, merged, * and mapped back in on-demand. */ VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); pmd_clear(pmd); for (cnt = 0; cnt < 512; cnt++) kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE); Then the
Re: [RFC] arm/cpu: fix soft lockup panic after resuming from stop
Hi Peter and Steven, On 2019/3/13 18:11, Steven Price wrote: On 12/03/2019 14:12, Marc Zyngier wrote: Hi Peter, On 12/03/2019 10:08, Peter Maydell wrote: On Tue, 12 Mar 2019 at 06:10, Heyi Guo wrote: When we stop a VM for more than 30 seconds and then resume it, by qemu monitor command "stop" and "cont", Linux on VM will complain of "soft lockup - CPU#x stuck for xxs!" as below: [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s! [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s! [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s! [ 2783.809563] Modules linked in... This is because Guest Linux uses generic timer virtual counter as a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped by qemu. This patch is to fix this issue by saving the value of CNTVCT_EL0 when stopping and restoring it when resuming. An alternative way of fixing this particular issue ("stop"/"cont" commands in QEMU) would be to wire up KVM_KVMCLOCK_CTRL for arm to allow QEMU to signal to the guest that it was forcibly stopped for a while (and so the watchdog timeout can be ignored by the guest). Hi -- I know we have issues with the passage of time in Arm VMs running under KVM when the VM is suspended, but the topic is a tricky one, and it's not clear to me that this is the correct way to fix it. I would prefer to see us start with a discussion on the kvm-arm mailing list about the best approach to the problem. I've cc'd that list and a couple of the Arm KVM maintainers for their opinion. QEMU patch left below for context -- the brief summary is that it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register to save it on VM pause and write that value back on VM resume. That's probably good enough for this particular use case, but I think there is more. I can get into similar trouble if I suspend my laptop, or suspend QEMU. It also has the slightly bizarre effect of skewing time, and this will affect timekeeping in the guest in ways that are much more subtle than just shouty CPUs. Indeed this is the bigger issue - user space doesn't get an opportunity to be involved when suspending/resuming, so saving/restoring (or using KVM_KVMCLOCK_CTRL) in user space won't fix these cases. Christoffer and Steve had some stuff regarding Live Physical Time, which should cover that, and other cases such as host system suspend, and QEMU being suspended. Live Physical Time (LPT) is only part of the solution - this handles the mess that otherwise would occur when moving to a new host with a different clock frequency. Personally I think what we need is: * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog firing when user space explicitly stops scheduling the guest for a while. Per Steven's comments, the above change is still needed even when LPT is implemented, so do it make sense for us to apply this patch first? At least it fixes something and does not cause conflict with future solution. And do you think it is better to use KVM_KVMCLOCK_CTRL? Thanks, Heyi * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the guest doesn't see time pass during a suspend. * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows the guest to query the wall clock time from the host and provides an offset between CNTVCT_EL0 to wall clock time which the KVM can update during suspend/resume. This means that during a suspend/resume the guest can observe that wall clock time has passed, without having to be bothered about CNTVCT_EL0 jumping forwards. Steve . ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Kick cpu when WFI in single-threaded kvm integration
Hi all, Currently I am working on a SystemC integration of kvm on arm. Therefore, I use the kvm api and of course SystemC (library to simulate hardware platforms with C++). As I need the virtual cpu to interrupt its execution loop from time to time to let the rest of the SystemC simulation execute, I use a perf_event and let the kernel send a signal on overflow to the simulation thread which kicks the virtual cpu (suggested by this mailing list, thanks again). Thus I am able to simulate a quantum mechanism for the virtual cpu. As I am running benchmarks (e.g. Coremark) on my virtual platform this works fine. I also get to boot Linux until it spawns the terminal and then wait for interrupts from my virtual uart. Here comes the problem: The perf event counting mechanism does increment its counted instructions very very slowly when the virtual cpu executes wfi. Thus my whole simulation starts to hang. As my simulation is single threaded I need the signal from the kernel to kick my cpu to let the virtual uart deliver its interrupt to react to my input. I tried to use the request_interrupt_window flag but this does not seem to work. Is there a way to kick the virtual cpu when it is waiting for interrupts? Or do I have to patch my kvm code? Thanks in advance Jan Bölke ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Hi Zheng, On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote: > > > On 2019/3/13 2:18, Marc Zyngier wrote: > > Hi Zheng, > > > > On 12/03/2019 15:30, Zheng Xiang wrote: > >> Hi Marc, > >> > >> On 2019/3/12 19:32, Marc Zyngier wrote: > >>> Hi Zheng, > >>> > >>> On 11/03/2019 16:31, Zheng Xiang wrote: > Hi all, > > While a page is merged into a transparent huge page, KVM will invalidate > Stage-2 for > the base address of the huge page and the whole of Stage-1. > However, this just only invalidates the first page within the huge page > and the other > pages are not invalidated, see bellow: > > +---+--+ > |abcde 2MB-Page | > +---+--+ > > TLB before setting new pmd: > +---+--+ > | VA |PAGESIZE | > +---+--+ > | a| 4KB | > +---+--+ > | b| 4KB | > +---+--+ > | c| 4KB | > +---+--+ > | d| 4KB | > +---+--+ > > TLB after setting new pmd: > +---+--+ > | VA |PAGESIZE | > +---+--+ > | a| 2MB | > +---+--+ > | b| 4KB | > +---+--+ > | c| 4KB | > +---+--+ > | d| 4KB | > +---+--+ > > When VM access *b* address, it will hit the TLB and result in TLB > conflict aborts or other potential exceptions. > >>> > >>> That's really bad. I can only imagine two scenarios: > >>> > >>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing > >>> the PTE table in the process, and place the PMD instead. I can't see > >>> this happening. > >>> > >>> 2) We fail to invalidate on unmap, and that slightly less bad (but still > >>> quite bad). > >>> > >>> Which of the two cases are you seeing? > >>> > For example, we need to keep tracking of the VM memory dirty pages when > VM is in live migration. > KVM will set the memslot READONLY and split the huge pages. > After live migration is canceled and abort, the pages will be merged > into THP. > The later access to these pages which are READONLY will cause level-3 > Permission Fault until they are invalidated. > > So should we invalidate the tlb entries for all relative pages(e.g > a,b,c,d), like __flush_tlb_range()? > Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries. > >>> > >>> We should perform an invalidate on each unmap. unmap_stage2_range seems > >>> to do the right thing. __flush_tlb_range only caters for Stage1 > >>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes > >>> TLBs for the whole VM. > >>> > >>> I'd really like to understand what you're seeing, and how to reproduce > >>> it. Do you have a minimal example I could run on my own HW? > >> > >> When I start the live migration for a VM, qemu then begins to log and > >> count dirty pages. > >> During the live migration, KVM set the pages READONLY so that we can count > >> how many pages > >> would be wrote afterwards. > >> > >> Anything is OK until I cancel the live migration and qemu stop logging. > >> Later the VM gets hang. > >> The trace log shows repeatedly level-3 permission fault caused by a write > >> on a same IPA. After > >> analyzing the source code, I find KVM always return from the bellow *if* > >> statement in > >> stage2_set_pmd_huge() even if we only have a single VCPU: > >> > >> /* > >> * Multiple vcpus faulting on the same PMD entry, can > >> * lead to them sequentially updating the PMD with the > >> * same value. Following the break-before-make > >> * (pmd_clear() followed by tlb_flush()) process can > >> * hinder forward progress due to refaults generated > >> * on missing translations. > >> * > >> * Skip updating the page table if the entry is > >> * unchanged. > >> */ > >> if (pmd_val(old_pmd) == pmd_val(*new_pmd)) > >> return 0; > >> > >> The PMD has already set the PMD_S2_RDWR bit. I doubt > >> kvm_tlb_flush_vmid_ipa() does not invalidate > >> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). > >> Finally I add some debug > >> code to flush tlb for all subpages of the PMD, as shown bellow: > >> > >> /* > >> * Mapping in huge pages should only happen through a > >>
RE: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when using gicv4
Hi, Marc, Shameer > -Original Message- > From: Marc Zyngier [mailto:marc.zyng...@arm.com] > Sent: Thursday, March 14, 2019 12:31 AM > To: Shameerali Kolothum Thodi ; > Tangnianyao (ICT) ; Christoffer Dall > ; james.mo...@arm.com; julien.thie...@arm.com; > suzuki.poul...@arm.com; catalin.mari...@arm.com; will.dea...@arm.com; > alex.ben...@linaro.org; mark.rutl...@arm.com; andre.przyw...@arm.com; > Zhangshaokun ; keesc...@chromium.org; > linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu; > linux-ker...@vger.kernel.org > Cc: Linuxarm > Subject: Re: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when > using gicv4 > > On 13/03/2019 15:59, Shameerali Kolothum Thodi wrote: > > Hi Shameer, > > >> Can you please give the following patch a go? I can't test it, but > >> hopefully you can. > > > > Thanks for your quick response. I just did a quick test on one of our > > platforms with VHE+GICv4 and it seems to fix the performance issue we > > were seeing when GICv4 is enabled. > > > > Test setup: > > > > Host connected to a PC over a 10G port. > > Launch Guest with an assigned vf dev. > > Run iperf from Guest, > > > > 5.0 kernel: > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 1.30 GBytes 1.12 Gbits/sec > > > > +Patch: > > > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 10.9 GBytes 9.39 Gbits/sec > > Ah, that looks much better! I'll try to write it properly, as I think what we > have > today is slightly bizarre (we write ICH_HCR_EL2 from too many paths, and I > need to remember how the whole thing works). > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny... Test setup: VHE enable. Connected to an intel 8180 server over 10G port. Launch guest with an assigned vf dev. Intel server as server. Arm guest vm as client. Run netperf in guest, package length 512byte: 5.0.0-rc3 kernel: without patch: 2600 Mbits/s with patch: 5800 Mbits/s Thanks, -Nianyao Tang ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm