Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault
On 05/05/2012 10:08 PM, Marcelo Tosatti wrote: I am confused with ' _everywhere_ ', it means all of the path read/update spte? why not only verify the path which depends on is_writable_pte()? I meant any path that updates from present-present. OK, got it. So let us focus on mmu_spte_update() only. :) For the reason of its easy to verify that it is correct? But these paths are safe since it is not care PT_WRITABLE_MASK at all. What these paths care is the Dirty-bit and Accessed-bit are not lost, that is why we always treat the spte as volatile if it is can be updated out of mmu-lock. For the further development? We can add the delta comment for is_writable_pte() to warn the developer use it more carefully. It is also very hard to verify spte everywhere. :( Actually, the current code to care PT_WRITABLE_MASK is just for tlb flush, may be we can fold it into mmu_spte_update. [ There are tree ways to modify spte, present - nonpresent, nonpresent - present, present - present. But we only need care present - present for lockless. ] Also need to take memory ordering into account, which was not an issue before. So it is not only TLB flush. It seems do not need explicit barrier, we always use atomic-xchg to update spte, it has already guaranteed the memory ordering. In mmu_spte_update(): /* the return value indicates wheater tlb need be flushed.*/ static bool mmu_spte_update(u64 *sptep, u64 new_spte) { u64 old_spte = *sptep; bool flush = false; old_spte = xchg(sptep, new_spte); if (is_writable_pte(old_spte) !is_writable_pte(spte) ) flush = true; . } /* * return true means we need flush tlbs caused by changing spte from writeable * to read-only. */ bool mmu_update_spte(u64 *sptep, u64 spte) { u64 last_spte, old_spte = *sptep; bool flush = false; last_spte = xchg(sptep, spte); if ((is_writable_pte(last_spte) || spte_has_updated_lockless(old_spte, last_spte)) !is_writable_pte(spte) ) flush = true; track Drity/Accessed bit ... return flush } Furthermore, the style of if (spte-has-changed) goto beginning is feasible in set_spte since this path is a fast path. (i can speed up mmu_need_write_protect) What you mean exactly? It would be better if all these complications introduced by lockless updates can be avoided, say using A/D bits as Avi suggested. Anyway, i do not object it if we have a better way to do these, but .. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush
On 05/03/2012 10:11 PM, Avi Kivity wrote: On 05/03/2012 04:23 PM, Xiao Guangrong wrote: On 05/03/2012 07:22 PM, Avi Kivity wrote: Currently we flush the TLB while holding mmu_lock. This increases the lock hold time by the IPI round-trip time, increasing contention, and makes dropping the lock (for latency reasons) harder. This patch changes TLB management to be usable locklessly, introducing the following APIs: kvm_mark_tlb_dirty() - mark the TLB as containing stale entries kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as dirty These APIs can be used without holding mmu_lock (though if the TLB became stale due to shadow page table modifications, typically it will need to be called with the lock held to prevent other threads from seeing the modified page tables with the TLB unmarked and unflushed)/ Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/virtual/kvm/locking.txt | 14 ++ arch/x86/kvm/paging_tmpl.h|4 ++-- include/linux/kvm_host.h | 22 +- virt/kvm/kvm_main.c | 29 - 4 files changed, 57 insertions(+), 12 deletions(-) diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3b..f6c90479 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt @@ -23,3 +23,17 @@ Arch:x86 Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - tsc offset in vmcb Comment: 'raw' because updating the tsc offsets must not be preempted. + +3. TLB control +-- + +The following APIs should be used for TLB control: + + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt + either guest or host page tables. + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked + +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be +used while holding mmu_lock if it is called due to host page table changes +(contrast to guest page table changes). In these patches, it seems that kvm_mark_tlb_dirty is always used under the protection of mmu-lock, yes? Correct. It's possible we'll find a use outside mmu_lock, but this isn't likely. If we need call kvm_mark_tlb_dirty outside mmu-lock, just use kvm_flush_remote_tlbs instead: if (need-flush-tlb) flush = true; do something... if (flush) kvm_flush_remote_tlbs If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead. If it is so, dirtied_count/flushed_count need not be atomic. But we always mark with mmu_lock held. Yes, so, we can change kvm_mark_tlb_dirty to: +static inline void kvm_mark_tlb_dirty(struct kvm *kvm) +{ + /* +* Make any changes to the page tables visible to remote flushers. +*/ + smb_mb(); + kvm-tlb_state.dirtied_count++; +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible circular locking dependency
On 05/07/2012 06:47 AM, Paul E. McKenney wrote: On Sun, May 06, 2012 at 11:34:39PM +0300, Sergey Senozhatsky wrote: On (05/06/12 09:42), Paul E. McKenney wrote: On Sun, May 06, 2012 at 11:55:30AM +0300, Avi Kivity wrote: On 05/03/2012 11:02 PM, Sergey Senozhatsky wrote: Hello, 3.4-rc5 Whoa. Looks like inconsistent locking between cpufreq and synchronize_srcu_expedited(). kvm triggered this because it is one of the few users of synchronize_srcu_expedited(), but I don't think it is doing anything wrong directly. Dave, Paul? SRCU hasn't changed much in mainline for quite some time. Holding the hotplug mutex across a synchronize_srcu() is a bad idea, though. However, there is a reworked implementation (courtesy of Lai Jiangshan) in -rcu that does not acquire the hotplug mutex. Could you try that out? Paul, should I try solely -rcu or there are several commits to pick up and apply on top of -linus tree? If you want the smallest possible change, take the rcu/srcu branch of -rcu. If you want the works, take the rcu/next branch of -rcu. You can find -rcu at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git To make the difference even smaller, merge the above branch with v3.4-rc5. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment
-Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Sunday, May 06, 2012 11:34 PM To: Xudong Hao Cc: mtosa...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Zhang, Xiantao; Hao, Xudong; Alex Williamson Subject: Re: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment On 05/06/2012 06:24 PM, Xudong Hao wrote: Enable device LTR/OBFF capibility before do device assignment, so that guest can benefit from them. cc += Alex @@ -166,6 +166,10 @@ int kvm_assign_device(struct kvm *kvm, if (pdev == NULL) return -ENODEV; + /* Enable some device capibility before do device assignment, +* so that guest can benefit from them. +*/ + kvm_iommu_enable_dev_caps(pdev); r = iommu_attach_device(domain, pdev-dev); Suppose we fail here. Do we need to disable_dev_caps()? I don't think so. When a device will be assigned to guest, it's be owned by a pci-stub driver, attach_device fail here do not affect everything. If host want to use it, host device driver has its own enable/disable dev_caps. if (r) { printk(KERN_ERR assign device %x:%x:%x.%x failed, @@ -228,6 +232,7 @@ int kvm_deassign_device(struct kvm *kvm, PCI_SLOT(assigned_dev-host_devfn), PCI_FUNC(assigned_dev-host_devfn)); + kvm_iommu_disable_dev_caps(pdev); return 0; } @@ -351,3 +356,30 @@ int kvm_iommu_unmap_guest(struct kvm *kvm) iommu_domain_free(domain); return 0; } + +static void kvm_iommu_enable_dev_caps(struct pci_dev *pdev) +{ + /* set default value */ + unsigned long type = PCI_EXP_OBFF_SIGNAL_ALWAYS; + int snoop_lat_ns = 1024, nosnoop_lat_ns = 1024; Where does this magic number come from? The number is the max value that register support, set it as default here, we did not have any device here, and we do not know what's the proper value, so it set a default value firstly. + + /* LTR(Latency tolerance reporting) allows devices to send +* messages to the root complex indicating their latency +* tolerance for snooped unsnooped memory transactions. +*/ + pci_enable_ltr(pdev); + pci_set_ltr(pdev, snoop_lat_ns, nosnoop_lat_ns); + + /* OBFF (optimized buffer flush/fill), where supported, +* can help improve energy efficiency by giving devices +* information about when interrupts and other activity +* will have a reduced power impact. +*/ + pci_enable_obff(pdev, type); +} + +static void kvm_iommu_disable_dev_caps(struct pci_dev *pdev) +{ + pci_disble_obff(pdev); + pci_disble_ltr(pdev); +} Do we need to communicate something about these capabilities to the guest? I guess you means that here host don't know if guest want to enable them, right? The ltr/obff new feature are supposed to enabled by guest if platform and device supported. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush
On 05/07/2012 10:06 AM, Xiao Guangrong wrote: On 05/03/2012 10:11 PM, Avi Kivity wrote: On 05/03/2012 04:23 PM, Xiao Guangrong wrote: On 05/03/2012 07:22 PM, Avi Kivity wrote: Currently we flush the TLB while holding mmu_lock. This increases the lock hold time by the IPI round-trip time, increasing contention, and makes dropping the lock (for latency reasons) harder. This patch changes TLB management to be usable locklessly, introducing the following APIs: kvm_mark_tlb_dirty() - mark the TLB as containing stale entries kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as dirty These APIs can be used without holding mmu_lock (though if the TLB became stale due to shadow page table modifications, typically it will need to be called with the lock held to prevent other threads from seeing the modified page tables with the TLB unmarked and unflushed)/ Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/virtual/kvm/locking.txt | 14 ++ arch/x86/kvm/paging_tmpl.h|4 ++-- include/linux/kvm_host.h | 22 +- virt/kvm/kvm_main.c | 29 - 4 files changed, 57 insertions(+), 12 deletions(-) diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3b..f6c90479 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt @@ -23,3 +23,17 @@ Arch: x86 Protects:- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - tsc offset in vmcb Comment: 'raw' because updating the tsc offsets must not be preempted. + +3. TLB control +-- + +The following APIs should be used for TLB control: + + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt + either guest or host page tables. + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked + +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be +used while holding mmu_lock if it is called due to host page table changes +(contrast to guest page table changes). In these patches, it seems that kvm_mark_tlb_dirty is always used under the protection of mmu-lock, yes? Correct. It's possible we'll find a use outside mmu_lock, but this isn't likely. If we need call kvm_mark_tlb_dirty outside mmu-lock, just use kvm_flush_remote_tlbs instead: if (need-flush-tlb) flush = true; do something... if (flush) kvm_flush_remote_tlbs It depends on how need-flush-tlb is computed. If it depends on sptes, then we mush use kvm_cond_flush_remote_tlbs(). If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead. If it is so, dirtied_count/flushed_count need not be atomic. But we always mark with mmu_lock held. Yes, so, we can change kvm_mark_tlb_dirty to: +static inline void kvm_mark_tlb_dirty(struct kvm *kvm) +{ + /* + * Make any changes to the page tables visible to remote flushers. + */ + smb_mb(); + kvm-tlb_state.dirtied_count++; +} Yes. We'll have to change it again if we ever dirty sptes outside the lock, but that's okay. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm: KVM internal error. Suberror: 1
On 05/06/2012 08:19 PM, Sasha Levin wrote: Hi all, During some fuzzing with trinity in a KVM guest running on qemu, I got the following error: KVM internal error. Suberror: 1 emulation failure RAX= RBX=8800284108e0 RCX=0001 RDX=84482008 RSI=1030 RDI=8180 RBP=880028723d38 RSP=880028723ce8 R8 =0206 R9 =f7e80206 R10= R11= R12=88002841 R13=846ba1c0 R14=84a74970 R15=9530 RIP=8111c862 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = CS =0010 00a09b00 DPL=0 CS64 [-RA] SS =0018 00c09300 DPL=0 DS [-WA] DS = FS = 7f955873b700 GS = 880035a0 LDT= TR =0040 880035bd2480 2087 8b00 DPL=0 TSS64-busy GDT= 880035a04000 007f IDT= 8436a000 0fff CR0=8005003b CR2=7f5cfdad0518 CR3=1a154000 CR4=000407e0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 Code=66 90 e8 7b 97 ff ff b8 01 00 00 00 eb 1c 0f 1f 40 00 31 c0 83 3d 97 9f c7 02 00 0f 95 c0 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 48 8b 5d d8 4c 8b 65 e0 KVM internal error. Suberror: 1 emulation failure This is cmpl $0x0,0x2c79f97(%rip) # 0x83d96800. I don't understand why it failed, we do emulate cmp. I'll try to write a unit test for it. RAX=88000d5f8000 RBX=88000d600010 RCX=0001 RDX= RSI=0001 RDI=88000d5f8000 RBP=88000d601ec8 RSP=88000d601ec8 R8 =0001 R9 = R10= R11= R12=83fed960 R13= R14= R15= RIP=8107d696 RFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = CS =0010 00a09b00 DPL=0 CS64 [-RA] SS =0018 00c09300 DPL=0 DS [-WA] DS = FS = GS = 88002980 LDT= TR =0040 8800299d2480 2087 8b00 DPL=0 TSS64-busy GDT= 880029804000 007f IDT= 8436a000 0fff CR0=8005003b CR2=7fcfa03f9e9c CR3=03a1c000 CR4=000407e0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 Code=89 e5 fb c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 c9 c3 66 0f 1f 84 00 00 00 00 00 55 8b 07 48 KVM internal error. Suberror: 1 emulation failure RAX=88000d5db000 RBX=88000d5ce010 RCX=0001 RDX= RSI=0001 RDI=88000d5db000 RBP=88000d5cfec8 RSP=88000d5cfec8 R8 =0001 R9 = R10= R11= R12=83fed960 R13= R14= R15= RIP=8107d696 RFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = CS =0010 00a09b00 DPL=0 CS64 [-RA] SS =0018 00c09300 DPL=0 DS [-WA] DS = FS = GS = 88001b80 LDT= TR =0040 88001b9d2480 2087 8b00 DPL=0 TSS64-busy GDT= 88001b804000 007f IDT= 8436a000 0fff CR0=8005003b CR2=7fcfa076b518 CR3=1a148000 CR4=000407e0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 Code=89 e5 fb c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 c9 c3 66 0f 1f 84 00 00 00 00 00 55 8b 07 48 The assembly doesn't quite make sense, and the fact that I got 3 of these in a row, makes me believe that it isn't an actual emulation error, but something else. The assembly makes sense, it's sti; hlt; leaveq. What doesn't make sense is that we have to emulate leaveq - rsp and rbp point at normal memory as far as I can tell. The fact that it often happens after hlt makes me suspect interrupts are involved. Please run this again with a trace so we so what happens prior to the error. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM.(targeted for 3.5 window) Note: This needs debugfs changes patch that should be in Xen / linux-next https://lkml.org/lkml/2012/3/30/687 Changes in V8: - Reabsed patches to 3.4-rc4 - Combined the KVM changes with ticketlock + Xen changes (Ingo) - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl. - Rewrite GET_MP_STATE condition (Avi) - Make pv_unhalt = bool (Avi) - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb) - Documentation changes (Rob Landley) - Have a printk to recognize that paravirt spinlock is enabled (Nikunj) - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now so that it can be used for other optimizations such as flush_tlb_ipi_others etc. (Nikunj) Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out: barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1:and$-2,%edx movzbl %dl,%esi 2:mov$0x800,%eax jmp4f 3:pause sub$0x1,%eax je 5f 4:movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5:callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1:pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity a...@redhat.com -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Add mmx movq emulation
On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote: On 05/04/2012 02:47 PM, Joerg Roedel wrote: Add support to the MMX versions of the movq instructions to the instruction emulator. Also handle possible exceptions they may cause. This is already in (cbe2c9d30, e59717550e). Are you looking at master instead of next? Right, I was looking at master. Probably I should have re-read your mail about the new git workflow. Since you've just thought of the issues involved, I'd appreciate a review of the commits above, both wrt correctness and maintainability. The patches above look correct to me. In fact cbe2c9d30 is more general than my implementation because it fetches all possible mmx operands. My implementation on the other side should be a bit faster because it looks for FP exceptions directly when the registers are accessed which saves one get_fpu/put_fpu cycle (and an fwait instruction). In fact I already see one difference - my patches do reg = 7, while your patches generate #UD for %mm8-%mm15. Your version is correct. Documentation says that REX-prefixes are ignored where not supported or misplaced. I also tried that directly on hardware and it works as documented and implemented in KVM. Regards, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Add mmx movq emulation
On 05/07/2012 11:56 AM, Joerg Roedel wrote: Since you've just thought of the issues involved, I'd appreciate a review of the commits above, both wrt correctness and maintainability. The patches above look correct to me. In fact cbe2c9d30 is more general than my implementation because it fetches all possible mmx operands. My implementation on the other side should be a bit faster because it looks for FP exceptions directly when the registers are accessed which saves one get_fpu/put_fpu cycle (and an fwait instruction). The get_fpu/put_fpu are nops (unless we schedule in between), since we only put_fpu() doesn't really unload the fpu. You're correct about the fwait; my motivation was to get the #MF exception early instead of doing the accesses first. In fact I already see one difference - my patches do reg = 7, while your patches generate #UD for %mm8-%mm15. Your version is correct. Documentation says that REX-prefixes are ignored where not supported or misplaced. I also tried that directly on hardware and it works as documented and implemented in KVM. Thanks for verifying. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation
On Sun, May 06, 2012 at 11:21:52AM +0300, Avi Kivity wrote: diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d4bf50c..1b516ec 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3937,6 +3937,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) struct opcode opcode; ctxt-memop.type = OP_NONE; + ctxt-memop.val = 0; ctxt-memopp = NULL; ctxt-_eip = ctxt-eip; ctxt-fetch.start = ctxt-_eip; This only works for long sized values - it doesn't initialize val64 on i386, for example. So I think it's better to change bsr (and family) to use emualte_2op_SrcV_nobyte() instead (which has the added benefit of using the same values as the processor for the undefined bits). Right, thats a better solution. How about the attached patch? The zf check shouldn't be necessary anymore because the generated assembly uses dst.val as input and output so writeback shouldn't do anything wrong. The bsr and bsf unittests all pass again with this patch. Joerg From e9262f18e90111d32b584084c0b5564cbd728d65 Mon Sep 17 00:00:00 2001 From: Joerg Roedel joerg.roe...@amd.com Date: Mon, 7 May 2012 12:05:28 +0200 Subject: [PATCH] KVM: X86: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte() The instruction emulation for bsrw is broken in KVM because the code always uses bsr with 32 or 64 bit operand size for emulation. Fix that by using emulate_2op_SrcV_nobyte() macro to use guest operand size for emulation. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/emulate.c | 26 ++ 1 file changed, 2 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0d151e2..a6f8488 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3134,35 +3134,13 @@ static int em_btc(struct x86_emulate_ctxt *ctxt) static int em_bsf(struct x86_emulate_ctxt *ctxt) { - u8 zf; - - __asm__ (bsf %2, %0; setz %1 -: =r(ctxt-dst.val), =q(zf) -: r(ctxt-src.val)); - - ctxt-eflags = ~X86_EFLAGS_ZF; - if (zf) { - ctxt-eflags |= X86_EFLAGS_ZF; - /* Disable writeback. */ - ctxt-dst.type = OP_NONE; - } + emulate_2op_SrcV_nobyte(ctxt, bsf); return X86EMUL_CONTINUE; } static int em_bsr(struct x86_emulate_ctxt *ctxt) { - u8 zf; - - __asm__ (bsr %2, %0; setz %1 -: =r(ctxt-dst.val), =q(zf) -: r(ctxt-src.val)); - - ctxt-eflags = ~X86_EFLAGS_ZF; - if (zf) { - ctxt-eflags |= X86_EFLAGS_ZF; - /* Disable writeback. */ - ctxt-dst.type = OP_NONE; - } + emulate_2op_SrcV_nobyte(ctxt, bsr); return X86EMUL_CONTINUE; } -- 1.7.9.5 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation
On 05/07/2012 01:12 PM, Joerg Roedel wrote: On Sun, May 06, 2012 at 11:21:52AM +0300, Avi Kivity wrote: diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d4bf50c..1b516ec 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3937,6 +3937,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) struct opcode opcode; ctxt-memop.type = OP_NONE; + ctxt-memop.val = 0; ctxt-memopp = NULL; ctxt-_eip = ctxt-eip; ctxt-fetch.start = ctxt-_eip; This only works for long sized values - it doesn't initialize val64 on i386, for example. So I think it's better to change bsr (and family) to use emualte_2op_SrcV_nobyte() instead (which has the added benefit of using the same values as the processor for the undefined bits). Right, thats a better solution. How about the attached patch? The zf check shouldn't be necessary anymore because the generated assembly uses dst.val as input and output so writeback shouldn't do anything wrong. The bsr and bsf unittests all pass again with this patch. Joerg From e9262f18e90111d32b584084c0b5564cbd728d65 Mon Sep 17 00:00:00 2001 From: Joerg Roedel joerg.roe...@amd.com Date: Mon, 7 May 2012 12:05:28 +0200 Subject: [PATCH] KVM: X86: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte() The instruction emulation for bsrw is broken in KVM because the code always uses bsr with 32 or 64 bit operand size for emulation. Fix that by using emulate_2op_SrcV_nobyte() macro to use guest operand size for emulation. It looks fine. Do you know what triggered this regression? (for figuring out if it's 3.4 material) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Add mmx movq emulation
On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote: This is already in (cbe2c9d30, e59717550e). Are you looking at master instead of next? Btw. your mail about the new git-workflow states something about an auto-next branch. But I don't see that branch in the KVM tree (looking at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). Is there another branch that contains all fixes and everything for the next merge window? Basically I am looking for a branch which has the new master and next merged. Thanks, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Add mmx movq emulation
On 05/07/2012 01:28 PM, Joerg Roedel wrote: On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote: This is already in (cbe2c9d30, e59717550e). Are you looking at master instead of next? Btw. your mail about the new git-workflow states something about an auto-next branch. But I don't see that branch in the KVM tree (looking at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). We forgot to generate it. Is there another branch that contains all fixes and everything for the next merge window? Basically I am looking for a branch which has the new master and next merged. Right. I'll get my scripts to generate it. (btw: auto-next = merge(upstream, master, next)). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Michael S. Tsirkin m...@redhat.com wrote: I'm looking at reducing the interrupt overhead for virtualized guests: some workloads spend a large part of their time processing interrupts. This patchset supplies infrastructure to reduce the IRQ ack overhead on x86: the idea is to add an eoi_write callback that we can then optimize without touching other apic functionality. The main user will be kvm: on kvm, an EOI write from the guest causes an expensive exit to host; we can avoid this using shared memory as the last patch in the series demonstrates. But I also wrote a micro-optimized version for the regular x2apic: this shaves off a branch and about 9 instructions from EOI when x2apic is used, and a comment in ack_APIC_irq implies that someone counted instructions there, at some point. Also included in the patchset are a couple of trivial macro fixes. The patches work fine on my boxes and I did look at the objdump output to verify that the generated code for the micro-optimization patch looks right and actually is shorter. Some benchmark results below (not sure what kind of testing is the most appropriate) show a tiny but measureable improvement. The tests were run on an AMD box with 24 cpus. - A clean kernel build after reboot shows a tiny but measureable improvement in system time which means lower CPU overhead (though not measureable in total time - that is dominated by user time and fluctuates too much): linux# reboot -f ... linux# make clean linux# time make -j 64 LOCALVERSION= 21 /dev/null Before: real2m52.244s user35m53.833s sys 6m7.194s After: real2m52.827s user35m48.916s sys 6m2.305s - perf micro-benchmarks seem to consistently show a tiny improvement in total time as well but it's below the confidence level of 3 std deviations: # ./tools/perf/perf stat --sync --repeat 100 --null perf bench sched messaging ... 0.414666797 seconds time elapsed ( +- 1.29% ) Performance counter stats for 'perf bench sched messaging' (100 runs): 0.395370891 seconds time elapsed ( +- 1.04% ) # ./tools/perf/perf stat --sync --repeat 100 --null perf bench sched pipe -l 1 0.307019664 seconds time elapsed ( +- 0.10% ) 0.304738024 seconds time elapsed ( +- 0.08% ) The patches are against 3.4-rc3 - let me know if I need to rebase. I think patches 1-2 are definitely a good idea, and patches 3-4 might be a good idea. Please review, and consider patches 1-4 for linux 3.5. Thanks, MST Michael S. Tsirkin (5): apic: fix typo EIO_ACK - EOI_ACK and document apic: use symbolic APIC_EOI_ACK x86: add apic-eoi_write callback x86: eoi micro-optimization kvm_para: guest side for eoi avoidance arch/x86/include/asm/apic.h| 22 -- arch/x86/include/asm/apicdef.h |2 +- arch/x86/include/asm/bitops.h |6 ++- arch/x86/include/asm/kvm_para.h|2 + arch/x86/kernel/apic/apic_flat_64.c|2 + arch/x86/kernel/apic/apic_noop.c |1 + arch/x86/kernel/apic/apic_numachip.c |1 + arch/x86/kernel/apic/bigsmp_32.c |1 + arch/x86/kernel/apic/es7000_32.c |2 + arch/x86/kernel/apic/numaq_32.c|1 + arch/x86/kernel/apic/probe_32.c|1 + arch/x86/kernel/apic/summit_32.c |1 + arch/x86/kernel/apic/x2apic_cluster.c |1 + arch/x86/kernel/apic/x2apic_phys.c |1 + arch/x86/kernel/apic/x2apic_uv_x.c |1 + arch/x86/kernel/kvm.c | 51 ++-- arch/x86/platform/visws/visws_quirks.c |2 +- 17 files changed, 88 insertions(+), 10 deletions(-) No objections from the x86 side. In terms of advantages, could you please create perf stat runs that counts the number of MMIOs or so? That should show a pretty obvious improvement - and that is enough as proof, no need to try to reproduce the performance win in a noisy benchmark. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation
On Mon, May 07, 2012 at 01:18:01PM +0300, Avi Kivity wrote: On 05/07/2012 01:12 PM, Joerg Roedel wrote: The instruction emulation for bsrw is broken in KVM because the code always uses bsr with 32 or 64 bit operand size for emulation. Fix that by using emulate_2op_SrcV_nobyte() macro to use guest operand size for emulation. It looks fine. Do you know what triggered this regression? (for figuring out if it's 3.4 material) Looks like it is 3.4 (and -stable) material. I tested a few older kernels and the test passes on 3.0 but fails on 3.2 an later kernels (I have not tested 3.1). Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: X86: Add mmx movq emulation
On Mon, May 07, 2012 at 01:30:48PM +0300, Avi Kivity wrote: On 05/07/2012 01:28 PM, Joerg Roedel wrote: On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote: This is already in (cbe2c9d30, e59717550e). Are you looking at master instead of next? Btw. your mail about the new git-workflow states something about an auto-next branch. But I don't see that branch in the KVM tree (looking at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). We forgot to generate it. Is there another branch that contains all fixes and everything for the next merge window? Basically I am looking for a branch which has the new master and next merged. Right. I'll get my scripts to generate it. (btw: auto-next = merge(upstream, master, next)). Cool thanks. That is perfect for our internal testing :) Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
On Mon, May 07, 2012 at 12:35:12PM +0200, Ingo Molnar wrote: Michael S. Tsirkin (5): apic: fix typo EIO_ACK - EOI_ACK and document apic: use symbolic APIC_EOI_ACK x86: add apic-eoi_write callback x86: eoi micro-optimization kvm_para: guest side for eoi avoidance arch/x86/include/asm/apic.h| 22 -- arch/x86/include/asm/apicdef.h |2 +- arch/x86/include/asm/bitops.h |6 ++- arch/x86/include/asm/kvm_para.h|2 + arch/x86/kernel/apic/apic_flat_64.c|2 + arch/x86/kernel/apic/apic_noop.c |1 + arch/x86/kernel/apic/apic_numachip.c |1 + arch/x86/kernel/apic/bigsmp_32.c |1 + arch/x86/kernel/apic/es7000_32.c |2 + arch/x86/kernel/apic/numaq_32.c|1 + arch/x86/kernel/apic/probe_32.c|1 + arch/x86/kernel/apic/summit_32.c |1 + arch/x86/kernel/apic/x2apic_cluster.c |1 + arch/x86/kernel/apic/x2apic_phys.c |1 + arch/x86/kernel/apic/x2apic_uv_x.c |1 + arch/x86/kernel/kvm.c | 51 ++-- arch/x86/platform/visws/visws_quirks.c |2 +- 17 files changed, 88 insertions(+), 10 deletions(-) No objections from the x86 side. Is kvm.git a good tree to merge this through? In terms of advantages, could you please create perf stat runs that counts the number of MMIOs or so? That should show a pretty obvious improvement - and that is enough as proof, no need to try to reproduce the performance win in a noisy benchmark. You mean with kvm PV, right? On real hardware the micro-optimization removes branches and maybe cache-misses but I don't see why would it reduce MMIOs. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivitya...@redhat.com Thank you. Here is a benchmark result with the patches. 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script while true with an instruction) unpinned scenario 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n BASE+patch: 3.4-rc4 + debugfs + pv patches with CONFIG_PARAVIRT_SPINLOCK=y Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non PLE) with 8 core , 64GB RAM (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74)131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 Will be working on further analysis with other benchmarks (pgbench/sysbench/ebizzy...) and further optimization. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Michael S. Tsirkin m...@redhat.com wrote: On Mon, May 07, 2012 at 12:35:12PM +0200, Ingo Molnar wrote: Michael S. Tsirkin (5): apic: fix typo EIO_ACK - EOI_ACK and document apic: use symbolic APIC_EOI_ACK x86: add apic-eoi_write callback x86: eoi micro-optimization kvm_para: guest side for eoi avoidance arch/x86/include/asm/apic.h| 22 -- arch/x86/include/asm/apicdef.h |2 +- arch/x86/include/asm/bitops.h |6 ++- arch/x86/include/asm/kvm_para.h|2 + arch/x86/kernel/apic/apic_flat_64.c|2 + arch/x86/kernel/apic/apic_noop.c |1 + arch/x86/kernel/apic/apic_numachip.c |1 + arch/x86/kernel/apic/bigsmp_32.c |1 + arch/x86/kernel/apic/es7000_32.c |2 + arch/x86/kernel/apic/numaq_32.c|1 + arch/x86/kernel/apic/probe_32.c|1 + arch/x86/kernel/apic/summit_32.c |1 + arch/x86/kernel/apic/x2apic_cluster.c |1 + arch/x86/kernel/apic/x2apic_phys.c |1 + arch/x86/kernel/apic/x2apic_uv_x.c |1 + arch/x86/kernel/kvm.c | 51 ++-- arch/x86/platform/visws/visws_quirks.c |2 +- 17 files changed, 88 insertions(+), 10 deletions(-) No objections from the x86 side. Is kvm.git a good tree to merge this through? Fine to me, but I haven't checked how widely it conflicts with existing bits: by the looks of it most of the linecount is on the core x86 side, while the kvm change is well concentrated. In terms of advantages, could you please create perf stat runs that counts the number of MMIOs or so? That should show a pretty obvious improvement - and that is enough as proof, no need to try to reproduce the performance win in a noisy benchmark. You mean with kvm PV, right? On real hardware the micro-optimization removes branches and maybe cache-misses but I don't see why would it reduce MMIOs. Yeah, on KVM. On real hw I doubt it's measurable. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for May, Tuesday 8th
Hi Please send in any agenda items you are interested in covering. Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
On 05/07/2012 02:40 PM, Ingo Molnar wrote: No objections from the x86 side. Is kvm.git a good tree to merge this through? Fine to me, but I haven't checked how widely it conflicts with existing bits: by the looks of it most of the linecount is on the core x86 side, while the kvm change is well concentrated. I don't see a problem with merging though tip.git - we're close to the next merge window, and the guest side rarely causes conflicts. But please don't apply the last patch yet, I want to review it more closely (esp. with the host side). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/5] apic: eoi optimization support
* Avi Kivity a...@redhat.com wrote: On 05/07/2012 02:40 PM, Ingo Molnar wrote: No objections from the x86 side. Is kvm.git a good tree to merge this through? Fine to me, but I haven't checked how widely it conflicts with existing bits: by the looks of it most of the linecount is on the core x86 side, while the kvm change is well concentrated. I don't see a problem with merging though tip.git - we're close to the next merge window, and the guest side rarely causes conflicts. But please don't apply the last patch yet, I want to review it more closely (esp. with the host side). That last patch was marked don't apply yet, so I definitely planned on another iteration that incorporates all the feedback that has been given. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivitya...@redhat.com Thank you. Here is a benchmark result with the patches. 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script while true with an instruction) unpinned scenario 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n BASE+patch: 3.4-rc4 + debugfs + pv patches with CONFIG_PARAVIRT_SPINLOCK=y Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non PLE) with 8 core , 64GB RAM (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th
On 05/07/2012 06:47 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - Status of the 1.1 release Regards, Anthony Liguori Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th
On 05/07/2012 06:47 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - QEMU documentation qemu-doc.texi is in a pretty awful state. I'm wondering if anyone has any ideas about how we can improve it. One thing we could do is move the entire contents of it to the wiki to allow for broader editing. I'd also be really happy to have a documentation submaintainer if anyone is interested in the role. Other ideas? Regards, Anthony Liguori Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivitya...@redhat.com [...] (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. You are right, my %improvement was intended to be like if 1) base takes 100 sec == patch takes 93 sec 2) base takes 100 sec == patch takes 11 sec 3) base takes 100 sec == patch takes 4 sec The above is more confusing (and incorrect!). Better is what you told which boils to 10x and 25x improvement in case 2 and case 3. And IMO, this *really* gives the feeling of magnitude of improvement with patches. I ll change script to report that way :). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:20 PM, Raghavendra K T wrote: On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivitya...@redhat.com [...] (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. You are right, my %improvement was intended to be like if 1) base takes 100 sec == patch takes 93 sec 2) base takes 100 sec == patch takes 11 sec 3) base takes 100 sec == patch takes 4 sec The above is more confusing (and incorrect!). Better is what you told which boils to 10x and 25x improvement in case 2 and case 3. And IMO, this *really* gives the feeling of magnitude of improvement with patches. I ll change script to report that way :). btw, this is on non-PLE hardware, right? What are the numbers for PLE? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debug or diagnose tools for individual guest VM
On 05/07/2012 06:01 AM, Hailong Yang wrote: Dear all, I would like to know are there any appropriate tools for debugging or diagnosing individual guest VM performance. Similar to kvm_stat, but could distinguish information for each guest VM. You could do something like 'perf top -e kvm:\* -p pid' -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th
- QEMU documentation qemu-doc.texi is in a pretty awful state. I'm wondering if anyone has any ideas about how we can improve it. One thing we could do is move the entire contents of it to the wiki to allow for broader editing. What's qemu-tech.texi status? ;) I'd also be really happy to have a documentation submaintainer if anyone is interested in the role. Other ideas? IMHO, one of the problem is there are documents scattering out there, not just in one place. There are too many links on http://wiki.qemu.org/Manual. :/ If people can focus on one document, then it's easier to make it a good shape. Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 06:52 PM, Avi Kivity wrote: On 05/07/2012 04:20 PM, Raghavendra K T wrote: On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivitya...@redhat.com [...] (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. You are right, my %improvement was intended to be like if 1) base takes 100 sec == patch takes 93 sec 2) base takes 100 sec == patch takes 11 sec 3) base takes 100 sec == patch takes 4 sec The above is more confusing (and incorrect!). Better is what you told which boils to 10x and 25x improvement in case 2 and case 3. And IMO, this *really* gives the feeling of magnitude of improvement with patches. I ll change script to report that way :). btw, this is on non-PLE hardware, right? What are the numbers for PLE? Sure. I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:19 PM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K Traghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? It is because PLE already does a good job (of not burning cpu). The 1-3% improvement is because, patchset knows atleast who is next to hold lock, which is lacking in PLE. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Avi Kivity a...@redhat.com [2012-05-07 16:49:25]: Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? I think its latter (PLE already doing a good job). - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:16 PM, Srivatsa Vaddagiri wrote: * Raghavendra K Traghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Yes, Sure. 'll take-up this and any scalability improvement possible further. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:53 PM, Raghavendra K T wrote: Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? It is because PLE already does a good job (of not burning cpu). The 1-3% improvement is because, patchset knows atleast who is next to hold lock, which is lacking in PLE. Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/06/2012 09:39 AM, Avi Kivity wrote: On 05/06/2012 05:35 PM, Anthony Liguori wrote: On 05/06/2012 08:11 AM, Avi Kivity wrote: libvirt is essentially the BMC for a virtual guest. I would suggest looking at implementing an IPMI interface to libvirt and exposing it to the guest through a USB RNDIS device. That's the first option. One unanswered question is what to do when the guest is down? Someone should listen for IPMI events, but we can't make it libvirt unconditionally, since many instances of libvirt are active at any one time. Note the IPMI external interface needs to be migrated, like any other. For all intents and purposes, the BMC/RSA is a separate physical machine. If you really wanted to model it, you would launch two instances of QEMU. The BMC instance would have a virtual NIC and would share a USB bus with the slave QEMU instance (probably via USBoIP). The USB bus is how the BMC exposes IPMI to the guest (via a USB rndis adapter), remote media, etc. I believe some BMC's also expose IPMI over i2c but that's pretty low bandwidth. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. I don't think there's a tremendous amount of value in QEMU making itself look like an IBM IMM or whatever HP/Dell's equivalent is. As I said, these stacks are hugely complicated and there are better ways of doing out of band management (like talk to libvirt directly). So what's really the use case here? Would an IPMI - libvirt bridge get you what you need? I really think that's the best path forward. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 05:30 PM, Anthony Liguori wrote: On 05/06/2012 09:39 AM, Avi Kivity wrote: On 05/06/2012 05:35 PM, Anthony Liguori wrote: On 05/06/2012 08:11 AM, Avi Kivity wrote: libvirt is essentially the BMC for a virtual guest. I would suggest looking at implementing an IPMI interface to libvirt and exposing it to the guest through a USB RNDIS device. That's the first option. One unanswered question is what to do when the guest is down? Someone should listen for IPMI events, but we can't make it libvirt unconditionally, since many instances of libvirt are active at any one time. Note the IPMI external interface needs to be migrated, like any other. For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. If you really wanted to model it, you would launch two instances of QEMU. The BMC instance would have a virtual NIC and would share a USB bus with the slave QEMU instance (probably via USBoIP). The USB bus is how the BMC exposes IPMI to the guest (via a USB rndis adapter), remote media, etc. I believe some BMC's also expose IPMI over i2c but that's pretty low bandwidth. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface would control the power level of the host, not the guest. I don't think there's a tremendous amount of value in QEMU making itself look like an IBM IMM or whatever HP/Dell's equivalent is. As I said, these stacks are hugely complicated and there are better ways of doing out of band management (like talk to libvirt directly). I have to agree here. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:28 PM, Avi Kivity wrote: On 05/07/2012 04:53 PM, Raghavendra K T wrote: Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? It is because PLE already does a good job (of not burning cpu). The 1-3% improvement is because, patchset knows atleast who is next to hold lock, which is lacking in PLE. Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. Hmm agree that being a step ahead of mighty hardware (and just an improvement of 1-3%) is no good for long term (where PLE is future). Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:52 PM, Avi Kivity wrote: Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Note: it's probably best to try very wide guests, where the overhead of iterating on all vcpus begins to show. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 09:44 AM, Avi Kivity wrote: On 05/07/2012 05:30 PM, Anthony Liguori wrote: On 05/06/2012 09:39 AM, Avi Kivity wrote: On 05/06/2012 05:35 PM, Anthony Liguori wrote: On 05/06/2012 08:11 AM, Avi Kivity wrote: libvirt is essentially the BMC for a virtual guest. I would suggest looking at implementing an IPMI interface to libvirt and exposing it to the guest through a USB RNDIS device. That's the first option. One unanswered question is what to do when the guest is down? Someone should listen for IPMI events, but we can't make it libvirt unconditionally, since many instances of libvirt are active at any one time. Note the IPMI external interface needs to be migrated, like any other. For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. If you really wanted to model it, you would launch two instances of QEMU. The BMC instance would have a virtual NIC and would share a USB bus with the slave QEMU instance (probably via USBoIP). The USB bus is how the BMC exposes IPMI to the guest (via a USB rndis adapter), remote media, etc. I believe some BMC's also expose IPMI over i2c but that's pretty low bandwidth. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface would control the power level of the host, not the guest. IPMI with a custom backend is what I mean. That's what I mean by an IPMI - libvirt bridge. You could build a libvirt client that exposes an IPMI interface and when you issue IPMI commands, it translate it to libvirt operations. This can run as a normal process on the host and then network it to the guest via an emulated USB rndis device. Existing software on the guest shouldn't be able to tell the difference as long as it doesn't try to use I2C to talk to the BMC. I don't think there's a tremendous amount of value in QEMU making itself look like an IBM IMM or whatever HP/Dell's equivalent is. As I said, these stacks are hugely complicated and there are better ways of doing out of band management (like talk to libvirt directly). I have to agree here. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:47 PM, Raghavendra K T wrote: Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. Hmm agree that being a step ahead of mighty hardware (and just an improvement of 1-3%) is no good for long term (where PLE is future). PLE is the present, not the future. It was introduced on later Nehalems and is present on all Westmeres. Two more processor generations have passed meanwhile. The AMD equivalent was also introduced around that timeframe. Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 05:55 PM, Anthony Liguori wrote: For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. And we fail to emulate it correctly as well, wrt. alarms. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. It controls the main power place, reset line, can read VGA and emulate keyboard, seems pretty well integrated. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Maybe. But it's on when the rest of the machine is off. So Vcc is not shared. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. Ah okay. It also has to read the VGA output (say via vnc) and supply keyboard input (via sendkey). It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface would control the power level of the host, not the guest. IPMI with a custom backend is what I mean. That's what I mean by an IPMI - libvirt bridge. You could build a libvirt client that exposes an IPMI interface and when you issue IPMI commands, it translate it to libvirt operations. This can run as a normal process on the host and then network it to the guest via an emulated USB rndis device. Existing software on the guest shouldn't be able to tell the difference as long as it doesn't try to use I2C to talk to the BMC. I still like the single process solution, it is more in line with the rest of qemu and handles live migration better. But even better would be not to do this at all, and satisfy the remote management requirements using the existing tools. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding an IPMI BMC device to KVM
Then should we also emulate one AMM virtual device? one fsp virtual device? one IVE virtual device? On Sat, May 5, 2012 at 3:10 AM, Corey Minyard tcminy...@gmail.com wrote: I've been working on adding an IPMI BMC as a virtual device under KVM.  I'm doing this for two primary reasons, one to have a better test environment than what I have now for testing IPMI issues, and second to be able to better simulate a legacy environment for customers porting legacy software. For those that don't know, IPMI is a system management interface.  Generally systems with IPMI have a small microcontroller, called a BMC, that is always on when the board is powered.  The BMC is capable of controlling power and reset on the board, and it is hooked to sensors on the board (voltage, current, temperature, the presence of things like DIMMS and power supplies, intrusion detection, and a host of other things).  The main processor on a system can communicate with the BMC over a device.  Often these systems also have a LAN interface that lets you control the system remotely even when it's off. In addition, IPMI provides access to FRU (Field Replaceable Unit) data that describes the components of the system that can be replaced.  It also has data records that describe the sensor, so it is possible to directly interpret the sensor data and know what the sensor is measuring without outside data. I've been struggling a bit with how to implement this.  There is a lot of configuration information, and you need ways to simulate the sensors.  This type of interface is a little sensitive, since it has direct access to the reset and power control of a system. I was at first considering having the BMC be an external program that KVM connected to over a chardev, with possibly a simulated LAN interface.  This has the advantage that the BMC can run even when KVM is down.  It could even start up KVM for a power up, though I'm not sure how valuable that would be. Plus it could be used for other virtual machines.  However, that means there is an interface to KVM over a chardev that could do nasty things, and even be a possible intrusion point.  It also means there is a separate program to maintain. You could also include the BMC inside of KVM and run it as a separate thread. That way there doesn't have to be an insecure interface.  But the BMC will need a lot of configuration data and this will add a bunch of code to KVM that's only tangentially related to it.  And you would still need a way to simulate setting sensors and such for testing things. Either way, is this interesting for including into KVM?  Does anyone have any opinions on the possible ways to implement this? Thanks, -corey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 10:11 AM, Avi Kivity wrote: On 05/07/2012 05:55 PM, Anthony Liguori wrote: For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. And we fail to emulate it correctly as well, wrt. alarms. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. It controls the main power place, reset line, can read VGA and emulate keyboard, seems pretty well integrated. Emulating the keyboard is done through USB. How the VGA thing works is very vendor dependent. The VGA snooping can happen as part of the display path (essentially connected via a VGA cable) or it can be a side-band using a special graphics adapter. I think QEMU VNC emulation is a pretty good analogy actually. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Maybe. But it's on when the rest of the machine is off. So Vcc is not shared. That's all plumbed through the PCI bus FWIW. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. Ah okay. It also has to read the VGA output (say via vnc) and supply keyboard input (via sendkey). Right, QMP + VNC is a pretty accurate analogy. It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface would control the power level of the host, not the guest. IPMI with a custom backend is what I mean. That's what I mean by an IPMI - libvirt bridge. You could build a libvirt client that exposes an IPMI interface and when you issue IPMI commands, it translate it to libvirt operations. This can run as a normal process on the host and then network it to the guest via an emulated USB rndis device. Existing software on the guest shouldn't be able to tell the difference as long as it doesn't try to use I2C to talk to the BMC. I still like the single process solution, it is more in line with the rest of qemu and handles live migration better. Two QEMU processes could be migrated in unison if you really wanted to support that... With qemu-system-mips/sh4 you could probably even run the real BMC software stack if you were so inclined :-) But even better would be not to do this at all, and satisfy the remote management requirements using the existing tools. Right. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment
On Mon, 2012-05-07 at 07:58 +, Hao, Xudong wrote: -Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Sunday, May 06, 2012 11:34 PM To: Xudong Hao Cc: mtosa...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Zhang, Xiantao; Hao, Xudong; Alex Williamson Subject: Re: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment On 05/06/2012 06:24 PM, Xudong Hao wrote: Enable device LTR/OBFF capibility before do device assignment, so that guest can benefit from them. cc += Alex @@ -166,6 +166,10 @@ int kvm_assign_device(struct kvm *kvm, if (pdev == NULL) return -ENODEV; + /* Enable some device capibility before do device assignment, +* so that guest can benefit from them. +*/ + kvm_iommu_enable_dev_caps(pdev); r = iommu_attach_device(domain, pdev-dev); Suppose we fail here. Do we need to disable_dev_caps()? If kvm_assign_device() fails we'll try to restore the state we saved in kvm_vm_ioctl_assign_device(), so ltr/obff should be brought back to initial state. I don't think so. When a device will be assigned to guest, it's be owned by a pci-stub driver, attach_device fail here do not affect everything. If host want to use it, host device driver has its own enable/disable dev_caps. Why is device assignment unique here? If there's a default value that's known to be safe, why doesn't pci_enable_device set it for everyone? Host drivers can fine tune the value later if they want. if (r) { printk(KERN_ERR assign device %x:%x:%x.%x failed, @@ -228,6 +232,7 @@ int kvm_deassign_device(struct kvm *kvm, PCI_SLOT(assigned_dev-host_devfn), PCI_FUNC(assigned_dev-host_devfn)); + kvm_iommu_disable_dev_caps(pdev); return 0; } @@ -351,3 +356,30 @@ int kvm_iommu_unmap_guest(struct kvm *kvm) iommu_domain_free(domain); return 0; } + +static void kvm_iommu_enable_dev_caps(struct pci_dev *pdev) +{ + /* set default value */ + unsigned long type = PCI_EXP_OBFF_SIGNAL_ALWAYS; + int snoop_lat_ns = 1024, nosnoop_lat_ns = 1024; Where does this magic number come from? The number is the max value that register support, set it as default here, we did not have any device here, and we do not know what's the proper value, so it set a default value firstly. The register is composed of latency scale and latency value fields. 1024 is simply the largest value the latency value can hold (+1). The scale field allows latencies up to 34,326,183,936ns to be specified, so please explain how 1024 is a universal default. + + /* LTR(Latency tolerance reporting) allows devices to send +* messages to the root complex indicating their latency +* tolerance for snooped unsnooped memory transactions. +*/ + pci_enable_ltr(pdev); + pci_set_ltr(pdev, snoop_lat_ns, nosnoop_lat_ns); + + /* OBFF (optimized buffer flush/fill), where supported, +* can help improve energy efficiency by giving devices +* information about when interrupts and other activity +* will have a reduced power impact. +*/ + pci_enable_obff(pdev, type); +} + +static void kvm_iommu_disable_dev_caps(struct pci_dev *pdev) +{ + pci_disble_obff(pdev); + pci_disble_ltr(pdev); +} Do we need to communicate something about these capabilities to the guest? I guess you means that here host don't know if guest want to enable them, right? The ltr/obff new feature are supposed to enabled by guest if platform and device supported. It looks like ltr is a two part mechanism, the capability and enable lives in the pci express capability, but the tuning registers live in extended capability space. The guest doesn't yet have access to the latter since we don't have an express chipset. The capability and enable are read-only to the guest currently, same for obff. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PCI: save/restore max Latency Value for device LTR
On Sun, May 6, 2012 at 8:11 AM, Xudong Hao xudong@linux.intel.com wrote: LTR: Save Max snoop/no-snoop Latency Value in pci_save_pcie_state, and restore them in pci_restore_pcie_state. Signed-off-by: Xudong Hao xudong@intel.com --- Â drivers/pci/pci.c | Â 12 Â 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 111569c..c8573c3 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -875,6 +875,12 @@ static int pci_save_pcie_state(struct pci_dev *dev) Â Â Â Â pci_read_config_word(dev, pos + PCI_EXP_LNKCTL2, cap[i++]); Â Â if (pcie_cap_has_sltctl2(dev-pcie_type, flags)) Â Â Â Â pci_read_config_word(dev, pos + PCI_EXP_SLTCTL2, cap[i++]); + Â if (pci_ltr_supported(dev)) { + Â Â Â pci_read_config_word(dev, pos + PCI_LTR_MAX_SNOOP_LAT, + Â Â Â Â Â Â Â Â Â Â Â Â Â cap[i++]); + Â Â Â pci_read_config_word(dev, pos + PCI_LTR_MAX_NOSNOOP_LAT, + Â Â Â Â Â Â Â Â Â Â Â Â Â cap[i++]); + Â } Â Â return 0; Â } @@ -908,6 +914,12 @@ static void pci_restore_pcie_state(struct pci_dev *dev) Â Â Â Â pci_write_config_word(dev, pos + PCI_EXP_LNKCTL2, cap[i++]); Â Â if (pcie_cap_has_sltctl2(dev-pcie_type, flags)) Â Â Â Â pci_write_config_word(dev, pos + PCI_EXP_SLTCTL2, cap[i++]); + Â if (pci_ltr_supported(dev)) { + Â Â Â pci_write_config_word(dev, pos + PCI_LTR_MAX_SNOOP_LAT, + Â Â Â Â Â Â Â Â Â Â Â Â Â cap[i++]); + Â Â Â pci_write_config_word(dev, pos + PCI_LTR_MAX_NOSNOOP_LAT, + Â Â Â Â Â Â Â Â Â Â Â Â Â cap[i++]); + Â } Â } This doesn't make any sense to me. pos is the offset of the PCI Express Capability (identifier 10h). LTR is a separate extended capability (identifier 18h), so you at least have to look up its offset. Bjorn -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Avi Kivity a...@redhat.com wrote: PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. If both Xen and KVM makes good use of it then that's a different matter. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
I think we are getting a little out of hand here, and we are mixing up concepts :). There are lots of things IPMI *can* do (including serial access, VGA snooping, LAN access, etc.) but I don't see any value it that. The main thing here is to emulate the interface to the guest. OOB management is really more appropriately handled with libvirt. How the BMC integrates into the hardware varies a *lot* between systems, but it's really kind of irrelevant. (Well, almost irrelevant, BMCs can provide a direct I2C messaging capability, and that may matter.) A guest can have one (or more) of a number of interfaces (that are all fairly bad, unfortunately). The standard ones are called KCS, BT and SMIC and they generally are directly on the ISA bus, but are in memory on non-x86 boxes (and on some x86 boxes) and sometimes on the PCI bus. Some systems also have interfaces over I2C, but that hasn't really caught on. Others have interfaces over serial ports, and that unfortunately has caught on in the ATCA world. And there are at least 3 different basic types of serial port interfaces with sub-variants :(. I'm not sure what the USB rndis device is, but I'll look at it. But there is no IPMI over USB. The big things a guest can do are sensor management, watchdog timer, reset, and power control. In complicated IPMI-based systems like ATCA, a guest may want to send messages through its local IPMI controller to other guest's IPMI controllers or to a central BMC that runs an entire chassis of systems. So that may need to be supported, depending on what people want to do and how hard they want to work on it. My proposal is to start small, with just a local interface, watchdog timer, sensors and power control. But have an architecture that would allow external LAN access, tying BMCs in different qemu instances together, perhaps serial over IPMI, and other things of that nature. -corey On 05/07/2012 10:21 AM, Anthony Liguori wrote: On 05/07/2012 10:11 AM, Avi Kivity wrote: On 05/07/2012 05:55 PM, Anthony Liguori wrote: For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. And we fail to emulate it correctly as well, wrt. alarms. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. It controls the main power place, reset line, can read VGA and emulate keyboard, seems pretty well integrated. Emulating the keyboard is done through USB. How the VGA thing works is very vendor dependent. The VGA snooping can happen as part of the display path (essentially connected via a VGA cable) or it can be a side-band using a special graphics adapter. I think QEMU VNC emulation is a pretty good analogy actually. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Maybe. But it's on when the rest of the machine is off. So Vcc is not shared. That's all plumbed through the PCI bus FWIW. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. Ah okay. It also has to read the VGA output (say via vnc) and supply keyboard input (via sendkey). Right, QMP + VNC is a pretty accurate analogy. It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface would control the power level of the host, not the guest. IPMI with a custom backend is what I mean. That's what I mean by an IPMI - libvirt bridge. You could build a libvirt client that exposes an IPMI interface and when you issue IPMI commands, it translate it to
Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. -- Eduardo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
FWIW, the idea of an IPMI interface to VMs was proposed for libvirt not too long ago. See: https://bugzilla.redhat.com/show_bug.cgi?id=815136 Dave On Mon, May 07, 2012 at 01:07:45PM -0500, Corey Minyard wrote: I think we are getting a little out of hand here, and we are mixing up concepts :). There are lots of things IPMI *can* do (including serial access, VGA snooping, LAN access, etc.) but I don't see any value it that. The main thing here is to emulate the interface to the guest. OOB management is really more appropriately handled with libvirt. How the BMC integrates into the hardware varies a *lot* between systems, but it's really kind of irrelevant. (Well, almost irrelevant, BMCs can provide a direct I2C messaging capability, and that may matter.) A guest can have one (or more) of a number of interfaces (that are all fairly bad, unfortunately). The standard ones are called KCS, BT and SMIC and they generally are directly on the ISA bus, but are in memory on non-x86 boxes (and on some x86 boxes) and sometimes on the PCI bus. Some systems also have interfaces over I2C, but that hasn't really caught on. Others have interfaces over serial ports, and that unfortunately has caught on in the ATCA world. And there are at least 3 different basic types of serial port interfaces with sub-variants :(. I'm not sure what the USB rndis device is, but I'll look at it. But there is no IPMI over USB. The big things a guest can do are sensor management, watchdog timer, reset, and power control. In complicated IPMI-based systems like ATCA, a guest may want to send messages through its local IPMI controller to other guest's IPMI controllers or to a central BMC that runs an entire chassis of systems. So that may need to be supported, depending on what people want to do and how hard they want to work on it. My proposal is to start small, with just a local interface, watchdog timer, sensors and power control. But have an architecture that would allow external LAN access, tying BMCs in different qemu instances together, perhaps serial over IPMI, and other things of that nature. -corey On 05/07/2012 10:21 AM, Anthony Liguori wrote: On 05/07/2012 10:11 AM, Avi Kivity wrote: On 05/07/2012 05:55 PM, Anthony Liguori wrote: For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. And we fail to emulate it correctly as well, wrt. alarms. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. It controls the main power place, reset line, can read VGA and emulate keyboard, seems pretty well integrated. Emulating the keyboard is done through USB. How the VGA thing works is very vendor dependent. The VGA snooping can happen as part of the display path (essentially connected via a VGA cable) or it can be a side-band using a special graphics adapter. I think QEMU VNC emulation is a pretty good analogy actually. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Maybe. But it's on when the rest of the machine is off. So Vcc is not shared. That's all plumbed through the PCI bus FWIW. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. Ah okay. It also has to read the VGA output (say via vnc) and supply keyboard input (via sendkey). Right, QMP + VNC is a pretty accurate analogy. It really boils down to what you are trying to do. If you want to just get some piece of software working that expects to do IPMI, the easiest thing to do is run IPMI in the host and use a USB rndis interface to interact with it. That would be most strange. A remote client connecting to the IPMI interface
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On Mon, 7 May 2012, Ingo Molnar wrote: * Avi Kivity a...@redhat.com wrote: PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. If both Xen and KVM makes good use of it then that's a different matter. Aside of that, it's kinda strange that a dude named Nikunj is referenced in the argument chain, but I can't find him on the CC list. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 02:45 PM, Dave Allan wrote: FWIW, the idea of an IPMI interface to VMs was proposed for libvirt not too long ago. See: https://bugzilla.redhat.com/show_bug.cgi?id=815136 Well, it wouldn't be to hard to do. I already have working emulation code that does the IPMI LAN interface (including the IPMI 2.0 stuff for more reasonable security). I have a KCS interface and a minimal IPMI controller working in KVM, though I'm not quite sure the best final way to hook it in. Configuration is going to be the hardest part, but a minimal configuration for providing basic management would be easy. -corey Dave On Mon, May 07, 2012 at 01:07:45PM -0500, Corey Minyard wrote: I think we are getting a little out of hand here, and we are mixing up concepts :). There are lots of things IPMI *can* do (including serial access, VGA snooping, LAN access, etc.) but I don't see any value it that. The main thing here is to emulate the interface to the guest. OOB management is really more appropriately handled with libvirt. How the BMC integrates into the hardware varies a *lot* between systems, but it's really kind of irrelevant. (Well, almost irrelevant, BMCs can provide a direct I2C messaging capability, and that may matter.) A guest can have one (or more) of a number of interfaces (that are all fairly bad, unfortunately). The standard ones are called KCS, BT and SMIC and they generally are directly on the ISA bus, but are in memory on non-x86 boxes (and on some x86 boxes) and sometimes on the PCI bus. Some systems also have interfaces over I2C, but that hasn't really caught on. Others have interfaces over serial ports, and that unfortunately has caught on in the ATCA world. And there are at least 3 different basic types of serial port interfaces with sub-variants :(. I'm not sure what the USB rndis device is, but I'll look at it. But there is no IPMI over USB. The big things a guest can do are sensor management, watchdog timer, reset, and power control. In complicated IPMI-based systems like ATCA, a guest may want to send messages through its local IPMI controller to other guest's IPMI controllers or to a central BMC that runs an entire chassis of systems. So that may need to be supported, depending on what people want to do and how hard they want to work on it. My proposal is to start small, with just a local interface, watchdog timer, sensors and power control. But have an architecture that would allow external LAN access, tying BMCs in different qemu instances together, perhaps serial over IPMI, and other things of that nature. -corey On 05/07/2012 10:21 AM, Anthony Liguori wrote: On 05/07/2012 10:11 AM, Avi Kivity wrote: On 05/07/2012 05:55 PM, Anthony Liguori wrote: For all intents and purposes, the BMC/RSA is a separate physical machine. That's true for any other card on a machine. It has a separate power source for all intents and purposes. If you think of it in QOM terms, what connects the nodes together ultimately is the Vcc pin that travels across all devices. The RTC is a little special because it has a battery backed CMOS/clock but it's also handled specially. And we fail to emulate it correctly as well, wrt. alarms. The BMC does not share Vcc. It's no different than a separate physical box. It just shares a couple buses. It controls the main power place, reset line, can read VGA and emulate keyboard, seems pretty well integrated. Emulating the keyboard is done through USB. How the VGA thing works is very vendor dependent. The VGA snooping can happen as part of the display path (essentially connected via a VGA cable) or it can be a side-band using a special graphics adapter. I think QEMU VNC emulation is a pretty good analogy actually. That is one way to do it. Figure out the interactions between two different parts in a machine, define an interface for them to communicate, and split them into two processes. We don't usually do that; I believe your motivation is that the two have different power domains (but then so do NICs with wake-on-LAN support). The power still comes from the PCI bus. Maybe. But it's on when the rest of the machine is off. So Vcc is not shared. That's all plumbed through the PCI bus FWIW. Think of something like a blade center. Each individual blade does not have it's own BMC. There's a single common BMC that provides an IPMI interface for all blades. Yet each blade still sees an IPMI interface via a USB rndis device. You can rip out the memory, PCI devices, etc. from a box while the Power is in and the BMC will be unaffected. At any rate, you would have some sort of virtual hardware device that essentially spoke QMP to the slave instance. You could just do virtio-serial and call it a day actually. Sorry I lost you. Which is the master and which is the slave? The BMC is the master, system being controlled is the slave. Ah okay. It also has to read the VGA output (say via vnc) and supply keyboard input (via
[PULL net-next] macvtap, vhost and virtio tools updates
There are mostly bugfixes here. I hope to merge some more patches by 3.5, in particular vlan support fixes are waiting for Eric's ack, and a version of tracepoint patch might be ready in time, but let's merge what's ready so it's testable. This includes a ton of zerocopy fixes by Jason - good stuff but too intrusive for 3.4 and zerocopy is experimental anyway. virtio supported delayed interrupt for a while now so adding support to the virtio tool made sense Please pull into net-next and merge for 3.5. Thanks! MST The following changes since commit e4ae004b84b315dd4b762e474f97403eac70f76a: netem: add ECN capability (2012-05-01 09:39:48 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next for you to fetch changes up to c70aa540c7a9f67add11ad3161096fb95233aa2e: vhost: zerocopy: poll vq in zerocopy callback (2012-05-02 18:22:25 +0300) Jason Wang (9): macvtap: zerocopy: fix offset calculation when building skb macvtap: zerocopy: fix truesize underestimation macvtap: zerocopy: put page when fail to get all requested user pages macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully macvtap: zerocopy: validate vectors before building skb vhost_net: zerocopy: fix possible NULL pointer dereference of vq-bufs vhost_net: re-poll only on EAGAIN or ENOBUFS vhost_net: zerocopy: adding and signalling immediately when fully copied vhost: zerocopy: poll vq in zerocopy callback Michael S. Tsirkin (1): virtio/tools: add delayed interupt mode drivers/net/macvtap.c | 57 ++- drivers/vhost/net.c |7 - drivers/vhost/vhost.c |1 + tools/virtio/linux/virtio.h |1 + tools/virtio/virtio_test.c | 26 --- 5 files changed, 69 insertions(+), 23 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vhost: basic tracepoints
On Tue, Apr 10, 2012 at 10:58:19AM +0800, Jason Wang wrote: To help for the performance optimizations and debugging, this patch tracepoints for vhost. Pay attention that the tracepoints are only for vhost, net code are not touched. Two kinds of activities were traced: virtio and vhost work. Signed-off-by: Jason Wang jasow...@redhat.com Thanks for looking into this. Some questions: Do we need to prefix traces with vhost_virtio_? How about a trace for enabling/disabling interrupts? Trace for a suppressed interrupt? I think we need a vq # not pointer. Also need some id for when there are many guests. Use the vhost thread name (includes owner pid)? It's pid? Both? Also, traces do add very small overhead when compiled but not enabled mainly due to increasing register pressure. Need to test to make sure perf is not hurt. Some traces are just for debugging so build them on debug kernel only? Further, there are many events some are rare some are common. I think we need some naming scheme so that really useful and low overhead stuff is easier to enable ignoring the rarely usefu;/higher overhead traces. --- drivers/vhost/trace.h | 153 + drivers/vhost/vhost.c | 17 + 2 files changed, 168 insertions(+), 2 deletions(-) create mode 100644 drivers/vhost/trace.h diff --git a/drivers/vhost/trace.h b/drivers/vhost/trace.h new file mode 100644 index 000..0423899 --- /dev/null +++ b/drivers/vhost/trace.h @@ -0,0 +1,153 @@ +#if !defined(_TRACE_VHOST_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VHOST_H + +#include linux/tracepoint.h +#include vhost.h + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vhost + +/* + * Tracepoint for updating used flag. + */ +TRACE_EVENT(vhost_virtio_update_used_flags, + TP_PROTO(struct vhost_virtqueue *vq), + TP_ARGS(vq), + + TP_STRUCT__entry( + __field(struct vhost_virtqueue *, vq) + __field(u16, used_flags) + ), + + TP_fast_assign( + __entry-vq = vq; + __entry-used_flags = vq-used_flags; + ), + + TP_printk(vhost update used flag %x to vq %p notify %s, + __entry-used_flags, __entry-vq, + (__entry-used_flags VRING_USED_F_NO_NOTIFY) ? + disabled : enabled) +); + +/* + * Tracepoint for updating avail event. + */ +TRACE_EVENT(vhost_virtio_update_avail_event, + TP_PROTO(struct vhost_virtqueue *vq), + TP_ARGS(vq), + + TP_STRUCT__entry( + __field(struct vhost_virtqueue *, vq) + __field(u16, avail_idx) + ), + + TP_fast_assign( + __entry-vq = vq; + __entry-avail_idx = vq-avail_idx; + ), + + TP_printk(vhost update avail idx %u(%u) for vq %p, + __entry-avail_idx, __entry-avail_idx % + __entry-vq-num, __entry-vq) +); + +/* + * Tracepoint for processing descriptor. + */ +TRACE_EVENT(vhost_virtio_get_vq_desc, + TP_PROTO(struct vhost_virtqueue *vq, unsigned int index, + unsigned out, unsigned int in), + TP_ARGS(vq, index, out, in), + + TP_STRUCT__entry( + __field(struct vhost_virtqueue *, vq) + __field(unsigned int, head) + __field(unsigned int, out) + __field(unsigned int, in) + ), + + TP_fast_assign( + __entry-vq = vq; + __entry-head = index; + __entry-out = out; + __entry-in = in; + ), + + TP_printk(vhost get vq %p head %u out %u in %u, + __entry-vq, __entry-head, __entry-out, __entry-in) + +); + +/* + * Tracepoint for signal guest. + */ +TRACE_EVENT(vhost_virtio_signal, + TP_PROTO(struct vhost_virtqueue *vq), + TP_ARGS(vq), + + TP_STRUCT__entry( + __field(struct vhost_virtqueue *, vq) + ), + + TP_fast_assign( + __entry-vq = vq; + ), + + TP_printk(vhost signal vq %p, __entry-vq) +); + +DECLARE_EVENT_CLASS(vhost_work_template, + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work), + TP_ARGS(dev, work), + + TP_STRUCT__entry( + __field(struct vhost_dev *, dev) + __field(struct vhost_work *, work) + __field(void *, function) + ), + + TP_fast_assign( + __entry-dev = dev; + __entry-work = work; + __entry-function = work-fn; + ), + + TP_printk(%pf for work %p dev %p, + __entry-function, __entry-work, __entry-dev) +); + +DEFINE_EVENT(vhost_work_template, vhost_work_queue_wakeup, + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work), + TP_ARGS(dev, work)); + +DEFINE_EVENT(vhost_work_template, vhost_work_queue_coalesce, + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work), + TP_ARGS(dev, work)); + +DEFINE_EVENT(vhost_work_template,
Fwd: Re: Adding an IPMI BMC device to KVM
Resending to the list, plan text only. Sorry about that... Original Message Subject:Re: Adding an IPMI BMC device to KVM Date: Mon, 07 May 2012 16:57:06 -0500 From: Corey Minyard tcminy...@gmail.com Reply-To: miny...@acm.org To: Anthony Liguori anth...@codemonkey.ws CC: kvm@vger.kernel.org, Avi Kivity a...@redhat.com On 05/05/2012 09:29 AM, Anthony Liguori wrote: You could just use a USB rndis device and then run an IPMI server on the host. That is probably the simpliest way to test. I'm trying to figure out how RNDIS would be helpful here. You can't have the guest talk USB for this, it's going to talk the standard interfaces. However, I am loathe to create my own protocol for talking between the BMC and QEMU. Perhaps RNDIS could be useful, but it's massive overkill. It also doesn't provide any security. If there were client and server libraries that would run over a socket, that would be tempting. I'm wondering if security is that big a deal, though. If you use a chardev, and you have QEMU make a connection to the BMC, maybe that would be ok. BMCs typically run a full OS like Linux making emulation as a device in QEMU prohibitively hard. BMCs are generally very simple 8-bit microcontrollers, unless they are something like an ATCA shelf manager. Emulation is pretty easy. Well, emulation of an ATCA shelf manager wouldn't be easy, but that's hopefully not required in the near future. -corey Regards, Anthony Liguori On May 4, 2012 2:10 PM, Corey Minyard tcminy...@gmail.com mailto:tcminy...@gmail.com wrote: I've been working on adding an IPMI BMC as a virtual device under KVM. I'm doing this for two primary reasons, one to have a better test environment than what I have now for testing IPMI issues, and second to be able to better simulate a legacy environment for customers porting legacy software. For those that don't know, IPMI is a system management interface. Generally systems with IPMI have a small microcontroller, called a BMC, that is always on when the board is powered. The BMC is capable of controlling power and reset on the board, and it is hooked to sensors on the board (voltage, current, temperature, the presence of things like DIMMS and power supplies, intrusion detection, and a host of other things). The main processor on a system can communicate with the BMC over a device. Often these systems also have a LAN interface that lets you control the system remotely even when it's off. In addition, IPMI provides access to FRU (Field Replaceable Unit) data that describes the components of the system that can be replaced. It also has data records that describe the sensor, so it is possible to directly interpret the sensor data and know what the sensor is measuring without outside data. I've been struggling a bit with how to implement this. There is a lot of configuration information, and you need ways to simulate the sensors. This type of interface is a little sensitive, since it has direct access to the reset and power control of a system. I was at first considering having the BMC be an external program that KVM connected to over a chardev, with possibly a simulated LAN interface. This has the advantage that the BMC can run even when KVM is down. It could even start up KVM for a power up, though I'm not sure how valuable that would be. Plus it could be used for other virtual machines. However, that means there is an interface to KVM over a chardev that could do nasty things, and even be a possible intrusion point. It also means there is a separate program to maintain. You could also include the BMC inside of KVM and run it as a separate thread. That way there doesn't have to be an insecure interface. But the BMC will need a lot of configuration data and this will add a bunch of code to KVM that's only tangentially related to it. And you would still need a way to simulate setting sensors and such for testing things. Either way, is this interesting for including into KVM? Does anyone have any opinions on the possible ways to implement this? Thanks, -corey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org mailto:majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible circular locking dependency
On (05/07/12 10:52), Avi Kivity wrote: On 05/07/2012 06:47 AM, Paul E. McKenney wrote: On Sun, May 06, 2012 at 11:34:39PM +0300, Sergey Senozhatsky wrote: On (05/06/12 09:42), Paul E. McKenney wrote: On Sun, May 06, 2012 at 11:55:30AM +0300, Avi Kivity wrote: On 05/03/2012 11:02 PM, Sergey Senozhatsky wrote: Hello, 3.4-rc5 Whoa. Looks like inconsistent locking between cpufreq and synchronize_srcu_expedited(). kvm triggered this because it is one of the few users of synchronize_srcu_expedited(), but I don't think it is doing anything wrong directly. Dave, Paul? SRCU hasn't changed much in mainline for quite some time. Holding the hotplug mutex across a synchronize_srcu() is a bad idea, though. However, there is a reworked implementation (courtesy of Lai Jiangshan) in -rcu that does not acquire the hotplug mutex. Could you try that out? Paul, should I try solely -rcu or there are several commits to pick up and apply on top of -linus tree? If you want the smallest possible change, take the rcu/srcu branch of -rcu. If you want the works, take the rcu/next branch of -rcu. You can find -rcu at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git To make the difference even smaller, merge the above branch with v3.4-rc5. I'm unable to reproduce the issue on 3.4-rc6 so far. So I guess this will take some time. Sergey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 06:49 AM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? How does PLE help with ticket scheduling on unlock? I thought it would just help with the actual spin loops. J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Adding an IPMI BMC device to KVM
On 05/07/2012 01:07 PM, Corey Minyard wrote: I think we are getting a little out of hand here, and we are mixing up concepts :). There are lots of things IPMI *can* do (including serial access, VGA snooping, LAN access, etc.) but I don't see any value it that. The main thing here is to emulate the interface to the guest. OOB management is really more appropriately handled with libvirt. How the BMC integrates into the hardware varies a *lot* between systems, but it's really kind of irrelevant. (Well, almost irrelevant, BMCs can provide a direct I2C messaging capability, and that may matter.) A guest can have one (or more) of a number of interfaces (that are all fairly bad, unfortunately). The standard ones are called KCS, BT and SMIC and they generally are directly on the ISA bus, but are in memory on non-x86 boxes (and on some x86 boxes) and sometimes on the PCI bus. Some systems also have interfaces over I2C, but that hasn't really caught on. Others have interfaces over serial ports, and that unfortunately has caught on in the ATCA world. And there are at least 3 different basic types of serial port interfaces with sub-variants :(. I'm not sure what the USB rndis device is, but I'll look at it. But there is no IPMI over USB. USB rndis == USB network adapater. It's just seen by the machine as IPMI over LAN. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)
On 07.05.2012, at 20:21, Eduardo Habkost wrote: Andre? Are you able to help to answer the question below? I would like to clarify what's the expected behavior of -cpu host to be able to continue working on it. I believe the code will need to be fixed on either case, but first we need to figure out what are the expectations/requirements, to know _which_ changes will be needed. On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote: (CCing Andre Przywara, in case he can help to clarify what's the expected meaning of -cpu host) [...] I am not sure I understand what you are proposing. Let me explain the use case I am thinking about: - Feature FOO is of type (A) (e.g. just a new instruction set that doesn't require additional userspace support) - User has a Qemu vesion that doesn't know anything about feature FOO - User gets a new CPU that supports feature FOO - User gets a new kernel that supports feature FOO (i.e. has FOO in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User expects to get feature FOO enabled if using -cpu host, without upgrading Qemu. The problem here is: to support the above use-case, userspace need a probing mechanism that can differentiate _new_ (previously unknown) features that are in group (A) (safe to blindly enable) from features that are in group (B) (that can't be enabled without an userspace upgrade). In short, it becomes a problem if we consider the following case: - Feature BAR is of type (B) (it can't be enabled without extra userspace support) - User has a Qemu version that doesn't know anything about feature BAR - User gets a new CPU that supports feature BAR - User gets a new kernel that supports feature BAR (i.e. has BAR in GET_SUPPORTED_CPUID) - User does _not_ upgrade Qemu. - User simply shouldn't get feature BAR enabled, even if using -cpu host, otherwise Qemu would break. If userspace always limited itself to features it knows about, it would be really easy to implement the feature without any new probing mechanism from the kernel. But that's not how I think users expect -cpu host to work. Maybe I am wrong, I don't know. I am CCing Andre, who introduced the -cpu host feature, in case he can explain what's the expected semantics on the cases above. Can you think of any feature that'd go into category B? All features I'm aware of work fine (without migration, but that one is moot for -cpu host anyway) as long as the host kvm implementation is fine with it (GET_SUPPORTED_CPUID). So they'd be category A. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/08/2012 04:45 AM, Jeremy Fitzhardinge wrote: On 05/07/2012 06:49 AM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K Traghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? How does PLE help with ticket scheduling on unlock? I thought it would just help with the actual spin loops. Hmm. This strikes something to me. I think I should replace while 1 hog in with some *real job* to measure over-commit case. I hope to see greater improvements because of fairness and scheduling of the patch-set. May be all the way I was measuring something equal to 1x case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Unlocked TLB flush
On Thu, May 03, 2012 at 02:22:58PM +0300, Avi Kivity wrote: This patchset implements unlocked TLB flushing for KVM. An operation that generates stale TLB entries can mark the TLB as dirty instead of flushing immediately, and then flush after releasing mmu_lock but before returning to the guest or the caller. A few call sites are converted too. Note not all call sites are easily convertible; as an example, sync_page() must flush before reading the guest page table. Huh? Are you referring to: * Note: * We should flush all tlbs if spte is dropped even though guest is * responsible for it. Since if we don't, * kvm_mmu_notifier_invalidate_page * and kvm_mmu_notifier_invalidate_range_start detect the mapping page * isn't * used by guest then tlbs are not flushed, so guest is allowed to * access the * freed pages. * And we increase kvm-tlbs_dirty to delay tlbs flush in this case. With an increased dirtied_count the flush can be performed by kvm_mmu_notifier_invalidate_page. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush
On Mon, May 07, 2012 at 10:59:04AM +0300, Avi Kivity wrote: On 05/07/2012 10:06 AM, Xiao Guangrong wrote: On 05/03/2012 10:11 PM, Avi Kivity wrote: On 05/03/2012 04:23 PM, Xiao Guangrong wrote: On 05/03/2012 07:22 PM, Avi Kivity wrote: Currently we flush the TLB while holding mmu_lock. This increases the lock hold time by the IPI round-trip time, increasing contention, and makes dropping the lock (for latency reasons) harder. This patch changes TLB management to be usable locklessly, introducing the following APIs: kvm_mark_tlb_dirty() - mark the TLB as containing stale entries kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as dirty These APIs can be used without holding mmu_lock (though if the TLB became stale due to shadow page table modifications, typically it will need to be called with the lock held to prevent other threads from seeing the modified page tables with the TLB unmarked and unflushed)/ Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/virtual/kvm/locking.txt | 14 ++ arch/x86/kvm/paging_tmpl.h|4 ++-- include/linux/kvm_host.h | 22 +- virt/kvm/kvm_main.c | 29 - 4 files changed, 57 insertions(+), 12 deletions(-) diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3b..f6c90479 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt @@ -23,3 +23,17 @@ Arch:x86 Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - tsc offset in vmcb Comment: 'raw' because updating the tsc offsets must not be preempted. + +3. TLB control +-- + +The following APIs should be used for TLB control: + + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt + either guest or host page tables. + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked + +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be +used while holding mmu_lock if it is called due to host page table changes +(contrast to guest page table changes). In these patches, it seems that kvm_mark_tlb_dirty is always used under the protection of mmu-lock, yes? Correct. It's possible we'll find a use outside mmu_lock, but this isn't likely. If we need call kvm_mark_tlb_dirty outside mmu-lock, just use kvm_flush_remote_tlbs instead: if (need-flush-tlb) flush = true; do something... if (flush) kvm_flush_remote_tlbs It depends on how need-flush-tlb is computed. If it depends on sptes, then we mush use kvm_cond_flush_remote_tlbs(). If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead. If it is so, dirtied_count/flushed_count need not be atomic. But we always mark with mmu_lock held. Yes, so, we can change kvm_mark_tlb_dirty to: +static inline void kvm_mark_tlb_dirty(struct kvm *kvm) +{ + /* +* Make any changes to the page tables visible to remote flushers. +*/ + smb_mb(); + kvm-tlb_state.dirtied_count++; +} Yes. We'll have to change it again if we ever dirty sptes outside the lock, but that's okay. Please don't. There are readers outside mmu_lock, so it should be atomic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush
On Thu, May 03, 2012 at 05:11:01PM +0300, Avi Kivity wrote: On 05/03/2012 04:23 PM, Xiao Guangrong wrote: On 05/03/2012 07:22 PM, Avi Kivity wrote: Currently we flush the TLB while holding mmu_lock. This increases the lock hold time by the IPI round-trip time, increasing contention, and makes dropping the lock (for latency reasons) harder. This patch changes TLB management to be usable locklessly, introducing the following APIs: kvm_mark_tlb_dirty() - mark the TLB as containing stale entries kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as dirty These APIs can be used without holding mmu_lock (though if the TLB became stale due to shadow page table modifications, typically it will need to be called with the lock held to prevent other threads from seeing the modified page tables with the TLB unmarked and unflushed)/ Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/virtual/kvm/locking.txt | 14 ++ arch/x86/kvm/paging_tmpl.h|4 ++-- include/linux/kvm_host.h | 22 +- virt/kvm/kvm_main.c | 29 - 4 files changed, 57 insertions(+), 12 deletions(-) diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index 3b4cd3b..f6c90479 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt @@ -23,3 +23,17 @@ Arch: x86 Protects:- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - tsc offset in vmcb Comment: 'raw' because updating the tsc offsets must not be preempted. + +3. TLB control +-- + +The following APIs should be used for TLB control: + + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt + either guest or host page tables. + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked + +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be +used while holding mmu_lock if it is called due to host page table changes +(contrast to guest page table changes). In these patches, it seems that kvm_mark_tlb_dirty is always used under the protection of mmu-lock, yes? Correct. It's possible we'll find a use outside mmu_lock, but this isn't likely. If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead. If it is so, dirtied_count/flushed_count need not be atomic. But we always mark with mmu_lock held. And, it seems there is a bug: VCPU 0 VCPU 1 hold mmu-lock zap spte which points to 'gfn' mark_tlb_dirty release mmu-lock hold mmu-lock rmap_write-protect: see no spte pointing to gfn tlb did not be flushed release mmu-lock write gfn by guest OOPS!!! kvm_cond_flush_remote_tlbs We need call kvm_cond_flush_remote_tlbs in rmap_write-protect unconditionally? Yes, that's the point. We change spin_lock(mmu_lock) conditionally flush the tlb spin_unlock(mmu_lock) to spin_lock(mmu_lock) conditionally mark the tlb as dirty spin_unlock(mmu_lock) kvm_cond_flush_remote_tlbs() but that means the entire codebase has to be converted. Is there any other site which expects sptes and TLBs to be in sync, other than rmap_write_protect? Please convert the flush_remote_tlbs at the end of set_spte (RW-RO) to mark_dirty. Looks good in general (patchset is incomplete though). One thing that is annoying is that there is no guarantee of progress for flushed_count increment (it can, in theory, always race with a mark_dirty). But since that would be only a performance, and not correctness, aspect, it is fine. It has the advantage that it requires code to explicitly document where the TLB must be flushed and the sites which expect sptes to be in sync with TLBs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Unlocked TLB flush
On Mon, May 07, 2012 at 10:25:34PM -0300, Marcelo Tosatti wrote: On Thu, May 03, 2012 at 02:22:58PM +0300, Avi Kivity wrote: This patchset implements unlocked TLB flushing for KVM. An operation that generates stale TLB entries can mark the TLB as dirty instead of flushing immediately, and then flush after releasing mmu_lock but before returning to the guest or the caller. A few call sites are converted too. Note not all call sites are easily convertible; as an example, sync_page() must flush before reading the guest page table. Huh? Are you referring to: * Note: * We should flush all tlbs if spte is dropped even though guest is * responsible for it. Since if we don't, * kvm_mmu_notifier_invalidate_page * and kvm_mmu_notifier_invalidate_range_start detect the mapping page * isn't * used by guest then tlbs are not flushed, so guest is allowed to * access the * freed pages. * And we increase kvm-tlbs_dirty to delay tlbs flush in this case. With an increased dirtied_count the flush can be performed by kvm_mmu_notifier_invalidate_page. Which is what patch 1 does. Your comment regarding sync_page() above is what is outdated, unless i am missing something. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio 3.4 patches
On Mon, 7 May 2012 08:30:21 +0300, Michael S. Tsirkin m...@redhat.com wrote: Hi Rusty, please also pick two fixes from for_linus tag on my tree I think they should be sent to Linus for 3.4: http://git.kernel.org/?p=linux/kernel/git/mst/vhost.git;a=tag;h=refs/tags/for_linus git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git refs/tags/for_linus Done! Thanks so much for handling virtio while I was away; knowing it was safely in your hands allowed me to relax without reservation. My wife thanks you, too :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_blk: Drop unused request tracking list
On Fri, 30 Mar 2012 11:24:10 +0800, Asias He as...@redhat.com wrote: Benchmark shows small performance improvement on fusion io device. Before: seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec After: seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec Signed-off-by: Asias He as...@redhat.com Thanks. It didn't really need a benchmark to justify this cleanup, but you certainly get points for thoroughness! Applied, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 08:22 PM, Avi Kivity wrote: On 05/07/2012 05:47 PM, Raghavendra K T wrote: Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. Hmm agree that being a step ahead of mighty hardware (and just an improvement of 1-3%) is no good for long term (where PLE is future). PLE is the present, not the future. It was introduced on later Nehalems and is present on all Westmeres. Two more processor generations have passed meanwhile. The AMD equivalent was also introduced around that timeframe. Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. Hmm. I think I messed up the fact while saying 1-3% improvement on PLE. Going by what I had posted in https://lkml.org/lkml/2012/4/5/73 (with correct calculation) 1x 70.475 (85.6979) 63.5033 (72.7041) 15.7% 2x 110.971 (132.829) 105.099 (128.738)5.56% 3x 150.265 (184.766) 138.341 (172.69) 8.62% It was around 12% with optimization patch posted separately with that (That one Needs more experiment though) But anyways, I will come up with result for current patch series.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html