PMU in KVM
Hi Gleb, I noticed that arch/x86/kvm/pmu.c is on your management and I have some questions about PMU in KVM. Thanks ahead if you can spare time answering these questions. 1. How could PMU cooperate with Intel VT? For example, I only find flags in IA32_PERFEVTSELx MSRs to count in OS and USER mode (Ring 0 and other rings). What is the consequence when I open VMXON with PMU enabled? Can I distinguish the counts in root and non-root mode? I cannot find the related descriptions in Intel manual. 2. What is the current status of vPMU in KVM? Is it auto-enabled? And how can I use (or enable/disable) it? Thanks, Arthur -- Arthur Chunqi Li Department of Computer Science School of EECS Peking University Beijing, China -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PMU in KVM
On Tue, Nov 26, 2013 at 04:02:36PM +0800, Arthur Chunqi Li wrote: Hi Gleb, I noticed that arch/x86/kvm/pmu.c is on your management and I have some questions about PMU in KVM. Thanks ahead if you can spare time answering these questions. 1. How could PMU cooperate with Intel VT? For example, I only find flags in IA32_PERFEVTSELx MSRs to count in OS and USER mode (Ring 0 and other rings). What is the consequence when I open VMXON with PMU enabled? Can I distinguish the counts in root and non-root mode? I cannot find the related descriptions in Intel manual. No you cannot. You can disable/enable PMU counters before/after vmexit/vmentry and in this way know in what sate event was counted. That what MPU emulation does. 2. What is the current status of vPMU in KVM? Is it auto-enabled? And how can I use (or enable/disable) it? If you run with -cpu host on Intel machine it will be enabled. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: Also, there is no guarantee of termination (as long as sptes are deleted with the correct timing). BTW, can't see any guarantee of termination for rculist nulls either (a writer can race with a lockless reader indefinately, restarting the lockless walk every time). Hmm, that can be avoided by checking dirty-bitmap before rewalk, that means, if the dirty-bitmap has been set during lockless write-protection, it�s unnecessary to write-protect its sptes. Your idea? This idea is based on the fact that the number of rmap is limited by RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, we can break the rewalk at once, in the case of deleting, we can only rewalk RMAP_RECYCLE_THRESHOLD times. Please explain in more detail. Okay. My proposal is like this: pte_list_walk_lockless() { restart: + if (__test_bit(slot-arch.dirty_bitmap, gfn-index)) + return; code-doing-lockless-walking; .. } Before do lockless-walking, we check the dirty-bitmap first, if it is set we can simply skip write-protection for the gfn, that is the case that new spte is being added into rmap when we lockless access the rmap. For the case of deleting spte from rmap, the number of entry is limited by RMAP_RECYCLE_THRESHOLD, that is not endlessly. The point is that rmap entry that you are inspecting can be constantly deleted and added to the beginning of some other list, so the code that traverse the list will never reach the end. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote: On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote: snip complicated stuff about parent_pte I'm not really following, but note that parent_pte predates EPT (and the use of rcu in kvm), so all the complexity that is the result of trying to pack as many list entries into a cache line can be dropped. Most setups now would have exactly one list entry, which is handled specially antyway. Alternatively, the trick of storing multiple entries in one list entry can be moved to generic code, it may be useful to others. Yes, can the lockless list walking code be transformed into generic single-linked list walking? So the correctness can be verified independently, and KVM becomes a simple user of that interface. I'am afraid the signle-entry list is not so good as we expected. In my experience, there're too many entries on rmap, more than 300 sometimes. (consider a case that a lib shared by all processes). This is without EPT though and non EPT HW is not performance king anyway. Nested EPT uses shadow paging too, but VMs hardly share any pages. With KSM they may though. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] create a single workqueue for each vm to update vm irq routing table
Hi all, When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. Any better ideas? Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote: Hi all, When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, Why vcpu thread ask the hypervisor to update the irq routing table on pcpu migration? in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. Any better ideas? Thanks, Zhang Haoyu -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 02:48:10PM +0200, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote: Hi all, When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, Why vcpu thread ask the hypervisor to update the irq routing table on pcpu migration? Ah, I misread. Guest sets irq smp_affinity not host. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? It should be rate limited somehow. Since it guest triggarable guest may cause host to allocate a lot of memory this way. Is this about MSI interrupt affinity? IIRC changing INT interrupt affinity should not trigger kvm_set_irq_routing update. If this is about MSI only then what about changing userspace to use KVM_SIGNAL_MSI for MSI injection? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 13:56, Gleb Natapov ha scritto: I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? It should be rate limited somehow. Since it guest triggarable guest may cause host to allocate a lot of memory this way. True, though if I understand Zhanghaoyu's proposal a workqueue would be even worse. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 14:18, Avi Kivity ha scritto: I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Can this cause an interrupt to be delivered to the wrong (old) vcpu? No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. There is still the problem that Gleb pointed out, though. Paolo The way Linux sets interrupt affinity, it cannot, since changing the affinity is (IIRC) done in the interrupt handler, so the next interrupt cannot be in flight and thus pick up the old interrupt routing table. However it may be vulnerable in other ways. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Qmp signal event
Hi, We're connecting to an instance using the Qmp socket and listening on the events in order to determine whether the shutdown came from inside the VM or it was issued by the host, for example, by piping 'system_powerdown'. However, we're unable to distinguish between a host shutdown issued through Qmp and a TERM signal issued by kill, for example. Is there a way to determine if KVM received a TERM signal? Is it possible to send an event, just like the ones used for 'SHUTDOWN' and 'POWERDOWN', through Qmp? Cheers, Jose -- Jose Antonio Lopes Ganeti Engineering Google Germany GmbH Dienerstr. 12, 80331, München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores Steuernummer: 48/725/00206 Umsatzsteueridentifikationsnummer: DE813741370 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for 2013-11-26
Juan Quintela quint...@redhat.com wrote: Hi Please, send any topic that you are interested in covering. As there are no topics for agenda, call is cancelled. Later, Juan. Thanks, Juan. Call details: 10:00 AM to 11:00 AM EDT Every two weeks If you need phone number details, contact me privately. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). If you eliminate the synchronize_rcu, new interrupts would see the new routing table, while interrupts already in flight will get a dangling pointer. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). It's another question whether the hardware provides the same guarantee. If you eliminate the synchronize_rcu, new interrupts would see the new routing table, while interrupts already in flight will get a dangling pointer. Sure, if you drop the synchronize_rcu(), you have to add call_rcu(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote: On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:03 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote: On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Consider this guest code: write msi entry, directing the interrupt away from this vcpu nop memset(idt, 0, sizeof(idt)); Currently, this code will never trigger a triple fault. With the change to call_rcu(), it may. Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), but we don't know if there's a path that guarantees the guest something that we're taking away with this change. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:20 PM, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 16:25, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:28 PM, Paolo Bonzini wrote: Il 26/11/2013 16:25, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
On Sun, Nov 24, 2013 at 11:22:17AM +0200, Razya Ladelsky wrote: 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch should probably do something portable instead of relying on x86-only rdtscll(). Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. (BTW, PCI memory writes are posted, but configuration writes are not). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? It should be rate limited somehow. Since it guest triggarable guest may cause host to allocate a lot of memory this way. The checks in __call_rcu(), should handle this I think. These keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. Is this about MSI interrupt affinity? IIRC changing INT interrupt affinity should not trigger kvm_set_irq_routing update. If this is about MSI only then what about changing userspace to use KVM_SIGNAL_MSI for MSI injection? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. The fact that Linux does interrupt migration from within the interrupt handler also shows that someone else believes that it is the only safe place to do it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote: On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. The fact that Linux does interrupt migration from within the interrupt handler also shows that someone else believes that it is the only safe place to do it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] KVM: optimize apic interrupt delivery
On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote: On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote: On 09/12/2012 03:34 PM, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote: On 09/12/2012 04:03 AM, Paul E. McKenney wrote: Paul, I'd like to check something with you here: this function can be triggered by userspace, any number of times; we allocate a 2K chunk of memory that is later freed by kfree_rcu. Is there a risk of DOS if RCU is delayed while lots of memory is queued up in this way? If yes is this a generic problem with kfree_rcu that should be addressed in core kernel? There is indeed a risk. In our case it's a 2K object. Is it a practical risk? How many kfree_rcu()s per second can a given user cause to happen? Not much more than a few hundred thousand per second per process (normal operation is zero). I managed to do 21466 per second. Strange, why so slow? Because ftrace buffer overflows :) With bigger buffer I get 169940. Ah, good, should not be a problem. In contrast, if you ran kfree_rcu() in a tight loop, you could probably do in excess of 100M per CPU per second. Now -that- might be a problem. Well, it -might- be a problem if you somehow figured out how to allocate memory that quickly in a steady-state manner. ;-) Good idea. Michael, is should be easy to modify kvm-unit-tests to write to the APIC ID register in a loop. I did. Memory consumption does not grow on otherwise idle host. Very good -- the checks in __call_rcu(), which is common code invoked by kfree_rcu(), seem to be doing their job, then. These do keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. My concern was that you might be overrunning that limit in way less than a grace period (as in about a hundred microseconds. My concern was of course unfounded -- you take several grace periods in push 10K callbacks through. Thanx, Paul Gleb noted that Documentation/RCU/checklist.txt has this text: An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods are delayed for whatever reason, then the synchronize_rcu() primitive will correspondingly delay updates. In contrast, code using call_rcu() should explicitly limit update rate in cases where grace periods are delayed, as failing to do so can result in excessive realtime latencies or even OOM conditions. If call_rcu is self-limiting maybe this should be documented ... Ok, thanks. -- error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 06:14:27PM +0200, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? It should be rate limited somehow. Since it guest triggarable guest may cause host to allocate a lot of memory this way. The checks in __call_rcu(), should handle this I think. These keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. Documentation/RCU/checklist.txt has: An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods are delayed for whatever reason, then the synchronize_rcu() primitive will correspondingly delay updates. In contrast, code using call_rcu() should explicitly limit update rate in cases where grace periods are delayed, as failing to do so can result in excessive realtime latencies or even OOM conditions. I just asked Paul what this means. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote: On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. I guess that kills the optimization then. Maybe you can do qrcu, whatever that is. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:24 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Suppose the guest did not disable_irq() and enable_irq(), but instead had a pci read where you have the enable_irq(). After the read you cannot have a stale irq (assuming the read flushes the irq all the way to the APIC). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a disable/enable IRQs looks like a relatively big change. But perhaps it's not for some reason I'm missing. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 04:58:53PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. (BTW, PCI memory writes are posted, but configuration writes are not). Paolo You can also do a PCI read and flush out the writes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:28 PM, Paolo Bonzini wrote: Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a disable/enable IRQs looks like a relatively big change. But perhaps it's not for some reason I'm missing. Those are guest operations, which may not be there at all. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 05:28:23PM +0100, Paolo Bonzini wrote: Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a disable/enable IRQs looks like a relatively big change. But perhaps it's not for some reason I'm missing. You will receive stale irq even without disable/enable IRQs of course. I put it there so that guest would have a chance to do stupid things like zeroing idt before receiving interrupt, but on real HW timing is different from what we emulate, so the same race may happen even without disable/enable IRQs. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 06:27:47PM +0200, Avi Kivity wrote: On 11/26/2013 06:24 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Suppose the guest did not disable_irq() and enable_irq(), but instead had a pci read where you have the enable_irq(). After the read you cannot have a stale irq (assuming the read flushes the irq all the way to the APIC). There still may be race between pci read and MSI registered in IRR. I do not believe such read can undo IRR changes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
Il 26/11/2013 17:21, Avi Kivity ha scritto: It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. I guess that kills the optimization then. Maybe you can do qrcu, whatever that is. It's srcu (a separate SRCU instance specific to the irq routing table), which I managed to misspell twice. Actually, it turns out that qrcu actually exists (http://lwn.net/Articles/223752/) and has extremely fast grace periods, but read_lock/read_unlock are also more expensive. So it was probably some kind of Freudian slip. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? It should be rate limited somehow. Since it guest triggarable guest may cause host to allocate a lot of memory this way. The checks in __call_rcu(), should handle this I think. These keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. Documentation/RCU/checklist.txt has: An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods are delayed for whatever reason, then the synchronize_rcu() primitive will correspondingly delay updates. In contrast, code using call_rcu() should explicitly limit update rate in cases where grace periods are delayed, as failing to do so can result in excessive realtime latencies or even OOM conditions. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] kvm-unit-tests/arm: initial drop
Sorry, just noticed this - you dropped me and the kvmarm list from your reply. On Wed, Nov 20, 2013 at 11:06:11PM +, María Soler Heredia wrote: Andrew Jones drjones at redhat.com writes: This series introduces arm to kvm-unit-tests. To use this you need an arm platform or simulator capable of running kvmarm and a qemu with the mach-virt patches[2], as well as the previously mentioned virtio-testdev. Hello, I have been playing with your tests for a while and I cannot seem to get them to work all right. When I run them disabling kvm on the arm-run script, they do work, but when I run them with kvm enabled they fail. This is my output: ./arm-run arm/boot.flat -smp 1 -m 256 -append 'info 0x1000 0x1000' qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt -cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -m 256 -append info 0x1000 0x1000 kvm [1252]: load/store instruction decoding not implemented error: kvm run failed Function not implemented The above errors come from the kernel and qemu. It's easy to see under what condition you would hit them, but it's not clear to me why that condition is present for you. ./arm-run: line 16: 1251 Aborted $command $@ Return value from qemu: 134 FAIL boot_info ./arm-run arm/boot.flat -smp 1 -append 'vectors' qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt -cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -append vectors kvm [1257]: load/store instruction decoding not implemented error: kvm run failed Function not implemented ./arm-run: line 16: 1256 Aborted $command $@ Return value from qemu: 134 FAIL boot_vectors I am using FastModels Model Debugger version 8.2.028, with a model of this characteristics: Model: -- Model name: ARM_Cortex-A15 Instance: cluster.cpu0 Using CADI 2.0 interface revision 0. Version: 8.2.72 Generated by Core Generator: No Needs SimGen License: No So far I've only tested on real hardware. So this could be the difference. running the latest stable linux release and qemu-devel's latest qemu with the patches indicated here https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02428.html plus [1] http://lists.nongnu.org/archive/html/qemu-devel/2013-10/msg01815.html I tested the instalation by running a linux with the same setup using this call: qemu-system-arm \ -display none \ -enable-kvm \ -kernel zImage\ -m 128 -M virt -cpu cortex-a15 \ -drive if=none,file=linux.img,id=fs \ -device virtio-blk-device,drive=fs As I said, the tests pass if the kvm is not enabled and fail otherwise. I have added a few printfs for debugging and I can tell that the code in boot.c runs ok, but then when virtio_testdev is called (from virtio_testdev_exit) the execution throws an exception (more specifically the line *tdp++ = cpu_to_le32(va_arg(va, unsigned)); inside the first while. Hmm, even more confusing, as this isn't the first mmio access. I am not used to sending emails to this kind of list, so I don't know if I am being too specific, too little or maybe not even giving the right information. Please tell me what else you need and if you can help me solve this problem. Your details are good, but instead of just stating 'latest' for your kernel and qemu versions, please give the exact version numbers. I've been busy with other things lately, but I'm due to post a v2 of this series. I should be able to finish that off this week. When I do, I'll see if I can test it over FastModel as well this time. Thanks for starting to poke at this! drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
Razya Ladelsky ra...@il.ibm.com writes: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. I think this is an exceptionally bad idea. We shouldn't throw away isolation without exhausting every other possibility. We've seen very positive results from adding threads. We should also look at scheduling. Once you are scheduling multiple guests in a single vhost device, you now create a whole new class of DoS attacks in the best case scenario. 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/26616133fafb7855cc80fac070b0572fd1aaf5d0 Ack on this. Regards, Anthony Liguori 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch improves the handling of the requests by the vhost thread, but could perhaps be delayed to a later time , and not submitted as one of the first Elvis patches. I'd love to hear some comments about whether this patch needs to be part of the first submission. Any other feedback on this plan will be appreciated, Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
Anthony Liguori anth...@codemonkey.ws wrote on 26/11/2013 08:05:00 PM: Razya Ladelsky ra...@il.ibm.com writes: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/ 3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. I think this is an exceptionally bad idea. We shouldn't throw away isolation without exhausting every other possibility. Seems you have missed the important details here. Anthony, we are aware you are concerned about isolation and you believe we should not share a single vhost thread across multiple VMs. That's why Razya proposed to change the patch so we will serve multiple virtio devices using a single vhost thread only if the devices belong to the same VM. This series of patches will not allow two different VMs to share the same vhost thread. So, I don't see why this will be throwing away isolation and why this could be a exceptionally bad idea. By the way, I remember that during the KVM forum a similar approach of having a single data plane thread for many devices was discussed We've seen very positive results from adding threads. We should also look at scheduling. ...and we have also seen exceptionally negative results from adding threads, both for vhost and data-plane. If you have lot of idle time/cores then it makes sense to run multiple threads. But IMHO in many scenarios you don't have lot of idle time/cores.. and if you have them you would probably prefer to run more VMs/VCPUshosting a single SMP VM when you have enough physical cores to run all the VCPU threads and the I/O threads is not a realistic scenario. That's why we are proposing to implement a mechanism that will enable the management stack to configure 1 thread per I/O device (as it is today) or 1 thread for many I/O devices (belonging to the same VM). Once you are scheduling multiple guests in a single vhost device, you now create a whole new class of DoS attacks in the best case scenario. Again, we are NOT proposing to schedule multiple guests in a single vhost thread. We are proposing to schedule multiple devices belonging to the same guest in a single (or multiple) vhost thread/s. 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/ 26616133fafb7855cc80fac070b0572fd1aaf5d0 Ack on this. :) Regards, Abel. Regards, Anthony Liguori 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ ac14206ea56939ecc3608dc5f978b86fa322e7b0 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/ f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch improves the handling of the requests by the vhost thread, but could perhaps be delayed to a later time , and not submitted as one of the first Elvis patches. I'd love to hear some comments about whether this patch needs to be
Re: [PATCHv2] KVM: optimize apic interrupt delivery
On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote: On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote: On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote: On 09/12/2012 03:34 PM, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote: On 09/12/2012 04:03 AM, Paul E. McKenney wrote: Paul, I'd like to check something with you here: this function can be triggered by userspace, any number of times; we allocate a 2K chunk of memory that is later freed by kfree_rcu. Is there a risk of DOS if RCU is delayed while lots of memory is queued up in this way? If yes is this a generic problem with kfree_rcu that should be addressed in core kernel? There is indeed a risk. In our case it's a 2K object. Is it a practical risk? How many kfree_rcu()s per second can a given user cause to happen? Not much more than a few hundred thousand per second per process (normal operation is zero). I managed to do 21466 per second. Strange, why so slow? Because ftrace buffer overflows :) With bigger buffer I get 169940. Ah, good, should not be a problem. In contrast, if you ran kfree_rcu() in a tight loop, you could probably do in excess of 100M per CPU per second. Now -that- might be a problem. Well, it -might- be a problem if you somehow figured out how to allocate memory that quickly in a steady-state manner. ;-) Good idea. Michael, is should be easy to modify kvm-unit-tests to write to the APIC ID register in a loop. I did. Memory consumption does not grow on otherwise idle host. Very good -- the checks in __call_rcu(), which is common code invoked by kfree_rcu(), seem to be doing their job, then. These do keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. My concern was that you might be overrunning that limit in way less than a grace period (as in about a hundred microseconds. My concern was of course unfounded -- you take several grace periods in push 10K callbacks through. Thanx, Paul Gleb noted that Documentation/RCU/checklist.txt has this text: An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods are delayed for whatever reason, then the synchronize_rcu() primitive will correspondingly delay updates. In contrast, code using call_rcu() should explicitly limit update rate in cases where grace periods are delayed, as failing to do so can result in excessive realtime latencies or even OOM conditions. If call_rcu is self-limiting maybe this should be documented ... It would be more accurate to say that takes has some measures to limit the damage -- you can overwhelm these measures if you try hard enough. And I guess I could say something to that effect. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
On Tue, Nov 26, 2013 at 08:53:47PM +0200, Abel Gordon wrote: Anthony Liguori anth...@codemonkey.ws wrote on 26/11/2013 08:05:00 PM: Razya Ladelsky ra...@il.ibm.com writes: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/ 3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. I think this is an exceptionally bad idea. We shouldn't throw away isolation without exhausting every other possibility. Seems you have missed the important details here. Anthony, we are aware you are concerned about isolation and you believe we should not share a single vhost thread across multiple VMs. That's why Razya proposed to change the patch so we will serve multiple virtio devices using a single vhost thread only if the devices belong to the same VM. This series of patches will not allow two different VMs to share the same vhost thread. So, I don't see why this will be throwing away isolation and why this could be a exceptionally bad idea. By the way, I remember that during the KVM forum a similar approach of having a single data plane thread for many devices was discussed We've seen very positive results from adding threads. We should also look at scheduling. ...and we have also seen exceptionally negative results from adding threads, both for vhost and data-plane. If you have lot of idle time/cores then it makes sense to run multiple threads. But IMHO in many scenarios you don't have lot of idle time/cores.. and if you have them you would probably prefer to run more VMs/VCPUshosting a single SMP VM when you have enough physical cores to run all the VCPU threads and the I/O threads is not a realistic scenario. That's why we are proposing to implement a mechanism that will enable the management stack to configure 1 thread per I/O device (as it is today) or 1 thread for many I/O devices (belonging to the same VM). Once you are scheduling multiple guests in a single vhost device, you now create a whole new class of DoS attacks in the best case scenario. Again, we are NOT proposing to schedule multiple guests in a single vhost thread. We are proposing to schedule multiple devices belonging to the same guest in a single (or multiple) vhost thread/s. I guess a question then becomes why have multiple devices? 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/ 26616133fafb7855cc80fac070b0572fd1aaf5d0 Ack on this. :) Regards, Abel. Regards, Anthony Liguori 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ ac14206ea56939ecc3608dc5f978b86fa322e7b0 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/
Re: Elvis upstreaming plan
Razya Ladelsky ra...@il.ibm.com writes: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) Does the sysfs interface aim to let the _user_ control the maximum number of devices per vhost thread or/and let the user create and destroy worker threads at will ? Setting the limit on the number of devices makes sense but I am not sure if there is any reason to actually expose an interface to create or destroy workers. Also, it might be worthwhile to think if it's better to just let the worker thread stay around (hoping it might be used again in the future) rather then destroying it.. I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? I am actually inclined more towards a static limit. I think that in a typical setup, the user will set this for his/her environment just once at load time and forget about it. Bandan 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/26616133fafb7855cc80fac070b0572fd1aaf5d0 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch improves the handling of the requests by the vhost thread, but could perhaps be delayed to a later time , and not submitted as one of the first Elvis patches. I'd love to hear some comments about whether this patch needs to be part of the first submission. Any other feedback on this plan will be appreciated, Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] KVM: optimize apic interrupt delivery
On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote: On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote: On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote: On 09/12/2012 03:34 PM, Gleb Natapov wrote: On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote: On 09/12/2012 04:03 AM, Paul E. McKenney wrote: Paul, I'd like to check something with you here: this function can be triggered by userspace, any number of times; we allocate a 2K chunk of memory that is later freed by kfree_rcu. Is there a risk of DOS if RCU is delayed while lots of memory is queued up in this way? If yes is this a generic problem with kfree_rcu that should be addressed in core kernel? There is indeed a risk. In our case it's a 2K object. Is it a practical risk? How many kfree_rcu()s per second can a given user cause to happen? Not much more than a few hundred thousand per second per process (normal operation is zero). I managed to do 21466 per second. Strange, why so slow? Because ftrace buffer overflows :) With bigger buffer I get 169940. Ah, good, should not be a problem. In contrast, if you ran kfree_rcu() in a tight loop, you could probably do in excess of 100M per CPU per second. Now -that- might be a problem. Well, it -might- be a problem if you somehow figured out how to allocate memory that quickly in a steady-state manner. ;-) Good idea. Michael, is should be easy to modify kvm-unit-tests to write to the APIC ID register in a loop. I did. Memory consumption does not grow on otherwise idle host. Very good -- the checks in __call_rcu(), which is common code invoked by kfree_rcu(), seem to be doing their job, then. These do keep a per-CPU counter, which can be adjusted via rcutree.blimit, which defaults to taking evasive action if more than 10K callbacks are waiting on a given CPU. My concern was that you might be overrunning that limit in way less than a grace period (as in about a hundred microseconds. My concern was of course unfounded -- you take several grace periods in push 10K callbacks through. Thanx, Paul Gleb noted that Documentation/RCU/checklist.txt has this text: An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods are delayed for whatever reason, then the synchronize_rcu() primitive will correspondingly delay updates. In contrast, code using call_rcu() should explicitly limit update rate in cases where grace periods are delayed, as failing to do so can result in excessive realtime latencies or even OOM conditions. If call_rcu is self-limiting maybe this should be documented ... The documentation should be fixed, rather, to not mention that call_rcu() must be rate-limited by the user. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote: On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote: snip complicated stuff about parent_pte I'm not really following, but note that parent_pte predates EPT (and the use of rcu in kvm), so all the complexity that is the result of trying to pack as many list entries into a cache line can be dropped. Most setups now would have exactly one list entry, which is handled specially antyway. Alternatively, the trick of storing multiple entries in one list entry can be moved to generic code, it may be useful to others. Yes, can the lockless list walking code be transformed into generic single-linked list walking? So the correctness can be verified independently, and KVM becomes a simple user of that interface. I'am afraid the signle-entry list is not so good as we expected. In my experience, there're too many entries on rmap, more than 300 sometimes. (consider a case that a lib shared by all processes). single linked list was about moving singly-linked lockless walking to generic code. http://www.spinics.net/lists/linux-usb/msg39643.html http://marc.info/?l=linux-kernelm=103305635013575w=3 The simpler version is to maintain lockless walk on depth-1 rmap entries (and grab the lock once depth-2 entry is found). I still think rmap-lockless is more graceful: soft mmu can get benefit from it also it is promising to be used in some mmu-notify functions. :) OK. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: Also, there is no guarantee of termination (as long as sptes are deleted with the correct timing). BTW, can't see any guarantee of termination for rculist nulls either (a writer can race with a lockless reader indefinately, restarting the lockless walk every time). Hmm, that can be avoided by checking dirty-bitmap before rewalk, that means, if the dirty-bitmap has been set during lockless write-protection, it�s unnecessary to write-protect its sptes. Your idea? This idea is based on the fact that the number of rmap is limited by RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, we can break the rewalk at once, in the case of deleting, we can only rewalk RMAP_RECYCLE_THRESHOLD times. Please explain in more detail. Okay. My proposal is like this: pte_list_walk_lockless() { restart: + if (__test_bit(slot-arch.dirty_bitmap, gfn-index)) + return; code-doing-lockless-walking; .. } Before do lockless-walking, we check the dirty-bitmap first, if it is set we can simply skip write-protection for the gfn, that is the case that new spte is being added into rmap when we lockless access the rmap. The dirty bit could be set after the check. For the case of deleting spte from rmap, the number of entry is limited by RMAP_RECYCLE_THRESHOLD, that is not endlessly. It can shrink and grow while lockless walk is performed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
On 11/24/2013 05:22 PM, Razya Ladelsky wrote: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) Any chance we can re-use the cwmq instead of inventing another mechanism? Looks like there're lots of function duplication here. Bandan has an RFC to do this. I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/26616133fafb7855cc80fac070b0572fd1aaf5d0 Maybe we can make poll_stop_idle adaptive which may help the light load case. Consider guest is often slow than vhost, if we just have one or two vms, polling too much may waste cpu in this case. 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0 How about using trace points instead? Besides statistics, it can also help more in debugging. 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch improves the handling of the requests by the vhost thread, but could perhaps be delayed to a later time , and not submitted as one of the first Elvis patches. I'd love to hear some comments about whether this patch needs to be part of the first submission. Any other feedback on this plan will be appreciated, Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
On Wed, Nov 27, 2013 at 10:49:20AM +0800, Jason Wang wrote: 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0 How about using trace points instead? Besides statistics, it can also help more in debugging. Definitely. kvm_stats has moved to ftrace long time ago. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
Hi, Razya is out for a few days, so I will try to answer the questions as well as I can: Michael S. Tsirkin m...@redhat.com wrote on 26/11/2013 11:11:57 PM: From: Michael S. Tsirkin m...@redhat.com To: Abel Gordon/Haifa/IBM@IBMIL, Cc: Anthony Liguori anth...@codemonkey.ws, abel.gor...@gmail.com, as...@redhat.com, digitale...@google.com, Eran Raichstein/Haifa/ IBM@IBMIL, g...@redhat.com, jasow...@redhat.com, Joel Nider/Haifa/ IBM@IBMIL, kvm@vger.kernel.org, pbonz...@redhat.com, Razya Ladelsky/ Haifa/IBM@IBMIL Date: 27/11/2013 01:08 AM Subject: Re: Elvis upstreaming plan On Tue, Nov 26, 2013 at 08:53:47PM +0200, Abel Gordon wrote: Anthony Liguori anth...@codemonkey.ws wrote on 26/11/2013 08:05:00 PM: Razya Ladelsky ra...@il.ibm.com writes: edit That's why we are proposing to implement a mechanism that will enable the management stack to configure 1 thread per I/O device (as it is today) or 1 thread for many I/O devices (belonging to the same VM). Once you are scheduling multiple guests in a single vhost device, you now create a whole new class of DoS attacks in the best case scenario. Again, we are NOT proposing to schedule multiple guests in a single vhost thread. We are proposing to schedule multiple devices belonging to the same guest in a single (or multiple) vhost thread/s. I guess a question then becomes why have multiple devices? If you mean why serve multiple devices from a single thread the answer is that we cannot rely on the Linux scheduler which has no knowledge of I/O queues to do a decent job of scheduling I/O. The idea is to take over the I/O scheduling responsibilities from the kernel's thread scheduler with a more efficient I/O scheduler inside each vhost thread. So by combining all of the I/O devices from the same guest (disks, network cards, etc) in a single I/O thread, it allows us to provide better scheduling by giving us more knowledge of the nature of the work. So now instead of relying on the linux scheduler to perform context switches between multiple vhost threads, we have a single thread context in which we can do the I/O scheduling more efficiently. We can closely monitor the performance needs of each queue of each device inside the vhost thread which gives us much more information than relying on the kernel's thread scheduler. This does not expose any additional opportunities for attacks (DoS or other) than are already available since all of the I/O traffic belongs to a single guest. You can make the argument that with low I/O loads this mechanism may not make much difference. However when you try to maximize the utilization of your hardware (such as in a commercial scenario) this technique can gain you a large benefit. Regards, Joel Nider Virtualization Research IBM Research and Development Haifa Research Lab Phone: 972-4-829-6326 | Mobile: 972-54-3155635 (Embedded image moved to file: E-mail: jo...@il.ibm.com pic31578.gif)IBM Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/ 3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. I think this is an exceptionally bad idea. We shouldn't throw away isolation without exhausting every other possibility. Seems you have missed the important details here. Anthony, we are aware you are concerned about isolation and you believe we should not share a single vhost thread across multiple VMs. That's why Razya proposed to change the patch so we will
Re: Elvis upstreaming plan
Gleb Natapov g...@redhat.com wrote on 27/11/2013 09:35:01 AM: From: Gleb Natapov g...@redhat.com To: Jason Wang jasow...@redhat.com, Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, anth...@codemonkey.ws, Michael S. Tsirkin m...@redhat.com, pbonz...@redhat.com, as...@redhat.com, digitale...@google.com, abel.gor...@gmail.com, Abel Gordon/Haifa/IBM@IBMIL, Eran Raichstein/ Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, b...@redhat.com Date: 27/11/2013 11:35 AM Subject: Re: Elvis upstreaming plan On Wed, Nov 27, 2013 at 10:49:20AM +0800, Jason Wang wrote: 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ ac14206ea56939ecc3608dc5f978b86fa322e7b0 How about using trace points instead? Besides statistics, it can also help more in debugging. Definitely. kvm_stats has moved to ftrace long time ago. -- Gleb. Ok - we will look at this newer mechanism. Joel Nider Virtualization Research IBM Research and Development Haifa Research Lab Phone: 972-4-829-6326 | Mobile: 972-54-3155635 (Embedded image moved to file: E-mail: jo...@il.ibm.com pic56195.gif)IBM attachment: pic56195.gif
[Bug 65941] New: KVM Guest Solaris 10/11 - a few time in an hour time jumps for a while to 1.jan 1970
https://bugzilla.kernel.org/show_bug.cgi?id=65941 Bug ID: 65941 Summary: KVM Guest Solaris 10/11 - a few time in an hour time jumps for a while to 1.jan 1970 Product: Virtualization Version: unspecified Kernel Version: 2.6.32-431.el6.x86_64 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: s...@kosecky.eu Regression: No KVM Host: HW: HP Proliant DL360 - 2x Intel(R) Xeon(R) CPU X5570 @ 2.93GHz OS: RedHat ELS 6.5 Virtualization: # /usr/libexec/qemu-kvm --version QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c) 2003-2008 Fabrice Bellard # KVM Guest: OS: Solaris 10, Solaris 11 64bit Startup command: /usr/libexec/qemu-kvm -name p1gdev -S -M rhel6.4.0 -enable-kvm -m 16384 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid fb28b784-0b1b-692b-92e8-d8b469bbb4e7 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/p1gdev.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/kvm_image/p1gdev.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=23,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:34:0e:f2,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 How to reproduce problem: - reboot Host HW and boot solaris guest - approximately after 24 hours guest will notice a few times in hour that time jumps to 1.jan 1970 for a few seconds Host and guests were updated to latest patchlevel but nothing changed. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html