Re: [PATCH 1/1] KVM: PPC: Increase memslots to 320
On 09/12/15 04:28, Paul Mackerras wrote: > On Wed, Nov 04, 2015 at 10:03:48AM +0100, Thomas Huth wrote: >> Only using 32 memslots for KVM on powerpc is way too low, you can >> nowadays hit this limit quite fast by adding a couple of PCI devices >> and/or pluggable memory DIMMs to the guest. >> x86 already increased the limit to 512 in total, to satisfy 256 >> pluggable DIMM slots, 3 private slots and 253 slots for other things >> like PCI devices. On powerpc, we only have 32 pluggable DIMMs in > > I agree with increasing the limit. Is there a reason we have only 32 > pluggable DIMMs in QEMU on powerpc, not more? Should we be increasing > that limit too? If so, maybe we should increase the number of memory > slots to 512? H ... the comment before the #define in QEMU reads: /* * This defines the maximum number of DIMM slots we can have for sPAPR * guest. This is not defined by sPAPR but we are defining it to 32 slots * based on default number of slots provided by PowerPC kernel. */ #define SPAPR_MAX_RAM_SLOTS 32 So as far as I can see, there's indeed a possibility that we'll increase this value once the kernel can handle more slots! So I guess you're right and we should play save and use more slots right from the start. I'll send a new patch with 512 instead. >> QEMU, not 256, so we likely do not as much slots as on x86. Thus > > "so we likely do not need as many slots as on x86" would be better > English. Oops, I'm sure that was my original intention ... anyway, thanks for pointing it out, it's always good to get some feedback as a non-native English speaker. Thomas -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 12/21] KVM: ARM64: Add reset and access handlers for PMCNTENSET and PMCNTENCLR register
On 2015/12/9 0:42, Marc Zyngier wrote: >> +void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable) >> > +{ >> > + int i; >> > + struct kvm_pmu *pmu = >arch.pmu; >> > + struct kvm_pmc *pmc; >> > + >> > + if (!all_enable) >> > + return; > You have the vcpu. Can you move the check for PMCR_EL0.E here instead of > having it in both of the callers? > But it still needs to check PMCR_EL0.E in kvm_pmu_handle_pmcr(). When PMCR_EL0.E == 1, it calls kvm_pmu_enable_counter(), otherwise it calls kvm_pmu_disable_counter(). So as it checks already, just pass the result as a parameter. >> > + >> > + for_each_set_bit(i, (const unsigned long *), ARMV8_MAX_COUNTERS) { > Nonononono... If you must have to use a long, use a long. Don't cast it > to a different type (hint: big endian). > >> > + pmc = >pmc[i]; >> > + if (pmc->perf_event) { >> > + perf_event_enable(pmc->perf_event); >> > + if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE) >> > + kvm_debug("fail to enable perf event\n"); >> > + } >> > + } >> > +} >> > + >> > +/** >> > + * kvm_pmu_disable_counter - disable selected PMU counter >> > + * @vcpu: The vcpu pointer >> > + * @val: the value guest writes to PMCNTENCLR register >> > + * >> > + * Call perf_event_disable to stop counting the perf event >> > + */ >> > +void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val) >> > +{ >> > + int i; >> > + struct kvm_pmu *pmu = >arch.pmu; >> > + struct kvm_pmc *pmc; >> > + > Why are enable and disable asymmetric (handling of PMCR.E)? > To enable a counter, it needs both the PMCR_EL0.E and the corresponding bit of PMCNTENSET_EL0 set to 1. But to disable a counter, it only needs one of them and when PMCR_EL0.E == 0, it disables all the counters. Thanks, -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 14/21] KVM: ARM64: Add reset and access handlers for PMOVSSET and PMOVSCLR register
On 2015/12/9 0:59, Marc Zyngier wrote: >> > + } >> > + >> > + /* If all overflow bits are cleared, kick the vcpu to clear interrupt >> > + * pending status. >> > + */ >> > + if (val == 0) >> > + kvm_vcpu_kick(vcpu); > Do we really need to do so? This will be dropped on the next entry > anyway, so i don't see the need to kick the vcpu again. Or am I missing > something? > I thought maybe it could not set the interrupt low immediately when all the overflow bits are cleared for some reason, so add a kick to force it to sync the interrupt. But as you said, I'll remove this. Thanks, -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: > Hi Michael & Alexander: > Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html > We still need to support Windows guest for migration and this is why our > patches keep all changes in the driver since it's impossible to change > Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. > Following is my idea to do DMA tracking. > > Inject event to VF driver after memory iterate stage > and before stop VCPU and then VF driver marks dirty all > using DMA memory. The new allocated pages also need to > be marked dirty before stopping VCPU. All dirty memory > in this time slot will be migrated until stop-and-copy > stage. We also need to make sure to disable VF via clearing the > bus master enable bit for VF before migrating these memory. > > The dma page allocated by VF driver also needs to reserve space > to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 07/21] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function
On Wed, 9 Dec 2015 15:38:09 +0800 Shannon Zhaowrote: > > > On 2015/12/8 23:43, Marc Zyngier wrote: > > On 08/12/15 12:47, Shannon Zhao wrote: > >> From: Shannon Zhao > >> +/** > >> + * kvm_pmu_get_counter_value - get PMU counter value > >> + * @vcpu: The vcpu pointer > >> + * @select_idx: The counter index > >> + */ > >> +u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx) > >> +{ > >> + u64 counter, enabled, running; > >> + struct kvm_pmu *pmu = >arch.pmu; > >> + struct kvm_pmc *pmc = >pmc[select_idx]; > >> + > >> + if (!vcpu_mode_is_32bit(vcpu)) > >> + counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx); > >> + else > >> + counter = vcpu_cp15(vcpu, c14_PMEVCNTR0 + select_idx); > >> + > >> + if (pmc->perf_event) > >> + counter += perf_event_read_value(pmc->perf_event, , > >> + ); > >> + > >> + return counter & pmc->bitmask; > > > > This one confused me for a while. Is it the case that you return > > whatever is in the vcpu view of the counter, plus anything that perf > > itself has counted? If so, I'd appreciate a comment here... > > > Yes, the real counter value is the current counter value plus the value > perf event counts. I'll add a comment. > > >> +} > >> + > >> +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u32 > >> select_idx) > >> +{ > >> + if (!vcpu_mode_is_32bit(vcpu)) > >> + return (vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMCR_E) & > >> + (vcpu_sys_reg(vcpu, PMCNTENSET_EL0) >> select_idx); > > > > This looks wrong. Shouldn't it be: > > > > return ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMCR_E) && > > (vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & (1 << select_idx))); > > > >> + else > >> + return (vcpu_sys_reg(vcpu, c9_PMCR) & ARMV8_PMCR_E) & > >> + (vcpu_sys_reg(vcpu, c9_PMCNTENSET) >> select_idx); > >> +} > > > > Also, I don't really see why we need to check the 32bit version, which > > has the exact same content. > > > >> + > >> +static inline struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc *pmc) > >> +{ > >> + struct kvm_pmu *pmu; > >> + struct kvm_vcpu_arch *vcpu_arch; > >> + > >> + pmc -= pmc->idx; > >> + pmu = container_of(pmc, struct kvm_pmu, pmc[0]); > >> + vcpu_arch = container_of(pmu, struct kvm_vcpu_arch, pmu); > >> + return container_of(vcpu_arch, struct kvm_vcpu, arch); > >> +} > >> + > >> +/** > >> + * kvm_pmu_stop_counter - stop PMU counter > >> + * @pmc: The PMU counter pointer > >> + * > >> + * If this counter has been configured to monitor some event, release it > >> here. > >> + */ > >> +static void kvm_pmu_stop_counter(struct kvm_pmc *pmc) > >> +{ > >> + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); > >> + u64 counter; > >> + > >> + if (pmc->perf_event) { > >> + counter = kvm_pmu_get_counter_value(vcpu, pmc->idx); > >> + if (!vcpu_mode_is_32bit(vcpu)) > >> + vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + pmc->idx) = counter; > >> + else > >> + vcpu_cp15(vcpu, c14_PMEVCNTR0 + pmc->idx) = counter; > > > > Same thing - we don't need to make a difference between 32 and 64bit. > > > So it's fine to drop all the vcpu_mode_is_32bit(vcpu) check of this > series? The only one we should take care is the PMCCNTR, right? Yes, mostly. As long as you only reason on the 64bit register set, you're pretty safe, and that in turn solves all kind of ugly endianness issues. > >> + > >> + perf_event_release_kernel(pmc->perf_event); > >> + pmc->perf_event = NULL; > >> + } > >> +} > >> + > >> +/** > >> + * kvm_pmu_set_counter_event_type - set selected counter to monitor some > >> event > >> + * @vcpu: The vcpu pointer > >> + * @data: The data guest writes to PMXEVTYPER_EL0 > >> + * @select_idx: The number of selected counter > >> + * > >> + * When OS accesses PMXEVTYPER_EL0, that means it wants to set a PMC to > >> count an > >> + * event with given hardware event number. Here we call perf_event API to > >> + * emulate this action and create a kernel perf event for it. > >> + */ > >> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data, > >> + u32 select_idx) > >> +{ > >> + struct kvm_pmu *pmu = >arch.pmu; > >> + struct kvm_pmc *pmc = >pmc[select_idx]; > >> + struct perf_event *event; > >> + struct perf_event_attr attr; > >> + u32 eventsel; > >> + u64 counter; > >> + > >> + kvm_pmu_stop_counter(pmc); > > > > Wait. I didn't realize this before, but you have the vcpu right here. > > Why don't you pass it as a parameter to kvm_pmu_stop_counter and avoid > > the kvm_pmc_to_vcpu thing altogether? > > > Yeah, we could pass vcpu as a parameter for this function. But the > kvm_pmc_to_vcpu helper is also used in kvm_pmu_perf_overflow() and > within kvm_pmu_perf_overflow it needs the pmc->idx, we couldn't pass > vcpu as a parameter, so this helper
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/8/2015 1:12 AM, Alexander Duyck wrote: On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyuwrote: On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. We are also adding such migration support for QAT device and so our solution will not just be limit to NIC. Now just is the beginning. We can't limit user to only use Linux guest. So the migration feature should work for both Windows and Linux guest. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. If we can benefit both migration and suspend, that would be wonderful. But migration and system pm is still different. Just for example, driver doesn't need to put device into deep D-status during migration and host can do this after migration while it's essential for system sleep. PCI configure space and interrupt config is emulated by Qemu and Qemu can migrate these configures to new machine. Driver doesn't need to deal with such thing. So I think migration still needs a different callback or different code path than device suspend/resume. Another concern is that we have to rework PM core ore PCI bus driver to call suspend/resume for passthrough devices during migration. This also blocks new feature works on the Windows. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. The main problem is tracking DMA issue. I will repose my solution in the new thread for discussion. If not way to mark DMA page dirty when DMA is enabled, we have to stop DMA for a small time to do that at the last stage. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the device to write into you cannot just poke a hole in
RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
Hi Radim, > -Original Message- > From: Radim Krčmář [mailto:rkrc...@redhat.com] > Sent: Tuesday, November 17, 2015 3:03 AM > To: Wu, Feng> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted- > interrupts > > 2015-11-09 10:46+0800, Feng Wu: > > Use vector-hashing to handle lowest-priority interrupts for > > posted-interrupts. As an example, modern Intel CPUs use this > > method to handle lowest-priority interrupts. > > (I don't think it's a good idea that the algorithm differs from non-PI > lowest priority delivery. I'd make them both vector-hashing, which > would be "fun" to explain to people expecting round robin ...) > > > Signed-off-by: Feng Wu > > --- > > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > > +/* > > + * This routine handles lowest-priority interrupts using vector-hashing > > + * mechanism. As an example, modern Intel CPUs use this method to handle > > + * lowest-priority interrupts. > > + * > > + * Here is the details about the vector-hashing mechanism: > > + * 1. For lowest-priority interrupts, store all the possible destination > > + *vCPUs in an array. > > + * 2. Use "guest vector % max number of destination vCPUs" to find the > > right > > + *destination vCPU in the array for the lowest-priority interrupt. > > + */ > > (Is Skylake i7-6700 a modern Intel CPU? > I didn't manage to get hashing ... all interrupts always went to the > lowest APIC ID in the set :/ > Is there a simple way to verify the algorithm?) > > > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, > > + struct kvm_lapic_irq *irq) > > + > > +{ > > + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; > > + unsigned int dest_vcpus = 0; > > + struct kvm_vcpu *vcpu; > > + unsigned int i, mod, idx = 0; > > + > > + vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq); > > + if (vcpu) > > + return vcpu; > > I think the rest of this function shouldn't be implemented: > - Shorthands are only for IPIs and hence don't need to be handled, > - Lowest priority physical broadcast is not supported, > - Lowest priority cluster logical broadcast is not supported, > - No point in optimizing mixed xAPIC and x2APIC mode, I read your comments again, and don't quite understand why we don't need PI optimization for mixed xAPIC and x2APIC mode. BTW, can we have mixed flat and cluster mode? Thanks, Feng > - The rest is handled by kvm_intr_vector_hashing_dest_fast(). >(Even lowest priority flat logical "broadcast".) > - We do the work twice when vcpu == NULL means that there is no >matching destination. > > Is there a valid case that can be resolved by going through all vcpus? > > > + > > + memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap)); > > + > > + kvm_for_each_vcpu(i, vcpu, kvm) { > > + if (!kvm_apic_present(vcpu)) > > + continue; > > + > > + if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand, > > + irq->dest_id, irq->dest_mode)) > > + continue; > > + > > + __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap); > > + dest_vcpus++; > > + } > > + > > + if (dest_vcpus == 0) > > + return NULL; > > + > > + mod = irq->vector % dest_vcpus; > > + > > + for (i = 0; i <= mod; i++) { > > + idx = find_next_bit(dest_vcpu_bitmap, KVM_MAX_VCPUS, idx) + > 1; > > + BUG_ON(idx >= KVM_MAX_VCPUS); > > + } > > + > > + return kvm_get_vcpu(kvm, idx - 1); > > +} > > +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest); > > + > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > @@ -816,6 +816,63 @@ out: > > +struct kvm_vcpu *kvm_intr_vector_hashing_dest_fast(struct kvm *kvm, > > + struct kvm_lapic_irq *irq) > > We now have three very similar functions :( > > kvm_irq_delivery_to_apic_fast > kvm_intr_is_single_vcpu_fast > kvm_intr_vector_hashing_dest_fast > > By utilizing the gcc optimizer, they can be merged without introducing > many instructions to the hot path, kvm_irq_delivery_to_apic_fast. > (I would eventually do it, so you can save time by ignoring this.) > > Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 12/21] KVM: ARM64: Add reset and access handlers for PMCNTENSET and PMCNTENCLR register
On Wed, 9 Dec 2015 16:35:58 +0800 Shannon Zhaowrote: > > > On 2015/12/9 0:42, Marc Zyngier wrote: > >> +void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool > >> all_enable) > >> > +{ > >> > +int i; > >> > +struct kvm_pmu *pmu = >arch.pmu; > >> > +struct kvm_pmc *pmc; > >> > + > >> > +if (!all_enable) > >> > +return; > > You have the vcpu. Can you move the check for PMCR_EL0.E here instead of > > having it in both of the callers? > > > But it still needs to check PMCR_EL0.E in kvm_pmu_handle_pmcr(). When > PMCR_EL0.E == 1, it calls kvm_pmu_enable_counter(), otherwise it calls > kvm_pmu_disable_counter(). So as it checks already, just pass the result > as a parameter. I've seen that, but it makes the code look ugly. At any rate, you might as well not call enable_counter if PMCR.E==0. But splitting the lookup of the bit and the test like you do is not nice at all. Making it self-contained looks a lot better, and you don't have to think about the caller. > >> > + > >> > +for_each_set_bit(i, (const unsigned long *), > >> > ARMV8_MAX_COUNTERS) { > > Nonononono... If you must have to use a long, use a long. Don't cast it > > to a different type (hint: big endian). > > > >> > +pmc = >pmc[i]; > >> > +if (pmc->perf_event) { > >> > +perf_event_enable(pmc->perf_event); > >> > +if (pmc->perf_event->state != > >> > PERF_EVENT_STATE_ACTIVE) > >> > +kvm_debug("fail to enable perf > >> > event\n"); > >> > +} > >> > +} > >> > +} > >> > + > >> > +/** > >> > + * kvm_pmu_disable_counter - disable selected PMU counter > >> > + * @vcpu: The vcpu pointer > >> > + * @val: the value guest writes to PMCNTENCLR register > >> > + * > >> > + * Call perf_event_disable to stop counting the perf event > >> > + */ > >> > +void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val) > >> > +{ > >> > +int i; > >> > +struct kvm_pmu *pmu = >arch.pmu; > >> > +struct kvm_pmc *pmc; > >> > + > > Why are enable and disable asymmetric (handling of PMCR.E)? > > > To enable a counter, it needs both the PMCR_EL0.E and the corresponding > bit of PMCNTENSET_EL0 set to 1. But to disable a counter, it only needs > one of them and when PMCR_EL0.E == 0, it disables all the counters. OK. M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] KVM: PPC: Increase memslots to 320
On 09/12/15 04:28, Paul Mackerras wrote: > On Wed, Nov 04, 2015 at 10:03:48AM +0100, Thomas Huth wrote: >> Only using 32 memslots for KVM on powerpc is way too low, you can >> nowadays hit this limit quite fast by adding a couple of PCI devices >> and/or pluggable memory DIMMs to the guest. >> x86 already increased the limit to 512 in total, to satisfy 256 >> pluggable DIMM slots, 3 private slots and 253 slots for other things >> like PCI devices. On powerpc, we only have 32 pluggable DIMMs in > > I agree with increasing the limit. Is there a reason we have only 32 > pluggable DIMMs in QEMU on powerpc, not more? Should we be increasing > that limit too? If so, maybe we should increase the number of memory > slots to 512? H ... the comment before the #define in QEMU reads: /* * This defines the maximum number of DIMM slots we can have for sPAPR * guest. This is not defined by sPAPR but we are defining it to 32 slots * based on default number of slots provided by PowerPC kernel. */ #define SPAPR_MAX_RAM_SLOTS 32 So as far as I can see, there's indeed a possibility that we'll increase this value once the kernel can handle more slots! So I guess you're right and we should play save and use more slots right from the start. I'll send a new patch with 512 instead. >> QEMU, not 256, so we likely do not as much slots as on x86. Thus > > "so we likely do not need as many slots as on x86" would be better > English. Oops, I'm sure that was my original intention ... anyway, thanks for pointing it out, it's always good to get some feedback as a non-native English speaker. Thomas -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 15/21] KVM: ARM64: Add reset and access handlers for PMUSERENR register
On 2015/12/9 1:03, Marc Zyngier wrote: > On 08/12/15 12:47, Shannon Zhao wrote: >> > From: Shannon Zhao>> > >> > The reset value of PMUSERENR_EL0 is UNKNOWN, use reset_unknown. >> > >> > Signed-off-by: Shannon Zhao >> > --- >> > arch/arm64/kvm/sys_regs.c | 5 +++-- >> > 1 file changed, 3 insertions(+), 2 deletions(-) >> > >> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c >> > index c830fde..80b66c0 100644 >> > --- a/arch/arm64/kvm/sys_regs.c >> > +++ b/arch/arm64/kvm/sys_regs.c >> > @@ -880,7 +880,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { >> > access_pmu_pmxevcntr }, >> >/* PMUSERENR_EL0 */ >> >{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000), >> > -trap_raz_wi }, >> > +access_pmu_regs, reset_unknown, PMUSERENR_EL0 }, > So while the 64bit view of the register resets as unknown, a CPU > resetting in 32bit mode resets as 0. I suggest you reset it as zero, and > document that choice. You may have to revisit all the other registers > that do reset as unknown for 64bit as well. > Sure. BTW, here I didn't handle the bits of PMUSERENR which are used to permit/forbid accessing some PMU registers from EL0. Does it need to add the handler? Is there any way to get the exceptional level of the accessing in hypervisor? Thanks, -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Increase memslots to 512
Only using 32 memslots for KVM on powerpc is way too low, you can nowadays hit this limit quite fast by adding a couple of PCI devices and/or pluggable memory DIMMs to the guest. x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256 pluggable DIMM slots, 3 private slots and 253 slots for other things like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in total). We should do something similar for powerpc, and since we do not use private slots here, we can set the value to 512 directly. While we're at it, also remove the KVM_MEM_SLOTS_NUM definition from the powerpc-specific header since this gets defined in the generic kvm_host.h header anyway. Signed-off-by: Thomas Huth--- arch/powerpc/include/asm/kvm_host.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 887c259..2c96a72 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -38,8 +38,7 @@ #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS -#define KVM_USER_MEM_SLOTS 32 -#define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS +#define KVM_USER_MEM_SLOTS 512 #ifdef CONFIG_KVM_MMIO #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Increase memslots to 512
Only using 32 memslots for KVM on powerpc is way too low, you can nowadays hit this limit quite fast by adding a couple of PCI devices and/or pluggable memory DIMMs to the guest. x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256 pluggable DIMM slots, 3 private slots and 253 slots for other things like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in total). We should do something similar for powerpc, and since we do not use private slots here, we can set the value to 512 directly. While we're at it, also remove the KVM_MEM_SLOTS_NUM definition from the powerpc-specific header since this gets defined in the generic kvm_host.h header anyway. Signed-off-by: Thomas Huth--- arch/powerpc/include/asm/kvm_host.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 887c259..2c96a72 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -38,8 +38,7 @@ #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS -#define KVM_USER_MEM_SLOTS 32 -#define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS +#define KVM_USER_MEM_SLOTS 512 #ifdef CONFIG_KVM_MMIO #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 15/21] KVM: ARM64: Add reset and access handlers for PMUSERENR register
On Wed, 9 Dec 2015 17:18:02 +0800 Shannon Zhaowrote: > > > On 2015/12/9 1:03, Marc Zyngier wrote: > > On 08/12/15 12:47, Shannon Zhao wrote: > >> > From: Shannon Zhao > >> > > >> > The reset value of PMUSERENR_EL0 is UNKNOWN, use reset_unknown. > >> > > >> > Signed-off-by: Shannon Zhao > >> > --- > >> > arch/arm64/kvm/sys_regs.c | 5 +++-- > >> > 1 file changed, 3 insertions(+), 2 deletions(-) > >> > > >> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > >> > index c830fde..80b66c0 100644 > >> > --- a/arch/arm64/kvm/sys_regs.c > >> > +++ b/arch/arm64/kvm/sys_regs.c > >> > @@ -880,7 +880,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { > >> >access_pmu_pmxevcntr }, > >> > /* PMUSERENR_EL0 */ > >> > { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000), > >> > - trap_raz_wi }, > >> > + access_pmu_regs, reset_unknown, PMUSERENR_EL0 }, > > So while the 64bit view of the register resets as unknown, a CPU > > resetting in 32bit mode resets as 0. I suggest you reset it as zero, and > > document that choice. You may have to revisit all the other registers > > that do reset as unknown for 64bit as well. > > > Sure. > > BTW, here I didn't handle the bits of PMUSERENR which are used to > permit/forbid accessing some PMU registers from EL0. Does it need to add > the handler? Is there any way to get the exceptional level of the > accessing in hypervisor? Ah, good point, I missed that. Yes, we need to be able to handle that. To find out, you can use vcpu_mode_priv(), which returns true if the CPU was in a high privilege mode (EL1 for 64bit, anything higher than USR on 32bit), and false otherwise. So far, the only user is arch/arm/kvm/perf.c. Thanks, M. -- Without deviation from the norm, progress is not possible. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html No, I will reply it separately and according your suggestion to snip it into 3 thread. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. Sure. Duplicated code should be encapsulated and make it able to reuse by other drivers. Just like you said the dummy write part. I meant the framework should not require to change Windows kernel code (such as PM core or PCI bus driver)and this will block implementation on the Windows. I think it's not problem to duplicate code in the Windows drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. This sounds reasonable. We may discuss it detail in the separate thread. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote: I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. I am not sure whether we can write a bus driver for Windows to support migration. But I think device vendor who want to support migration will improve their driver if we provide such framework in the hypervisor which just need them to change their driver. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Dec 09, 2015 at 07:19:15PM +0800, Lan, Tianyu wrote: > On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: > >On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: > >>Hi Michael & Alexander: > >>Thanks a lot for your comments and suggestions. > > > >It's nice that it's appreciated, but you then go on and ignore > >all that I have written here: > >https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html > > > > No, I will reply it separately and according your suggestion to snip it into > 3 thread. > > >>We still need to support Windows guest for migration and this is why our > >>patches keep all changes in the driver since it's impossible to change > >>Windows kernel. > > > >This is not a reasonable argument. It makes no sense to duplicate code > >on Linux because you must duplicate code on Windows. Let's assume you > >must do it in the driver on windows because windows has closed source > >drivers. What does it matter? Linux can still do it as part of DMA API > >and have it apply to all drivers. > > > > Sure. Duplicated code should be encapsulated and make it able to reuse > by other drivers. Just like you said the dummy write part. > > I meant the framework should not require to change Windows kernel code > (such as PM core or PCI bus driver)and this will block implementation on > the Windows. I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. > I think it's not problem to duplicate code in the Windows drivers. > > >>Following is my idea to do DMA tracking. > >> > >>Inject event to VF driver after memory iterate stage > >>and before stop VCPU and then VF driver marks dirty all > >>using DMA memory. The new allocated pages also need to > >>be marked dirty before stopping VCPU. All dirty memory > >>in this time slot will be migrated until stop-and-copy > >>stage. We also need to make sure to disable VF via clearing the > >>bus master enable bit for VF before migrating these memory. > >> > >>The dma page allocated by VF driver also needs to reserve space > >>to do dummy write. > > > >I suggested ways to do it all in the hypervisor without driver hacks, or > >hide it within DMA API without need to reserve extra space. Both > >approaches seem much cleaner. > > > > This sounds reasonable. We may discuss it detail in the separate thread. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
2015-12-09 08:19+, Wu, Feng: >> -Original Message- >> From: Radim Krčmář [mailto:rkrc...@redhat.com] >> Sent: Tuesday, November 17, 2015 3:03 AM >> To: Wu, Feng>> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org >> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted- >> interrupts >> >> 2015-11-09 10:46+0800, Feng Wu: >> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, >> > +struct kvm_lapic_irq *irq) >> > + >> > +{ >> > + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; >> > + unsigned int dest_vcpus = 0; >> > + struct kvm_vcpu *vcpu; >> > + unsigned int i, mod, idx = 0; >> > + >> > + vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq); >> > + if (vcpu) >> > + return vcpu; >> >> I think the rest of this function shouldn't be implemented: >> - Shorthands are only for IPIs and hence don't need to be handled, >> - Lowest priority physical broadcast is not supported, >> - Lowest priority cluster logical broadcast is not supported, >> - No point in optimizing mixed xAPIC and x2APIC mode, > > I read your comments again, and don't quite understand why we > don't need PI optimization for mixed xAPIC and x2APIC mode. There shouldn't be a non-hobbyist operating system that uses mixed mode, so the optimization would practically be dead code as all other cases are handled by kvm_intr_vector_hashing_dest_fast(). I think that having extra code would bring problems in the future -- we need to take care of it when refactoring KVM's APIC and we should also write a unit-test for this otherwise dead path. I don't think that the benefit for guests would ever balance those efforts. (Physical xAPIC+x2APIC mode is still somewhat reasonable and xAPIC CPUs start with LDR=0, which means that operating system doesn't need to utilize mixed mode, as defined by KVM, when switching to x2APIC.) > BTW, can we have mixed flat and cluster mode? Yes, KVM recognizes that mixed mode, but luckily, there are severe limitations. Notes below SDM section 10.6.2.2: All processors that have their APIC software enabled (using the spurious vector enable/disable bit) must have their DFRs (Destination Format Registers) programmed identically. I hope there isn't a human that would use it in good faith. (Only NMI/SMI/INIT/SIPI are delivered in software disabled mode and if the system uses cluster xAPIC, OS should set DFR before LDR, which doesn't trigger mixed mode either.) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] x86: always inline functions called after set_exception_return
On 07/12/2015 21:36, David Matlack wrote: > set_exception_return forces exceptions handlers to return to a specific > address instead of returning to the instruction address pushed by the > CPU at the time of the exception. The unit tests apic.c and vmx.c use > this functionality to recover from expected exceptions. > > When using set_exception_return we have to be careful not to modify the > stack (such as by doing a function call) as triggering the exception will > likely jump us past the instructions which undo the stack manipulation > (such as a ret). To accomplish this, declare all functions called after > set_exception_return as __always_inline, so that the compiler always > inlines them. set_exception_return is generally not a great idea IMHO---thanks for looking at it. A couple years ago we discussed adding setjmp/longjmp to libcflat (http://www.spinics.net/lists/kvm/msg94159.html which is however missing a 32-bit version). Making the exceptions do a longjmp would be a much safer option. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/4] VSOCK: Introduce virtio-vsock-common.ko
From: Asias HeThis module contains the common code and header files for the following virtio-vsock and virtio-vhost kernel modules. Signed-off-by: Asias He Signed-off-by: Stefan Hajnoczi --- v3: * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead of REQUEST/RESPONSE/ACK * Remove SOCK_DGRAM support and focus on SOCK_STREAM first * Only allow host->guest connections (same security model as latest VMware) v2: * Fix peer_buf_alloc inheritance on child socket * Notify other side of SOCK_STREAM disconnect (fixes shutdown semantics) * Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock) * Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants * Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants --- include/linux/virtio_vsock.h| 203 include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_vsock.h | 87 net/vmw_vsock/virtio_transport_common.c | 854 4 files changed, 1145 insertions(+) create mode 100644 include/linux/virtio_vsock.h create mode 100644 include/uapi/linux/virtio_vsock.h create mode 100644 net/vmw_vsock/virtio_transport_common.c diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h new file mode 100644 index 000..e54eb45 --- /dev/null +++ b/include/linux/virtio_vsock.h @@ -0,0 +1,203 @@ +/* + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so + * anyone can use the definitions to implement compatible drivers/servers: + * + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in the + *documentation and/or other materials provided with the distribution. + * 3. Neither the name of IBM nor the names of its contributors + *may be used to endorse or promote products derived from this software + *without specific prior written permission. + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * Copyright (C) Red Hat, Inc., 2013-2015 + * Copyright (C) Asias He , 2013 + * Copyright (C) Stefan Hajnoczi , 2015 + */ + +#ifndef _LINUX_VIRTIO_VSOCK_H +#define _LINUX_VIRTIO_VSOCK_H + +#include +#include +#include + +#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE 128 +#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE (1024 * 256) +#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE (1024 * 256) +#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4) +#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xUL +#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64) +#define VIRTIO_VSOCK_MAX_TX_BUF_SIZE (1024 * 1024 * 16) +#define VIRTIO_VSOCK_MAX_DGRAM_SIZE(1024 * 64) + +struct vsock_transport_recv_notify_data; +struct vsock_transport_send_notify_data; +struct sockaddr_vm; +struct vsock_sock; + +enum { + VSOCK_VQ_CTRL = 0, + VSOCK_VQ_RX = 1, /* for host to guest data */ + VSOCK_VQ_TX = 2, /* for guest to host data */ + VSOCK_VQ_MAX= 3, +}; + +/* virtio transport socket state */ +struct virtio_transport { + struct virtio_transport_pkt_ops *ops; + struct vsock_sock *vsk; + + u32 buf_size; + u32 buf_size_min; + u32 buf_size_max; + + struct mutex tx_lock; + struct mutex rx_lock; + + struct list_head rx_queue; + u32 rx_bytes; + + /* Protected by trans->tx_lock */ + u32 tx_cnt; + u32 buf_alloc; + u32 peer_fwd_cnt; + u32 peer_buf_alloc; + /* Protected by trans->rx_lock */ + u32 fwd_cnt; +}; + +struct virtio_vsock_pkt { + struct virtio_vsock_hdr hdr; + struct virtio_transport *trans; + struct work_struct work; + struct list_head list; + void *buf; + u32 len; + u32 off; +}; + +struct virtio_vsock_pkt_info {
[PATCH v3 0/4] Add virtio transport for AF_VSOCK
Note: the virtio-vsock device specification is currently under review but not yet finalized. Please review this code but don't merge until I send an update when the spec is finalized. Thanks! v3: * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead of REQUEST/RESPONSE/ACK * Remove SOCK_DGRAM support and focus on SOCK_STREAM first (also drop v2 Patch 1, it's only needed for SOCK_DGRAM) * Only allow host->guest connections (same security model as latest VMware) * Don't put vhost vsock driver into staging * Add missing Kconfig dependencies (Arnd Bergmann) * Remove unneeded variable used to store return value (Fengguang Wu and Julia Lawall ) v2: * Rebased onto Linux v4.4-rc2 * vhost: Refuse to assign reserved CIDs * vhost: Refuse guest CID if already in use * vhost: Only accept correctly addressed packets (no spoofing!) * vhost: Support flexible rx/tx descriptor layout * vhost: Add missing total_tx_buf decrement * virtio_transport: Fix total_tx_buf accounting * virtio_transport: Add virtio_transport global mutex to prevent races * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown semantics) * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock) * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants * common: Fix peer_buf_alloc inheritance on child socket This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/). AF_VSOCK is designed for communication between virtual machines and hypervisors. It is currently only implemented for VMware's VMCI transport. This series implements the proposed virtio-vsock device specification from here: http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980 Most of the work was done by Asias He and Gerd Hoffmann a while back. I have picked up the series again. The QEMU userspace changes are here: https://github.com/stefanha/qemu/commits/vsock Why virtio-vsock? - Guest<->host communication is currently done over the virtio-serial device. This makes it hard to port sockets API-based applications and is limited to static ports. virtio-vsock uses the sockets API so that applications can rely on familiar SOCK_STREAM semantics. Applications on the host can easily connect to guest agents because the sockets API allows multiple connections to a listen socket (unlike virtio-serial). This simplifies the guest<->host communication and eliminates the need for extra processes on the host to arbitrate virtio-serial ports. Overview This series adds 3 pieces: 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko 2. virtio_transport.ko - guest driver 3. drivers/vhost/vsock.ko - host driver Howto - The following kernel options are needed: CONFIG_VSOCKETS=y CONFIG_VIRTIO_VSOCKETS=y CONFIG_VIRTIO_VSOCKETS_COMMON=y CONFIG_VHOST_VSOCK=m Launch QEMU as follows: # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3 Guest and host can communicate via AF_VSOCK sockets. The host's CID (address) is 2 and the guest must be assigned a CID (3 in the example above). Status -- This patch series implements the latest draft specification. Please review. Asias He (4): VSOCK: Introduce virtio-vsock-common.ko VSOCK: Introduce virtio-vsock.ko VSOCK: Introduce vhost-vsock.ko VSOCK: Add Makefile and Kconfig drivers/vhost/Kconfig | 10 + drivers/vhost/Makefile | 4 + drivers/vhost/vsock.c | 628 +++ drivers/vhost/vsock.h | 4 + include/linux/virtio_vsock.h| 203 include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_vsock.h | 87 net/vmw_vsock/Kconfig | 18 + net/vmw_vsock/Makefile | 2 + net/vmw_vsock/virtio_transport.c| 466 + net/vmw_vsock/virtio_transport_common.c | 854 11 files changed, 2277 insertions(+) create mode 100644 drivers/vhost/vsock.c create mode 100644 drivers/vhost/vsock.h create mode 100644 include/linux/virtio_vsock.h create mode 100644 include/uapi/linux/virtio_vsock.h create mode 100644 net/vmw_vsock/virtio_transport.c create mode 100644 net/vmw_vsock/virtio_transport_common.c -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/4] VSOCK: Add Makefile and Kconfig
From: Asias HeEnable virtio-vsock and vhost-vsock. Signed-off-by: Asias He Signed-off-by: Stefan Hajnoczi --- v3: * Don't put vhost vsock driver into staging * Add missing Kconfig dependencies (Arnd Bergmann ) --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile | 4 net/vmw_vsock/Kconfig | 18 ++ net/vmw_vsock/Makefile | 2 ++ 4 files changed, 34 insertions(+) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 533eaf0..a1bb4c2 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -21,6 +21,16 @@ config VHOST_SCSI Say M here to enable the vhost_scsi TCM fabric module for use with virtio-scsi guests +config VHOST_VSOCK + tristate "vhost virtio-vsock driver" + depends on VSOCKETS && EVENTFD + select VIRTIO_VSOCKETS_COMMON + select VHOST + select VHOST_RING + default n + ---help--- + Say M here to enable the vhost-vsock for virtio-vsock guests + config VHOST_RING tristate ---help--- diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index e0441c3..6b012b9 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -4,5 +4,9 @@ vhost_net-y := net.o obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o vhost_scsi-y := scsi.o +obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o +vhost_vsock-y := vsock.o + obj-$(CONFIG_VHOST_RING) += vringh.o + obj-$(CONFIG_VHOST)+= vhost.o diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig index 14810ab..74e0bc8 100644 --- a/net/vmw_vsock/Kconfig +++ b/net/vmw_vsock/Kconfig @@ -26,3 +26,21 @@ config VMWARE_VMCI_VSOCKETS To compile this driver as a module, choose M here: the module will be called vmw_vsock_vmci_transport. If unsure, say N. + +config VIRTIO_VSOCKETS + tristate "virtio transport for Virtual Sockets" + depends on VSOCKETS && VIRTIO + select VIRTIO_VSOCKETS_COMMON + help + This module implements a virtio transport for Virtual Sockets. + + Enable this transport if your Virtual Machine runs on Qemu/KVM. + + To compile this driver as a module, choose M here: the module + will be called virtio_vsock_transport. If unsure, say N. + +config VIRTIO_VSOCKETS_COMMON + tristate + ---help--- + This option is selected by any driver which needs to access + the virtio_vsock. diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile index 2ce52d7..cf4c294 100644 --- a/net/vmw_vsock/Makefile +++ b/net/vmw_vsock/Makefile @@ -1,5 +1,7 @@ obj-$(CONFIG_VSOCKETS) += vsock.o obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o +obj-$(CONFIG_VIRTIO_VSOCKETS) += virtio_transport.o +obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += virtio_transport_common.o vsock-y += af_vsock.o vsock_addr.o -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/4] VSOCK: Introduce vhost-vsock.ko
From: Asias HeVM sockets vhost transport implementation. This module runs in host kernel. Signed-off-by: Asias He Signed-off-by: Stefan Hajnoczi --- v3: * Remove unneeded variable used to store return value (Fengguang Wu and Julia Lawall ) v2: * Add missing total_tx_buf decrement * Support flexible rx/tx descriptor layout * Refuse to assign reserved CIDs * Refuse guest CID if already in use * Only accept correctly addressed packets --- drivers/vhost/vsock.c | 628 ++ drivers/vhost/vsock.h | 4 + 2 files changed, 632 insertions(+) create mode 100644 drivers/vhost/vsock.c create mode 100644 drivers/vhost/vsock.h diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c new file mode 100644 index 000..3c0034a --- /dev/null +++ b/drivers/vhost/vsock.c @@ -0,0 +1,628 @@ +/* + * vhost transport for vsock + * + * Copyright (C) 2013-2015 Red Hat, Inc. + * Author: Asias He + * Stefan Hajnoczi + * + * This work is licensed under the terms of the GNU GPL, version 2. + */ +#include +#include +#include +#include +#include +#include + +#include +#include "vhost.h" +#include "vsock.h" + +#define VHOST_VSOCK_DEFAULT_HOST_CID 2 + +static int vhost_transport_socket_init(struct vsock_sock *vsk, + struct vsock_sock *psk); + +enum { + VHOST_VSOCK_FEATURES = VHOST_FEATURES, +}; + +/* Used to track all the vhost_vsock instances on the system. */ +static LIST_HEAD(vhost_vsock_list); +static DEFINE_MUTEX(vhost_vsock_mutex); + +struct vhost_vsock_virtqueue { + struct vhost_virtqueue vq; +}; + +struct vhost_vsock { + /* Vhost device */ + struct vhost_dev dev; + /* Vhost vsock virtqueue*/ + struct vhost_vsock_virtqueue vqs[VSOCK_VQ_MAX]; + /* Link to global vhost_vsock_list*/ + struct list_head list; + /* Head for pkt from host to guest */ + struct list_head send_pkt_list; + /* Work item to send pkt */ + struct vhost_work send_pkt_work; + /* Wait queue for send pkt */ + wait_queue_head_t queue_wait; + /* Used for global tx buf limitation */ + u32 total_tx_buf; + /* Guest contex id this vhost_vsock instance handles */ + u32 guest_cid; +}; + +static u32 vhost_transport_get_local_cid(void) +{ + return VHOST_VSOCK_DEFAULT_HOST_CID; +} + +static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) +{ + struct vhost_vsock *vsock; + + mutex_lock(_vsock_mutex); + list_for_each_entry(vsock, _vsock_list, list) { + if (vsock->guest_cid == guest_cid) { + mutex_unlock(_vsock_mutex); + return vsock; + } + } + mutex_unlock(_vsock_mutex); + + return NULL; +} + +static void +vhost_transport_do_send_pkt(struct vhost_vsock *vsock, + struct vhost_virtqueue *vq) +{ + bool added = false; + + mutex_lock(>mutex); + vhost_disable_notify(>dev, vq); + for (;;) { + struct virtio_vsock_pkt *pkt; + struct iov_iter iov_iter; + unsigned out, in; + struct sock *sk; + size_t nbytes; + size_t len; + int head; + + if (list_empty(>send_pkt_list)) { + vhost_enable_notify(>dev, vq); + break; + } + + head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), +, , NULL, NULL); + pr_debug("%s: head = %d\n", __func__, head); + if (head < 0) + break; + + if (head == vq->num) { + if (unlikely(vhost_enable_notify(>dev, vq))) { + vhost_disable_notify(>dev, vq); + continue; + } + break; + } + + pkt = list_first_entry(>send_pkt_list, + struct virtio_vsock_pkt, list); + list_del_init(>list); + + if (out) { + virtio_transport_free_pkt(pkt); + vq_err(vq, "Expected 0 output buffers, got %u\n", out); + break; + } + + len = iov_length(>iov[out], in); + iov_iter_init(_iter, READ, >iov[out], in, len); + + nbytes = copy_to_iter(>hdr, sizeof(pkt->hdr), _iter); + if (nbytes != sizeof(pkt->hdr)) { + virtio_transport_free_pkt(pkt); + vq_err(vq, "Faulted on copying pkt hdr\n"); + break; + } + + nbytes = copy_to_iter(pkt->buf, pkt->len,
[PATCH v3 2/4] VSOCK: Introduce virtio-vsock.ko
From: Asias HeVM sockets virtio transport implementation. This module runs in guest kernel. Signed-off-by: Asias He Signed-off-by: Stefan Hajnoczi --- v2: * Fix total_tx_buf accounting * Add virtio_transport global mutex to prevent races --- net/vmw_vsock/virtio_transport.c | 466 +++ 1 file changed, 466 insertions(+) create mode 100644 net/vmw_vsock/virtio_transport.c diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c new file mode 100644 index 000..df65dca --- /dev/null +++ b/net/vmw_vsock/virtio_transport.c @@ -0,0 +1,466 @@ +/* + * virtio transport for vsock + * + * Copyright (C) 2013-2015 Red Hat, Inc. + * Author: Asias He + * Stefan Hajnoczi + * + * Some of the code is take from Gerd Hoffmann 's + * early virtio-vsock proof-of-concept bits. + * + * This work is licensed under the terms of the GNU GPL, version 2. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static struct workqueue_struct *virtio_vsock_workqueue; +static struct virtio_vsock *the_virtio_vsock; +static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */ +static void virtio_vsock_rx_fill(struct virtio_vsock *vsock); + +struct virtio_vsock { + /* Virtio device */ + struct virtio_device *vdev; + /* Virtio virtqueue */ + struct virtqueue *vqs[VSOCK_VQ_MAX]; + /* Wait queue for send pkt */ + wait_queue_head_t queue_wait; + /* Work item to send pkt */ + struct work_struct tx_work; + /* Work item to recv pkt */ + struct work_struct rx_work; + /* Mutex to protect send pkt*/ + struct mutex tx_lock; + /* Mutex to protect recv pkt*/ + struct mutex rx_lock; + /* Number of recv buffers */ + int rx_buf_nr; + /* Number of max recv buffers */ + int rx_buf_max_nr; + /* Used for global tx buf limitation */ + u32 total_tx_buf; + /* Guest context id, just like guest ip address */ + u32 guest_cid; +}; + +static struct virtio_vsock *virtio_vsock_get(void) +{ + return the_virtio_vsock; +} + +static u32 virtio_transport_get_local_cid(void) +{ + struct virtio_vsock *vsock = virtio_vsock_get(); + + return vsock->guest_cid; +} + +static int +virtio_transport_send_pkt(struct vsock_sock *vsk, + struct virtio_vsock_pkt_info *info) +{ + u32 src_cid, src_port, dst_cid, dst_port; + int ret, in_sg = 0, out_sg = 0; + struct virtio_transport *trans; + struct virtio_vsock_pkt *pkt; + struct virtio_vsock *vsock; + struct scatterlist hdr, buf, *sgs[2]; + struct virtqueue *vq; + u32 pkt_len = info->pkt_len; + DEFINE_WAIT(wait); + + vsock = virtio_vsock_get(); + if (!vsock) + return -ENODEV; + + src_cid = virtio_transport_get_local_cid(); + src_port = vsk->local_addr.svm_port; + if (!info->remote_cid) { + dst_cid = vsk->remote_addr.svm_cid; + dst_port = vsk->remote_addr.svm_port; + } else { + dst_cid = info->remote_cid; + dst_port = info->remote_port; + } + + trans = vsk->trans; + vq = vsock->vqs[VSOCK_VQ_TX]; + + if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE) + pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE; + pkt_len = virtio_transport_get_credit(trans, pkt_len); + /* Do not send zero length OP_RW pkt*/ + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW) + return pkt_len; + + /* Respect global tx buf limitation */ + mutex_lock(>tx_lock); + while (pkt_len + vsock->total_tx_buf > VIRTIO_VSOCK_MAX_TX_BUF_SIZE) { + prepare_to_wait_exclusive(>queue_wait, , + TASK_UNINTERRUPTIBLE); + mutex_unlock(>tx_lock); + schedule(); + mutex_lock(>tx_lock); + finish_wait(>queue_wait, ); + } + vsock->total_tx_buf += pkt_len; + mutex_unlock(>tx_lock); + + pkt = virtio_transport_alloc_pkt(vsk, info, pkt_len, +src_cid, src_port, +dst_cid, dst_port); + if (!pkt) { + mutex_lock(>tx_lock); + vsock->total_tx_buf -= pkt_len; + mutex_unlock(>tx_lock); + virtio_transport_put_credit(trans, pkt_len); + return -ENOMEM; + } + + pr_debug("%s:info->pkt_len= %d\n", __func__, info->pkt_len); + + /* Will be released in virtio_transport_send_pkt_work */ + sock_hold(>vsk->sk); + virtio_transport_inc_tx_pkt(pkt); + + /* Put pkt in the virtqueue */ + sg_init_one(, >hdr,
Re: live migration vs device assignment (motivation)
On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF It also requires guest to do switch configuration. These are hard to manage and deploy from our customers. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyuwrote: > > > On 12/8/2015 1:12 AM, Alexander Duyck wrote: >> >> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: >>> >>> On 12/5/2015 1:07 AM, Alexander Duyck wrote: > > > > We still need to support Windows guest for migration and this is why > our > patches keep all changes in the driver since it's impossible to change > Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. >>> >>> >>> >>> This totally depends on the NIC or other devices' vendors and they >>> should make decision to support migration or not. If yes, they would >>> modify driver. >> >> >> Having to modify every driver that wants to support live migration is >> a bit much. In addition I don't see this being limited only to NIC >> devices. You can direct assign a number of different devices, your >> solution cannot be specific to NICs. > > > We are also adding such migration support for QAT device and so our > solution will not just be limit to NIC. Now just is the beginning. Agreed, but still QAT is networking related. My advice would be to look at something else that works from within a different subsystem such as storage. All I am saying is that your solution is very networking centric. > We can't limit user to only use Linux guest. So the migration feature > should work for both Windows and Linux guest. Right now what your solution is doing is to limit things so that only the Intel NICs can support this since it will require driver modification across the board. Instead what I have proposed should make it so that once you have done the work there should be very little work that has to be done on your port to support any device. >> >>> If just target to call suspend/resume during migration, the feature will >>> be meaningless. Most cases don't want to affect user during migration >>> a lot and so the service down time is vital. Our target is to apply >>> SRIOV NIC passthough to cloud service and NFV(network functions >>> virtualization) projects which are sensitive to network performance >>> and stability. From my opinion, We should give a change for device >>> driver to implement itself migration job. Call suspend and resume >>> callback in the driver if it doesn't care the performance during >>> migration. >> >> >> The suspend/resume callback should be efficient in terms of time. >> After all we don't want the system to stall for a long period of time >> when it should be either running or asleep. Having it burn cycles in >> a power state limbo doesn't do anyone any good. If nothing else maybe >> it will help to push the vendors to speed up those functions which >> then benefit migration and the system sleep states. > > > If we can benefit both migration and suspend, that would be wonderful. > But migration and system pm is still different. Just for example, > driver doesn't need to put device into deep D-status during migration > and host can do this after migration while it's essential for > system sleep. PCI configure space and interrupt config is emulated by > Qemu and Qemu can migrate these configures to new machine. Driver > doesn't need to deal with such thing. So I think migration still needs a > different callback or different code path than device suspend/resume. SR-IOV devices are considered to be in D3 as soon as you clear the bus master enable bit. They don't actually have a PCIe power management block in their configuration space. The advantage of the suspend/resume approach is that the D0->D3->D0 series of transitions should trigger a PCIe reset on the device. As such the resume call is capable of fully reinitializing a device. As far as migrating the interrupts themselves moving live interrupts is problematic. You are more likely to throw them out of sync since the state of the device will not match the state of what you migrated for things like the pending bit array so if there is a device that actually depending on those bits you might run into issues. > Another concern is that we have to rework PM core ore PCI bus driver > to call suspend/resume for passthrough devices during migration. This > also blocks new feature works on the Windows. If I am not mistaken the Windows drivers have a similar feature that is called when you disable or enable an interface. I believe the motivation for using D3 when a device has been disabled is to save power on the system since in D3 the device should be in its lowest power state. >> >> Also you keep assuming you can keep the device running while you do >> the migration and you can't. You are going to corrupt the
Re: live migration vs device assignment (motivation)
On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyuwrote: > For other kind of devices, it's hard to work. > We are also adding migration support for QAT(QuickAssist Technology) device. > > QAT device user case introduction. > Server, networking, big data, and storage applications use QuickAssist > Technology to offload servers from handling compute-intensive operations, > such as: > 1) Symmetric cryptography functions including cipher operations and > authentication operations > 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve > cryptography > 3) Compression and decompression functions including DEFLATE and LZS > > PCI hotplug will not work for such devices during migration and these > operations will fail when unplug device. I assume the problem is that with a PCI hotplug event you are losing the state information for the device, do I have that right? Looking over the QAT drivers it doesn't seem like any of them support the suspend/resume PM calls. I would imagine it makes it difficult for a system with a QAT card in it to be able to drop the system to a low power state. You might want to try enabling suspend/resume support for the devices on bare metal before you attempt to take on migration as it would provide you with a good testing framework to see what needs to be saved/restored within the device and in what order before you attempt to do the same while migrating from one system to another. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch V0] This patch adds some support required for KVM in order to support LMCE.
- Add support for MSR_IA32_MCG_EXT_CTL - Add MCG_LMCE_P to KVM_MCE_CAP_SUPPORTED - Changes to IA32_FEATURE_CONTROL, allow this MSR to be defined just not for nested VMM, but now its required for Local MCE. Reviewed-by: Andi KleenReviewed-by: Tony Luck Tested-by: Gong Chen Signed-off-by: Ashok Raj --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx.c | 26 +- arch/x86/kvm/x86.c | 17 - 3 files changed, 38 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 30cfd64..6940141 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -525,6 +525,7 @@ struct kvm_vcpu_arch { u64 mcg_cap; u64 mcg_status; u64 mcg_ctl; + u64 mcg_ext_ctl; u64 *mce_banks; /* Cache MMIO info */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 87acc52..c2ce9f4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2747,6 +2747,20 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) return 0; } +bool can_feature_control_exist(struct kvm_vcpu *vcpu) +{ + /* +* There are some features that require BIOS enabling. +* In such cases BIOS is supposed to set this bit and indicate +* the feature is enabled and available to the OS. +* Local Machine Check Exception (LMCE) is one such feature. +*/ + if (vcpu->arch.mcg_cap & MCG_LMCE_P) + return true; + + return (nested_vmx_allowed(vcpu)); +} + /* * Reads an msr value (of 'msr_index') into 'pdata'. * Returns 0 on success, non-0 otherwise. @@ -2789,9 +2803,11 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = vmcs_read64(GUEST_BNDCFGS); break; case MSR_IA32_FEATURE_CONTROL: - if (!nested_vmx_allowed(vcpu)) + if (can_feature_control_exist(vcpu)) + msr_info->data = + to_vmx(vcpu)->nested.msr_ia32_feature_control; + else return 1; - msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control; break; case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: if (!nested_vmx_allowed(vcpu)) @@ -2882,9 +2898,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) ret = kvm_set_msr_common(vcpu, msr_info); break; case MSR_IA32_FEATURE_CONTROL: - if (!nested_vmx_allowed(vcpu) || - (to_vmx(vcpu)->nested.msr_ia32_feature_control & -FEATURE_CONTROL_LOCKED && !msr_info->host_initiated)) + if ((can_feature_control_exist(vcpu) == false) || + ((to_vmx(vcpu)->nested.msr_ia32_feature_control & +FEATURE_CONTROL_LOCKED) && !msr_info->host_initiated)) return 1; vmx->nested.msr_ia32_feature_control = data; if (msr_info->host_initiated && data == 0) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 00462bd..0da3871 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -70,7 +70,7 @@ #define MAX_IO_MSRS 256 #define KVM_MAX_MCE_BANKS 32 -#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P) +#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P | MCG_LMCE_P) #define emul_to_vcpu(ctxt) \ container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt) @@ -974,6 +974,7 @@ static u32 emulated_msrs[] = { MSR_IA32_MISC_ENABLE, MSR_IA32_MCG_STATUS, MSR_IA32_MCG_CTL, + MSR_IA32_MCG_EXT_CTL, MSR_IA32_SMBASE, }; @@ -1913,6 +1914,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) return -1; vcpu->arch.mcg_ctl = data; break; + case MSR_IA32_MCG_EXT_CTL: + if (!(mcg_cap & MCG_LMCE_P)) + return 1; + if (data != 0 && data != 0x1) + return -1; + vcpu->arch.mcg_ext_ctl = data; + break; default: if (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(bank_num)) { @@ -2170,6 +2178,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: + case MSR_IA32_MCG_EXT_CTL: case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: return set_msr_mce(vcpu, msr, data); @@ -2266,6 +2275,11 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) return 1; data = vcpu->arch.mcg_ctl; break; +
Re: [PATCH v3 0/4] Add virtio transport for AF_VSOCK
On Wed, Dec 09, 2015 at 08:03:49PM +0800, Stefan Hajnoczi wrote: > Note: the virtio-vsock device specification is currently under review but not > yet finalized. Please review this code but don't merge until I send an update > when the spec is finalized. Thanks! Yes, this should have RFC in the subject. > v3: > * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead >of REQUEST/RESPONSE/ACK > * Remove SOCK_DGRAM support and focus on SOCK_STREAM first >(also drop v2 Patch 1, it's only needed for SOCK_DGRAM) > * Only allow host->guest connections (same security model as latest >VMware) > * Don't put vhost vsock driver into staging > * Add missing Kconfig dependencies (Arnd Bergmann) > * Remove unneeded variable used to store return value >(Fengguang Wu and Julia Lawall > ) > > v2: > * Rebased onto Linux v4.4-rc2 > * vhost: Refuse to assign reserved CIDs > * vhost: Refuse guest CID if already in use > * vhost: Only accept correctly addressed packets (no spoofing!) > * vhost: Support flexible rx/tx descriptor layout > * vhost: Add missing total_tx_buf decrement > * virtio_transport: Fix total_tx_buf accounting > * virtio_transport: Add virtio_transport global mutex to prevent races > * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown >semantics) > * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes > deadlock) > * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants > * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants > * common: Fix peer_buf_alloc inheritance on child socket > > This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/). > AF_VSOCK is designed for communication between virtual machines and > hypervisors. It is currently only implemented for VMware's VMCI transport. > > This series implements the proposed virtio-vsock device specification from > here: > http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980 > > Most of the work was done by Asias He and Gerd Hoffmann a while back. I have > picked up the series again. > > The QEMU userspace changes are here: > https://github.com/stefanha/qemu/commits/vsock > > Why virtio-vsock? > - > Guest<->host communication is currently done over the virtio-serial device. > This makes it hard to port sockets API-based applications and is limited to > static ports. > > virtio-vsock uses the sockets API so that applications can rely on familiar > SOCK_STREAM semantics. Applications on the host can easily connect to guest > agents because the sockets API allows multiple connections to a listen socket > (unlike virtio-serial). This simplifies the guest<->host communication and > eliminates the need for extra processes on the host to arbitrate virtio-serial > ports. > > Overview > > This series adds 3 pieces: > > 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko > > 2. virtio_transport.ko - guest driver > > 3. drivers/vhost/vsock.ko - host driver > > Howto > - > The following kernel options are needed: > CONFIG_VSOCKETS=y > CONFIG_VIRTIO_VSOCKETS=y > CONFIG_VIRTIO_VSOCKETS_COMMON=y > CONFIG_VHOST_VSOCK=m > > Launch QEMU as follows: > # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3 > > Guest and host can communicate via AF_VSOCK sockets. The host's CID (address) > is 2 and the guest must be assigned a CID (3 in the example above). > > Status > -- > This patch series implements the latest draft specification. Please review. > > Asias He (4): > VSOCK: Introduce virtio-vsock-common.ko > VSOCK: Introduce virtio-vsock.ko > VSOCK: Introduce vhost-vsock.ko > VSOCK: Add Makefile and Kconfig > > drivers/vhost/Kconfig | 10 + > drivers/vhost/Makefile | 4 + > drivers/vhost/vsock.c | 628 +++ > drivers/vhost/vsock.h | 4 + > include/linux/virtio_vsock.h| 203 > include/uapi/linux/virtio_ids.h | 1 + > include/uapi/linux/virtio_vsock.h | 87 > net/vmw_vsock/Kconfig | 18 + > net/vmw_vsock/Makefile | 2 + > net/vmw_vsock/virtio_transport.c| 466 + > net/vmw_vsock/virtio_transport_common.c | 854 > > 11 files changed, 2277 insertions(+) > create mode 100644 drivers/vhost/vsock.c > create mode 100644 drivers/vhost/vsock.h > create mode 100644 include/linux/virtio_vsock.h > create mode 100644 include/uapi/linux/virtio_vsock.h > create mode 100644 net/vmw_vsock/virtio_transport.c > create mode 100644 net/vmw_vsock/virtio_transport_common.c > > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at
kvmclock doesn't work, help?
I'm trying to clean up kvmclock and I can't get it to work at all. My host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. If I boot an SMP (2 vcpus) guest, tracing says: qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 If I boot a UP guest, tracing says: qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 I suspect, but I haven't verified, that this is fallout from: commit 16a9602158861687c78b6de6dc6a79e6e8a9136f Author: Marcelo TosattiDate: Wed May 14 12:43:24 2014 -0300 KVM: x86: disable master clock if TSC is reset during suspend Updating system_time from the kernel clock once master clock has been enabled can result in time backwards event, in case kernel clock frequency is lower than TSC frequency. Disable master clock in case it is necessary to update it from the resume path. Signed-off-by: Marcelo Tosatti Signed-off-by: Paolo Bonzini Can we please stop making kvmclock more complex? It's a beast right now, and not in a good way. It's far too tangled with the vclock machinery on both the host and guest sides, the pvclock stuff is not well thought out (even in principle in an ABI sense), and it's never been clear to my what problem exactly the kvmclock stuff is supposed to solve. I'm somewhat tempted to suggest that we delete kvmclock entirely and start over. A correctly functioning KVM guest using TSC (i.e. ignoring kvmclock entirely) seems to work rather more reliably and considerably faster than a kvmclock guest. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonziniwrote: > > > On 09/12/2015 22:10, Andy Lutomirski wrote: >> Can we please stop making kvmclock more complex? It's a beast right >> now, and not in a good way. It's far too tangled with the vclock >> machinery on both the host and guest sides, the pvclock stuff is not >> well thought out (even in principle in an ABI sense), and it's never >> been clear to my what problem exactly the kvmclock stuff is supposed >> to solve. > > It's supposed to solve the problem that: > > - not all hosts have a working TSC Fine, but we don't need any vdso integration for that. > > - even if they all do, virtual machines can be migrated (or > saved/restored) to a host with a different TSC frequency OK, I buy that. So we want to export a linear function that the guest applies to the TSC so the guest can apply it. I suppose we also want ntp frequency corrections on the host to propagate to the guest. > > - any MMIO- or PIO-based mechanism to access the current time is orders > of magnitude slower than the TSC and less precise too. Yup. But TSC by itself gets that benefit, too. > >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> start over. A correctly functioning KVM guest using TSC (i.e. >> ignoring kvmclock entirely) seems to work rather more reliably and >> considerably faster than a kvmclock guest. > > If all your hosts have a working TSC and you don't do migration or > save/restore, that's a valid configuration. It's not a good default, > however. Er? kvmclock is still really quite slow and buggy. And the patch I identified is definitely a problem here: [ 136.131241] KVM: disabling fast timing permanently due to inability to recover from suspend I got that on the host with this whitespace-damaged patch: if (backwards_tsc) { u64 delta_cyc = max_tsc - local_tsc; + if (!backwards_tsc_observed) + pr_warn("KVM: disabling fast timing permanently due to inability to recover from suspend\n"); when I suspended and resumed. Can anyone explain what problem 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On brief inspection, it just seems to be incorrect. Shouldn't KVM's normal TSC logic handle that case right? After all, all vcpus should be paused when we resume from suspend. At worst, we should just need kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, shouldn't we do that regardless of which way the TSC jumped on suspend/resume? After all, the jTSC-to-wall-clock offset is quite likely to change except on the very small handful of CPUs (if any) that keep the TSC running in S3 and hibernate. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonziniwrote: > > > On 09/12/2015 22:49, Andy Lutomirski wrote: >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: >>> >>> >>> On 09/12/2015 22:10, Andy Lutomirski wrote: Can we please stop making kvmclock more complex? It's a beast right now, and not in a good way. It's far too tangled with the vclock machinery on both the host and guest sides, the pvclock stuff is not well thought out (even in principle in an ABI sense), and it's never been clear to my what problem exactly the kvmclock stuff is supposed to solve. >>> >>> It's supposed to solve the problem that: >>> >>> - not all hosts have a working TSC >> >> Fine, but we don't need any vdso integration for that. > > Well, you still want a fast time source. That was a given. :) If the host can't figure out how to give *itself* a fast time source, I'd be surprised if KVM can manage to give the guest a fast, reliable time source. > >>> - even if they all do, virtual machines can be migrated (or >>> saved/restored) to a host with a different TSC frequency >>> >>> - any MMIO- or PIO-based mechanism to access the current time is orders >>> of magnitude slower than the TSC and less precise too. >> >> Yup. But TSC by itself gets that benefit, too. > > Yes, the problem is if you want to solve all three of them. The first > two are solved by the ACPI PM timer with a decent resolution (70 > ns---much faster anyway than an I/O port access). The third is solved > by TSC. To solve all three, you need kvmclock. Still confused. Is kvmclock really used in cases where even the host can't pull of working TSC? > I'm somewhat tempted to suggest that we delete kvmclock entirely and start over. A correctly functioning KVM guest using TSC (i.e. ignoring kvmclock entirely) seems to work rather more reliably and considerably faster than a kvmclock guest. >>> >>> If all your hosts have a working TSC and you don't do migration or >>> save/restore, that's a valid configuration. It's not a good default, >>> however. >> >> Er? >> >> kvmclock is still really quite slow and buggy. > > Unless it takes 3-4000 clock cycles for a gettimeofday, which it > shouldn't even with vdso disabled, it's definitely not slower than PIO. > >> And the patch I identified is definitely a problem here: >> >> [ 136.131241] KVM: disabling fast timing permanently due to inability >> to recover from suspend >> >> I got that on the host with this whitespace-damaged patch: >> >> if (backwards_tsc) { >> u64 delta_cyc = max_tsc - local_tsc; >> + if (!backwards_tsc_observed) >> + pr_warn("KVM: disabling fast timing >> permanently due to inability to recover from suspend\n"); >> >> when I suspended and resumed. >> >> Can anyone explain what problem >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >> brief inspection, it just seems to be incorrect. Shouldn't KVM's >> normal TSC logic handle that case right? After all, all vcpus should >> be paused when we resume from suspend. At worst, we should just need >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >> shouldn't we do that regardless of which way the TSC jumped on >> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >> likely to change except on the very small handful of CPUs (if any) >> that keep the TSC running in S3 and hibernate. > > I don't recall the details of that patch, so Marcelo will have to answer > this, or Alex too since he chimed in the original thread. At least it > should be made conditional on the existence of a VM at suspend time (and > the master clock stuff should be made per VM, as I suggested at > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). > > It would indeed be great if the master clock could be dropped. But I'm > definitely missing some of the subtle details. :( Me, too. Anyway, see the attached untested patch. Marcelo? --Andy From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001 Message-Id: From: Andy Lutomirski Date: Wed, 9 Dec 2015 14:21:05 -0800 Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks This gets rid of the "did TSC go backwards" logic and just updates all clocks. It should work better (no more disabling of fast timing) and more reliably (all of the clocks are actually updated). Signed-off-by: Andy Lutomirski --- arch/x86/kvm/x86.c | 75 +++--- 1 file changed, 3 insertions(+), 72 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eed32283d22c..c88f91f4b1a3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); unsigned
Re: kvmclock doesn't work, help?
On 09/12/2015 23:27, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonziniwrote: >> On 09/12/2015 22:49, Andy Lutomirski wrote: >>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC >>> >>> Fine, but we don't need any vdso integration for that. >> >> Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. There's no vdso integration unless the host has a constant, nonstop (fully "working") TSC. That's the meaning of PVCLOCK_TSC_STABLE_BIT. So, correction: if you can pull it off, you still want a fast time source. Otherwise, you still want one that is as fast as possible, especially on the kernel side. - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. >> >> the problem is if you want to solve all three of them. The first >> two are solved by the ACPI PM timer with a decent resolution (70 >> ns---much faster anyway than an I/O port access). The third is solved >> by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? You can certainly provide kvmclock even if you lack constant-rate or nonstop TSC. Those are only a requirement for vdso. If the host has a constant-rate TSC, but the rate differs per physical CPU (common on older NUMA machines), you can easily provide a working kvmclock. It cannot support vdso because you'll need to read the time from a non-preemptable section, but it will work because KVM can update the kvmclock parameters on VCPU migration, and it's still faster than anything else. (The purpose of the now-gone migration notifiers was to support vdso even in this case). If the host doesn't even have constant-rate TSC, you can still provide kernel-only kvmclock reads through cpufreq notifiers. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. If all your hosts have a working TSC and you don't do migration or save/restore, that's a valid configuration. It's not a good default, however. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: Don't report guest userspace emulation error to userspace, why ?
Commit a2b9e6c1a35afcc09: KVM: x86: Don't report guest userspace emulation error to userspace Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. I searched the archives but failed in finding anything. Can someone please explain why this is needed ? Or, why not let userspace decide what to do based on the cpl, whether to continue execution or kill the guest ? Is the assumption here that this is what userspace always wants ? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirskiwrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: >> >> >> On 09/12/2015 22:49, Andy Lutomirski wrote: >>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC >>> >>> Fine, but we don't need any vdso integration for that. >> >> Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. > >> - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. >>> >>> Yup. But TSC by itself gets that benefit, too. >> >> Yes, the problem is if you want to solve all three of them. The first >> two are solved by the ACPI PM timer with a decent resolution (70 >> ns---much faster anyway than an I/O port access). The third is solved >> by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? > >> > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. If all your hosts have a working TSC and you don't do migration or save/restore, that's a valid configuration. It's not a good default, however. >>> >>> Er? >>> >>> kvmclock is still really quite slow and buggy. >> >> Unless it takes 3-4000 clock cycles for a gettimeofday, which it >> shouldn't even with vdso disabled, it's definitely not slower than PIO. >> >>> And the patch I identified is definitely a problem here: >>> >>> [ 136.131241] KVM: disabling fast timing permanently due to inability >>> to recover from suspend >>> >>> I got that on the host with this whitespace-damaged patch: >>> >>> if (backwards_tsc) { >>> u64 delta_cyc = max_tsc - local_tsc; >>> + if (!backwards_tsc_observed) >>> + pr_warn("KVM: disabling fast timing >>> permanently due to inability to recover from suspend\n"); >>> >>> when I suspended and resumed. >>> >>> Can anyone explain what problem >>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >>> brief inspection, it just seems to be incorrect. Shouldn't KVM's >>> normal TSC logic handle that case right? After all, all vcpus should >>> be paused when we resume from suspend. At worst, we should just need >>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >>> shouldn't we do that regardless of which way the TSC jumped on >>> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >>> likely to change except on the very small handful of CPUs (if any) >>> that keep the TSC running in S3 and hibernate. >> >> I don't recall the details of that patch, so Marcelo will have to answer >> this, or Alex too since he chimed in the original thread. At least it >> should be made conditional on the existence of a VM at suspend time (and >> the master clock stuff should be made per VM, as I suggested at >> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). >> >> It would indeed be great if the master clock could be dropped. But I'm >> definitely missing some of the subtle details. :( > > Me, too. > > Anyway, see the attached untested patch. Marcelo? That patch seems to work. I have valid timing before and after host suspend. When I suspend and resume the host with a running guest, I get: [ 26.504071] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large: [ 26.505253] clocksource: 'kvm-clock' wd_now: 66744c542 wd_last: 564b09794 mask: [ 26.506436] clocksource: 'tsc' cs_now: fee310b133c8 cs_last: cf5d0b952 mask: in the guest, which is arguably correct. KVM could be further improved to update the tsc offset after suspend/resume to get rid of that artifact. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a
Re: kvmclock doesn't work, help?
On 09/12/2015 22:49, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonziniwrote: >> >> >> On 09/12/2015 22:10, Andy Lutomirski wrote: >>> Can we please stop making kvmclock more complex? It's a beast right >>> now, and not in a good way. It's far too tangled with the vclock >>> machinery on both the host and guest sides, the pvclock stuff is not >>> well thought out (even in principle in an ABI sense), and it's never >>> been clear to my what problem exactly the kvmclock stuff is supposed >>> to solve. >> >> It's supposed to solve the problem that: >> >> - not all hosts have a working TSC > > Fine, but we don't need any vdso integration for that. Well, you still want a fast time source. That was a given. :) >> - even if they all do, virtual machines can be migrated (or >> saved/restored) to a host with a different TSC frequency >> >> - any MMIO- or PIO-based mechanism to access the current time is orders >> of magnitude slower than the TSC and less precise too. > > Yup. But TSC by itself gets that benefit, too. Yes, the problem is if you want to solve all three of them. The first two are solved by the ACPI PM timer with a decent resolution (70 ns---much faster anyway than an I/O port access). The third is solved by TSC. To solve all three, you need kvmclock. >>> I'm somewhat tempted to suggest that we delete kvmclock entirely and >>> start over. A correctly functioning KVM guest using TSC (i.e. >>> ignoring kvmclock entirely) seems to work rather more reliably and >>> considerably faster than a kvmclock guest. >> >> If all your hosts have a working TSC and you don't do migration or >> save/restore, that's a valid configuration. It's not a good default, >> however. > > Er? > > kvmclock is still really quite slow and buggy. Unless it takes 3-4000 clock cycles for a gettimeofday, which it shouldn't even with vdso disabled, it's definitely not slower than PIO. > And the patch I identified is definitely a problem here: > > [ 136.131241] KVM: disabling fast timing permanently due to inability > to recover from suspend > > I got that on the host with this whitespace-damaged patch: > > if (backwards_tsc) { > u64 delta_cyc = max_tsc - local_tsc; > + if (!backwards_tsc_observed) > + pr_warn("KVM: disabling fast timing > permanently due to inability to recover from suspend\n"); > > when I suspended and resumed. > > Can anyone explain what problem > 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On > brief inspection, it just seems to be incorrect. Shouldn't KVM's > normal TSC logic handle that case right? After all, all vcpus should > be paused when we resume from suspend. At worst, we should just need > kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, > shouldn't we do that regardless of which way the TSC jumped on > suspend/resume? After all, the jTSC-to-wall-clock offset is quite > likely to change except on the very small handful of CPUs (if any) > that keep the TSC running in S3 and hibernate. I don't recall the details of that patch, so Marcelo will have to answer this, or Alex too since he chimed in the original thread. At least it should be made conditional on the existence of a VM at suspend time (and the master clock stuff should be made per VM, as I suggested at https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). It would indeed be great if the master clock could be dropped. But I'm definitely missing some of the subtle details. :( Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote: > On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: > >I thought about what this is doing at the high level, and I do have some > >value in what you are trying to do, but I also think we need to clarify > >the motivation a bit more. What you are saying is not really what the > >patches are doing. > > > >And with that clearer understanding of the motivation in mind (assuming > >it actually captures a real need), I would also like to suggest some > >changes. > > Motivation: > Most current solutions for migration with passthough device are based on > the PCI hotplug but it has side affect and can't work for all device. > > For NIC device: > PCI hotplug solution can work around Network device migration > via switching VF and PF. This is just more confusion. hotplug is just a way to add and remove devices. switching VF and PF is up to guest and hypervisor. > But switching network interface will introduce service down time. > > I tested the service down time via putting VF and PV interface > into a bonded interface and ping the bonded interface during plug > and unplug VF. > 1) About 100ms when add VF > 2) About 30ms when del VF OK and what's the source of the downtime? I'm guessing that's just arp being repopulated. So simply save and re-populate it. There would be a much cleaner solution. Or maybe there's a timer there that just delays hotplug for no reason. Fix it, everyone will benefit. > It also requires guest to do switch configuration. That's just wrong. if you want a switch, you need to configure a switch. > These are hard to > manage and deploy from our customers. So kernel want to remain flexible, and the stack is configurable. Downside: customers need to deploy userspace to configure it. Your solution: a hard-coded configuration within kernel and hypervisor. Sorry, this makes no sense. If kernel is easier for you to deploy than userspace, you need to rethink your deployment strategy. > To maintain PV performance during > migration, host side also needs to assign a VF to PV device. This > affects scalability. No idea what this means. > These factors block SRIOV NIC passthough usage in the cloud service and > OPNFV which require network high performance and stability a lot. Everyone needs performance and scalability. > > For other kind of devices, it's hard to work. > We are also adding migration support for QAT(QuickAssist Technology) device. > > QAT device user case introduction. > Server, networking, big data, and storage applications use QuickAssist > Technology to offload servers from handling compute-intensive operations, > such as: > 1) Symmetric cryptography functions including cipher operations and > authentication operations > 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve > cryptography > 3) Compression and decompression functions including DEFLATE and LZS > > PCI hotplug will not work for such devices during migration and these > operations will fail when unplug device. > > So we are trying implementing a new solution which really migrates > device state to target machine and won't affect user during migration > with low service down time. Let's assume for the sake of the argument that there's a lot going on and removing the device is just too slow (though you should figure out what's going on before giving up and just building something new from scratch). I still don't think you should be migrating state. That's just too fragile, and it also means you depend on driver to be nice and shut down device on source, so you can not migrate at will. Instead, reset device on destination and re-initialize it. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
This gets rid of the "did TSC go backwards" logic and just updates all clocks. It should work better (no more disabling of fast timing) and more reliably (all of the clocks are actually updated). Signed-off-by: Andy Lutomirski--- arch/x86/kvm/x86.c | 75 +++--- 1 file changed, 3 insertions(+), 72 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eed32283d22c..c88f91f4b1a3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); unsigned int __read_mostly lapic_timer_advance_ns = 0; module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR); -static bool __read_mostly backwards_tsc_observed = false; - #define KVM_NR_SHARED_MSRS 16 struct kvm_shared_msrs_global { @@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) >master_cycle_now); ka->use_master_clock = host_tsc_clocksource && vcpus_matched - && !backwards_tsc_observed && !ka->boot_vcpu_runs_old_kvmclock; if (ka->use_master_clock) @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void) struct kvm_vcpu *vcpu; int i; int ret; - u64 local_tsc; - u64 max_tsc = 0; - bool stable, backwards_tsc = false; kvm_shared_msr_cpu_online(); ret = kvm_x86_ops->hardware_enable(); if (ret != 0) return ret; - local_tsc = rdtsc(); - stable = !check_tsc_unstable(); list_for_each_entry(kvm, _list, vm_list) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!stable && vcpu->cpu == smp_processor_id()) + if (vcpu->cpu == smp_processor_id()) { kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); - if (stable && vcpu->arch.last_host_tsc > local_tsc) { - backwards_tsc = true; - if (vcpu->arch.last_host_tsc > max_tsc) - max_tsc = vcpu->arch.last_host_tsc; + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, +vcpu); } } } - /* -* Sometimes, even reliable TSCs go backwards. This happens on -* platforms that reset TSC during suspend or hibernate actions, but -* maintain synchronization. We must compensate. Fortunately, we can -* detect that condition here, which happens early in CPU bringup, -* before any KVM threads can be running. Unfortunately, we can't -* bring the TSCs fully up to date with real time, as we aren't yet far -* enough into CPU bringup that we know how much real time has actually -* elapsed; our helper function, get_kernel_ns() will be using boot -* variables that haven't been updated yet. -* -* So we simply find the maximum observed TSC above, then record the -* adjustment to TSC in each VCPU. When the VCPU later gets loaded, -* the adjustment will be applied. Note that we accumulate -* adjustments, in case multiple suspend cycles happen before some VCPU -* gets a chance to run again. In the event that no KVM threads get a -* chance to run, we will miss the entire elapsed period, as we'll have -* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may -* loose cycle time. This isn't too big a deal, since the loss will be -* uniform across all VCPUs (not to mention the scenario is extremely -* unlikely). It is possible that a second hibernate recovery happens -* much faster than a first, causing the observed TSC here to be -* smaller; this would require additional padding adjustment, which is -* why we set last_host_tsc to the local tsc observed here. -* -* N.B. - this code below runs only on platforms with reliable TSC, -* as that is the only way backwards_tsc is set above. Also note -* that this runs for ALL vcpus, which is not a bug; all VCPUs should -* have the same delta_cyc adjustment applied if backwards_tsc -* is detected. Note further, this adjustment is only done once, -* as we reset last_host_tsc on all VCPUs to stop this from being -* called multiple times (one for each physical CPU bringup). -* -* Platforms with unreliable TSCs don't have to deal with this, they -* will be compensated by the logic in vcpu_load, which sets the TSC to -* catchup mode. This will catchup all VCPUs to real time, but cannot -* guarantee that they stay in perfect synchronization. -*/ - if (backwards_tsc) { -
[PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
From: Andy LutomirskiThe pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 45ns on my machine. With this change, it takes 29ns, which is almost as fast as the pure TSC implementation. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 81 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index ca94fa649251..c325ba1bdddf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vsyscall_time_info *pvti; + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; cycle_t ret; - u64 last; - u32 version; - u8 flags; - unsigned cpu, cpu1; - + u64 tsc, pvti_tsc; + u64 last, delta, pvti_system_time; + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; /* -* Note: hypervisor must guarantee that: -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. -* 2. that per-CPU pvclock time info is updated if the -*underlying CPU changes. -* 3. that version is increased whenever underlying CPU -*changes. +* Note: The kernel and hypervisor must guarantee that cpu ID +* number maps 1:1 to per-CPU pvclock time info. +* +* Because the hypervisor is entirely unaware of guest userspace +* preemption, it cannot guarantee that per-CPU pvclock time +* info is updated if the underlying CPU changes or that that +* version is increased whenever underlying CPU changes. * +* On KVM, we are guaranteed that pvti updates for any vCPU are +* atomic as seen by *all* vCPUs. This is an even stronger +* guarantee than we get with a normal seqlock. +* +* On Xen, we don't appear to have that guarantee, but Xen still +* supplies a valid seqlock using the version field. + +* We only do pvclock vdso timing at all if +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to +* mean that all vCPUs have matching pvti and that the TSC is +* synced, so we can just look at vCPU 0's pvti. */ - do { - cpu = __getcpu() & VGETCPU_CPU_MASK; - /* TODO: We can put vcpu id into higher bits of pvti.version. -* This will save a couple of cycles by getting rid of -* __getcpu() calls (Gleb). -*/ - - pvti = get_pvti(cpu); - - version = __pvclock_read_cycles(>pvti, , ); - - /* -* Test we're still on the cpu as well as the version. -* We could have been migrated just after the first -* vgetcpu but before fetching the version, so we -* wouldn't notice a version change. -*/ - cpu1 = __getcpu() & VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti->pvti.version & 1) || - pvti->pvti.version != version)); - - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) + + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { *mode = VCLOCK_NONE; + return 0; + } + + do { + version = pvti->version; + + /* This is also a read barrier, so we'll read version first. */ + tsc = rdtsc_ordered(); + + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; + pvti_tsc_shift = pvti->tsc_shift; + pvti_system_time = pvti->system_time; + pvti_tsc = pvti->tsc_timestamp; + + /* Make sure that the version double-check is last. */ + smp_rmb(); + } while (unlikely((version & 1) || version != pvti->version)); + + delta = tsc - pvti_tsc; + ret = pvti_system_time + + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, + pvti_tsc_shift); /* refer to tsc.c read_tsc() comment for rationale */ last = gtod->cycle_last; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] VFIO fixes for v4.4-rc5
Hi Linus, The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec: Linux 4.4-rc1 (2015-11-15 17:00:27 -0800) are available in the git repository at: git://github.com/awilliam/linux-vfio.git tags/vfio-v4.4-rc5 for you to fetch changes up to ae5515d66362b9d96cdcfce504567f0b8b7bd83e: Revert: "vfio: Include No-IOMMU mode" (2015-12-04 08:38:42 -0700) VFIO fixes for v4.4-rc5 - Various fixes for removing redundancy, const'ifying structs, avoiding stack usage, fixing WARN usage (Krzysztof Kozlowski, Julia Lawall, Kees Cook, Dan Carpenter) - Revert No-IOMMU mode as the intended user has not emerged (Alex Williamson) Alex Williamson (1): Revert: "vfio: Include No-IOMMU mode" Dan Carpenter (1): vfio: fix a warning message Julia Lawall (1): vfio-pci: constify pci_error_handlers structures Kees Cook (1): vfio: platform: remove needless stack usage Krzysztof Kozlowski (1): vfio: Drop owner assignment from platform_driver drivers/vfio/Kconfig | 15 --- drivers/vfio/pci/vfio_pci.c | 10 +- drivers/vfio/platform/vfio_platform.c| 1 - drivers/vfio/platform/vfio_platform_common.c | 5 +- drivers/vfio/vfio.c | 188 +-- include/linux/vfio.h | 3 - include/uapi/linux/vfio.h| 7 - 7 files changed, 13 insertions(+), 216 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
> -Original Message- > From: Radim Krčmář [mailto:rkrc...@redhat.com] > Sent: Wednesday, December 9, 2015 10:54 PM > To: Wu, Feng> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted- > interrupts > > 2015-12-09 08:19+, Wu, Feng: > >> -Original Message- > >> From: Radim Krčmář [mailto:rkrc...@redhat.com] > >> Sent: Tuesday, November 17, 2015 3:03 AM > >> To: Wu, Feng > >> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux- > ker...@vger.kernel.org > >> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted- > >> interrupts > >> > >> 2015-11-09 10:46+0800, Feng Wu: > >> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, > >> > + struct kvm_lapic_irq *irq) > >> > + > >> > +{ > >> > +unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; > >> > +unsigned int dest_vcpus = 0; > >> > +struct kvm_vcpu *vcpu; > >> > +unsigned int i, mod, idx = 0; > >> > + > >> > +vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq); > >> > +if (vcpu) > >> > +return vcpu; > >> > >> I think the rest of this function shouldn't be implemented: > >> - Shorthands are only for IPIs and hence don't need to be handled, > >> - Lowest priority physical broadcast is not supported, > >> - Lowest priority cluster logical broadcast is not supported, > >> - No point in optimizing mixed xAPIC and x2APIC mode, > > > > I read your comments again, and don't quite understand why we > > don't need PI optimization for mixed xAPIC and x2APIC mode. > > There shouldn't be a non-hobbyist operating system that uses mixed mode, > so the optimization would practically be dead code as all other cases > are handled by kvm_intr_vector_hashing_dest_fast(). Thanks a lot for your elaboration! > > I think that having extra code would bring problems in the future -- we > need to take care of it when refactoring KVM's APIC and we should also > write a unit-test for this otherwise dead path. I don't think that the > benefit for guests would ever balance those efforts. > > (Physical xAPIC+x2APIC mode is still somewhat reasonable and xAPIC CPUs > start with LDR=0, which means that operating system doesn't need to > utilize mixed mode, as defined by KVM, when switching to x2APIC.) I think you mean Physical xAPIC+Physical x2APIC mode, right? For physical mode, we don't use LDR in any case, do we? So in physical mode, we only use the APIC ID, that is why they can be mixed, is my understanding correct? Thanks a lot! > > > BTW, can we have mixed flat and cluster mode? > > Yes, KVM recognizes that mixed mode, but luckily, there are severe > limitations. > > Notes below SDM section 10.6.2.2: > All processors that have their APIC software enabled (using the > spurious vector enable/disable bit) must have their DFRs (Destination > Format Registers) programmed identically. Thanks for pointing this out, good to know it! > > I hope there isn't a human that would use it in good faith. > > (Only NMI/SMI/INIT/SIPI are delivered in software disabled mode and if > the system uses cluster xAPIC, OS should set DFR before LDR, which > doesn't trigger mixed mode either.) Just curious, if the APIC is software disabled and it is in xAPIC mode. OS sets different value for DFR for different APICs, then when OS sets LDR, KVM can trigger mixed flat and cluster mode, right? Thanks, Feng
[PATCH 0/5] x86: KVM vdso and clock improvements
NB: patch 1 doesn't really belong here, but it makes this a lot easier for me to test. Patch 1, if it's okay at all, should go though the kvm tree. The rest should probably go through tip:x86/vdso once they're reviewed. I'll do a followup to enable vdso pvclock on 32-bit guests. I'm not currently set up to test it. (The KVM people could also do it very easily on top of these patches.) Andy Lutomirski (5): x86/kvm: On KVM re-enable (e.g. after suspend), update clocks x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap x86/vdso: Remove pvclock fixmap machinery x86/vdso: Enable vdso pvclock access on all vdso variants arch/x86/entry/vdso/vclock_gettime.c | 151 -- arch/x86/entry/vdso/vdso-layout.lds.S | 3 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 14 arch/x86/include/asm/fixmap.h | 5 -- arch/x86/include/asm/pvclock.h| 14 ++-- arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 11 ++- arch/x86/kernel/pvclock.c | 24 -- arch/x86/kvm/x86.c| 75 + 10 files changed, 110 insertions(+), 191 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
Signed-off-by: Andy Lutomirski--- arch/x86/entry/vdso/vclock_gettime.c | 20 arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 13 + arch/x86/include/asm/pvclock.h| 9 + arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 5 + 7 files changed, 41 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index c325ba1bdddf..5dd363d54348 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void) } #endif +#ifdef CONFIG_PARAVIRT_CLOCK +extern u8 pvclock_page + __attribute__((visibility("hidden"))); +#endif + #ifndef BUILD_VDSO32 #include @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) #ifdef CONFIG_PARAVIRT_CLOCK -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { - const struct pvclock_vsyscall_time_info *pvti_base; - int idx = cpu / (PAGE_SIZE/PVTI_SIZE); - int offset = cpu % (PAGE_SIZE/PVTI_SIZE); - - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END); - - pvti_base = (struct pvclock_vsyscall_time_info *) - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); - - return _base[offset]; + return (const struct pvclock_vsyscall_time_info *)_page; } static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; + const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti; cycle_t ret; u64 tsc, pvti_tsc; u64 last, delta, pvti_system_time; diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index de2c921025f5..4158acc17df0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -25,7 +25,7 @@ SECTIONS * segment. */ - vvar_start = . - 2 * PAGE_SIZE; + vvar_start = . - 3 * PAGE_SIZE; vvar_page = vvar_start; /* Place all vvars at the offsets in asm/vvar.h. */ @@ -36,6 +36,7 @@ SECTIONS #undef EMIT_VVAR hpet_page = vvar_start + PAGE_SIZE; + pvclock_page = vvar_start + 2 * PAGE_SIZE; . = SIZEOF_HEADERS; diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 785d9922b106..491020b2826d 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -73,6 +73,7 @@ enum { sym_vvar_start, sym_vvar_page, sym_hpet_page, + sym_pvclock_page, sym_VDSO_FAKE_SECTION_TABLE_START, sym_VDSO_FAKE_SECTION_TABLE_END, }; @@ -80,6 +81,7 @@ enum { const int special_pages[] = { sym_vvar_page, sym_hpet_page, + sym_pvclock_page, }; struct vdso_sym { @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = { [sym_vvar_start] = {"vvar_start", true}, [sym_vvar_page] = {"vvar_page", true}, [sym_hpet_page] = {"hpet_page", true}, + [sym_pvclock_page] = {"pvclock_page", true}, [sym_VDSO_FAKE_SECTION_TABLE_START] = { "VDSO_FAKE_SECTION_TABLE_START", false }, diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 64df47148160..aa828191c654 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) .name = "[vvar]", .pages = no_pages, }; + struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) } #endif + pvti = pvclock_pvti_cpu0_va(); + if (pvti && image->sym_pvclock_page) { + ret = remap_pfn_range(vma, + text_start + image->sym_pvclock_page, + __pa(pvti) >> PAGE_SHIFT, + PAGE_SIZE, + PAGE_READONLY); + + if (ret) + goto up_fail; + } + up_fail: if (ret) current->mm->context.vdso = NULL; diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7a6bed5c08bc..3864398c7cb2 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -4,6 +4,15 @@ #include #include +#ifdef CONFIG_PARAVIRT_CLOCK +extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void); +#else +static inline struct pvclock_vsyscall_time_info
Re: [PATCH] KVM: PPC: Increase memslots to 512
On Wed, Dec 09, 2015 at 11:34:07AM +0100, Thomas Huth wrote: > Only using 32 memslots for KVM on powerpc is way too low, you can > nowadays hit this limit quite fast by adding a couple of PCI devices > and/or pluggable memory DIMMs to the guest. > > x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256 > pluggable DIMM slots, 3 private slots and 253 slots for other things > like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in > total). We should do something similar for powerpc, and since we do > not use private slots here, we can set the value to 512 directly. > > While we're at it, also remove the KVM_MEM_SLOTS_NUM definition > from the powerpc-specific header since this gets defined in the > generic kvm_host.h header anyway. > > Signed-off-by: Thomas HuthThanks, applied to my kvm-ppc-next branch. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
On Fri, Nov 20, 2015 at 09:11:45AM +0100, Thomas Huth wrote: > In the old DABR register, the BT (Breakpoint Translation) bit > is bit number 61. In the new DAWRX register, the WT (Watchpoint > Translation) bit is bit number 59. So to move the DABR-BT bit > into the position of the DAWRX-WT bit, it has to be shifted by > two, not only by one. This fixes hardware watchpoints in gdb of > older guests that only use the H_SET_DABR/X interface instead > of the new H_SET_MODE interface. > > Signed-off-by: Thomas HuthThanks, applied to my kvm-ppc-next branch, with cc: sta...@vger.kernel.org. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: remove unused variable 'vcpu_book3s'
On Tue, Dec 01, 2015 at 08:42:10PM -0300, Geyslan G. Bem wrote: > The vcpu_book3s struct is assigned but never used. So remove it. > > Signed-off-by: Geyslan G. BemThanks, applied to my kvm-ppc-next branch. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Increase memslots to 512
On Wed, Dec 09, 2015 at 11:34:07AM +0100, Thomas Huth wrote: > Only using 32 memslots for KVM on powerpc is way too low, you can > nowadays hit this limit quite fast by adding a couple of PCI devices > and/or pluggable memory DIMMs to the guest. > > x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256 > pluggable DIMM slots, 3 private slots and 253 slots for other things > like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in > total). We should do something similar for powerpc, and since we do > not use private slots here, we can set the value to 512 directly. > > While we're at it, also remove the KVM_MEM_SLOTS_NUM definition > from the powerpc-specific header since this gets defined in the > generic kvm_host.h header anyway. > > Signed-off-by: Thomas HuthThanks, applied to my kvm-ppc-next branch. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
On Fri, Nov 20, 2015 at 09:11:45AM +0100, Thomas Huth wrote: > In the old DABR register, the BT (Breakpoint Translation) bit > is bit number 61. In the new DAWRX register, the WT (Watchpoint > Translation) bit is bit number 59. So to move the DABR-BT bit > into the position of the DAWRX-WT bit, it has to be shifted by > two, not only by one. This fixes hardware watchpoints in gdb of > older guests that only use the H_SET_DABR/X interface instead > of the new H_SET_MODE interface. > > Signed-off-by: Thomas HuthThanks, applied to my kvm-ppc-next branch, with cc: sta...@vger.kernel.org. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: remove unused variable 'vcpu_book3s'
On Tue, Dec 01, 2015 at 08:42:10PM -0300, Geyslan G. Bem wrote: > The vcpu_book3s struct is assigned but never used. So remove it. > > Signed-off-by: Geyslan G. BemThanks, applied to my kvm-ppc-next branch. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] x86/vdso: Remove pvclock fixmap machinery
Signed-off-by: Andy Lutomirski--- arch/x86/entry/vdso/vclock_gettime.c | 1 - arch/x86/entry/vdso/vma.c| 1 + arch/x86/include/asm/fixmap.h| 5 - arch/x86/include/asm/pvclock.h | 5 - arch/x86/kernel/kvmclock.c | 6 -- arch/x86/kernel/pvclock.c| 24 6 files changed, 1 insertion(+), 41 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 5dd363d54348..59a98c25bde7 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -45,7 +45,6 @@ extern u8 pvclock_page #include #include -#include #include notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index aa828191c654..b8f69e264ac4 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index f80d70009ff8..6d7d0e52ed5a 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -19,7 +19,6 @@ #include #include #include -#include #ifdef CONFIG_X86_32 #include #include @@ -72,10 +71,6 @@ enum fixed_addresses { #ifdef CONFIG_X86_VSYSCALL_EMULATION VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT, #endif -#ifdef CONFIG_PARAVIRT_CLOCK - PVCLOCK_FIXMAP_BEGIN, - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, -#endif #endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 3864398c7cb2..66df22b2e0c9 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info { } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1) - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size); -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ec1b06dc82d2..72cef58693c7 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) { #ifdef CONFIG_X86_64 int cpu; - int ret; u8 flags; struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) return 1; } - if ((ret = pvclock_init_vsyscall(hv_clock, size))) { - put_cpu(); - return ret; - } - put_cpu(); kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 2f355d229a58..99bfc025111d 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } - -#ifdef CONFIG_X86_64 -/* - * Initialize the generic pvclock vsyscall state. This will allocate - * a/some page(s) for the per-vcpu pvclock information, set up a - * fixmap mapping for the page(s) - */ - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size) -{ - int idx; - - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); - - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, -__pa(i) + (idx*PAGE_SIZE), -PAGE_KERNEL_VVAR); - } - - return 0; -} -#endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] x86/vdso: Enable vdso pvclock access on all vdso variants
Now that pvclock doesn't require access to the fixmap, all vdso variants can use it. The kernel side isn't wired up for 32-bit kernels yet, but this covers 32-bit and x32 userspace on 64-bit kernels. Signed-off-by: Andy Lutomirski--- arch/x86/entry/vdso/vclock_gettime.c | 91 1 file changed, 40 insertions(+), 51 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 59a98c25bde7..8602f06c759f 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -17,8 +17,10 @@ #include #include #include +#include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -43,10 +45,6 @@ extern u8 pvclock_page #ifndef BUILD_VDSO32 -#include -#include -#include - notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) return ret; } -#ifdef CONFIG_PARAVIRT_CLOCK +#else + +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) + : "memory", "edx"); + return ret; +} + +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) + : "memory", "edx"); + return ret; +} + +#endif + +#ifdef CONFIG_PARAVIRT_CLOCK static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { return (const struct pvclock_vsyscall_time_info *)_page; @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode) do { version = pvti->version; - /* This is also a read barrier, so we'll read version first. */ - tsc = rdtsc_ordered(); + smp_rmb(); + tsc = rdtsc_ordered(); pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; pvti_tsc_shift = pvti->tsc_shift; pvti_system_time = pvti->system_time; @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode) pvclock_scale_delta(delta, pvti_tsc_to_system_mul, pvti_tsc_shift); - /* refer to tsc.c read_tsc() comment for rationale */ + /* refer to vread_tsc() comment for rationale */ last = gtod->cycle_last; if (likely(ret >= last)) @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode) } #endif -#else - -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) - : "memory", "edx"); - return ret; -} - -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) - : "memory", "edx"); - return ret; -} - -#ifdef CONFIG_PARAVIRT_CLOCK - -static notrace cycle_t vread_pvclock(int *mode) -{ - *mode = VCLOCK_NONE; - return 0; -} -#endif - -#endif - notrace static cycle_t vread_tsc(void) { cycle_t ret = (cycle_t)rdtsc_ordered(); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On 12/10/2015 1:14 AM, Alexander Duyck wrote: On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyuwrote: For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. I assume the problem is that with a PCI hotplug event you are losing the state information for the device, do I have that right? Looking over the QAT drivers it doesn't seem like any of them support the suspend/resume PM calls. I would imagine it makes it difficult for a system with a QAT card in it to be able to drop the system to a low power state. You might want to try enabling suspend/resume support for the devices on bare metal before you attempt to take on migration as it would provide you with a good testing framework to see what needs to be saved/restored within the device and in what order before you attempt to do the same while migrating from one system to another. Sure. The suspend/resume job is under way. Actually, we have enabled QAT work for migration internally. Doing more test and fixing bugs. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Please pull my kvm-ppc-fixes branch
Hi Paolo, I have a small patch that I would like to get into 4.4 because it fixes a bug which for certain kernel configs allows userspace to crash the kernel. The configs are those for which KVM_BOOK3S_64_HV is set (y or m) and KVM_BOOK3S_64_PR is not. Fortunately most distros that enable KVM_BOOK3S_64_HV also enable KVM_BOOK3S_64_PR, as far as I can tell. Thanks, Paul. The following changes since commit 09922076003ad66de41ea14d2f8c3b4a16ec7774: Merge tag 'kvm-arm-for-v4.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master (2015-12-04 18:32:32 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git kvm-ppc-fixes for you to fetch changes up to c20875a3e638e4a03e099b343ec798edd1af5cc6: KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR (2015-12-10 11:34:27 +1100) Paul Mackerras (1): KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR arch/powerpc/kvm/book3s_hv.c | 6 ++ 1 file changed, 6 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Please pull my kvm-ppc-fixes branch
Hi Paolo, I have a small patch that I would like to get into 4.4 because it fixes a bug which for certain kernel configs allows userspace to crash the kernel. The configs are those for which KVM_BOOK3S_64_HV is set (y or m) and KVM_BOOK3S_64_PR is not. Fortunately most distros that enable KVM_BOOK3S_64_HV also enable KVM_BOOK3S_64_PR, as far as I can tell. Thanks, Paul. The following changes since commit 09922076003ad66de41ea14d2f8c3b4a16ec7774: Merge tag 'kvm-arm-for-v4.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master (2015-12-04 18:32:32 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git kvm-ppc-fixes for you to fetch changes up to c20875a3e638e4a03e099b343ec798edd1af5cc6: KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR (2015-12-10 11:34:27 +1100) Paul Mackerras (1): KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR arch/powerpc/kvm/book3s_hv.c | 6 ++ 1 file changed, 6 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 0/5] implement vNVDIMM
New version, new week, and unfortunate new ping... :( On 12/02/2015 03:20 PM, Xiao Guangrong wrote: This patchset can be found at: https://github.com/xiaogr/qemu.git nvdimm-v9 It is based on pci branch on Michael's tree and the top commit is: commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36). Changelog in v9: - the changes address Michael's comments: 1) move the control parameter to -machine and it is off on default, then it can be enabled by, for example, -machine pc,nvdimm 2) introduce a macro to define "NCAL" 3) abstract the function, nvdimm_build_device_dsm(), to clean up the code 4) adjust the code style of dsm method 5) add spec reference in the code comment other: pick up Stefan's Reviewed-by Changelog in v8: We split the long patch series into the small parts, as you see now, this is the first part which enables NVDIMM without label data support. The command line has been changed because some patches simplifying the things have not been included into this series, you should specify the file size exactly using the parameters as follows: memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \ -device nvdimm,memdev=mem1,id=nv1 Changelog in v7: - changes from Vladimir Sementsov-Ogievskiy's comments: 1) let gethugepagesize() realize if fstat is failed instead of get normal page size 2) rename open_file_path to open_ram_file_path 3) better log the error message by using error_setg_errno 4) update commit in the commit log to explain hugepage detection on Windows - changes from Eduardo Habkost's comments: 1) use 'Error**' to collect error message for qemu_file_get_page_size() 2) move gethugepagesize() replacement to the same patch to make it better for review 3) introduce qemu_get_file_size to unity the code with raw_getlength() - changes from Stefan's comments: 1) check the memory region is large enough to contain DSM output buffer - changes from Eric Blake's comments: 1) update the shell command in the commit log to generate the patch which drops 'pc-dimm' prefix - others: pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and Eric Blake. Changelog in v6: - changes from Stefan's comments: 1) fix code style of struct naming by CamelCase way 2) fix offset + length overflow when read/write label data 3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can be used to replace getpagesize() Changelog in v5: - changes from Michael's comments: 1) prefix nvdimm_ to everything in NVDIMM source files 2) make parsing _DSM Arg3 more clear 3) comment style fix 5) drop single used definition 6) fix dirty dsm buffer lost due to memory write happened on host 7) check dsm buffer if it is big enough to contain input data 8) use build_append_int_noprefix to store single value to GArray - changes from Michael's and Igor's comments: 1) introduce 'nvdimm-support' parameter to control nvdimm enablement and it is disabled for 2.4 and its earlier versions to make live migration compatible 2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI virtualization - changes from Stefan's comments: 1) do endian adjustment for the buffer length - changes from Bharata B Rao's comments: 1) fix compile on ppc - others: 1) the buffer length is directly got from IO read rather than got from dsm memory 2) fix dirty label data lost due to memory write happened on host Changelog in v4: - changes from Michael's comments: 1) show the message, "Memory is not allocated from HugeTlbfs", if file based memory is not allocated from hugetlbfs. 2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState from Machine. 3) statically define UUID and make its operation more clear 4) use GArray to build device structures to avoid potential buffer overflow 4) improve comments in the code 5) improve code style - changes from Igor's comments: 1) add NVDIMM ACPI spec document 2) use serialized method to avoid Mutex 3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c 4) introduce a common ASL method used by _DSM for all devices to reduce ACPI size 5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU it's better to upgrade QEMU to support Rev2 in the future - changes from Stefan's comments: 1) copy input data from DSM memory to local buffer to avoid potential issues as DSM memory is visible to guest. Output data is handled in a similar way - changes from Dan's comments: 1) drop static namespace as Linux has already supported label-less nvdimm devices - changes from Vladimir's comments: 1) print better message, "failed to get file size for %s, can't create backend on it", if any file operation filed to obtain file size - others: create a git repo on github.com for better
Re: live migration vs device assignment (motivation)
On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote: On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. This is just more confusion. hotplug is just a way to add and remove devices. switching VF and PF is up to guest and hypervisor. This is a combination. Because it's not able to migrate device state in the current world during migration(What we are doing), Exist solutions of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF before starting migration and then switch network from VF NIC to PV NIC in order to maintain the network connection. Plug VF again after migration and then switch from PV back to VF. Bond driver provides a way to switch between PV and VF NIC automatically with save IP and MAC and so bond driver is more preferred. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF OK and what's the source of the downtime? I'm guessing that's just arp being repopulated. So simply save and re-populate it. There would be a much cleaner solution. Or maybe there's a timer there that just delays hotplug for no reason. Fix it, everyone will benefit. It also requires guest to do switch configuration. That's just wrong. if you want a switch, you need to configure a switch. I meant the config of switching operation between PV and VF. These are hard to manage and deploy from our customers. So kernel want to remain flexible, and the stack is configurable. Downside: customers need to deploy userspace to configure it. Your solution: a hard-coded configuration within kernel and hypervisor. Sorry, this makes no sense. If kernel is easier for you to deploy than userspace, you need to rethink your deployment strategy. This is one factor. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. No idea what this means. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. Everyone needs performance and scalability. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. Let's assume for the sake of the argument that there's a lot going on and removing the device is just too slow (though you should figure out what's going on before giving up and just building something new from scratch). No, we can find a PV NIC as backup for VF NIC during migration but it doesn't work for other kinds of device since there is no backup for them. E,G When migration happens during users compresses files via QAT, it's impossible to remove QAT at that point. If do that, the compress operation will fail and affect user experience. I still don't think you should be migrating state. That's just too fragile, and it also means you depend on driver to be nice and shut down device on source, so you can not migrate at will. Instead, reset device on destination and re-initialize it. Yes, saving and restoring device state relies on the driver and so we reworks driver and make it more friend to migration. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vgaarb: fix signal handling in vga_get()
On Mon, Nov 30, 2015 at 04:17:31AM +0200, Kirill A. Shutemov wrote: > There are few defects in vga_get() related to signal hadning: > > - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE > case; > > - if we found pending signal we must remove ourself from wait queue > and change task state back to running; > > - -ERESTARTSYS is more appropriate, I guess. > > Signed-off-by: Kirill A. ShutemovPing? David, this patch fixes crash. It would be nice to get it into 4.4. > --- > > Alex, I try to get KVM with VGA passthrough working properly. I have i915 > (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not > capabale of EFI, so I have to use x-vga=on. It's kinda work with your > patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU > after KVM was not shut down correctly, resulting in host crash like this: > > BUG: unable to handle kernel paging request at 880870187ed8 > IP: [] 0x880870187ed8 > PGD 2129067 PUD 800841e3 > Oops: 0011 [#1] PREEMPT SMP > Modules linked in: iwlmvm iwlwifi > CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6 > Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, BIOS > F5a 06/12/2014 > task: 88087a91 ti: 8808632c task.ti: 8808632c > RIP: 0010:[] [] 0x880870187ed8 > RSP: 0018:8808632c3d08 EFLAGS: 00010006 > RAX: 880870187db0 RBX: 70187f58 RCX: > RDX: RSI: 0003 RDI: 880870187db0 > RBP: 8808632c3d48 R08: R09: > R10: 000103c0 R11: 0293 R12: 81ea03c8 > R13: 8104c7cb R14: R15: 0003 > FS: 7f984f9b2700() GS:88089f38() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0 > Stack: > 810cc83d 632c3d28 81ea03c0 > 0046 0003 > 8808632c3d80 810cca44 88087af63800 0286 > Call Trace: > [] ? __wake_up_common+0x4d/0x80 > [] __wake_up+0x34/0x50 > [] __vga_put+0x73/0xd0 > [] vga_put+0x54/0x80 > [] vfio_pci_vga_rw+0x1d2/0x220 > [] vfio_pci_rw+0x33/0x60 > [] vfio_pci_write+0x17/0x20 > [] vfio_device_fops_write+0x26/0x30 > [] __vfs_write+0x23/0xe0 > [] ? __vfs_read+0x23/0xd0 > [] ? do_vfs_ioctl+0x2b5/0x490 > [] vfs_write+0xa4/0x190 > [] SyS_pwrite64+0x66/0xa0 > [] entry_SYSCALL_64_fastpath+0x12/0x6a > Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 70 08 > 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 00 00 00 00 > 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08 > RIP [] 0x880870187ed8 > RSP > CR2: 880870187ed8 > > The patch fixes the crash, but doesn't help with getting GPU in guest > working again. > > Any ideas? > > --- > drivers/gpu/vga/vgaarb.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c > index 3166e4bc4eb6..9abcaa53bd25 100644 > --- a/drivers/gpu/vga/vgaarb.c > +++ b/drivers/gpu/vga/vgaarb.c > @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int rsrc, int > interruptible) > set_current_state(interruptible ? > TASK_INTERRUPTIBLE : > TASK_UNINTERRUPTIBLE); > - if (signal_pending(current)) { > - rc = -EINTR; > + if (interruptible && signal_pending(current)) { > + __set_current_state(TASK_RUNNING); > + remove_wait_queue(_wait_queue, ); > + rc = -ERESTARTSYS; > break; > } > schedule(); > -- > 2.6.3 > -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html