Re: [PATCH 1/1] KVM: PPC: Increase memslots to 320

2015-12-09 Thread Thomas Huth
On 09/12/15 04:28, Paul Mackerras wrote:
> On Wed, Nov 04, 2015 at 10:03:48AM +0100, Thomas Huth wrote:
>> Only using 32 memslots for KVM on powerpc is way too low, you can
>> nowadays hit this limit quite fast by adding a couple of PCI devices
>> and/or pluggable memory DIMMs to the guest.
>> x86 already increased the limit to 512 in total, to satisfy 256
>> pluggable DIMM slots, 3 private slots and 253 slots for other things
>> like PCI devices. On powerpc, we only have 32 pluggable DIMMs in
> 
> I agree with increasing the limit.  Is there a reason we have only 32
> pluggable DIMMs in QEMU on powerpc, not more?  Should we be increasing
> that limit too?  If so, maybe we should increase the number of memory
> slots to 512?

H ... the comment before the #define in QEMU reads:

/*
 * This defines the maximum number of DIMM slots we can have for sPAPR
 * guest. This is not defined by sPAPR but we are defining it to 32 slots
 * based on default number of slots provided by PowerPC kernel.
 */
#define SPAPR_MAX_RAM_SLOTS 32

So as far as I can see, there's indeed a possibility that we'll
increase this value once the kernel can handle more slots!

So I guess you're right and we should play save and use more slots
right from the start. I'll send a new patch with 512 instead.

>> QEMU, not 256, so we likely do not as much slots as on x86. Thus
> 
> "so we likely do not need as many slots as on x86" would be better
> English.

Oops, I'm sure that was my original intention ... anyway, thanks for
pointing it out, it's always good to get some feedback as a non-native
English speaker.

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/21] KVM: ARM64: Add reset and access handlers for PMCNTENSET and PMCNTENCLR register

2015-12-09 Thread Shannon Zhao


On 2015/12/9 0:42, Marc Zyngier wrote:
>> +void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable)
>> > +{
>> > +  int i;
>> > +  struct kvm_pmu *pmu = >arch.pmu;
>> > +  struct kvm_pmc *pmc;
>> > +
>> > +  if (!all_enable)
>> > +  return;
> You have the vcpu. Can you move the check for PMCR_EL0.E here instead of
> having it in both of the callers?
> 
But it still needs to check PMCR_EL0.E in kvm_pmu_handle_pmcr(). When
PMCR_EL0.E == 1, it calls kvm_pmu_enable_counter(), otherwise it calls
kvm_pmu_disable_counter(). So as it checks already, just pass the result
as a parameter.

>> > +
>> > +  for_each_set_bit(i, (const unsigned long *), ARMV8_MAX_COUNTERS) {
> Nonononono... If you must have to use a long, use a long. Don't cast it
> to a different type (hint: big endian).
> 
>> > +  pmc = >pmc[i];
>> > +  if (pmc->perf_event) {
>> > +  perf_event_enable(pmc->perf_event);
>> > +  if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE)
>> > +  kvm_debug("fail to enable perf event\n");
>> > +  }
>> > +  }
>> > +}
>> > +
>> > +/**
>> > + * kvm_pmu_disable_counter - disable selected PMU counter
>> > + * @vcpu: The vcpu pointer
>> > + * @val: the value guest writes to PMCNTENCLR register
>> > + *
>> > + * Call perf_event_disable to stop counting the perf event
>> > + */
>> > +void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val)
>> > +{
>> > +  int i;
>> > +  struct kvm_pmu *pmu = >arch.pmu;
>> > +  struct kvm_pmc *pmc;
>> > +
> Why are enable and disable asymmetric (handling of PMCR.E)?
> 
To enable a counter, it needs both the PMCR_EL0.E and the corresponding
bit of PMCNTENSET_EL0 set to 1. But to disable a counter, it only needs
one of them and when PMCR_EL0.E == 0, it disables all the counters.

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/21] KVM: ARM64: Add reset and access handlers for PMOVSSET and PMOVSCLR register

2015-12-09 Thread Shannon Zhao


On 2015/12/9 0:59, Marc Zyngier wrote:
>> > +  }
>> > +
>> > +  /* If all overflow bits are cleared, kick the vcpu to clear interrupt
>> > +   * pending status.
>> > +   */
>> > +  if (val == 0)
>> > +  kvm_vcpu_kick(vcpu);
> Do we really need to do so? This will be dropped on the next entry
> anyway, so i don't see the need to kick the vcpu again. Or am I missing
> something?
> 
I thought maybe it could not set the interrupt low immediately when all
the overflow bits are cleared for some reason, so add a kick to force it
to sync the interrupt. But as you said, I'll remove this.

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> Hi Michael & Alexander:
> Thanks a lot for your comments and suggestions.

It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html

> We still need to support Windows guest for migration and this is why our
> patches keep all changes in the driver since it's impossible to change
> Windows kernel.

This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.

> Following is my idea to do DMA tracking.
> 
> Inject event to VF driver after memory iterate stage
> and before stop VCPU and then VF driver marks dirty all
> using DMA memory. The new allocated pages also need to
> be marked dirty before stopping VCPU. All dirty memory
> in this time slot will be migrated until stop-and-copy
> stage. We also need to make sure to disable VF via clearing the
> bus master enable bit for VF before migrating these memory.
> 
> The dma page allocated by VF driver also needs to reserve space
> to do dummy write.

I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 07/21] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function

2015-12-09 Thread Marc Zyngier
On Wed, 9 Dec 2015 15:38:09 +0800
Shannon Zhao  wrote:

> 
> 
> On 2015/12/8 23:43, Marc Zyngier wrote:
> > On 08/12/15 12:47, Shannon Zhao wrote:
> >> From: Shannon Zhao 
> >> +/**
> >> + * kvm_pmu_get_counter_value - get PMU counter value
> >> + * @vcpu: The vcpu pointer
> >> + * @select_idx: The counter index
> >> + */
> >> +u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
> >> +{
> >> +  u64 counter, enabled, running;
> >> +  struct kvm_pmu *pmu = >arch.pmu;
> >> +  struct kvm_pmc *pmc = >pmc[select_idx];
> >> +
> >> +  if (!vcpu_mode_is_32bit(vcpu))
> >> +  counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx);
> >> +  else
> >> +  counter = vcpu_cp15(vcpu, c14_PMEVCNTR0 + select_idx);
> >> +
> >> +  if (pmc->perf_event)
> >> +  counter += perf_event_read_value(pmc->perf_event, ,
> >> +   );
> >> +
> >> +  return counter & pmc->bitmask;
> > 
> > This one confused me for a while. Is it the case that you return
> > whatever is in the vcpu view of the counter, plus anything that perf
> > itself has counted? If so, I'd appreciate a comment here...
> > 
> Yes, the real counter value is the current counter value plus the value
> perf event counts. I'll add a comment.
> 
> >> +}
> >> +
> >> +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u32 
> >> select_idx)
> >> +{
> >> +  if (!vcpu_mode_is_32bit(vcpu))
> >> +  return (vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMCR_E) &
> >> + (vcpu_sys_reg(vcpu, PMCNTENSET_EL0) >> select_idx);
> > 
> > This looks wrong. Shouldn't it be:
> > 
> > return ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMCR_E) &&
> > (vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & (1 << select_idx)));
> > 
> >> +  else
> >> +  return (vcpu_sys_reg(vcpu, c9_PMCR) & ARMV8_PMCR_E) &
> >> + (vcpu_sys_reg(vcpu, c9_PMCNTENSET) >> select_idx);
> >> +}
> > 
> > Also, I don't really see why we need to check the 32bit version, which
> > has the exact same content.
> > 
> >> +
> >> +static inline struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc *pmc)
> >> +{
> >> +  struct kvm_pmu *pmu;
> >> +  struct kvm_vcpu_arch *vcpu_arch;
> >> +
> >> +  pmc -= pmc->idx;
> >> +  pmu = container_of(pmc, struct kvm_pmu, pmc[0]);
> >> +  vcpu_arch = container_of(pmu, struct kvm_vcpu_arch, pmu);
> >> +  return container_of(vcpu_arch, struct kvm_vcpu, arch);
> >> +}
> >> +
> >> +/**
> >> + * kvm_pmu_stop_counter - stop PMU counter
> >> + * @pmc: The PMU counter pointer
> >> + *
> >> + * If this counter has been configured to monitor some event, release it 
> >> here.
> >> + */
> >> +static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
> >> +{
> >> +  struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> >> +  u64 counter;
> >> +
> >> +  if (pmc->perf_event) {
> >> +  counter = kvm_pmu_get_counter_value(vcpu, pmc->idx);
> >> +  if (!vcpu_mode_is_32bit(vcpu))
> >> +  vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + pmc->idx) = counter;
> >> +  else
> >> +  vcpu_cp15(vcpu, c14_PMEVCNTR0 + pmc->idx) = counter;
> > 
> > Same thing - we don't need to make a difference between 32 and 64bit.
> > 
> So it's fine to drop all the vcpu_mode_is_32bit(vcpu) check of this
> series? The only one we should take care is the PMCCNTR, right?

Yes, mostly. As long as you only reason on the 64bit register set,
you're pretty safe, and that in turn solves all kind of ugly endianness
issues.

> >> +
> >> +  perf_event_release_kernel(pmc->perf_event);
> >> +  pmc->perf_event = NULL;
> >> +  }
> >> +}
> >> +
> >> +/**
> >> + * kvm_pmu_set_counter_event_type - set selected counter to monitor some 
> >> event
> >> + * @vcpu: The vcpu pointer
> >> + * @data: The data guest writes to PMXEVTYPER_EL0
> >> + * @select_idx: The number of selected counter
> >> + *
> >> + * When OS accesses PMXEVTYPER_EL0, that means it wants to set a PMC to 
> >> count an
> >> + * event with given hardware event number. Here we call perf_event API to
> >> + * emulate this action and create a kernel perf event for it.
> >> + */
> >> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
> >> +  u32 select_idx)
> >> +{
> >> +  struct kvm_pmu *pmu = >arch.pmu;
> >> +  struct kvm_pmc *pmc = >pmc[select_idx];
> >> +  struct perf_event *event;
> >> +  struct perf_event_attr attr;
> >> +  u32 eventsel;
> >> +  u64 counter;
> >> +
> >> +  kvm_pmu_stop_counter(pmc);
> > 
> > Wait. I didn't realize this before, but you have the vcpu right here.
> > Why don't you pass it as a parameter to kvm_pmu_stop_counter and avoid
> > the kvm_pmc_to_vcpu thing altogether?
> > 
> Yeah, we could pass vcpu as a parameter for this function. But the
> kvm_pmc_to_vcpu helper is also used in kvm_pmu_perf_overflow() and
> within kvm_pmu_perf_overflow it needs the pmc->idx, we couldn't pass
> vcpu as a parameter, so this helper 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu



On 12/8/2015 1:12 AM, Alexander Duyck wrote:

On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:

On 12/5/2015 1:07 AM, Alexander Duyck wrote:



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.



That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.



This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.


Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.


We are also adding such migration support for QAT device and so our
solution will not just be limit to NIC. Now just is the beginning.

We can't limit user to only use Linux guest. So the migration feature
should work for both Windows and Linux guest.




If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.


The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.


If we can benefit both migration and suspend, that would be wonderful.
But migration and system pm is still different. Just for example,
driver doesn't need to put device into deep D-status during migration
and host can do this after migration while it's essential for
system sleep. PCI configure space and interrupt config is emulated by
Qemu and Qemu can migrate these configures to new machine. Driver
doesn't need to deal with such thing. So I think migration still needs a
different callback or different code path than device suspend/resume.

Another concern is that we have to rework PM core ore PCI bus driver
to call suspend/resume for passthrough devices during migration. This
also blocks new feature works on the Windows.



Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.



The main problem is tracking DMA issue. I will repose my solution in the
new thread for discussion. If not way to mark DMA page dirty when
DMA is enabled, we have to stop DMA for a small time to do that at the
last stage.








Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.



The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
   What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.



This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.


You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke a hole in 

RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-09 Thread Wu, Feng
Hi Radim,

> -Original Message-
> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> Sent: Tuesday, November 17, 2015 3:03 AM
> To: Wu, Feng 
> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> interrupts
> 
> 2015-11-09 10:46+0800, Feng Wu:
> > Use vector-hashing to handle lowest-priority interrupts for
> > posted-interrupts. As an example, modern Intel CPUs use this
> > method to handle lowest-priority interrupts.
> 
> (I don't think it's a good idea that the algorithm differs from non-PI
>  lowest priority delivery.  I'd make them both vector-hashing, which
>  would be "fun" to explain to people expecting round robin ...)
> 
> > Signed-off-by: Feng Wu 
> > ---
> > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> > +/*
> > + * This routine handles lowest-priority interrupts using vector-hashing
> > + * mechanism. As an example, modern Intel CPUs use this method to handle
> > + * lowest-priority interrupts.
> > + *
> > + * Here is the details about the vector-hashing mechanism:
> > + * 1. For lowest-priority interrupts, store all the possible destination
> > + *vCPUs in an array.
> > + * 2. Use "guest vector % max number of destination vCPUs" to find the 
> > right
> > + *destination vCPU in the array for the lowest-priority interrupt.
> > + */
> 
> (Is Skylake i7-6700 a modern Intel CPU?
>  I didn't manage to get hashing ... all interrupts always went to the
>  lowest APIC ID in the set :/
>  Is there a simple way to verify the algorithm?)
> 
> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
> > + struct kvm_lapic_irq *irq)
> > +
> > +{
> > +   unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> > +   unsigned int dest_vcpus = 0;
> > +   struct kvm_vcpu *vcpu;
> > +   unsigned int i, mod, idx = 0;
> > +
> > +   vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq);
> > +   if (vcpu)
> > +   return vcpu;
> 
> I think the rest of this function shouldn't be implemented:
>  - Shorthands are only for IPIs and hence don't need to be handled,
>  - Lowest priority physical broadcast is not supported,
>  - Lowest priority cluster logical broadcast is not supported,
>  - No point in optimizing mixed xAPIC and x2APIC mode,

I read your comments again, and don't quite understand why we
don't need PI optimization for mixed xAPIC and x2APIC mode.

BTW, can we have mixed flat and cluster mode?

Thanks,
Feng

>  - The rest is handled by kvm_intr_vector_hashing_dest_fast().
>(Even lowest priority flat logical "broadcast".)
>  - We do the work twice when vcpu == NULL means that there is no
>matching destination.
> 
> Is there a valid case that can be resolved by going through all vcpus?
> 
> > +
> > +   memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
> > +
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   if (!kvm_apic_present(vcpu))
> > +   continue;
> > +
> > +   if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
> > +   irq->dest_id, irq->dest_mode))
> > +   continue;
> > +
> > +   __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
> > +   dest_vcpus++;
> > +   }
> > +
> > +   if (dest_vcpus == 0)
> > +   return NULL;
> > +
> > +   mod = irq->vector % dest_vcpus;
> > +
> > +   for (i = 0; i <= mod; i++) {
> > +   idx = find_next_bit(dest_vcpu_bitmap, KVM_MAX_VCPUS, idx) +
> 1;
> > +   BUG_ON(idx >= KVM_MAX_VCPUS);
> > +   }
> > +
> > +   return kvm_get_vcpu(kvm, idx - 1);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
> > +
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > @@ -816,6 +816,63 @@ out:
> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest_fast(struct kvm *kvm,
> > +  struct kvm_lapic_irq *irq)
> 
> We now have three very similar functions :(
> 
>   kvm_irq_delivery_to_apic_fast
>   kvm_intr_is_single_vcpu_fast
>   kvm_intr_vector_hashing_dest_fast
> 
> By utilizing the gcc optimizer, they can be merged without introducing
> many instructions to the hot path, kvm_irq_delivery_to_apic_fast.
> (I would eventually do it, so you can save time by ignoring this.)
> 
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/21] KVM: ARM64: Add reset and access handlers for PMCNTENSET and PMCNTENCLR register

2015-12-09 Thread Marc Zyngier
On Wed, 9 Dec 2015 16:35:58 +0800
Shannon Zhao  wrote:

> 
> 
> On 2015/12/9 0:42, Marc Zyngier wrote:
> >> +void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool 
> >> all_enable)
> >> > +{
> >> > +int i;
> >> > +struct kvm_pmu *pmu = >arch.pmu;
> >> > +struct kvm_pmc *pmc;
> >> > +
> >> > +if (!all_enable)
> >> > +return;
> > You have the vcpu. Can you move the check for PMCR_EL0.E here instead of
> > having it in both of the callers?
> > 
> But it still needs to check PMCR_EL0.E in kvm_pmu_handle_pmcr(). When
> PMCR_EL0.E == 1, it calls kvm_pmu_enable_counter(), otherwise it calls
> kvm_pmu_disable_counter(). So as it checks already, just pass the result
> as a parameter.

I've seen that, but it makes the code look ugly. At any rate, you might
as well not call enable_counter if PMCR.E==0. But splitting the lookup
of the bit and the test like you do is not nice at all. Making it
self-contained looks a lot better, and you don't have to think about
the caller.

> >> > +
> >> > +for_each_set_bit(i, (const unsigned long *), 
> >> > ARMV8_MAX_COUNTERS) {
> > Nonononono... If you must have to use a long, use a long. Don't cast it
> > to a different type (hint: big endian).
> > 
> >> > +pmc = >pmc[i];
> >> > +if (pmc->perf_event) {
> >> > +perf_event_enable(pmc->perf_event);
> >> > +if (pmc->perf_event->state != 
> >> > PERF_EVENT_STATE_ACTIVE)
> >> > +kvm_debug("fail to enable perf 
> >> > event\n");
> >> > +}
> >> > +}
> >> > +}
> >> > +
> >> > +/**
> >> > + * kvm_pmu_disable_counter - disable selected PMU counter
> >> > + * @vcpu: The vcpu pointer
> >> > + * @val: the value guest writes to PMCNTENCLR register
> >> > + *
> >> > + * Call perf_event_disable to stop counting the perf event
> >> > + */
> >> > +void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val)
> >> > +{
> >> > +int i;
> >> > +struct kvm_pmu *pmu = >arch.pmu;
> >> > +struct kvm_pmc *pmc;
> >> > +
> > Why are enable and disable asymmetric (handling of PMCR.E)?
> > 
> To enable a counter, it needs both the PMCR_EL0.E and the corresponding
> bit of PMCNTENSET_EL0 set to 1. But to disable a counter, it only needs
> one of them and when PMCR_EL0.E == 0, it disables all the counters.

OK.

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: PPC: Increase memslots to 320

2015-12-09 Thread Thomas Huth
On 09/12/15 04:28, Paul Mackerras wrote:
> On Wed, Nov 04, 2015 at 10:03:48AM +0100, Thomas Huth wrote:
>> Only using 32 memslots for KVM on powerpc is way too low, you can
>> nowadays hit this limit quite fast by adding a couple of PCI devices
>> and/or pluggable memory DIMMs to the guest.
>> x86 already increased the limit to 512 in total, to satisfy 256
>> pluggable DIMM slots, 3 private slots and 253 slots for other things
>> like PCI devices. On powerpc, we only have 32 pluggable DIMMs in
> 
> I agree with increasing the limit.  Is there a reason we have only 32
> pluggable DIMMs in QEMU on powerpc, not more?  Should we be increasing
> that limit too?  If so, maybe we should increase the number of memory
> slots to 512?

H ... the comment before the #define in QEMU reads:

/*
 * This defines the maximum number of DIMM slots we can have for sPAPR
 * guest. This is not defined by sPAPR but we are defining it to 32 slots
 * based on default number of slots provided by PowerPC kernel.
 */
#define SPAPR_MAX_RAM_SLOTS 32

So as far as I can see, there's indeed a possibility that we'll
increase this value once the kernel can handle more slots!

So I guess you're right and we should play save and use more slots
right from the start. I'll send a new patch with 512 instead.

>> QEMU, not 256, so we likely do not as much slots as on x86. Thus
> 
> "so we likely do not need as many slots as on x86" would be better
> English.

Oops, I'm sure that was my original intention ... anyway, thanks for
pointing it out, it's always good to get some feedback as a non-native
English speaker.

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 15/21] KVM: ARM64: Add reset and access handlers for PMUSERENR register

2015-12-09 Thread Shannon Zhao


On 2015/12/9 1:03, Marc Zyngier wrote:
> On 08/12/15 12:47, Shannon Zhao wrote:
>> > From: Shannon Zhao 
>> > 
>> > The reset value of PMUSERENR_EL0 is UNKNOWN, use reset_unknown.
>> > 
>> > Signed-off-by: Shannon Zhao 
>> > ---
>> >  arch/arm64/kvm/sys_regs.c | 5 +++--
>> >  1 file changed, 3 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> > index c830fde..80b66c0 100644
>> > --- a/arch/arm64/kvm/sys_regs.c
>> > +++ b/arch/arm64/kvm/sys_regs.c
>> > @@ -880,7 +880,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>> >  access_pmu_pmxevcntr },
>> >/* PMUSERENR_EL0 */
>> >{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
>> > -trap_raz_wi },
>> > +access_pmu_regs, reset_unknown, PMUSERENR_EL0 },
> So while the 64bit view of the register resets as unknown, a CPU
> resetting in 32bit mode resets as 0. I suggest you reset it as zero, and
> document that choice. You may have to revisit all the other registers
> that do reset as unknown for 64bit as well.
> 
Sure.

BTW, here I didn't handle the bits of PMUSERENR which are used to
permit/forbid accessing some PMU registers from EL0. Does it need to add
the handler? Is there any way to get the exceptional level of the
accessing in hypervisor?

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Increase memslots to 512

2015-12-09 Thread Thomas Huth
Only using 32 memslots for KVM on powerpc is way too low, you can
nowadays hit this limit quite fast by adding a couple of PCI devices
and/or pluggable memory DIMMs to the guest.

x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256
pluggable DIMM slots, 3 private slots and 253 slots for other things
like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in
total). We should do something similar for powerpc, and since we do
not use private slots here, we can set the value to 512 directly.

While we're at it, also remove the KVM_MEM_SLOTS_NUM definition
from the powerpc-specific header since this gets defined in the
generic kvm_host.h header anyway.

Signed-off-by: Thomas Huth 
---
 arch/powerpc/include/asm/kvm_host.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 887c259..2c96a72 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -38,8 +38,7 @@
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
-#define KVM_USER_MEM_SLOTS 32
-#define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS
+#define KVM_USER_MEM_SLOTS 512
 
 #ifdef CONFIG_KVM_MMIO
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Increase memslots to 512

2015-12-09 Thread Thomas Huth
Only using 32 memslots for KVM on powerpc is way too low, you can
nowadays hit this limit quite fast by adding a couple of PCI devices
and/or pluggable memory DIMMs to the guest.

x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256
pluggable DIMM slots, 3 private slots and 253 slots for other things
like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in
total). We should do something similar for powerpc, and since we do
not use private slots here, we can set the value to 512 directly.

While we're at it, also remove the KVM_MEM_SLOTS_NUM definition
from the powerpc-specific header since this gets defined in the
generic kvm_host.h header anyway.

Signed-off-by: Thomas Huth 
---
 arch/powerpc/include/asm/kvm_host.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 887c259..2c96a72 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -38,8 +38,7 @@
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
-#define KVM_USER_MEM_SLOTS 32
-#define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS
+#define KVM_USER_MEM_SLOTS 512
 
 #ifdef CONFIG_KVM_MMIO
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 15/21] KVM: ARM64: Add reset and access handlers for PMUSERENR register

2015-12-09 Thread Marc Zyngier
On Wed, 9 Dec 2015 17:18:02 +0800
Shannon Zhao  wrote:

> 
> 
> On 2015/12/9 1:03, Marc Zyngier wrote:
> > On 08/12/15 12:47, Shannon Zhao wrote:
> >> > From: Shannon Zhao 
> >> > 
> >> > The reset value of PMUSERENR_EL0 is UNKNOWN, use reset_unknown.
> >> > 
> >> > Signed-off-by: Shannon Zhao 
> >> > ---
> >> >  arch/arm64/kvm/sys_regs.c | 5 +++--
> >> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >> > 
> >> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> >> > index c830fde..80b66c0 100644
> >> > --- a/arch/arm64/kvm/sys_regs.c
> >> > +++ b/arch/arm64/kvm/sys_regs.c
> >> > @@ -880,7 +880,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> >> >access_pmu_pmxevcntr },
> >> >  /* PMUSERENR_EL0 */
> >> >  { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
> >> > -  trap_raz_wi },
> >> > +  access_pmu_regs, reset_unknown, PMUSERENR_EL0 },
> > So while the 64bit view of the register resets as unknown, a CPU
> > resetting in 32bit mode resets as 0. I suggest you reset it as zero, and
> > document that choice. You may have to revisit all the other registers
> > that do reset as unknown for 64bit as well.
> > 
> Sure.
> 
> BTW, here I didn't handle the bits of PMUSERENR which are used to
> permit/forbid accessing some PMU registers from EL0. Does it need to add
> the handler? Is there any way to get the exceptional level of the
> accessing in hypervisor?

Ah, good point, I missed that. Yes, we need to be able to handle that.

To find out, you can use vcpu_mode_priv(), which returns true if the
CPU was in a high privilege mode (EL1 for 64bit, anything higher than
USR on 32bit), and false otherwise. So far, the only user is
arch/arm/kvm/perf.c.

Thanks,

M.
-- 
Without deviation from the norm, progress is not possible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:

On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.


It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html



No, I will reply it separately and according your suggestion to snip it 
into 3 thread.



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.



Sure. Duplicated code should be encapsulated and make it able to reuse
by other drivers. Just like you said the dummy write part.

I meant the framework should not require to change Windows kernel code
(such as PM core or PCI bus driver)and this will block implementation on
the Windows.

I think it's not problem to duplicate code in the Windows drivers.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.



This sounds reasonable. We may discuss it detail in the separate thread.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote:

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.


I am not sure whether we can write a bus driver for Windows to support
migration. But I think device vendor who want to support migration will 
improve their driver if we provide such framework in the

hypervisor which just need them to change their driver.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Wed, Dec 09, 2015 at 07:19:15PM +0800, Lan, Tianyu wrote:
> On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:
> >On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> >>Hi Michael & Alexander:
> >>Thanks a lot for your comments and suggestions.
> >
> >It's nice that it's appreciated, but you then go on and ignore
> >all that I have written here:
> >https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html
> >
> 
> No, I will reply it separately and according your suggestion to snip it into
> 3 thread.
> 
> >>We still need to support Windows guest for migration and this is why our
> >>patches keep all changes in the driver since it's impossible to change
> >>Windows kernel.
> >
> >This is not a reasonable argument.  It makes no sense to duplicate code
> >on Linux because you must duplicate code on Windows.  Let's assume you
> >must do it in the driver on windows because windows has closed source
> >drivers.  What does it matter? Linux can still do it as part of DMA API
> >and have it apply to all drivers.
> >
> 
> Sure. Duplicated code should be encapsulated and make it able to reuse
> by other drivers. Just like you said the dummy write part.
> 
> I meant the framework should not require to change Windows kernel code
> (such as PM core or PCI bus driver)and this will block implementation on
> the Windows.

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.

> I think it's not problem to duplicate code in the Windows drivers.
> 
> >>Following is my idea to do DMA tracking.
> >>
> >>Inject event to VF driver after memory iterate stage
> >>and before stop VCPU and then VF driver marks dirty all
> >>using DMA memory. The new allocated pages also need to
> >>be marked dirty before stopping VCPU. All dirty memory
> >>in this time slot will be migrated until stop-and-copy
> >>stage. We also need to make sure to disable VF via clearing the
> >>bus master enable bit for VF before migrating these memory.
> >>
> >>The dma page allocated by VF driver also needs to reserve space
> >>to do dummy write.
> >
> >I suggested ways to do it all in the hypervisor without driver hacks, or
> >hide it within DMA API without need to reserve extra space. Both
> >approaches seem much cleaner.
> >
> 
> This sounds reasonable. We may discuss it detail in the separate thread.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-09 Thread Radim Krčmář
2015-12-09 08:19+, Wu, Feng:
>> -Original Message-
>> From: Radim Krčmář [mailto:rkrc...@redhat.com]
>> Sent: Tuesday, November 17, 2015 3:03 AM
>> To: Wu, Feng 
>> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org
>> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
>> interrupts
>> 
>> 2015-11-09 10:46+0800, Feng Wu:
>> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
>> > +struct kvm_lapic_irq *irq)
>> > +
>> > +{
>> > +  unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>> > +  unsigned int dest_vcpus = 0;
>> > +  struct kvm_vcpu *vcpu;
>> > +  unsigned int i, mod, idx = 0;
>> > +
>> > +  vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq);
>> > +  if (vcpu)
>> > +  return vcpu;
>> 
>> I think the rest of this function shouldn't be implemented:
>>  - Shorthands are only for IPIs and hence don't need to be handled,
>>  - Lowest priority physical broadcast is not supported,
>>  - Lowest priority cluster logical broadcast is not supported,
>>  - No point in optimizing mixed xAPIC and x2APIC mode,
> 
> I read your comments again, and don't quite understand why we
> don't need PI optimization for mixed xAPIC and x2APIC mode.

There shouldn't be a non-hobbyist operating system that uses mixed mode,
so the optimization would practically be dead code as all other cases
are handled by kvm_intr_vector_hashing_dest_fast().

I think that having extra code would bring problems in the future -- we
need to take care of it when refactoring KVM's APIC and we should also
write a unit-test for this otherwise dead path.  I don't think that the
benefit for guests would ever balance those efforts.

(Physical xAPIC+x2APIC mode is still somewhat reasonable and xAPIC CPUs
 start with LDR=0, which means that operating system doesn't need to
 utilize mixed mode, as defined by KVM, when switching to x2APIC.)

> BTW, can we have mixed flat and cluster mode?

Yes, KVM recognizes that mixed mode, but luckily, there are severe
limitations.

Notes below SDM section 10.6.2.2:
  All processors that have their APIC software enabled (using the
  spurious vector enable/disable bit) must have their DFRs (Destination
  Format Registers) programmed identically.

I hope there isn't a human that would use it in good faith.

(Only NMI/SMI/INIT/SIPI are delivered in software disabled mode and if
 the system uses cluster xAPIC, OS should set DFR before LDR, which
 doesn't trigger mixed mode either.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] x86: always inline functions called after set_exception_return

2015-12-09 Thread Paolo Bonzini


On 07/12/2015 21:36, David Matlack wrote:
> set_exception_return forces exceptions handlers to return to a specific
> address instead of returning to the instruction address pushed by the
> CPU at the time of the exception. The unit tests apic.c and vmx.c use
> this functionality to recover from expected exceptions.
> 
> When using set_exception_return we have to be careful not to modify the
> stack (such as by doing a function call) as triggering the exception will
> likely jump us past the instructions which undo the stack manipulation
> (such as a ret). To accomplish this, declare all functions called after
> set_exception_return as __always_inline, so that the compiler always
> inlines them.

set_exception_return is generally not a great idea IMHO---thanks for
looking at it.

A couple years ago we discussed adding setjmp/longjmp to libcflat
(http://www.spinics.net/lists/kvm/msg94159.html which is however missing
a 32-bit version).  Making the exceptions do a longjmp would be a much
safer option.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] VSOCK: Introduce virtio-vsock-common.ko

2015-12-09 Thread Stefan Hajnoczi
From: Asias He 

This module contains the common code and header files for the following
virtio-vsock and virtio-vhost kernel modules.

Signed-off-by: Asias He 
Signed-off-by: Stefan Hajnoczi 
---
v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
 * Only allow host->guest connections (same security model as latest
   VMware)
v2:
 * Fix peer_buf_alloc inheritance on child socket
 * Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
---
 include/linux/virtio_vsock.h| 203 
 include/uapi/linux/virtio_ids.h |   1 +
 include/uapi/linux/virtio_vsock.h   |  87 
 net/vmw_vsock/virtio_transport_common.c | 854 
 4 files changed, 1145 insertions(+)
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
new file mode 100644
index 000..e54eb45
--- /dev/null
+++ b/include/linux/virtio_vsock.h
@@ -0,0 +1,203 @@
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *may be used to endorse or promote products derived from this software
+ *without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS 
IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2013-2015
+ * Copyright (C) Asias He , 2013
+ * Copyright (C) Stefan Hajnoczi , 2015
+ */
+
+#ifndef _LINUX_VIRTIO_VSOCK_H
+#define _LINUX_VIRTIO_VSOCK_H
+
+#include 
+#include 
+#include 
+
+#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE  128
+#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE  (1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE  (1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 4)
+#define VIRTIO_VSOCK_MAX_BUF_SIZE  0xUL
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
+#define VIRTIO_VSOCK_MAX_TX_BUF_SIZE   (1024 * 1024 * 16)
+#define VIRTIO_VSOCK_MAX_DGRAM_SIZE(1024 * 64)
+
+struct vsock_transport_recv_notify_data;
+struct vsock_transport_send_notify_data;
+struct sockaddr_vm;
+struct vsock_sock;
+
+enum {
+   VSOCK_VQ_CTRL   = 0,
+   VSOCK_VQ_RX = 1, /* for host to guest data */
+   VSOCK_VQ_TX = 2, /* for guest to host data */
+   VSOCK_VQ_MAX= 3,
+};
+
+/* virtio transport socket state */
+struct virtio_transport {
+   struct virtio_transport_pkt_ops *ops;
+   struct vsock_sock *vsk;
+
+   u32 buf_size;
+   u32 buf_size_min;
+   u32 buf_size_max;
+
+   struct mutex tx_lock;
+   struct mutex rx_lock;
+
+   struct list_head rx_queue;
+   u32 rx_bytes;
+
+   /* Protected by trans->tx_lock */
+   u32 tx_cnt;
+   u32 buf_alloc;
+   u32 peer_fwd_cnt;
+   u32 peer_buf_alloc;
+   /* Protected by trans->rx_lock */
+   u32 fwd_cnt;
+};
+
+struct virtio_vsock_pkt {
+   struct virtio_vsock_hdr hdr;
+   struct virtio_transport *trans;
+   struct work_struct work;
+   struct list_head list;
+   void *buf;
+   u32 len;
+   u32 off;
+};
+
+struct virtio_vsock_pkt_info {

[PATCH v3 0/4] Add virtio transport for AF_VSOCK

2015-12-09 Thread Stefan Hajnoczi
Note: the virtio-vsock device specification is currently under review but not
yet finalized.  Please review this code but don't merge until I send an update
when the spec is finalized.  Thanks!

v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
   (also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
 * Only allow host->guest connections (same security model as latest
   VMware)
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann )
 * Remove unneeded variable used to store return value
   (Fengguang Wu  and Julia Lawall
   )

v2:
 * Rebased onto Linux v4.4-rc2
 * vhost: Refuse to assign reserved CIDs
 * vhost: Refuse guest CID if already in use
 * vhost: Only accept correctly addressed packets (no spoofing!)
 * vhost: Support flexible rx/tx descriptor layout
 * vhost: Add missing total_tx_buf decrement
 * virtio_transport: Fix total_tx_buf accounting
 * virtio_transport: Add virtio_transport global mutex to prevent races
 * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
 * common: Fix peer_buf_alloc inheritance on child socket

This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
AF_VSOCK is designed for communication between virtual machines and
hypervisors.  It is currently only implemented for VMware's VMCI transport.

This series implements the proposed virtio-vsock device specification from
here:
http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980

Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
picked up the series again.

The QEMU userspace changes are here:
https://github.com/stefanha/qemu/commits/vsock

Why virtio-vsock?
-
Guest<->host communication is currently done over the virtio-serial device.
This makes it hard to port sockets API-based applications and is limited to
static ports.

virtio-vsock uses the sockets API so that applications can rely on familiar
SOCK_STREAM semantics.  Applications on the host can easily connect to guest
agents because the sockets API allows multiple connections to a listen socket
(unlike virtio-serial).  This simplifies the guest<->host communication and
eliminates the need for extra processes on the host to arbitrate virtio-serial
ports.

Overview

This series adds 3 pieces:

1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko

2. virtio_transport.ko - guest driver

3. drivers/vhost/vsock.ko - host driver

Howto
-
The following kernel options are needed:
  CONFIG_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS_COMMON=y
  CONFIG_VHOST_VSOCK=m

Launch QEMU as follows:
  # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3

Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
is 2 and the guest must be assigned a CID (3 in the example above).

Status
--
This patch series implements the latest draft specification.  Please review.

Asias He (4):
  VSOCK: Introduce virtio-vsock-common.ko
  VSOCK: Introduce virtio-vsock.ko
  VSOCK: Introduce vhost-vsock.ko
  VSOCK: Add Makefile and Kconfig

 drivers/vhost/Kconfig   |  10 +
 drivers/vhost/Makefile  |   4 +
 drivers/vhost/vsock.c   | 628 +++
 drivers/vhost/vsock.h   |   4 +
 include/linux/virtio_vsock.h| 203 
 include/uapi/linux/virtio_ids.h |   1 +
 include/uapi/linux/virtio_vsock.h   |  87 
 net/vmw_vsock/Kconfig   |  18 +
 net/vmw_vsock/Makefile  |   2 +
 net/vmw_vsock/virtio_transport.c| 466 +
 net/vmw_vsock/virtio_transport_common.c | 854 
 11 files changed, 2277 insertions(+)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport.c
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] VSOCK: Add Makefile and Kconfig

2015-12-09 Thread Stefan Hajnoczi
From: Asias He 

Enable virtio-vsock and vhost-vsock.

Signed-off-by: Asias He 
Signed-off-by: Stefan Hajnoczi 
---
v3:
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann )
---
 drivers/vhost/Kconfig  | 10 ++
 drivers/vhost/Makefile |  4 
 net/vmw_vsock/Kconfig  | 18 ++
 net/vmw_vsock/Makefile |  2 ++
 4 files changed, 34 insertions(+)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 533eaf0..a1bb4c2 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -21,6 +21,16 @@ config VHOST_SCSI
Say M here to enable the vhost_scsi TCM fabric module
for use with virtio-scsi guests
 
+config VHOST_VSOCK
+   tristate "vhost virtio-vsock driver"
+   depends on VSOCKETS && EVENTFD
+   select VIRTIO_VSOCKETS_COMMON
+   select VHOST
+   select VHOST_RING
+   default n
+   ---help---
+   Say M here to enable the vhost-vsock for virtio-vsock guests
+
 config VHOST_RING
tristate
---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index e0441c3..6b012b9 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -4,5 +4,9 @@ vhost_net-y := net.o
 obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o
 vhost_scsi-y := scsi.o
 
+obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
+vhost_vsock-y := vsock.o
+
 obj-$(CONFIG_VHOST_RING) += vringh.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index 14810ab..74e0bc8 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -26,3 +26,21 @@ config VMWARE_VMCI_VSOCKETS
 
  To compile this driver as a module, choose M here: the module
  will be called vmw_vsock_vmci_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS
+   tristate "virtio transport for Virtual Sockets"
+   depends on VSOCKETS && VIRTIO
+   select VIRTIO_VSOCKETS_COMMON
+   help
+ This module implements a virtio transport for Virtual Sockets.
+
+ Enable this transport if your Virtual Machine runs on Qemu/KVM.
+
+ To compile this driver as a module, choose M here: the module
+ will be called virtio_vsock_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS_COMMON
+   tristate
+   ---help---
+ This option is selected by any driver which needs to access
+ the virtio_vsock.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 2ce52d7..cf4c294 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,5 +1,7 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS) += virtio_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += virtio_transport_common.o
 
 vsock-y += af_vsock.o vsock_addr.o
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/4] VSOCK: Introduce vhost-vsock.ko

2015-12-09 Thread Stefan Hajnoczi
From: Asias He 

VM sockets vhost transport implementation. This module runs in host
kernel.

Signed-off-by: Asias He 
Signed-off-by: Stefan Hajnoczi 
---
v3:
 * Remove unneeded variable used to store return value
   (Fengguang Wu  and Julia Lawall
   )
v2:
 * Add missing total_tx_buf decrement
 * Support flexible rx/tx descriptor layout
 * Refuse to assign reserved CIDs
 * Refuse guest CID if already in use
 * Only accept correctly addressed packets
---
 drivers/vhost/vsock.c | 628 ++
 drivers/vhost/vsock.h |   4 +
 2 files changed, 632 insertions(+)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
new file mode 100644
index 000..3c0034a
--- /dev/null
+++ b/drivers/vhost/vsock.c
@@ -0,0 +1,628 @@
+/*
+ * vhost transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He 
+ * Stefan Hajnoczi 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include "vhost.h"
+#include "vsock.h"
+
+#define VHOST_VSOCK_DEFAULT_HOST_CID   2
+
+static int vhost_transport_socket_init(struct vsock_sock *vsk,
+  struct vsock_sock *psk);
+
+enum {
+   VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+};
+
+/* Used to track all the vhost_vsock instances on the system. */
+static LIST_HEAD(vhost_vsock_list);
+static DEFINE_MUTEX(vhost_vsock_mutex);
+
+struct vhost_vsock_virtqueue {
+   struct vhost_virtqueue vq;
+};
+
+struct vhost_vsock {
+   /* Vhost device */
+   struct vhost_dev dev;
+   /* Vhost vsock virtqueue*/
+   struct vhost_vsock_virtqueue vqs[VSOCK_VQ_MAX];
+   /* Link to global vhost_vsock_list*/
+   struct list_head list;
+   /* Head for pkt from host to guest */
+   struct list_head send_pkt_list;
+   /* Work item to send pkt */
+   struct vhost_work send_pkt_work;
+   /* Wait queue for send pkt */
+   wait_queue_head_t queue_wait;
+   /* Used for global tx buf limitation */
+   u32 total_tx_buf;
+   /* Guest contex id this vhost_vsock instance handles */
+   u32 guest_cid;
+};
+
+static u32 vhost_transport_get_local_cid(void)
+{
+   return VHOST_VSOCK_DEFAULT_HOST_CID;
+}
+
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+   struct vhost_vsock *vsock;
+
+   mutex_lock(_vsock_mutex);
+   list_for_each_entry(vsock, _vsock_list, list) {
+   if (vsock->guest_cid == guest_cid) {
+   mutex_unlock(_vsock_mutex);
+   return vsock;
+   }
+   }
+   mutex_unlock(_vsock_mutex);
+
+   return NULL;
+}
+
+static void
+vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
+   struct vhost_virtqueue *vq)
+{
+   bool added = false;
+
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+   for (;;) {
+   struct virtio_vsock_pkt *pkt;
+   struct iov_iter iov_iter;
+   unsigned out, in;
+   struct sock *sk;
+   size_t nbytes;
+   size_t len;
+   int head;
+
+   if (list_empty(>send_pkt_list)) {
+   vhost_enable_notify(>dev, vq);
+   break;
+   }
+
+   head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+, , NULL, NULL);
+   pr_debug("%s: head = %d\n", __func__, head);
+   if (head < 0)
+   break;
+
+   if (head == vq->num) {
+   if (unlikely(vhost_enable_notify(>dev, vq))) {
+   vhost_disable_notify(>dev, vq);
+   continue;
+   }
+   break;
+   }
+
+   pkt = list_first_entry(>send_pkt_list,
+  struct virtio_vsock_pkt, list);
+   list_del_init(>list);
+
+   if (out) {
+   virtio_transport_free_pkt(pkt);
+   vq_err(vq, "Expected 0 output buffers, got %u\n", out);
+   break;
+   }
+
+   len = iov_length(>iov[out], in);
+   iov_iter_init(_iter, READ, >iov[out], in, len);
+
+   nbytes = copy_to_iter(>hdr, sizeof(pkt->hdr), _iter);
+   if (nbytes != sizeof(pkt->hdr)) {
+   virtio_transport_free_pkt(pkt);
+   vq_err(vq, "Faulted on copying pkt hdr\n");
+   break;
+   }
+
+   nbytes = copy_to_iter(pkt->buf, pkt->len, 

[PATCH v3 2/4] VSOCK: Introduce virtio-vsock.ko

2015-12-09 Thread Stefan Hajnoczi
From: Asias He 

VM sockets virtio transport implementation. This module runs in guest
kernel.

Signed-off-by: Asias He 
Signed-off-by: Stefan Hajnoczi 
---
v2:
 * Fix total_tx_buf accounting
 * Add virtio_transport global mutex to prevent races
---
 net/vmw_vsock/virtio_transport.c | 466 +++
 1 file changed, 466 insertions(+)
 create mode 100644 net/vmw_vsock/virtio_transport.c

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
new file mode 100644
index 000..df65dca
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport.c
@@ -0,0 +1,466 @@
+/*
+ * virtio transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He 
+ * Stefan Hajnoczi 
+ *
+ * Some of the code is take from Gerd Hoffmann 's
+ * early virtio-vsock proof-of-concept bits.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct workqueue_struct *virtio_vsock_workqueue;
+static struct virtio_vsock *the_virtio_vsock;
+static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock);
+
+struct virtio_vsock {
+   /* Virtio device */
+   struct virtio_device *vdev;
+   /* Virtio virtqueue */
+   struct virtqueue *vqs[VSOCK_VQ_MAX];
+   /* Wait queue for send pkt */
+   wait_queue_head_t queue_wait;
+   /* Work item to send pkt */
+   struct work_struct tx_work;
+   /* Work item to recv pkt */
+   struct work_struct rx_work;
+   /* Mutex to protect send pkt*/
+   struct mutex tx_lock;
+   /* Mutex to protect recv pkt*/
+   struct mutex rx_lock;
+   /* Number of recv buffers */
+   int rx_buf_nr;
+   /* Number of max recv buffers */
+   int rx_buf_max_nr;
+   /* Used for global tx buf limitation */
+   u32 total_tx_buf;
+   /* Guest context id, just like guest ip address */
+   u32 guest_cid;
+};
+
+static struct virtio_vsock *virtio_vsock_get(void)
+{
+   return the_virtio_vsock;
+}
+
+static u32 virtio_transport_get_local_cid(void)
+{
+   struct virtio_vsock *vsock = virtio_vsock_get();
+
+   return vsock->guest_cid;
+}
+
+static int
+virtio_transport_send_pkt(struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info)
+{
+   u32 src_cid, src_port, dst_cid, dst_port;
+   int ret, in_sg = 0, out_sg = 0;
+   struct virtio_transport *trans;
+   struct virtio_vsock_pkt *pkt;
+   struct virtio_vsock *vsock;
+   struct scatterlist hdr, buf, *sgs[2];
+   struct virtqueue *vq;
+   u32 pkt_len = info->pkt_len;
+   DEFINE_WAIT(wait);
+
+   vsock = virtio_vsock_get();
+   if (!vsock)
+   return -ENODEV;
+
+   src_cid = virtio_transport_get_local_cid();
+   src_port = vsk->local_addr.svm_port;
+   if (!info->remote_cid) {
+   dst_cid = vsk->remote_addr.svm_cid;
+   dst_port = vsk->remote_addr.svm_port;
+   } else {
+   dst_cid = info->remote_cid;
+   dst_port = info->remote_port;
+   }
+
+   trans = vsk->trans;
+   vq = vsock->vqs[VSOCK_VQ_TX];
+
+   if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+   pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+   pkt_len = virtio_transport_get_credit(trans, pkt_len);
+   /* Do not send zero length OP_RW pkt*/
+   if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+   return pkt_len;
+
+   /* Respect global tx buf limitation */
+   mutex_lock(>tx_lock);
+   while (pkt_len + vsock->total_tx_buf > VIRTIO_VSOCK_MAX_TX_BUF_SIZE) {
+   prepare_to_wait_exclusive(>queue_wait, ,
+ TASK_UNINTERRUPTIBLE);
+   mutex_unlock(>tx_lock);
+   schedule();
+   mutex_lock(>tx_lock);
+   finish_wait(>queue_wait, );
+   }
+   vsock->total_tx_buf += pkt_len;
+   mutex_unlock(>tx_lock);
+
+   pkt = virtio_transport_alloc_pkt(vsk, info, pkt_len,
+src_cid, src_port,
+dst_cid, dst_port);
+   if (!pkt) {
+   mutex_lock(>tx_lock);
+   vsock->total_tx_buf -= pkt_len;
+   mutex_unlock(>tx_lock);
+   virtio_transport_put_credit(trans, pkt_len);
+   return -ENOMEM;
+   }
+
+   pr_debug("%s:info->pkt_len= %d\n", __func__, info->pkt_len);
+
+   /* Will be released in virtio_transport_send_pkt_work */
+   sock_hold(>vsk->sk);
+   virtio_transport_inc_tx_pkt(pkt);
+
+   /* Put pkt in the virtqueue */
+   sg_init_one(, >hdr, 

Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist 
Technology to offload servers from handling compute-intensive 
operations, such as:
1) Symmetric cryptography functions including cipher operations and 
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic 
curve cryptography

3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these 
operations will fail when unplug device.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Alexander Duyck
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyu  wrote:
>
>
> On 12/8/2015 1:12 AM, Alexander Duyck wrote:
>>
>> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
>>>
>>> On 12/5/2015 1:07 AM, Alexander Duyck wrote:
>
>
>
> We still need to support Windows guest for migration and this is why
> our
> patches keep all changes in the driver since it's impossible to change
> Windows kernel.



 That is a poor argument.  I highly doubt Microsoft is interested in
 having to modify all of the drivers that will support direct assignment
 in order to support migration.  They would likely request something
 similar to what I have in that they will want a way to do DMA tracking
 with minimal modification required to the drivers.
>>>
>>>
>>>
>>> This totally depends on the NIC or other devices' vendors and they
>>> should make decision to support migration or not. If yes, they would
>>> modify driver.
>>
>>
>> Having to modify every driver that wants to support live migration is
>> a bit much.  In addition I don't see this being limited only to NIC
>> devices.  You can direct assign a number of different devices, your
>> solution cannot be specific to NICs.
>
>
> We are also adding such migration support for QAT device and so our
> solution will not just be limit to NIC. Now just is the beginning.

Agreed, but still QAT is networking related.  My advice would be to
look at something else that works from within a different subsystem
such as storage.  All I am saying is that your solution is very
networking centric.

> We can't limit user to only use Linux guest. So the migration feature
> should work for both Windows and Linux guest.

Right now what your solution is doing is to limit things so that only
the Intel NICs can support this since it will require driver
modification across the board.  Instead what I have proposed should
make it so that once you have done the work there should be very
little work that has to be done on your port to support any device.

>>
>>> If just target to call suspend/resume during migration, the feature will
>>> be meaningless. Most cases don't want to affect user during migration
>>> a lot and so the service down time is vital. Our target is to apply
>>> SRIOV NIC passthough to cloud service and NFV(network functions
>>> virtualization) projects which are sensitive to network performance
>>> and stability. From my opinion, We should give a change for device
>>> driver to implement itself migration job. Call suspend and resume
>>> callback in the driver if it doesn't care the performance during
>>> migration.
>>
>>
>> The suspend/resume callback should be efficient in terms of time.
>> After all we don't want the system to stall for a long period of time
>> when it should be either running or asleep.  Having it burn cycles in
>> a power state limbo doesn't do anyone any good.  If nothing else maybe
>> it will help to push the vendors to speed up those functions which
>> then benefit migration and the system sleep states.
>
>
> If we can benefit both migration and suspend, that would be wonderful.
> But migration and system pm is still different. Just for example,
> driver doesn't need to put device into deep D-status during migration
> and host can do this after migration while it's essential for
> system sleep. PCI configure space and interrupt config is emulated by
> Qemu and Qemu can migrate these configures to new machine. Driver
> doesn't need to deal with such thing. So I think migration still needs a
> different callback or different code path than device suspend/resume.

SR-IOV devices are considered to be in D3 as soon as you clear the bus
master enable bit.  They don't actually have a PCIe power management
block in their configuration space.  The advantage of the
suspend/resume approach is that the D0->D3->D0 series of transitions
should trigger a PCIe reset on the device.  As such the resume call is
capable of fully reinitializing a device.

As far as migrating the interrupts themselves moving live interrupts
is problematic.  You are more likely to throw them out of sync since
the state of the device will not match the state of what you migrated
for things like the pending bit array so if there is a device that
actually depending on those bits you might run into issues.

> Another concern is that we have to rework PM core ore PCI bus driver
> to call suspend/resume for passthrough devices during migration. This
> also blocks new feature works on the Windows.

If I am not mistaken the Windows drivers have a similar feature that
is called when you disable or enable an interface.  I believe the
motivation for using D3 when a device has been disabled is to save
power on the system since in D3 the device should be in its lowest
power state.

>>
>> Also you keep assuming you can keep the device running while you do
>> the migration and you can't.  You are going to corrupt the 

Re: live migration vs device assignment (motivation)

2015-12-09 Thread Alexander Duyck
On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu  wrote:

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
>
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
>
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch V0] This patch adds some support required for KVM in order to support LMCE.

2015-12-09 Thread Ashok Raj
- Add support for MSR_IA32_MCG_EXT_CTL
- Add MCG_LMCE_P to KVM_MCE_CAP_SUPPORTED
- Changes to IA32_FEATURE_CONTROL, allow this MSR to be defined just not for
  nested VMM, but now its required for Local MCE.

Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Tested-by: Gong Chen 
Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx.c  | 26 +-
 arch/x86/kvm/x86.c  | 17 -
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30cfd64..6940141 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -525,6 +525,7 @@ struct kvm_vcpu_arch {
u64 mcg_cap;
u64 mcg_status;
u64 mcg_ctl;
+   u64 mcg_ext_ctl;
u64 *mce_banks;
 
/* Cache MMIO info */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 87acc52..c2ce9f4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2747,6 +2747,20 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
return 0;
 }
 
+bool can_feature_control_exist(struct kvm_vcpu *vcpu)
+{
+   /*
+* There are some features that require BIOS enabling.
+* In such cases BIOS is supposed to set this bit and indicate
+* the feature is enabled and available to the OS.
+* Local Machine Check Exception (LMCE) is one such feature.
+*/
+   if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+   return true;
+
+   return (nested_vmx_allowed(vcpu));
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -2789,9 +2803,11 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu))
+   if (can_feature_control_exist(vcpu))
+   msr_info->data =
+   to_vmx(vcpu)->nested.msr_ia32_feature_control;
+   else
return 1;
-   msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
break;
case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
if (!nested_vmx_allowed(vcpu))
@@ -2882,9 +2898,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
ret = kvm_set_msr_common(vcpu, msr_info);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu) ||
-   (to_vmx(vcpu)->nested.msr_ia32_feature_control &
-FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
+   if ((can_feature_control_exist(vcpu) == false) ||
+   ((to_vmx(vcpu)->nested.msr_ia32_feature_control &
+FEATURE_CONTROL_LOCKED) && !msr_info->host_initiated))
return 1;
vmx->nested.msr_ia32_feature_control = data;
if (msr_info->host_initiated && data == 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00462bd..0da3871 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -70,7 +70,7 @@
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
-#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
+#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P | MCG_LMCE_P)
 
 #define emul_to_vcpu(ctxt) \
container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt)
@@ -974,6 +974,7 @@ static u32 emulated_msrs[] = {
MSR_IA32_MISC_ENABLE,
MSR_IA32_MCG_STATUS,
MSR_IA32_MCG_CTL,
+   MSR_IA32_MCG_EXT_CTL,
MSR_IA32_SMBASE,
 };
 
@@ -1913,6 +1914,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return -1;
vcpu->arch.mcg_ctl = data;
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!(mcg_cap & MCG_LMCE_P))
+   return 1;
+   if (data != 0 && data != 0x1)
+   return -1;
+   vcpu->arch.mcg_ext_ctl = data;
+   break;
default:
if (msr >= MSR_IA32_MC0_CTL &&
msr < MSR_IA32_MCx_CTL(bank_num)) {
@@ -2170,6 +2178,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
 
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
+   case MSR_IA32_MCG_EXT_CTL:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
return set_msr_mce(vcpu, msr, data);
 
@@ -2266,6 +2275,11 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
return 1;
data = vcpu->arch.mcg_ctl;
break;
+

Re: [PATCH v3 0/4] Add virtio transport for AF_VSOCK

2015-12-09 Thread Michael S. Tsirkin
On Wed, Dec 09, 2015 at 08:03:49PM +0800, Stefan Hajnoczi wrote:
> Note: the virtio-vsock device specification is currently under review but not
> yet finalized.  Please review this code but don't merge until I send an update
> when the spec is finalized.  Thanks!

Yes, this should have RFC in the subject.

> v3:
>  * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
>of REQUEST/RESPONSE/ACK
>  * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
>(also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
>  * Only allow host->guest connections (same security model as latest
>VMware)
>  * Don't put vhost vsock driver into staging
>  * Add missing Kconfig dependencies (Arnd Bergmann )
>  * Remove unneeded variable used to store return value
>(Fengguang Wu  and Julia Lawall
>)
> 
> v2:
>  * Rebased onto Linux v4.4-rc2
>  * vhost: Refuse to assign reserved CIDs
>  * vhost: Refuse guest CID if already in use
>  * vhost: Only accept correctly addressed packets (no spoofing!)
>  * vhost: Support flexible rx/tx descriptor layout
>  * vhost: Add missing total_tx_buf decrement
>  * virtio_transport: Fix total_tx_buf accounting
>  * virtio_transport: Add virtio_transport global mutex to prevent races
>  * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
>semantics)
>  * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes 
> deadlock)
>  * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
>  * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
>  * common: Fix peer_buf_alloc inheritance on child socket
> 
> This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
> AF_VSOCK is designed for communication between virtual machines and
> hypervisors.  It is currently only implemented for VMware's VMCI transport.
> 
> This series implements the proposed virtio-vsock device specification from
> here:
> http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980
> 
> Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
> picked up the series again.
> 
> The QEMU userspace changes are here:
> https://github.com/stefanha/qemu/commits/vsock
> 
> Why virtio-vsock?
> -
> Guest<->host communication is currently done over the virtio-serial device.
> This makes it hard to port sockets API-based applications and is limited to
> static ports.
> 
> virtio-vsock uses the sockets API so that applications can rely on familiar
> SOCK_STREAM semantics.  Applications on the host can easily connect to guest
> agents because the sockets API allows multiple connections to a listen socket
> (unlike virtio-serial).  This simplifies the guest<->host communication and
> eliminates the need for extra processes on the host to arbitrate virtio-serial
> ports.
> 
> Overview
> 
> This series adds 3 pieces:
> 
> 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko
> 
> 2. virtio_transport.ko - guest driver
> 
> 3. drivers/vhost/vsock.ko - host driver
> 
> Howto
> -
> The following kernel options are needed:
>   CONFIG_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS_COMMON=y
>   CONFIG_VHOST_VSOCK=m
> 
> Launch QEMU as follows:
>   # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3
> 
> Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
> is 2 and the guest must be assigned a CID (3 in the example above).
> 
> Status
> --
> This patch series implements the latest draft specification.  Please review.
> 
> Asias He (4):
>   VSOCK: Introduce virtio-vsock-common.ko
>   VSOCK: Introduce virtio-vsock.ko
>   VSOCK: Introduce vhost-vsock.ko
>   VSOCK: Add Makefile and Kconfig
> 
>  drivers/vhost/Kconfig   |  10 +
>  drivers/vhost/Makefile  |   4 +
>  drivers/vhost/vsock.c   | 628 +++
>  drivers/vhost/vsock.h   |   4 +
>  include/linux/virtio_vsock.h| 203 
>  include/uapi/linux/virtio_ids.h |   1 +
>  include/uapi/linux/virtio_vsock.h   |  87 
>  net/vmw_vsock/Kconfig   |  18 +
>  net/vmw_vsock/Makefile  |   2 +
>  net/vmw_vsock/virtio_transport.c| 466 +
>  net/vmw_vsock/virtio_transport_common.c | 854 
> 
>  11 files changed, 2277 insertions(+)
>  create mode 100644 drivers/vhost/vsock.c
>  create mode 100644 drivers/vhost/vsock.h
>  create mode 100644 include/linux/virtio_vsock.h
>  create mode 100644 include/uapi/linux/virtio_vsock.h
>  create mode 100644 net/vmw_vsock/virtio_transport.c
>  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> 
> -- 
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
I'm trying to clean up kvmclock and I can't get it to work at all.  My
host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.

If I boot an SMP (2 vcpus) guest, tracing says:

 qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 0
 qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
 qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1


If I boot a UP guest, tracing says:

 qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1

I suspect, but I haven't verified, that this is fallout from:

commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
Author: Marcelo Tosatti 
Date:   Wed May 14 12:43:24 2014 -0300

KVM: x86: disable master clock if TSC is reset during suspend

Updating system_time from the kernel clock once master clock
has been enabled can result in time backwards event, in case
kernel clock frequency is lower than TSC frequency.

Disable master clock in case it is necessary to update it
from the resume path.

Signed-off-by: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 


Can we please stop making kvmclock more complex?  It's a beast right
now, and not in a good way.  It's far too tangled with the vclock
machinery on both the host and guest sides, the pvclock stuff is not
well thought out (even in principle in an ABI sense), and it's never
been clear to my what problem exactly the kvmclock stuff is supposed
to solve.


I'm somewhat tempted to suggest that we delete kvmclock entirely and
start over.  A correctly functioning KVM guest using TSC (i.e.
ignoring kvmclock entirely) seems to work rather more reliably and
considerably faster than a kvmclock guest.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:
>
>
> On 09/12/2015 22:10, Andy Lutomirski wrote:
>> Can we please stop making kvmclock more complex?  It's a beast right
>> now, and not in a good way.  It's far too tangled with the vclock
>> machinery on both the host and guest sides, the pvclock stuff is not
>> well thought out (even in principle in an ABI sense), and it's never
>> been clear to my what problem exactly the kvmclock stuff is supposed
>> to solve.
>
> It's supposed to solve the problem that:
>
> - not all hosts have a working TSC

Fine, but we don't need any vdso integration for that.

>
> - even if they all do, virtual machines can be migrated (or
> saved/restored) to a host with a different TSC frequency

OK, I buy that.  So we want to export a linear function that the guest
applies to the TSC so the guest can apply it.

I suppose we also want ntp frequency corrections on the host to
propagate to the guest.

>
> - any MMIO- or PIO-based mechanism to access the current time is orders
> of magnitude slower than the TSC and less precise too.

Yup.  But TSC by itself gets that benefit, too.

>
>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> start over.  A correctly functioning KVM guest using TSC (i.e.
>> ignoring kvmclock entirely) seems to work rather more reliably and
>> considerably faster than a kvmclock guest.
>
> If all your hosts have a working TSC and you don't do migration or
> save/restore, that's a valid configuration.  It's not a good default,
> however.

Er?

kvmclock is still really quite slow and buggy.  And the patch I
identified is definitely a problem here:

[  136.131241] KVM: disabling fast timing permanently due to inability
to recover from suspend

I got that on the host with this whitespace-damaged patch:

if (backwards_tsc) {
u64 delta_cyc = max_tsc - local_tsc;
+   if (!backwards_tsc_observed)
+   pr_warn("KVM: disabling fast timing
permanently due to inability to recover from suspend\n");

when I suspended and resumed.

Can anyone explain what problem
16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
brief inspection, it just seems to be incorrect.  Shouldn't KVM's
normal TSC logic handle that case right?  After all, all vcpus should
be paused when we resume from suspend.  At worst, we should just need
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
shouldn't we do that regardless of which way the TSC jumped on
suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
likely to change except on the very small handful of CPUs (if any)
that keep the TSC running in S3 and hibernate.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini  wrote:
>
>
> On 09/12/2015 22:49, Andy Lutomirski wrote:
>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:
>>>
>>>
>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
 Can we please stop making kvmclock more complex?  It's a beast right
 now, and not in a good way.  It's far too tangled with the vclock
 machinery on both the host and guest sides, the pvclock stuff is not
 well thought out (even in principle in an ABI sense), and it's never
 been clear to my what problem exactly the kvmclock stuff is supposed
 to solve.
>>>
>>> It's supposed to solve the problem that:
>>>
>>> - not all hosts have a working TSC
>>
>> Fine, but we don't need any vdso integration for that.
>
> Well, you still want a fast time source.  That was a given. :)

If the host can't figure out how to give *itself* a fast time source,
I'd be surprised if KVM can manage to give the guest a fast, reliable
time source.

>
>>> - even if they all do, virtual machines can be migrated (or
>>> saved/restored) to a host with a different TSC frequency
>>>
>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>> of magnitude slower than the TSC and less precise too.
>>
>> Yup.  But TSC by itself gets that benefit, too.
>
> Yes, the problem is if you want to solve all three of them.  The first
> two are solved by the ACPI PM timer with a decent resolution (70
> ns---much faster anyway than an I/O port access).  The third is solved
> by TSC.  To solve all three, you need kvmclock.

Still confused.  Is kvmclock really used in cases where even the host
can't pull of working TSC?

>
 I'm somewhat tempted to suggest that we delete kvmclock entirely and
 start over.  A correctly functioning KVM guest using TSC (i.e.
 ignoring kvmclock entirely) seems to work rather more reliably and
 considerably faster than a kvmclock guest.
>>>
>>> If all your hosts have a working TSC and you don't do migration or
>>> save/restore, that's a valid configuration.  It's not a good default,
>>> however.
>>
>> Er?
>>
>> kvmclock is still really quite slow and buggy.
>
> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>
>> And the patch I identified is definitely a problem here:
>>
>> [  136.131241] KVM: disabling fast timing permanently due to inability
>> to recover from suspend
>>
>> I got that on the host with this whitespace-damaged patch:
>>
>> if (backwards_tsc) {
>> u64 delta_cyc = max_tsc - local_tsc;
>> +   if (!backwards_tsc_observed)
>> +   pr_warn("KVM: disabling fast timing
>> permanently due to inability to recover from suspend\n");
>>
>> when I suspended and resumed.
>>
>> Can anyone explain what problem
>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>> normal TSC logic handle that case right?  After all, all vcpus should
>> be paused when we resume from suspend.  At worst, we should just need
>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>> shouldn't we do that regardless of which way the TSC jumped on
>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>> likely to change except on the very small handful of CPUs (if any)
>> that keep the TSC running in S3 and hibernate.
>
> I don't recall the details of that patch, so Marcelo will have to answer
> this, or Alex too since he chimed in the original thread.  At least it
> should be made conditional on the existence of a VM at suspend time (and
> the master clock stuff should be made per VM, as I suggested at
> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>
> It would indeed be great if the master clock could be dropped.  But I'm
> definitely missing some of the subtle details. :(

Me, too.

Anyway, see the attached untested patch.  Marcelo?

--Andy
From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001
Message-Id: 
From: Andy Lutomirski 
Date: Wed, 9 Dec 2015 14:21:05 -0800
Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks

This gets rid of the "did TSC go backwards" logic and just updates
all clocks.  It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kvm/x86.c | 75 +++---
 1 file changed, 3 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 unsigned 

Re: kvmclock doesn't work, help?

2015-12-09 Thread Paolo Bonzini


On 09/12/2015 23:27, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini  wrote:
>> On 09/12/2015 22:49, Andy Lutomirski wrote:
>>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:


 On 09/12/2015 22:10, Andy Lutomirski wrote:
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.

 It's supposed to solve the problem that:

 - not all hosts have a working TSC
>>>
>>> Fine, but we don't need any vdso integration for that.
>>
>> Well, you still want a fast time source.  That was a given. :)
> 
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.

There's no vdso integration unless the host has a constant, nonstop
(fully "working") TSC.  That's the meaning of PVCLOCK_TSC_STABLE_BIT.

So, correction:  if you can pull it off, you still want a fast time
source.  Otherwise, you still want one that is as fast as possible,
especially on the kernel side.

 - even if they all do, virtual machines can be migrated (or
 saved/restored) to a host with a different TSC frequency

 - any MMIO- or PIO-based mechanism to access the current time is orders
 of magnitude slower than the TSC and less precise too.
>>
>> the problem is if you want to solve all three of them.  The first
>> two are solved by the ACPI PM timer with a decent resolution (70
>> ns---much faster anyway than an I/O port access).  The third is solved
>> by TSC.  To solve all three, you need kvmclock.
> 
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?

You can certainly provide kvmclock even if you lack constant-rate or
nonstop TSC.  Those are only a requirement for vdso.

If the host has a constant-rate TSC, but the rate differs per physical
CPU (common on older NUMA machines), you can easily provide a working
kvmclock.  It cannot support vdso because you'll need to read the time
from a non-preemptable section, but it will work because KVM can update
the kvmclock parameters on VCPU migration, and it's still faster than
anything else.  (The purpose of the now-gone migration notifiers was to
support vdso even in this case).

If the host doesn't even have constant-rate TSC, you can still provide
kernel-only kvmclock reads through cpufreq notifiers.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Paolo Bonzini


On 09/12/2015 22:10, Andy Lutomirski wrote:
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.

It's supposed to solve the problem that:

- not all hosts have a working TSC

- even if they all do, virtual machines can be migrated (or
saved/restored) to a host with a different TSC frequency

- any MMIO- or PIO-based mechanism to access the current time is orders
of magnitude slower than the TSC and less precise too.

> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) seems to work rather more reliably and
> considerably faster than a kvmclock guest.

If all your hosts have a working TSC and you don't do migration or
save/restore, that's a valid configuration.  It's not a good default,
however.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-09 Thread Bandan Das

Commit a2b9e6c1a35afcc09:

KVM: x86: Don't report guest userspace emulation error to userspace

Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures 
to
userspace due to race-condition between a vmexit and the instruction 
emulator.
The same rational applies also to userspace applications that are permitted 
by
the guest OS to access MMIO area or perform PIO.

This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.

I searched the archives but failed in finding anything. Can someone please
explain why this is needed ? Or, why not let userspace decide what to do based
on the cpl, whether to continue execution or kill the guest ? Is the assumption
here that this is what userspace always wants ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-09 Thread Andy Lutomirski
On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirski  wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini  wrote:
>>
>>
>> On 09/12/2015 22:49, Andy Lutomirski wrote:
>>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:


 On 09/12/2015 22:10, Andy Lutomirski wrote:
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.

 It's supposed to solve the problem that:

 - not all hosts have a working TSC
>>>
>>> Fine, but we don't need any vdso integration for that.
>>
>> Well, you still want a fast time source.  That was a given. :)
>
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.
>
>>
 - even if they all do, virtual machines can be migrated (or
 saved/restored) to a host with a different TSC frequency

 - any MMIO- or PIO-based mechanism to access the current time is orders
 of magnitude slower than the TSC and less precise too.
>>>
>>> Yup.  But TSC by itself gets that benefit, too.
>>
>> Yes, the problem is if you want to solve all three of them.  The first
>> two are solved by the ACPI PM timer with a decent resolution (70
>> ns---much faster anyway than an I/O port access).  The third is solved
>> by TSC.  To solve all three, you need kvmclock.
>
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?
>
>>
> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) seems to work rather more reliably and
> considerably faster than a kvmclock guest.

 If all your hosts have a working TSC and you don't do migration or
 save/restore, that's a valid configuration.  It's not a good default,
 however.
>>>
>>> Er?
>>>
>>> kvmclock is still really quite slow and buggy.
>>
>> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
>> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>>
>>> And the patch I identified is definitely a problem here:
>>>
>>> [  136.131241] KVM: disabling fast timing permanently due to inability
>>> to recover from suspend
>>>
>>> I got that on the host with this whitespace-damaged patch:
>>>
>>> if (backwards_tsc) {
>>> u64 delta_cyc = max_tsc - local_tsc;
>>> +   if (!backwards_tsc_observed)
>>> +   pr_warn("KVM: disabling fast timing
>>> permanently due to inability to recover from suspend\n");
>>>
>>> when I suspended and resumed.
>>>
>>> Can anyone explain what problem
>>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>>> normal TSC logic handle that case right?  After all, all vcpus should
>>> be paused when we resume from suspend.  At worst, we should just need
>>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>>> shouldn't we do that regardless of which way the TSC jumped on
>>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>>> likely to change except on the very small handful of CPUs (if any)
>>> that keep the TSC running in S3 and hibernate.
>>
>> I don't recall the details of that patch, so Marcelo will have to answer
>> this, or Alex too since he chimed in the original thread.  At least it
>> should be made conditional on the existence of a VM at suspend time (and
>> the master clock stuff should be made per VM, as I suggested at
>> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>>
>> It would indeed be great if the master clock could be dropped.  But I'm
>> definitely missing some of the subtle details. :(
>
> Me, too.
>
> Anyway, see the attached untested patch.  Marcelo?

That patch seems to work.  I have valid timing before and after host
suspend.  When I suspend and resume the host with a running guest, I
get:

[   26.504071] clocksource: timekeeping watchdog: Marking clocksource
'tsc' as unstable because the skew is too large:
[   26.505253] clocksource:   'kvm-clock' wd_now:
66744c542 wd_last: 564b09794 mask: 
[   26.506436] clocksource:   'tsc' cs_now:
fee310b133c8 cs_last: cf5d0b952 mask: 

in the guest, which is arguably correct.  KVM could be further
improved to update the tsc offset after suspend/resume to get rid of
that artifact.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a 

Re: kvmclock doesn't work, help?

2015-12-09 Thread Paolo Bonzini


On 09/12/2015 22:49, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:
>>
>>
>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>> Can we please stop making kvmclock more complex?  It's a beast right
>>> now, and not in a good way.  It's far too tangled with the vclock
>>> machinery on both the host and guest sides, the pvclock stuff is not
>>> well thought out (even in principle in an ABI sense), and it's never
>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>> to solve.
>>
>> It's supposed to solve the problem that:
>>
>> - not all hosts have a working TSC
> 
> Fine, but we don't need any vdso integration for that.

Well, you still want a fast time source.  That was a given. :)

>> - even if they all do, virtual machines can be migrated (or
>> saved/restored) to a host with a different TSC frequency
>> 
>> - any MMIO- or PIO-based mechanism to access the current time is orders
>> of magnitude slower than the TSC and less precise too.
> 
> Yup.  But TSC by itself gets that benefit, too.

Yes, the problem is if you want to solve all three of them.  The first
two are solved by the ACPI PM timer with a decent resolution (70
ns---much faster anyway than an I/O port access).  The third is solved
by TSC.  To solve all three, you need kvmclock.

>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>> considerably faster than a kvmclock guest.
>>
>> If all your hosts have a working TSC and you don't do migration or
>> save/restore, that's a valid configuration.  It's not a good default,
>> however.
> 
> Er?
> 
> kvmclock is still really quite slow and buggy.

Unless it takes 3-4000 clock cycles for a gettimeofday, which it
shouldn't even with vdso disabled, it's definitely not slower than PIO.

> And the patch I identified is definitely a problem here:
> 
> [  136.131241] KVM: disabling fast timing permanently due to inability
> to recover from suspend
> 
> I got that on the host with this whitespace-damaged patch:
> 
> if (backwards_tsc) {
> u64 delta_cyc = max_tsc - local_tsc;
> +   if (!backwards_tsc_observed)
> +   pr_warn("KVM: disabling fast timing
> permanently due to inability to recover from suspend\n");
> 
> when I suspended and resumed.
> 
> Can anyone explain what problem
> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
> normal TSC logic handle that case right?  After all, all vcpus should
> be paused when we resume from suspend.  At worst, we should just need
> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
> shouldn't we do that regardless of which way the TSC jumped on
> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
> likely to change except on the very small handful of CPUs (if any)
> that keep the TSC running in S3 and hibernate.

I don't recall the details of that patch, so Marcelo will have to answer
this, or Alex too since he chimed in the original thread.  At least it
should be made conditional on the existence of a VM at suspend time (and
the master clock stuff should be made per VM, as I suggested at
https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).

It would indeed be great if the master clock could be dropped.  But I'm
definitely missing some of the subtle details. :(

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-09 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.

This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.

> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF

OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.

> It also requires guest to do switch configuration.

That's just wrong. if you want a switch, you need to
configure a switch.

> These are hard to
> manage and deploy from our customers.

So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.

> To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.

No idea what this means.

> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Everyone needs performance and scalability.

> 
> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.
> 
> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).

I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks

2015-12-09 Thread Andy Lutomirski
This gets rid of the "did TSC go backwards" logic and just updates
all clocks.  It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kvm/x86.c | 75 +++---
 1 file changed, 3 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 unsigned int __read_mostly lapic_timer_advance_ns = 0;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
-static bool __read_mostly backwards_tsc_observed = false;
-
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>master_cycle_now);
 
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-   && !backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
 
if (ka->use_master_clock)
@@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
struct kvm_vcpu *vcpu;
int i;
int ret;
-   u64 local_tsc;
-   u64 max_tsc = 0;
-   bool stable, backwards_tsc = false;
 
kvm_shared_msr_cpu_online();
ret = kvm_x86_ops->hardware_enable();
if (ret != 0)
return ret;
 
-   local_tsc = rdtsc();
-   stable = !check_tsc_unstable();
list_for_each_entry(kvm, _list, vm_list) {
kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!stable && vcpu->cpu == smp_processor_id())
+   if (vcpu->cpu == smp_processor_id()) {
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   if (stable && vcpu->arch.last_host_tsc > local_tsc) {
-   backwards_tsc = true;
-   if (vcpu->arch.last_host_tsc > max_tsc)
-   max_tsc = vcpu->arch.last_host_tsc;
+   kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+vcpu);
}
}
}
 
-   /*
-* Sometimes, even reliable TSCs go backwards.  This happens on
-* platforms that reset TSC during suspend or hibernate actions, but
-* maintain synchronization.  We must compensate.  Fortunately, we can
-* detect that condition here, which happens early in CPU bringup,
-* before any KVM threads can be running.  Unfortunately, we can't
-* bring the TSCs fully up to date with real time, as we aren't yet far
-* enough into CPU bringup that we know how much real time has actually
-* elapsed; our helper function, get_kernel_ns() will be using boot
-* variables that haven't been updated yet.
-*
-* So we simply find the maximum observed TSC above, then record the
-* adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
-* the adjustment will be applied.  Note that we accumulate
-* adjustments, in case multiple suspend cycles happen before some VCPU
-* gets a chance to run again.  In the event that no KVM threads get a
-* chance to run, we will miss the entire elapsed period, as we'll have
-* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
-* loose cycle time.  This isn't too big a deal, since the loss will be
-* uniform across all VCPUs (not to mention the scenario is extremely
-* unlikely). It is possible that a second hibernate recovery happens
-* much faster than a first, causing the observed TSC here to be
-* smaller; this would require additional padding adjustment, which is
-* why we set last_host_tsc to the local tsc observed here.
-*
-* N.B. - this code below runs only on platforms with reliable TSC,
-* as that is the only way backwards_tsc is set above.  Also note
-* that this runs for ALL vcpus, which is not a bug; all VCPUs should
-* have the same delta_cyc adjustment applied if backwards_tsc
-* is detected.  Note further, this adjustment is only done once,
-* as we reset last_host_tsc on all VCPUs to stop this from being
-* called multiple times (one for each physical CPU bringup).
-*
-* Platforms with unreliable TSCs don't have to deal with this, they
-* will be compensated by the logic in vcpu_load, which sets the TSC to
-* catchup mode.  This will catchup all VCPUs to real time, but cannot
-* guarantee that they stay in perfect synchronization.
-*/
-   if (backwards_tsc) {
-

[PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-09 Thread Andy Lutomirski
From: Andy Lutomirski 

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 45ns on my machine.
With this change, it takes 29ns, which is almost as fast as the pure TSC
implementation.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c | 81 
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa649251..c325ba1bdddf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
*get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vsyscall_time_info *pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
cycle_t ret;
-   u64 last;
-   u32 version;
-   u8 flags;
-   unsigned cpu, cpu1;
-
+   u64 tsc, pvti_tsc;
+   u64 last, delta, pvti_system_time;
+   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
-* Note: hypervisor must guarantee that:
-* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-* 2. that per-CPU pvclock time info is updated if the
-*underlying CPU changes.
-* 3. that version is increased whenever underlying CPU
-*changes.
+* Note: The kernel and hypervisor must guarantee that cpu ID
+* number maps 1:1 to per-CPU pvclock time info.
+*
+* Because the hypervisor is entirely unaware of guest userspace
+* preemption, it cannot guarantee that per-CPU pvclock time
+* info is updated if the underlying CPU changes or that that
+* version is increased whenever underlying CPU changes.
 *
+* On KVM, we are guaranteed that pvti updates for any vCPU are
+* atomic as seen by *all* vCPUs.  This is an even stronger
+* guarantee than we get with a normal seqlock.
+*
+* On Xen, we don't appear to have that guarantee, but Xen still
+* supplies a valid seqlock using the version field.
+
+* We only do pvclock vdso timing at all if
+* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+* mean that all vCPUs have matching pvti and that the TSC is
+* synced, so we can just look at vCPU 0's pvti.
 */
-   do {
-   cpu = __getcpu() & VGETCPU_CPU_MASK;
-   /* TODO: We can put vcpu id into higher bits of pvti.version.
-* This will save a couple of cycles by getting rid of
-* __getcpu() calls (Gleb).
-*/
-
-   pvti = get_pvti(cpu);
-
-   version = __pvclock_read_cycles(>pvti, , );
-
-   /*
-* Test we're still on the cpu as well as the version.
-* We could have been migrated just after the first
-* vgetcpu but before fetching the version, so we
-* wouldn't notice a version change.
-*/
-   cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-   } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
-   if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+   return 0;
+   }
+
+   do {
+   version = pvti->version;
+
+   /* This is also a read barrier, so we'll read version first. */
+   tsc = rdtsc_ordered();
+
+   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+   pvti_tsc_shift = pvti->tsc_shift;
+   pvti_system_time = pvti->system_time;
+   pvti_tsc = pvti->tsc_timestamp;
+
+   /* Make sure that the version double-check is last. */
+   smp_rmb();
+   } while (unlikely((version & 1) || version != pvti->version));
+
+   delta = tsc - pvti_tsc;
+   ret = pvti_system_time +
+   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+   pvti_tsc_shift);
 
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] VFIO fixes for v4.4-rc5

2015-12-09 Thread Alex Williamson
Hi Linus,

The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec:

  Linux 4.4-rc1 (2015-11-15 17:00:27 -0800)

are available in the git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.4-rc5

for you to fetch changes up to ae5515d66362b9d96cdcfce504567f0b8b7bd83e:

  Revert: "vfio: Include No-IOMMU mode" (2015-12-04 08:38:42 -0700)


VFIO fixes for v4.4-rc5
 - Various fixes for removing redundancy, const'ifying structs,
   avoiding stack usage, fixing WARN usage (Krzysztof Kozlowski,
   Julia Lawall, Kees Cook, Dan Carpenter)
 - Revert No-IOMMU mode as the intended user has not emerged
   (Alex Williamson)


Alex Williamson (1):
  Revert: "vfio: Include No-IOMMU mode"

Dan Carpenter (1):
  vfio: fix a warning message

Julia Lawall (1):
  vfio-pci: constify pci_error_handlers structures

Kees Cook (1):
  vfio: platform: remove needless stack usage

Krzysztof Kozlowski (1):
  vfio: Drop owner assignment from platform_driver

 drivers/vfio/Kconfig |  15 ---
 drivers/vfio/pci/vfio_pci.c  |  10 +-
 drivers/vfio/platform/vfio_platform.c|   1 -
 drivers/vfio/platform/vfio_platform_common.c |   5 +-
 drivers/vfio/vfio.c  | 188 +--
 include/linux/vfio.h |   3 -
 include/uapi/linux/vfio.h|   7 -
 7 files changed, 13 insertions(+), 216 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-09 Thread Wu, Feng


> -Original Message-
> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> Sent: Wednesday, December 9, 2015 10:54 PM
> To: Wu, Feng 
> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> interrupts
> 
> 2015-12-09 08:19+, Wu, Feng:
> >> -Original Message-
> >> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> >> Sent: Tuesday, November 17, 2015 3:03 AM
> >> To: Wu, Feng 
> >> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-
> ker...@vger.kernel.org
> >> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> >> interrupts
> >>
> >> 2015-11-09 10:46+0800, Feng Wu:
> >> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
> >> > +  struct kvm_lapic_irq *irq)
> >> > +
> >> > +{
> >> > +unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> >> > +unsigned int dest_vcpus = 0;
> >> > +struct kvm_vcpu *vcpu;
> >> > +unsigned int i, mod, idx = 0;
> >> > +
> >> > +vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq);
> >> > +if (vcpu)
> >> > +return vcpu;
> >>
> >> I think the rest of this function shouldn't be implemented:
> >>  - Shorthands are only for IPIs and hence don't need to be handled,
> >>  - Lowest priority physical broadcast is not supported,
> >>  - Lowest priority cluster logical broadcast is not supported,
> >>  - No point in optimizing mixed xAPIC and x2APIC mode,
> >
> > I read your comments again, and don't quite understand why we
> > don't need PI optimization for mixed xAPIC and x2APIC mode.
> 
> There shouldn't be a non-hobbyist operating system that uses mixed mode,
> so the optimization would practically be dead code as all other cases
> are handled by kvm_intr_vector_hashing_dest_fast().

Thanks a lot for your elaboration!

> 
> I think that having extra code would bring problems in the future -- we
> need to take care of it when refactoring KVM's APIC and we should also
> write a unit-test for this otherwise dead path.  I don't think that the
> benefit for guests would ever balance those efforts.
> 
> (Physical xAPIC+x2APIC mode is still somewhat reasonable and xAPIC CPUs
>  start with LDR=0, which means that operating system doesn't need to
>  utilize mixed mode, as defined by KVM, when switching to x2APIC.)

I think you mean Physical xAPIC+Physical x2APIC mode, right? For physical
mode, we don't use LDR in any case, do we? So in physical mode, we only
use the APIC ID, that is why they can be mixed, is my understanding correct?
Thanks a lot!

> 
> > BTW, can we have mixed flat and cluster mode?
> 
> Yes, KVM recognizes that mixed mode, but luckily, there are severe
> limitations.
> 
> Notes below SDM section 10.6.2.2:
>   All processors that have their APIC software enabled (using the
>   spurious vector enable/disable bit) must have their DFRs (Destination
>   Format Registers) programmed identically.

Thanks for pointing this out, good to know it!

> 
> I hope there isn't a human that would use it in good faith.
> 
> (Only NMI/SMI/INIT/SIPI are delivered in software disabled mode and if
>  the system uses cluster xAPIC, OS should set DFR before LDR, which
>  doesn't trigger mixed mode either.)

Just curious, if the APIC is software disabled and it is in xAPIC mode. OS sets
different value for DFR for different APICs, then when OS sets LDR, KVM can
trigger mixed flat and cluster mode, right?

Thanks,
Feng



[PATCH 0/5] x86: KVM vdso and clock improvements

2015-12-09 Thread Andy Lutomirski
NB: patch 1 doesn't really belong here, but it makes this a lot
easier for me to test.  Patch 1, if it's okay at all, should go
though the kvm tree.  The rest should probably go through
tip:x86/vdso once they're reviewed.

I'll do a followup to enable vdso pvclock on 32-bit guests.
I'm not currently set up to test it.  (The KVM people could also
do it very easily on top of these patches.)

Andy Lutomirski (5):
  x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
  x86/vdso: Remove pvclock fixmap machinery
  x86/vdso: Enable vdso pvclock access on all vdso variants

 arch/x86/entry/vdso/vclock_gettime.c  | 151 --
 arch/x86/entry/vdso/vdso-layout.lds.S |   3 +-
 arch/x86/entry/vdso/vdso2c.c  |   3 +
 arch/x86/entry/vdso/vma.c |  14 
 arch/x86/include/asm/fixmap.h |   5 --
 arch/x86/include/asm/pvclock.h|  14 ++--
 arch/x86/include/asm/vdso.h   |   1 +
 arch/x86/kernel/kvmclock.c|  11 ++-
 arch/x86/kernel/pvclock.c |  24 --
 arch/x86/kvm/x86.c|  75 +
 10 files changed, 110 insertions(+), 191 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

2015-12-09 Thread Andy Lutomirski
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c  | 20 
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c | 13 +
 arch/x86/include/asm/pvclock.h|  9 +
 arch/x86/include/asm/vdso.h   |  1 +
 arch/x86/kernel/kvmclock.c|  5 +
 7 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1bdddf..5dd363d54348 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 #include 
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
-   const struct pvclock_vsyscall_time_info *pvti_base;
-   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
-   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
-   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
-   pvti_base = (struct pvclock_vsyscall_time_info *)
-   __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
-   return _base[offset];
+   return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921025f5..4158acc17df0 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 * segment.
 */
 
-   vvar_start = . - 2 * PAGE_SIZE;
+   vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
 
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
hpet_page = vvar_start + PAGE_SIZE;
+   pvclock_page = vvar_start + 2 * PAGE_SIZE;
 
. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d9922b106..491020b2826d 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -80,6 +81,7 @@ enum {
 const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
 };
 
 struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+   [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df47148160..aa828191c654 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+   struct pvclock_vsyscall_time_info *pvti;
 
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
}
 #endif
 
+   pvti = pvclock_pvti_cpu0_va();
+   if (pvti && image->sym_pvclock_page) {
+   ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+   if (ret)
+   goto up_fail;
+   }
+
 up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5c08bc..3864398c7cb2 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
 #include 
 #include 
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info 

Re: [PATCH] KVM: PPC: Increase memslots to 512

2015-12-09 Thread Paul Mackerras
On Wed, Dec 09, 2015 at 11:34:07AM +0100, Thomas Huth wrote:
> Only using 32 memslots for KVM on powerpc is way too low, you can
> nowadays hit this limit quite fast by adding a couple of PCI devices
> and/or pluggable memory DIMMs to the guest.
> 
> x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256
> pluggable DIMM slots, 3 private slots and 253 slots for other things
> like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in
> total). We should do something similar for powerpc, and since we do
> not use private slots here, we can set the value to 512 directly.
> 
> While we're at it, also remove the KVM_MEM_SLOTS_NUM definition
> from the powerpc-specific header since this gets defined in the
> generic kvm_host.h header anyway.
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to my kvm-ppc-next branch.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8

2015-12-09 Thread Paul Mackerras
On Fri, Nov 20, 2015 at 09:11:45AM +0100, Thomas Huth wrote:
> In the old DABR register, the BT (Breakpoint Translation) bit
> is bit number 61. In the new DAWRX register, the WT (Watchpoint
> Translation) bit is bit number 59. So to move the DABR-BT bit
> into the position of the DAWRX-WT bit, it has to be shifted by
> two, not only by one. This fixes hardware watchpoints in gdb of
> older guests that only use the H_SET_DABR/X interface instead
> of the new H_SET_MODE interface.
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to my kvm-ppc-next branch, with cc: sta...@vger.kernel.org.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: remove unused variable 'vcpu_book3s'

2015-12-09 Thread Paul Mackerras
On Tue, Dec 01, 2015 at 08:42:10PM -0300, Geyslan G. Bem wrote:
> The vcpu_book3s struct is assigned but never used. So remove it.
> 
> Signed-off-by: Geyslan G. Bem 

Thanks, applied to my kvm-ppc-next branch.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Increase memslots to 512

2015-12-09 Thread Paul Mackerras
On Wed, Dec 09, 2015 at 11:34:07AM +0100, Thomas Huth wrote:
> Only using 32 memslots for KVM on powerpc is way too low, you can
> nowadays hit this limit quite fast by adding a couple of PCI devices
> and/or pluggable memory DIMMs to the guest.
> 
> x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256
> pluggable DIMM slots, 3 private slots and 253 slots for other things
> like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in
> total). We should do something similar for powerpc, and since we do
> not use private slots here, we can set the value to 512 directly.
> 
> While we're at it, also remove the KVM_MEM_SLOTS_NUM definition
> from the powerpc-specific header since this gets defined in the
> generic kvm_host.h header anyway.
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to my kvm-ppc-next branch.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8

2015-12-09 Thread Paul Mackerras
On Fri, Nov 20, 2015 at 09:11:45AM +0100, Thomas Huth wrote:
> In the old DABR register, the BT (Breakpoint Translation) bit
> is bit number 61. In the new DAWRX register, the WT (Watchpoint
> Translation) bit is bit number 59. So to move the DABR-BT bit
> into the position of the DAWRX-WT bit, it has to be shifted by
> two, not only by one. This fixes hardware watchpoints in gdb of
> older guests that only use the H_SET_DABR/X interface instead
> of the new H_SET_MODE interface.
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to my kvm-ppc-next branch, with cc: sta...@vger.kernel.org.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: remove unused variable 'vcpu_book3s'

2015-12-09 Thread Paul Mackerras
On Tue, Dec 01, 2015 at 08:42:10PM -0300, Geyslan G. Bem wrote:
> The vcpu_book3s struct is assigned but never used. So remove it.
> 
> Signed-off-by: Geyslan G. Bem 

Thanks, applied to my kvm-ppc-next branch.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] x86/vdso: Remove pvclock fixmap machinery

2015-12-09 Thread Andy Lutomirski
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c |  1 -
 arch/x86/entry/vdso/vma.c|  1 +
 arch/x86/include/asm/fixmap.h|  5 -
 arch/x86/include/asm/pvclock.h   |  5 -
 arch/x86/kernel/kvmclock.c   |  6 --
 arch/x86/kernel/pvclock.c| 24 
 6 files changed, 1 insertion(+), 41 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d54348..59a98c25bde7 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
 
 #include 
 #include 
-#include 
 #include 
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa828191c654..b8f69e264ac4 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d70009ff8..6d7d0e52ed5a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #ifdef CONFIG_X86_32
 #include 
 #include 
@@ -72,10 +71,6 @@ enum fixed_addresses {
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
 #endif
-#ifdef CONFIG_PARAVIRT_CLOCK
-   PVCLOCK_FIXMAP_BEGIN,
-   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
 #endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 3864398c7cb2..66df22b2e0c9 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06dc82d2..72cef58693c7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
int cpu;
-   int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
-   if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
-   put_cpu();
-   return ret;
-   }
-
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d229a58..99bfc025111d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state.  This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size)
-{
-   int idx;
-
-   WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
-   for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
-   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
-__pa(i) + (idx*PAGE_SIZE),
-PAGE_KERNEL_VVAR);
-   }
-
-   return 0;
-}
-#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] x86/vdso: Enable vdso pvclock access on all vdso variants

2015-12-09 Thread Andy Lutomirski
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.

The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c | 91 
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c25bde7..8602f06c759f 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -43,10 +45,6 @@ extern u8 pvclock_page
 
 #ifndef BUILD_VDSO32
 
-#include 
-#include 
-#include 
-
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
return ret;
 }
 
-#ifdef CONFIG_PARAVIRT_CLOCK
 
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+   : "memory", "edx");
+   return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+   : "memory", "edx");
+   return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
 static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
return (const struct pvclock_vsyscall_time_info *)_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
 
-   /* This is also a read barrier, so we'll read version first. */
-   tsc = rdtsc_ordered();
+   smp_rmb();
 
+   tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
 
-   /* refer to tsc.c read_tsc() comment for rationale */
+   /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
 
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
 }
 #endif
 
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
-   : "memory", "edx");
-   return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
-   *mode = VCLOCK_NONE;
-   return 0;
-}
-#endif
-
-#endif
-
 notrace static cycle_t vread_tsc(void)
 {
cycle_t ret = (cycle_t)rdtsc_ordered();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu



On 12/10/2015 1:14 AM, Alexander Duyck wrote:

On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu  wrote:


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.


Sure. The suspend/resume job is under way.
Actually, we have enabled QAT work for migration internally. Doing more 
test and fixing bugs.




- Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Please pull my kvm-ppc-fixes branch

2015-12-09 Thread Paul Mackerras
Hi Paolo,

I have a small patch that I would like to get into 4.4 because it
fixes a bug which for certain kernel configs allows userspace to crash
the kernel.  The configs are those for which KVM_BOOK3S_64_HV is set
(y or m) and KVM_BOOK3S_64_PR is not.  Fortunately most distros that
enable KVM_BOOK3S_64_HV also enable KVM_BOOK3S_64_PR, as far as I can
tell.

Thanks,
Paul.

The following changes since commit 09922076003ad66de41ea14d2f8c3b4a16ec7774:

  Merge tag 'kvm-arm-for-v4.4-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master 
(2015-12-04 18:32:32 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git kvm-ppc-fixes

for you to fetch changes up to c20875a3e638e4a03e099b343ec798edd1af5cc6:

  KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR 
(2015-12-10 11:34:27 +1100)


Paul Mackerras (1):
  KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR

 arch/powerpc/kvm/book3s_hv.c | 6 ++
 1 file changed, 6 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Please pull my kvm-ppc-fixes branch

2015-12-09 Thread Paul Mackerras
Hi Paolo,

I have a small patch that I would like to get into 4.4 because it
fixes a bug which for certain kernel configs allows userspace to crash
the kernel.  The configs are those for which KVM_BOOK3S_64_HV is set
(y or m) and KVM_BOOK3S_64_PR is not.  Fortunately most distros that
enable KVM_BOOK3S_64_HV also enable KVM_BOOK3S_64_PR, as far as I can
tell.

Thanks,
Paul.

The following changes since commit 09922076003ad66de41ea14d2f8c3b4a16ec7774:

  Merge tag 'kvm-arm-for-v4.4-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master 
(2015-12-04 18:32:32 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git kvm-ppc-fixes

for you to fetch changes up to c20875a3e638e4a03e099b343ec798edd1af5cc6:

  KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR 
(2015-12-10 11:34:27 +1100)


Paul Mackerras (1):
  KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR

 arch/powerpc/kvm/book3s_hv.c | 6 ++
 1 file changed, 6 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/5] implement vNVDIMM

2015-12-09 Thread Xiao Guangrong


New version, new week, and unfortunate new ping... :(


On 12/02/2015 03:20 PM, Xiao Guangrong wrote:

This patchset can be found at:
   https://github.com/xiaogr/qemu.git nvdimm-v9

It is based on pci branch on Michael's tree and the top commit is:
commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).

Changelog in v9:
- the changes address Michael's comments:
   1) move the control parameter to -machine and it is off on default, then
  it can be enabled by, for example, -machine pc,nvdimm
   2) introduce a macro to define "NCAL"
   3) abstract the function, nvdimm_build_device_dsm(), to clean up the
  code
   4) adjust the code style of dsm method
   5) add spec reference in the code comment

other:
   pick up Stefan's Reviewed-by

Changelog in v8:
We split the long patch series into the small parts, as you see now, this
is the first part which enables NVDIMM without label data support.

The command line has been changed because some patches simplifying the
things have not been included into this series, you should specify the
file size exactly using the parameters as follows:
memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
-device nvdimm,memdev=mem1,id=nv1

Changelog in v7:
- changes from Vladimir Sementsov-Ogievskiy's comments:
   1) let gethugepagesize() realize if fstat is failed instead of get
  normal page size
   2) rename  open_file_path to open_ram_file_path
   3) better log the error message by using error_setg_errno
   4) update commit in the commit log to explain hugepage detection on
  Windows

- changes from Eduardo Habkost's comments:
   1) use 'Error**' to collect error message for qemu_file_get_page_size()
   2) move gethugepagesize() replacement to the same patch to make it
  better for review
   3) introduce qemu_get_file_size to unity the code with raw_getlength()

- changes from Stefan's comments:
   1) check the memory region is large enough to contain DSM output
  buffer

- changes from Eric Blake's comments:
   1) update the shell command in the commit log to generate the patch
  which drops 'pc-dimm' prefix

- others:
   pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
   Eric Blake.

Changelog in v6:
- changes from Stefan's comments:
   1) fix code style of struct naming by CamelCase way
   2) fix offset + length overflow when read/write label data
   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
  be used to replace getpagesize()

Changelog in v5:
- changes from Michael's comments:
   1) prefix nvdimm_ to everything in NVDIMM source files
   2) make parsing _DSM Arg3 more clear
   3) comment style fix
   5) drop single used definition
   6) fix dirty dsm buffer lost due to memory write happened on host
   7) check dsm buffer if it is big enough to contain input data
   8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
   1) introduce 'nvdimm-support' parameter to control nvdimm
  enablement and it is disabled for 2.4 and its earlier versions
  to make live migration compatible
   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
  virtualization

- changes from Stefan's comments:
   1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
   1) fix compile on ppc

- others:
   1) the buffer length is directly got from IO read rather than got
  from dsm memory
   2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
   1) show the message, "Memory is not allocated from HugeTlbfs", if file
  based memory is not allocated from hugetlbfs.
   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
  from Machine.
   3) statically define UUID and make its operation more clear
   4) use GArray to build device structures to avoid potential buffer
  overflow
   4) improve comments in the code
   5) improve code style

- changes from Igor's comments:
   1) add NVDIMM ACPI spec document
   2) use serialized method to avoid Mutex
   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
   4) introduce a common ASL method used by _DSM for all devices to reduce
  ACPI size
   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
  it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
   1) copy input data from DSM memory to local buffer to avoid potential
  issues as DSM memory is visible to guest. Output data is handled
  in a similar way

- changes from Dan's comments:
   1) drop static namespace as Linux has already supported label-less
  nvdimm devices

- changes from Vladimir's comments:
   1) print better message, "failed to get file size for %s, can't create
  backend on it", if any file operation filed to obtain file size

- others:
   create a git repo on github.com for better 

Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu


On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.


This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.


This is a combination. Because it's not able to migrate device state in
the current world during migration(What we are doing), Exist solutions
of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF
before starting migration and then switch network from VF NIC to PV NIC
in order to maintain the network connection. Plug VF again after
migration and then switch from PV back to VF. Bond driver provides a way 
to switch between PV and VF NIC automatically with save IP and MAC and 
so bond driver is more preferred.





But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF


OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.


It also requires guest to do switch configuration.


That's just wrong. if you want a switch, you need to
configure a switch.


I meant the config of switching operation between PV and VF.




These are hard to
manage and deploy from our customers.


So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.


This is one factor.




To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.


No idea what this means.


These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Everyone needs performance and scalability.



For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.

So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).


No, we can find a PV NIC as backup for VF NIC during migration but it 
doesn't work for other kinds of device since there is no backup for 
them. E,G When migration happens during users compresses files via QAT, 
it's impossible to remove QAT at that point. If do that, the compress 
operation will fail and affect user experience.




I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.



Yes, saving and restoring device state relies on the driver and so we 
reworks driver and make it more friend to migration.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vgaarb: fix signal handling in vga_get()

2015-12-09 Thread Kirill A. Shutemov
On Mon, Nov 30, 2015 at 04:17:31AM +0200, Kirill A. Shutemov wrote:
> There are few defects in vga_get() related to signal hadning:
> 
>   - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
> case;
> 
>   - if we found pending signal we must remove ourself from wait queue
> and change task state back to running;
> 
>   - -ERESTARTSYS is more appropriate, I guess.
> 
> Signed-off-by: Kirill A. Shutemov 

Ping?

David, this patch fixes crash. It would be nice to get it into 4.4.

> ---
> 
> Alex, I try to get KVM with VGA passthrough working properly. I have i915
> (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not
> capabale of EFI, so I have to use x-vga=on. It's kinda work with your
> patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU
> after KVM was not shut down correctly, resulting in host crash like this:
> 
> BUG: unable to handle kernel paging request at 880870187ed8
> IP: [] 0x880870187ed8
> PGD 2129067 PUD 800841e3
> Oops: 0011 [#1] PREEMPT SMP
> Modules linked in: iwlmvm iwlwifi
> CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6
> Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, BIOS 
> F5a 06/12/2014
> task: 88087a91 ti: 8808632c task.ti: 8808632c
> RIP: 0010:[]  [] 0x880870187ed8
> RSP: 0018:8808632c3d08  EFLAGS: 00010006
> RAX: 880870187db0 RBX: 70187f58 RCX: 
> RDX:  RSI: 0003 RDI: 880870187db0
> RBP: 8808632c3d48 R08:  R09: 
> R10: 000103c0 R11: 0293 R12: 81ea03c8
> R13: 8104c7cb R14:  R15: 0003
> FS:  7f984f9b2700() GS:88089f38() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0
> Stack:
>  810cc83d 632c3d28  81ea03c0
>  0046 0003  
>  8808632c3d80 810cca44 88087af63800 0286
> Call Trace:
>  [] ? __wake_up_common+0x4d/0x80
>  [] __wake_up+0x34/0x50
>  [] __vga_put+0x73/0xd0
>  [] vga_put+0x54/0x80
>  [] vfio_pci_vga_rw+0x1d2/0x220
>  [] vfio_pci_rw+0x33/0x60
>  [] vfio_pci_write+0x17/0x20
>  [] vfio_device_fops_write+0x26/0x30
>  [] __vfs_write+0x23/0xe0
>  [] ? __vfs_read+0x23/0xd0
>  [] ? do_vfs_ioctl+0x2b5/0x490
>  [] vfs_write+0xa4/0x190
>  [] SyS_pwrite64+0x66/0xa0
>  [] entry_SYSCALL_64_fastpath+0x12/0x6a
> Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 70 08 
> 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 00 00 00 00 
> 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08
> RIP  [] 0x880870187ed8
>  RSP 
> CR2: 880870187ed8
> 
> The patch fixes the crash, but doesn't help with getting GPU in guest
> working again.
> 
> Any ideas?
> 
> ---
>  drivers/gpu/vga/vgaarb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
> index 3166e4bc4eb6..9abcaa53bd25 100644
> --- a/drivers/gpu/vga/vgaarb.c
> +++ b/drivers/gpu/vga/vgaarb.c
> @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int rsrc, int 
> interruptible)
>   set_current_state(interruptible ?
> TASK_INTERRUPTIBLE :
> TASK_UNINTERRUPTIBLE);
> - if (signal_pending(current)) {
> - rc = -EINTR;
> + if (interruptible && signal_pending(current)) {
> + __set_current_state(TASK_RUNNING);
> + remove_wait_queue(_wait_queue, );
> + rc = -ERESTARTSYS;
>   break;
>   }
>   schedule();
> -- 
> 2.6.3
> 

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html