Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-28 Thread Michael S. Tsirkin
On Fri, Oct 28, 2016 at 11:39:44AM +0200, Paolo Bonzini wrote:
> 
> 
> On 27/10/2016 19:06, Radim Krčmář wrote:
> > 2016-10-27 19:51+0300, Michael S. Tsirkin:
> >> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> >>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>  On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> > 2016-10-14 20:21+0200, Paolo Bonzini:
> >> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> posted interrupts turn out to be slower than interrupt injection via
> >> KVM_REQ_EVENT.
> >>
> >> This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> operations in the common case where PI.ON=0 at vmentry or the PIR 
> >> vector
> >> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >>
> >>   | enable_apicv=1  |  enable_apicv=0
> >>   | mean stdev  |  mean stdev
> >> --|-|--
> >> before| 5826 32.65  |  5765 47.09
> >> after | 5809 43.42  |  5777 77.02
> >>
> >> Of course, any change in the right column is just placebo effect. :)
> >> The savings are bigger if interrupts are frequent.
> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
> >> *pi_desc)
> >>(unsigned long *)_desc->control);
> >>  }
> >>  
> >> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> +{
> >> +  clear_bit(POSTED_INTR_ON,
> >> +(unsigned long *)_desc->control);
> >> +}
> >
> > We should add an explicit smp_mb__after_atomic() for extra correctness,
> > because clear_bit() does not guarantee a memory barrier and we must make
> > sure that pir reads can't be reordered before it.
> > x86 clear_bit() currently uses locked instruction, though.
> 
>  smp_mb__after_atomic is empty on x86 so it's
>  a documentation thing, not a correctness thing anyway.
> >>>
> >>> All atomics currently contain a barrier, but the code is also
> >>> future-proofing, not just documentation: implementation of clear_bit()
> >>> could drop the barrier and smp_mb__after_atomic() would then become a
> >>> real barrier.
> >>>
> >>> Adding dma_mb__after_atomic() would be even better as this bug could
> >>> happen even on a uniprocessor with an assigned device, but people who
> >>> buy a SMP chip to run a UP kernel deserve it.
> >>
> >> Not doing dma so does not seem to make sense ...
> > 
> > IOMMU does -- it writes to the PIR and sets ON asynchronously.
> 
> I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
> difference is documentation only, since all of them are
> compiler-barriers only on x86.

A comment is also an option.

> Preferences?
> 
> Thanks,
> 
> Paolo

virt_ is for a VM guest. Pls don't use for host side code.
I thought it's clear enough but maybe I should add
more documentation.

> >> Why do you need a barrier on a UP kernel?
> > 
> > If pi_clear_on() doesn't contain a memory barrier (possible future),
> > then we have the following race: (pir[0] begins as 0.)
> > 
> > KVM   |  IOMMU
> >---+-
> >pir_val = ACCESS_ONCE(pir[0])  |
> >   | pir[0] = 123
> >   | pi_set_on()
> >pi_clear_on()  |
> >if (pir_val)   |
> > 
> > ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> > does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> > wrote to it, then our optimization to avoid the xchg would yield a false
> > negative and the interrupt would be lost.
> > 




Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-28 Thread Michael S. Tsirkin
On Fri, Oct 28, 2016 at 11:39:44AM +0200, Paolo Bonzini wrote:
> 
> 
> On 27/10/2016 19:06, Radim Krčmář wrote:
> > 2016-10-27 19:51+0300, Michael S. Tsirkin:
> >> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> >>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>  On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> > 2016-10-14 20:21+0200, Paolo Bonzini:
> >> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> posted interrupts turn out to be slower than interrupt injection via
> >> KVM_REQ_EVENT.
> >>
> >> This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> operations in the common case where PI.ON=0 at vmentry or the PIR 
> >> vector
> >> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >>
> >>   | enable_apicv=1  |  enable_apicv=0
> >>   | mean stdev  |  mean stdev
> >> --|-|--
> >> before| 5826 32.65  |  5765 47.09
> >> after | 5809 43.42  |  5777 77.02
> >>
> >> Of course, any change in the right column is just placebo effect. :)
> >> The savings are bigger if interrupts are frequent.
> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
> >> *pi_desc)
> >>(unsigned long *)_desc->control);
> >>  }
> >>  
> >> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> +{
> >> +  clear_bit(POSTED_INTR_ON,
> >> +(unsigned long *)_desc->control);
> >> +}
> >
> > We should add an explicit smp_mb__after_atomic() for extra correctness,
> > because clear_bit() does not guarantee a memory barrier and we must make
> > sure that pir reads can't be reordered before it.
> > x86 clear_bit() currently uses locked instruction, though.
> 
>  smp_mb__after_atomic is empty on x86 so it's
>  a documentation thing, not a correctness thing anyway.
> >>>
> >>> All atomics currently contain a barrier, but the code is also
> >>> future-proofing, not just documentation: implementation of clear_bit()
> >>> could drop the barrier and smp_mb__after_atomic() would then become a
> >>> real barrier.
> >>>
> >>> Adding dma_mb__after_atomic() would be even better as this bug could
> >>> happen even on a uniprocessor with an assigned device, but people who
> >>> buy a SMP chip to run a UP kernel deserve it.
> >>
> >> Not doing dma so does not seem to make sense ...
> > 
> > IOMMU does -- it writes to the PIR and sets ON asynchronously.
> 
> I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
> difference is documentation only, since all of them are
> compiler-barriers only on x86.

A comment is also an option.

> Preferences?
> 
> Thanks,
> 
> Paolo

virt_ is for a VM guest. Pls don't use for host side code.
I thought it's clear enough but maybe I should add
more documentation.

> >> Why do you need a barrier on a UP kernel?
> > 
> > If pi_clear_on() doesn't contain a memory barrier (possible future),
> > then we have the following race: (pir[0] begins as 0.)
> > 
> > KVM   |  IOMMU
> >---+-
> >pir_val = ACCESS_ONCE(pir[0])  |
> >   | pir[0] = 123
> >   | pi_set_on()
> >pi_clear_on()  |
> >if (pir_val)   |
> > 
> > ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> > does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> > wrote to it, then our optimization to avoid the xchg would yield a false
> > negative and the interrupt would be lost.
> > 




Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-28 Thread Paolo Bonzini


On 27/10/2016 19:06, Radim Krčmář wrote:
> 2016-10-27 19:51+0300, Michael S. Tsirkin:
>> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
 On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> posted interrupts turn out to be slower than interrupt injection via
>> KVM_REQ_EVENT.
>>
>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>
>>   | enable_apicv=1  |  enable_apicv=0
>>   | mean stdev  |  mean stdev
>> --|-|--
>> before| 5826 32.65  |  5765 47.09
>> after | 5809 43.42  |  5777 77.02
>>
>> Of course, any change in the right column is just placebo effect. :)
>> The savings are bigger if interrupts are frequent.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
>> *pi_desc)
>>  (unsigned long *)_desc->control);
>>  }
>>  
>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> +{
>> +clear_bit(POSTED_INTR_ON,
>> +  (unsigned long *)_desc->control);
>> +}
>
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

 smp_mb__after_atomic is empty on x86 so it's
 a documentation thing, not a correctness thing anyway.
>>>
>>> All atomics currently contain a barrier, but the code is also
>>> future-proofing, not just documentation: implementation of clear_bit()
>>> could drop the barrier and smp_mb__after_atomic() would then become a
>>> real barrier.
>>>
>>> Adding dma_mb__after_atomic() would be even better as this bug could
>>> happen even on a uniprocessor with an assigned device, but people who
>>> buy a SMP chip to run a UP kernel deserve it.
>>
>> Not doing dma so does not seem to make sense ...
> 
> IOMMU does -- it writes to the PIR and sets ON asynchronously.

I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
difference is documentation only, since all of them are
compiler-barriers only on x86.

Preferences?

Thanks,

Paolo

>> Why do you need a barrier on a UP kernel?
> 
> If pi_clear_on() doesn't contain a memory barrier (possible future),
> then we have the following race: (pir[0] begins as 0.)
> 
> KVM   |  IOMMU
>---+-
>pir_val = ACCESS_ONCE(pir[0])  |
>   | pir[0] = 123
>   | pi_set_on()
>pi_clear_on()  |
>if (pir_val)   |
> 
> ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> wrote to it, then our optimization to avoid the xchg would yield a false
> negative and the interrupt would be lost.
> 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-28 Thread Paolo Bonzini


On 27/10/2016 19:06, Radim Krčmář wrote:
> 2016-10-27 19:51+0300, Michael S. Tsirkin:
>> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
 On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> posted interrupts turn out to be slower than interrupt injection via
>> KVM_REQ_EVENT.
>>
>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>
>>   | enable_apicv=1  |  enable_apicv=0
>>   | mean stdev  |  mean stdev
>> --|-|--
>> before| 5826 32.65  |  5765 47.09
>> after | 5809 43.42  |  5777 77.02
>>
>> Of course, any change in the right column is just placebo effect. :)
>> The savings are bigger if interrupts are frequent.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
>> *pi_desc)
>>  (unsigned long *)_desc->control);
>>  }
>>  
>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> +{
>> +clear_bit(POSTED_INTR_ON,
>> +  (unsigned long *)_desc->control);
>> +}
>
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

 smp_mb__after_atomic is empty on x86 so it's
 a documentation thing, not a correctness thing anyway.
>>>
>>> All atomics currently contain a barrier, but the code is also
>>> future-proofing, not just documentation: implementation of clear_bit()
>>> could drop the barrier and smp_mb__after_atomic() would then become a
>>> real barrier.
>>>
>>> Adding dma_mb__after_atomic() would be even better as this bug could
>>> happen even on a uniprocessor with an assigned device, but people who
>>> buy a SMP chip to run a UP kernel deserve it.
>>
>> Not doing dma so does not seem to make sense ...
> 
> IOMMU does -- it writes to the PIR and sets ON asynchronously.

I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
difference is documentation only, since all of them are
compiler-barriers only on x86.

Preferences?

Thanks,

Paolo

>> Why do you need a barrier on a UP kernel?
> 
> If pi_clear_on() doesn't contain a memory barrier (possible future),
> then we have the following race: (pir[0] begins as 0.)
> 
> KVM   |  IOMMU
>---+-
>pir_val = ACCESS_ONCE(pir[0])  |
>   | pir[0] = 123
>   | pi_set_on()
>pi_clear_on()  |
>if (pir_val)   |
> 
> ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> wrote to it, then our optimization to avoid the xchg would yield a false
> negative and the interrupt would be lost.
> 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Radim Krčmář
2016-10-27 19:51+0300, Michael S. Tsirkin:
> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> >> 2016-10-14 20:21+0200, Paolo Bonzini:
>> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> >> > posted interrupts turn out to be slower than interrupt injection via
>> >> > KVM_REQ_EVENT.
>> >> > 
>> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> >> > 
>> >> >   | enable_apicv=1  |  enable_apicv=0
>> >> >   | mean stdev  |  mean stdev
>> >> > --|-|--
>> >> > before| 5826 32.65  |  5765 47.09
>> >> > after | 5809 43.42  |  5777 77.02
>> >> > 
>> >> > Of course, any change in the right column is just placebo effect. :)
>> >> > The savings are bigger if interrupts are frequent.
>> >> > 
>> >> > Signed-off-by: Paolo Bonzini 
>> >> > ---
>> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
>> >> > *pi_desc)
>> >> > (unsigned long *)_desc->control);
>> >> >  }
>> >> >  
>> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> >> > +{
>> >> > +   clear_bit(POSTED_INTR_ON,
>> >> > + (unsigned long *)_desc->control);
>> >> > +}
>> >> 
>> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> >> because clear_bit() does not guarantee a memory barrier and we must make
>> >> sure that pir reads can't be reordered before it.
>> >> x86 clear_bit() currently uses locked instruction, though.
>> > 
>> > smp_mb__after_atomic is empty on x86 so it's
>> > a documentation thing, not a correctness thing anyway.
>> 
>> All atomics currently contain a barrier, but the code is also
>> future-proofing, not just documentation: implementation of clear_bit()
>> could drop the barrier and smp_mb__after_atomic() would then become a
>> real barrier.
>> 
>> Adding dma_mb__after_atomic() would be even better as this bug could
>> happen even on a uniprocessor with an assigned device, but people who
>> buy a SMP chip to run a UP kernel deserve it.
> 
> Not doing dma so does not seem to make sense ...

IOMMU does -- it writes to the PIR and sets ON asynchronously.

> Why do you need a barrier on a UP kernel?

If pi_clear_on() doesn't contain a memory barrier (possible future),
then we have the following race: (pir[0] begins as 0.)

KVM   |  IOMMU
   ---+-
   pir_val = ACCESS_ONCE(pir[0])  |
  | pir[0] = 123
  | pi_set_on()
   pi_clear_on()  |
   if (pir_val)   |

ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
does nothing in this patch), so if there was 0 in pir[0] before IOMMU
wrote to it, then our optimization to avoid the xchg would yield a false
negative and the interrupt would be lost.


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Radim Krčmář
2016-10-27 19:51+0300, Michael S. Tsirkin:
> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> >> 2016-10-14 20:21+0200, Paolo Bonzini:
>> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> >> > posted interrupts turn out to be slower than interrupt injection via
>> >> > KVM_REQ_EVENT.
>> >> > 
>> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> >> > 
>> >> >   | enable_apicv=1  |  enable_apicv=0
>> >> >   | mean stdev  |  mean stdev
>> >> > --|-|--
>> >> > before| 5826 32.65  |  5765 47.09
>> >> > after | 5809 43.42  |  5777 77.02
>> >> > 
>> >> > Of course, any change in the right column is just placebo effect. :)
>> >> > The savings are bigger if interrupts are frequent.
>> >> > 
>> >> > Signed-off-by: Paolo Bonzini 
>> >> > ---
>> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
>> >> > *pi_desc)
>> >> > (unsigned long *)_desc->control);
>> >> >  }
>> >> >  
>> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> >> > +{
>> >> > +   clear_bit(POSTED_INTR_ON,
>> >> > + (unsigned long *)_desc->control);
>> >> > +}
>> >> 
>> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> >> because clear_bit() does not guarantee a memory barrier and we must make
>> >> sure that pir reads can't be reordered before it.
>> >> x86 clear_bit() currently uses locked instruction, though.
>> > 
>> > smp_mb__after_atomic is empty on x86 so it's
>> > a documentation thing, not a correctness thing anyway.
>> 
>> All atomics currently contain a barrier, but the code is also
>> future-proofing, not just documentation: implementation of clear_bit()
>> could drop the barrier and smp_mb__after_atomic() would then become a
>> real barrier.
>> 
>> Adding dma_mb__after_atomic() would be even better as this bug could
>> happen even on a uniprocessor with an assigned device, but people who
>> buy a SMP chip to run a UP kernel deserve it.
> 
> Not doing dma so does not seem to make sense ...

IOMMU does -- it writes to the PIR and sets ON asynchronously.

> Why do you need a barrier on a UP kernel?

If pi_clear_on() doesn't contain a memory barrier (possible future),
then we have the following race: (pir[0] begins as 0.)

KVM   |  IOMMU
   ---+-
   pir_val = ACCESS_ONCE(pir[0])  |
  | pir[0] = 123
  | pi_set_on()
   pi_clear_on()  |
   if (pir_val)   |

ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
does nothing in this patch), so if there was 0 in pir[0] before IOMMU
wrote to it, then our optimization to avoid the xchg would yield a false
negative and the interrupt would be lost.


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Michael S. Tsirkin
On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >> 2016-10-14 20:21+0200, Paolo Bonzini:
> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> > posted interrupts turn out to be slower than interrupt injection via
> >> > KVM_REQ_EVENT.
> >> > 
> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >> > 
> >> >   | enable_apicv=1  |  enable_apicv=0
> >> >   | mean stdev  |  mean stdev
> >> > --|-|--
> >> > before| 5826 32.65  |  5765 47.09
> >> > after | 5809 43.42  |  5777 77.02
> >> > 
> >> > Of course, any change in the right column is just placebo effect. :)
> >> > The savings are bigger if interrupts are frequent.
> >> > 
> >> > Signed-off-by: Paolo Bonzini 
> >> > ---
> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
> >> > *pi_desc)
> >> >  (unsigned long *)_desc->control);
> >> >  }
> >> >  
> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> > +{
> >> > +clear_bit(POSTED_INTR_ON,
> >> > +  (unsigned long *)_desc->control);
> >> > +}
> >> 
> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >> because clear_bit() does not guarantee a memory barrier and we must make
> >> sure that pir reads can't be reordered before it.
> >> x86 clear_bit() currently uses locked instruction, though.
> > 
> > smp_mb__after_atomic is empty on x86 so it's
> > a documentation thing, not a correctness thing anyway.
> 
> All atomics currently contain a barrier, but the code is also
> future-proofing, not just documentation: implementation of clear_bit()
> could drop the barrier and smp_mb__after_atomic() would then become a
> real barrier.
> 
> Adding dma_mb__after_atomic() would be even better as this bug could
> happen even on a uniprocessor with an assigned device, but people who
> buy a SMP chip to run a UP kernel deserve it.

Not doing dma so does not seem to make sense ...
Why do you need a barrier on a UP kernel?

-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Michael S. Tsirkin
On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >> 2016-10-14 20:21+0200, Paolo Bonzini:
> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> > posted interrupts turn out to be slower than interrupt injection via
> >> > KVM_REQ_EVENT.
> >> > 
> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >> > 
> >> >   | enable_apicv=1  |  enable_apicv=0
> >> >   | mean stdev  |  mean stdev
> >> > --|-|--
> >> > before| 5826 32.65  |  5765 47.09
> >> > after | 5809 43.42  |  5777 77.02
> >> > 
> >> > Of course, any change in the right column is just placebo effect. :)
> >> > The savings are bigger if interrupts are frequent.
> >> > 
> >> > Signed-off-by: Paolo Bonzini 
> >> > ---
> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc 
> >> > *pi_desc)
> >> >  (unsigned long *)_desc->control);
> >> >  }
> >> >  
> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> > +{
> >> > +clear_bit(POSTED_INTR_ON,
> >> > +  (unsigned long *)_desc->control);
> >> > +}
> >> 
> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >> because clear_bit() does not guarantee a memory barrier and we must make
> >> sure that pir reads can't be reordered before it.
> >> x86 clear_bit() currently uses locked instruction, though.
> > 
> > smp_mb__after_atomic is empty on x86 so it's
> > a documentation thing, not a correctness thing anyway.
> 
> All atomics currently contain a barrier, but the code is also
> future-proofing, not just documentation: implementation of clear_bit()
> could drop the barrier and smp_mb__after_atomic() would then become a
> real barrier.
> 
> Adding dma_mb__after_atomic() would be even better as this bug could
> happen even on a uniprocessor with an assigned device, but people who
> buy a SMP chip to run a UP kernel deserve it.

Not doing dma so does not seem to make sense ...
Why do you need a barrier on a UP kernel?

-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Radim Krčmář
2016-10-27 00:42+0300, Michael S. Tsirkin:
> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> 2016-10-14 20:21+0200, Paolo Bonzini:
>> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> > posted interrupts turn out to be slower than interrupt injection via
>> > KVM_REQ_EVENT.
>> > 
>> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> > 
>> >   | enable_apicv=1  |  enable_apicv=0
>> >   | mean stdev  |  mean stdev
>> > --|-|--
>> > before| 5826 32.65  |  5765 47.09
>> > after | 5809 43.42  |  5777 77.02
>> > 
>> > Of course, any change in the right column is just placebo effect. :)
>> > The savings are bigger if interrupts are frequent.
>> > 
>> > Signed-off-by: Paolo Bonzini 
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> >(unsigned long *)_desc->control);
>> >  }
>> >  
>> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> > +{
>> > +  clear_bit(POSTED_INTR_ON,
>> > +(unsigned long *)_desc->control);
>> > +}
>> 
>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> because clear_bit() does not guarantee a memory barrier and we must make
>> sure that pir reads can't be reordered before it.
>> x86 clear_bit() currently uses locked instruction, though.
> 
> smp_mb__after_atomic is empty on x86 so it's
> a documentation thing, not a correctness thing anyway.

All atomics currently contain a barrier, but the code is also
future-proofing, not just documentation: implementation of clear_bit()
could drop the barrier and smp_mb__after_atomic() would then become a
real barrier.

Adding dma_mb__after_atomic() would be even better as this bug could
happen even on a uniprocessor with an assigned device, but people who
buy a SMP chip to run a UP kernel deserve it.


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 Thread Radim Krčmář
2016-10-27 00:42+0300, Michael S. Tsirkin:
> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> 2016-10-14 20:21+0200, Paolo Bonzini:
>> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> > posted interrupts turn out to be slower than interrupt injection via
>> > KVM_REQ_EVENT.
>> > 
>> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> > 
>> >   | enable_apicv=1  |  enable_apicv=0
>> >   | mean stdev  |  mean stdev
>> > --|-|--
>> > before| 5826 32.65  |  5765 47.09
>> > after | 5809 43.42  |  5777 77.02
>> > 
>> > Of course, any change in the right column is just placebo effect. :)
>> > The savings are bigger if interrupts are frequent.
>> > 
>> > Signed-off-by: Paolo Bonzini 
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> >(unsigned long *)_desc->control);
>> >  }
>> >  
>> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> > +{
>> > +  clear_bit(POSTED_INTR_ON,
>> > +(unsigned long *)_desc->control);
>> > +}
>> 
>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> because clear_bit() does not guarantee a memory barrier and we must make
>> sure that pir reads can't be reordered before it.
>> x86 clear_bit() currently uses locked instruction, though.
> 
> smp_mb__after_atomic is empty on x86 so it's
> a documentation thing, not a correctness thing anyway.

All atomics currently contain a barrier, but the code is also
future-proofing, not just documentation: implementation of clear_bit()
could drop the barrier and smp_mb__after_atomic() would then become a
real barrier.

Adding dma_mb__after_atomic() would be even better as this bug could
happen even on a uniprocessor with an assigned device, but people who
buy a SMP chip to run a UP kernel deserve it.


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Michael S. Tsirkin
On Wed, Oct 19, 2016 at 04:45:48AM -0700, Paul E. McKenney wrote:
> On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> > On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > > 
> > > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  
> > > > > wrote:
> > > > >>> 
> > > > >>> for (i = 0; i <= 7; i++) {
> > > > >>> -   pir_val = xchg([i], 0);
> > > > >>> -   if (pir_val)
> > > > >>> +   pir_val = READ_ONCE(pir[i]);
> > > > >> 
> > > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > > 
> > > > > The answer can only be "depends on the compiler's whims". :)
> > > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > > 
> > > > Hm.. So the idea is to make the code "race-free” in the sense
> > > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > > 
> > > > If that is the case, I think there are many other cases that need to be
> > > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > > 
> > > There is no documentation for this in the kernel tree unfortunately.
> > > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > > memory barriers is a start.
> > > 
> > > Paolo
> > 
> > I'm beginning to think that if a value is always (maybe except for init
> > where we don't much care about the code size anyway) accessed through
> > *_ONCE macros, we should just mark it volatile and be done with it. The
> > code will look cleaner, and there will be less space for errors
> > like forgetting *_ONCE macros.
> > 
> > Would such code (where all accesses are done through
> > READ_ONCE/WRITE_ONCE otherwise) be an exception to
> > volatile-considered-harmful.txt rules?
> > 
> > Cc Paul and Jonathan (for volatile-considered-harmful.txt).
> 
> One concern would be the guy reading the code, to whom it might not
> be obvious that the underlying access was volatile, especially if
> the reference was a few levels down.
> 
>   Thanx, Paul

So the guy reading the code will think access is unsafe, check it up and
realise it's fine.  Is this a big deal?  Isn't it better than the
reverse where one forgets to use READ_ONCE and gets a runtime bug?

My point is that this text:

The key point to understand with regard to volatile is that its purpose 
is
to suppress optimization, which is almost never what one really wants 
to do.

doesn't seem to apply anymore since we have hundreds of users of
READ_ONCE the point of which is exactly to suppress optimization.

I'm guessing this was written when we only had barrier() which is quite
unlike READ_ONCE in that it's not tied to a specific memory reference.

-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Michael S. Tsirkin
On Wed, Oct 19, 2016 at 04:45:48AM -0700, Paul E. McKenney wrote:
> On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> > On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > > 
> > > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  
> > > > > wrote:
> > > > >>> 
> > > > >>> for (i = 0; i <= 7; i++) {
> > > > >>> -   pir_val = xchg([i], 0);
> > > > >>> -   if (pir_val)
> > > > >>> +   pir_val = READ_ONCE(pir[i]);
> > > > >> 
> > > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > > 
> > > > > The answer can only be "depends on the compiler's whims". :)
> > > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > > 
> > > > Hm.. So the idea is to make the code "race-free” in the sense
> > > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > > 
> > > > If that is the case, I think there are many other cases that need to be
> > > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > > 
> > > There is no documentation for this in the kernel tree unfortunately.
> > > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > > memory barriers is a start.
> > > 
> > > Paolo
> > 
> > I'm beginning to think that if a value is always (maybe except for init
> > where we don't much care about the code size anyway) accessed through
> > *_ONCE macros, we should just mark it volatile and be done with it. The
> > code will look cleaner, and there will be less space for errors
> > like forgetting *_ONCE macros.
> > 
> > Would such code (where all accesses are done through
> > READ_ONCE/WRITE_ONCE otherwise) be an exception to
> > volatile-considered-harmful.txt rules?
> > 
> > Cc Paul and Jonathan (for volatile-considered-harmful.txt).
> 
> One concern would be the guy reading the code, to whom it might not
> be obvious that the underlying access was volatile, especially if
> the reference was a few levels down.
> 
>   Thanx, Paul

So the guy reading the code will think access is unsafe, check it up and
realise it's fine.  Is this a big deal?  Isn't it better than the
reverse where one forgets to use READ_ONCE and gets a runtime bug?

My point is that this text:

The key point to understand with regard to volatile is that its purpose 
is
to suppress optimization, which is almost never what one really wants 
to do.

doesn't seem to apply anymore since we have hundreds of users of
READ_ONCE the point of which is exactly to suppress optimization.

I'm guessing this was written when we only had barrier() which is quite
unlike READ_ONCE in that it's not tied to a specific memory reference.

-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Michael S. Tsirkin
On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> > posted interrupts turn out to be slower than interrupt injection via
> > KVM_REQ_EVENT.
> > 
> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> > 
> >   | enable_apicv=1  |  enable_apicv=0
> >   | mean stdev  |  mean stdev
> > --|-|--
> > before| 5826 32.65  |  5765 47.09
> > after | 5809 43.42  |  5777 77.02
> > 
> > Of course, any change in the right column is just placebo effect. :)
> > The savings are bigger if interrupts are frequent.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> > (unsigned long *)_desc->control);
> >  }
> >  
> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> > +{
> > +   clear_bit(POSTED_INTR_ON,
> > + (unsigned long *)_desc->control);
>^^
>bad whitespace.
> 
> > +}
> 
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

smp_mb__after_atomic is empty on x86 so it's
a documentation thing, not a correctness thing anyway.

> > +
> >  static inline int pi_test_on(struct pi_desc *pi_desc)
> >  {
> > return test_bit(POSTED_INTR_ON,
> 
> Other than that,
> 
> Reviewed-by: Radim Krčmář 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Michael S. Tsirkin
On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> > posted interrupts turn out to be slower than interrupt injection via
> > KVM_REQ_EVENT.
> > 
> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> > 
> >   | enable_apicv=1  |  enable_apicv=0
> >   | mean stdev  |  mean stdev
> > --|-|--
> > before| 5826 32.65  |  5765 47.09
> > after | 5809 43.42  |  5777 77.02
> > 
> > Of course, any change in the right column is just placebo effect. :)
> > The savings are bigger if interrupts are frequent.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> > (unsigned long *)_desc->control);
> >  }
> >  
> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> > +{
> > +   clear_bit(POSTED_INTR_ON,
> > + (unsigned long *)_desc->control);
>^^
>bad whitespace.
> 
> > +}
> 
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

smp_mb__after_atomic is empty on x86 so it's
a documentation thing, not a correctness thing anyway.

> > +
> >  static inline int pi_test_on(struct pi_desc *pi_desc)
> >  {
> > return test_bit(POSTED_INTR_ON,
> 
> Other than that,
> 
> Reviewed-by: Radim Krčmář 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Radim Krčmář
2016-10-14 20:21+0200, Paolo Bonzini:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
> 
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> 
>   | enable_apicv=1  |  enable_apicv=0
>   | mean stdev  |  mean stdev
> --|-|--
> before| 5826 32.65  |  5765 47.09
> after | 5809 43.42  |  5777 77.02
> 
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>   (unsigned long *)_desc->control);
>  }
>  
> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> +{
> + clear_bit(POSTED_INTR_ON,
> +   (unsigned long *)_desc->control);
   ^^
   bad whitespace.

> +}

We should add an explicit smp_mb__after_atomic() for extra correctness,
because clear_bit() does not guarantee a memory barrier and we must make
sure that pir reads can't be reordered before it.
x86 clear_bit() currently uses locked instruction, though.

> +
>  static inline int pi_test_on(struct pi_desc *pi_desc)
>  {
>   return test_bit(POSTED_INTR_ON,

Other than that,

Reviewed-by: Radim Krčmář 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-26 Thread Radim Krčmář
2016-10-14 20:21+0200, Paolo Bonzini:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
> 
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> 
>   | enable_apicv=1  |  enable_apicv=0
>   | mean stdev  |  mean stdev
> --|-|--
> before| 5826 32.65  |  5765 47.09
> after | 5809 43.42  |  5777 77.02
> 
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>   (unsigned long *)_desc->control);
>  }
>  
> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> +{
> + clear_bit(POSTED_INTR_ON,
> +   (unsigned long *)_desc->control);
   ^^
   bad whitespace.

> +}

We should add an explicit smp_mb__after_atomic() for extra correctness,
because clear_bit() does not guarantee a memory barrier and we must make
sure that pir reads can't be reordered before it.
x86 clear_bit() currently uses locked instruction, though.

> +
>  static inline int pi_test_on(struct pi_desc *pi_desc)
>  {
>   return test_bit(POSTED_INTR_ON,

Other than that,

Reviewed-by: Radim Krčmář 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-19 Thread Paul E. McKenney
On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > 
> > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> > > >>> 
> > > >>>   for (i = 0; i <= 7; i++) {
> > > >>> - pir_val = xchg([i], 0);
> > > >>> - if (pir_val)
> > > >>> + pir_val = READ_ONCE(pir[i]);
> > > >> 
> > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > 
> > > > The answer can only be "depends on the compiler's whims". :)
> > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > 
> > > Hm.. So the idea is to make the code "race-free” in the sense
> > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > 
> > > If that is the case, I think there are many other cases that need to be
> > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > 
> > There is no documentation for this in the kernel tree unfortunately.
> > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > memory barriers is a start.
> > 
> > Paolo
> 
> I'm beginning to think that if a value is always (maybe except for init
> where we don't much care about the code size anyway) accessed through
> *_ONCE macros, we should just mark it volatile and be done with it. The
> code will look cleaner, and there will be less space for errors
> like forgetting *_ONCE macros.
> 
> Would such code (where all accesses are done through
> READ_ONCE/WRITE_ONCE otherwise) be an exception to
> volatile-considered-harmful.txt rules?
> 
> Cc Paul and Jonathan (for volatile-considered-harmful.txt).

One concern would be the guy reading the code, to whom it might not
be obvious that the underlying access was volatile, especially if
the reference was a few levels down.

Thanx, Paul



Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-19 Thread Paul E. McKenney
On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > 
> > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> > > >>> 
> > > >>>   for (i = 0; i <= 7; i++) {
> > > >>> - pir_val = xchg([i], 0);
> > > >>> - if (pir_val)
> > > >>> + pir_val = READ_ONCE(pir[i]);
> > > >> 
> > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > 
> > > > The answer can only be "depends on the compiler's whims". :)
> > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > 
> > > Hm.. So the idea is to make the code "race-free” in the sense
> > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > 
> > > If that is the case, I think there are many other cases that need to be
> > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > 
> > There is no documentation for this in the kernel tree unfortunately.
> > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > memory barriers is a start.
> > 
> > Paolo
> 
> I'm beginning to think that if a value is always (maybe except for init
> where we don't much care about the code size anyway) accessed through
> *_ONCE macros, we should just mark it volatile and be done with it. The
> code will look cleaner, and there will be less space for errors
> like forgetting *_ONCE macros.
> 
> Would such code (where all accesses are done through
> READ_ONCE/WRITE_ONCE otherwise) be an exception to
> volatile-considered-harmful.txt rules?
> 
> Cc Paul and Jonathan (for volatile-considered-harmful.txt).

One concern would be the guy reading the code, to whom it might not
be obvious that the underlying access was volatile, especially if
the reference was a few levels down.

Thanx, Paul



Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-17 Thread Paolo Bonzini


On 16/10/2016 05:21, Michael S. Tsirkin wrote:
> On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> posted interrupts turn out to be slower than interrupt injection via
>> KVM_REQ_EVENT.
>>
>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>
>>   | enable_apicv=1  |  enable_apicv=0
>>   | mean stdev  |  mean stdev
>> --|-|--
>> before| 5826 32.65  |  5765 47.09
>> after | 5809 43.42  |  5777 77.02
>>
>> Of course, any change in the right column is just placebo effect. :)
>> The savings are bigger if interrupts are frequent.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  arch/x86/kvm/lapic.c | 6 --
>>  arch/x86/kvm/vmx.c   | 9 -
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 23b99f305382..63a442aefc12 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>>  u32 i, pir_val;
>>  
>>  for (i = 0; i <= 7; i++) {
>> -pir_val = xchg([i], 0);
>> -if (pir_val)
>> +pir_val = READ_ONCE(pir[i]);
>> +if (pir_val) {
>> +pir_val = xchg([i], 0);
>>  *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
>> +}
>>  }
>>  }
>>  EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
> 
> gcc doesn't seem to unroll this loop and it's
> probably worth unrolling it
> 
> The following seems to do the trick for me on upstream - I didn't
> benchmark it though. Is there a kvm unit test for interrupts?

No, not yet.  Purely from a branch-prediction point of view I wouldn't
be so sure that it's worth unrolling it.  Because of the xchg you cannot
make the code branchless, and having a branch per iteration puts some
pressure on the BTB.

This is only a hot path in workloads that have a lot of interrupts and a
lot of vmexits.  If it's always the same interrupt (actually the same
group of 32 interrupts) that triggers, then the branch predictor will
predict the history very easily (e.g. 1000 1000 1000...).

Paolo

> --->
> 
> kvm: unroll the loop in __kvm_apic_update_irr.
> 
> This is hot data path in interrupt-rich workloads, worth unrolling.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b62c852..0c3462c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
>   return count;
>  }
>  
> -void __kvm_apic_update_irr(u32 *pir, void *regs)
> +void __attribute__((optimize("unroll-loops")))
> +__kvm_apic_update_irr(u32 *pir, void *regs)
>  {
>   u32 i, pir_val;
>  
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-17 Thread Paolo Bonzini


On 16/10/2016 05:21, Michael S. Tsirkin wrote:
> On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> posted interrupts turn out to be slower than interrupt injection via
>> KVM_REQ_EVENT.
>>
>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>
>>   | enable_apicv=1  |  enable_apicv=0
>>   | mean stdev  |  mean stdev
>> --|-|--
>> before| 5826 32.65  |  5765 47.09
>> after | 5809 43.42  |  5777 77.02
>>
>> Of course, any change in the right column is just placebo effect. :)
>> The savings are bigger if interrupts are frequent.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  arch/x86/kvm/lapic.c | 6 --
>>  arch/x86/kvm/vmx.c   | 9 -
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 23b99f305382..63a442aefc12 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>>  u32 i, pir_val;
>>  
>>  for (i = 0; i <= 7; i++) {
>> -pir_val = xchg([i], 0);
>> -if (pir_val)
>> +pir_val = READ_ONCE(pir[i]);
>> +if (pir_val) {
>> +pir_val = xchg([i], 0);
>>  *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
>> +}
>>  }
>>  }
>>  EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
> 
> gcc doesn't seem to unroll this loop and it's
> probably worth unrolling it
> 
> The following seems to do the trick for me on upstream - I didn't
> benchmark it though. Is there a kvm unit test for interrupts?

No, not yet.  Purely from a branch-prediction point of view I wouldn't
be so sure that it's worth unrolling it.  Because of the xchg you cannot
make the code branchless, and having a branch per iteration puts some
pressure on the BTB.

This is only a hot path in workloads that have a lot of interrupts and a
lot of vmexits.  If it's always the same interrupt (actually the same
group of 32 interrupts) that triggers, then the branch predictor will
predict the history very easily (e.g. 1000 1000 1000...).

Paolo

> --->
> 
> kvm: unroll the loop in __kvm_apic_update_irr.
> 
> This is hot data path in interrupt-rich workloads, worth unrolling.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b62c852..0c3462c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
>   return count;
>  }
>  
> -void __kvm_apic_update_irr(u32 *pir, void *regs)
> +void __attribute__((optimize("unroll-loops")))
> +__kvm_apic_update_irr(u32 *pir, void *regs)
>  {
>   u32 i, pir_val;
>  
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Michael S. Tsirkin
On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
> 
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> 
>   | enable_apicv=1  |  enable_apicv=0
>   | mean stdev  |  mean stdev
> --|-|--
> before| 5826 32.65  |  5765 47.09
> after | 5809 43.42  |  5777 77.02
> 
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/lapic.c | 6 --
>  arch/x86/kvm/vmx.c   | 9 -
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>   u32 i, pir_val;
>  
>   for (i = 0; i <= 7; i++) {
> - pir_val = xchg([i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);
> + if (pir_val) {
> + pir_val = xchg([i], 0);
>   *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
> + }
>   }
>  }
>  EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

gcc doesn't seem to unroll this loop and it's
probably worth unrolling it

The following seems to do the trick for me on upstream - I didn't
benchmark it though. Is there a kvm unit test for interrupts?

--->

kvm: unroll the loop in __kvm_apic_update_irr.

This is hot data path in interrupt-rich workloads, worth unrolling.

Signed-off-by: Michael S. Tsirkin 


diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b62c852..0c3462c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
return count;
 }
 
-void __kvm_apic_update_irr(u32 *pir, void *regs)
+void __attribute__((optimize("unroll-loops")))
+__kvm_apic_update_irr(u32 *pir, void *regs)
 {
u32 i, pir_val;
 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Michael S. Tsirkin
On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
> 
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> 
>   | enable_apicv=1  |  enable_apicv=0
>   | mean stdev  |  mean stdev
> --|-|--
> before| 5826 32.65  |  5765 47.09
> after | 5809 43.42  |  5777 77.02
> 
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/lapic.c | 6 --
>  arch/x86/kvm/vmx.c   | 9 -
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>   u32 i, pir_val;
>  
>   for (i = 0; i <= 7; i++) {
> - pir_val = xchg([i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);
> + if (pir_val) {
> + pir_val = xchg([i], 0);
>   *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
> + }
>   }
>  }
>  EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

gcc doesn't seem to unroll this loop and it's
probably worth unrolling it

The following seems to do the trick for me on upstream - I didn't
benchmark it though. Is there a kvm unit test for interrupts?

--->

kvm: unroll the loop in __kvm_apic_update_irr.

This is hot data path in interrupt-rich workloads, worth unrolling.

Signed-off-by: Michael S. Tsirkin 


diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b62c852..0c3462c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
return count;
 }
 
-void __kvm_apic_update_irr(u32 *pir, void *regs)
+void __attribute__((optimize("unroll-loops")))
+__kvm_apic_update_irr(u32 *pir, void *regs)
 {
u32 i, pir_val;
 


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Michael S. Tsirkin
On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> 
> > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> > >>> 
> > >>> for (i = 0; i <= 7; i++) {
> > >>> -   pir_val = xchg([i], 0);
> > >>> -   if (pir_val)
> > >>> +   pir_val = READ_ONCE(pir[i]);
> > >> 
> > >> Out of curiosity, do you really need this READ_ONCE?
> > > 
> > > The answer can only be "depends on the compiler's whims". :)
> > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > 
> > Hm.. So the idea is to make the code "race-free” in the sense
> > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > 
> > If that is the case, I think there are many other cases that need to be
> > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> 
> There is no documentation for this in the kernel tree unfortunately.
> But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> memory barriers is a start.
> 
> Paolo

I'm beginning to think that if a value is always (maybe except for init
where we don't much care about the code size anyway) accessed through
*_ONCE macros, we should just mark it volatile and be done with it. The
code will look cleaner, and there will be less space for errors
like forgetting *_ONCE macros.

Would such code (where all accesses are done through
READ_ONCE/WRITE_ONCE otherwise) be an exception to
volatile-considered-harmful.txt rules?

Cc Paul and Jonathan (for volatile-considered-harmful.txt).


-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Michael S. Tsirkin
On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> 
> > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> > >>> 
> > >>> for (i = 0; i <= 7; i++) {
> > >>> -   pir_val = xchg([i], 0);
> > >>> -   if (pir_val)
> > >>> +   pir_val = READ_ONCE(pir[i]);
> > >> 
> > >> Out of curiosity, do you really need this READ_ONCE?
> > > 
> > > The answer can only be "depends on the compiler's whims". :)
> > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > 
> > Hm.. So the idea is to make the code "race-free” in the sense
> > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > 
> > If that is the case, I think there are many other cases that need to be
> > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> 
> There is no documentation for this in the kernel tree unfortunately.
> But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> memory barriers is a start.
> 
> Paolo

I'm beginning to think that if a value is always (maybe except for init
where we don't much care about the code size anyway) accessed through
*_ONCE macros, we should just mark it volatile and be done with it. The
code will look cleaner, and there will be less space for errors
like forgetting *_ONCE macros.

Would such code (where all accesses are done through
READ_ONCE/WRITE_ONCE otherwise) be an exception to
volatile-considered-harmful.txt rules?

Cc Paul and Jonathan (for volatile-considered-harmful.txt).


-- 
MST


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Paolo Bonzini

> > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> >>> 
> >>>   for (i = 0; i <= 7; i++) {
> >>> - pir_val = xchg([i], 0);
> >>> - if (pir_val)
> >>> + pir_val = READ_ONCE(pir[i]);
> >> 
> >> Out of curiosity, do you really need this READ_ONCE?
> > 
> > The answer can only be "depends on the compiler's whims". :)
> > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> 
> Hm.. So the idea is to make the code "race-free” in the sense
> that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> 
> If that is the case, I think there are many other cases that need to be
> changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.

There is no documentation for this in the kernel tree unfortunately.
But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
memory barriers is a start.

Paolo


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-15 Thread Paolo Bonzini

> > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
> >>> 
> >>>   for (i = 0; i <= 7; i++) {
> >>> - pir_val = xchg([i], 0);
> >>> - if (pir_val)
> >>> + pir_val = READ_ONCE(pir[i]);
> >> 
> >> Out of curiosity, do you really need this READ_ONCE?
> > 
> > The answer can only be "depends on the compiler's whims". :)
> > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> 
> Hm.. So the idea is to make the code "race-free” in the sense
> that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> 
> If that is the case, I think there are many other cases that need to be
> changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.

There is no documentation for this in the kernel tree unfortunately.
But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
memory barriers is a start.

Paolo


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Nadav Amit

> On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
>>> 
>>> for (i = 0; i <= 7; i++) {
>>> -   pir_val = xchg([i], 0);
>>> -   if (pir_val)
>>> +   pir_val = READ_ONCE(pir[i]);
>> 
>> Out of curiosity, do you really need this READ_ONCE?
> 
> The answer can only be "depends on the compiler's whims". :)
> If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Hm.. So the idea is to make the code "race-free” in the sense
that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?

If that is the case, I think there are many other cases that need to be
changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.




Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Nadav Amit

> On Oct 14, 2016, at 11:56 AM, Paolo Bonzini  wrote:
>>> 
>>> for (i = 0; i <= 7; i++) {
>>> -   pir_val = xchg([i], 0);
>>> -   if (pir_val)
>>> +   pir_val = READ_ONCE(pir[i]);
>> 
>> Out of curiosity, do you really need this READ_ONCE?
> 
> The answer can only be "depends on the compiler's whims". :)
> If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Hm.. So the idea is to make the code "race-free” in the sense
that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?

If that is the case, I think there are many other cases that need to be
changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.




Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Paolo Bonzini
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 23b99f305382..63a442aefc12 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> > u32 i, pir_val;
> > 
> > for (i = 0; i <= 7; i++) {
> > -   pir_val = xchg([i], 0);
> > -   if (pir_val)
> > +   pir_val = READ_ONCE(pir[i]);
> 
> Out of curiosity, do you really need this READ_ONCE?

The answer can only be "depends on the compiler's whims". :)
If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Paolo


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Paolo Bonzini
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 23b99f305382..63a442aefc12 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> > u32 i, pir_val;
> > 
> > for (i = 0; i <= 7; i++) {
> > -   pir_val = xchg([i], 0);
> > -   if (pir_val)
> > +   pir_val = READ_ONCE(pir[i]);
> 
> Out of curiosity, do you really need this READ_ONCE?

The answer can only be "depends on the compiler's whims". :)
If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Paolo


Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Nadav Amit
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>   u32 i, pir_val;
> 
>   for (i = 0; i <= 7; i++) {
> - pir_val = xchg([i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);

Out of curiosity, do you really need this READ_ONCE?



Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Nadav Amit
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>   u32 i, pir_val;
> 
>   for (i = 0; i <= 7; i++) {
> - pir_val = xchg([i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);

Out of curiosity, do you really need this READ_ONCE?



[PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Paolo Bonzini
On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

  | enable_apicv=1  |  enable_apicv=0
  | mean stdev  |  mean stdev
--|-|--
before| 5826 32.65  |  5765 47.09
after | 5809 43.42  |  5777 77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/lapic.c | 6 --
 arch/x86/kvm/vmx.c   | 9 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 23b99f305382..63a442aefc12 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
u32 i, pir_val;
 
for (i = 0; i <= 7; i++) {
-   pir_val = xchg([i], 0);
-   if (pir_val)
+   pir_val = READ_ONCE(pir[i]);
+   if (pir_val) {
+   pir_val = xchg([i], 0);
*((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+   }
}
 }
 EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2577183b40d9..7c79d6c6b6ed 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
(unsigned long *)_desc->control);
 }
 
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+   clear_bit(POSTED_INTR_ON,
+ (unsigned long *)_desc->control);
+}
+
 static inline int pi_test_on(struct pi_desc *pi_desc)
 {
return test_bit(POSTED_INTR_ON,
@@ -4854,9 +4860,10 @@ static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-   if (!pi_test_and_clear_on(>pi_desc))
+   if (!pi_test_on(>pi_desc))
return;
 
+   pi_clear_on(>pi_desc);
kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
 }
 
-- 
1.8.3.1




[PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 Thread Paolo Bonzini
On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

  | enable_apicv=1  |  enable_apicv=0
  | mean stdev  |  mean stdev
--|-|--
before| 5826 32.65  |  5765 47.09
after | 5809 43.42  |  5777 77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/lapic.c | 6 --
 arch/x86/kvm/vmx.c   | 9 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 23b99f305382..63a442aefc12 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
u32 i, pir_val;
 
for (i = 0; i <= 7; i++) {
-   pir_val = xchg([i], 0);
-   if (pir_val)
+   pir_val = READ_ONCE(pir[i]);
+   if (pir_val) {
+   pir_val = xchg([i], 0);
*((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+   }
}
 }
 EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2577183b40d9..7c79d6c6b6ed 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
(unsigned long *)_desc->control);
 }
 
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+   clear_bit(POSTED_INTR_ON,
+ (unsigned long *)_desc->control);
+}
+
 static inline int pi_test_on(struct pi_desc *pi_desc)
 {
return test_bit(POSTED_INTR_ON,
@@ -4854,9 +4860,10 @@ static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-   if (!pi_test_and_clear_on(>pi_desc))
+   if (!pi_test_on(>pi_desc))
return;
 
+   pi_clear_on(>pi_desc);
kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
 }
 
-- 
1.8.3.1