Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Wanpeng Li
2017-11-15 17:54 GMT+08:00 Peter Zijlstra :
> On Wed, Nov 15, 2017 at 04:43:32PM +0800, Wanpeng Li wrote:
>> Hi Peterz,
>>
>> I found big performance difference as I discuss with you several days ago.
>>
>> ebizzy -M
>> vanillastatic/local cpumask per-cpu cpumask
>>  8 vCPUs   1015210083  10117
>> 16 vCPUs1224  4866  10008
>> 24 vCPUs1109  38719928
>> 32 vCPUs1025  33759811
>>
>> In addition, I can observe ~50% perf top time is occupied by
>> smp_call_function_many(), ~30% perf top time is occupied by
>> call_function_interrupt() in the guest when running ebizzy for
>> static/local cpumask variable. However, I almost can't observe these
>> IPI stuffs after changing to per-cpu variable. Any opinions?
>
> That doesn't really make sense.. :/
>
> So a single static variable is broken (multiple CPUs can call
> flush_tlb_others() concurrently and overwrite each others masks). But I
> don't see why a per-cpu variable would be much slower than an on-stack
> variable.

The score of ebizzy, bigger is better, so per-cpu variable 2~3 times
better than on-stack. Actually I find what happens here. :)

+ for_each_possible_cpu(cpu) {
+ zalloc_cpumask_var_node(per_cpu_ptr(&__pv_tlb_mask, cpu),
+ GFP_KERNEL, cpu_to_node(cpu));
+ }

This zalloc_cpumask_var_node() returns NULL and fails to alloc per-cpu
memory. There is a check in my kvm_flush_tlb_others():

+ if (unlikely(!flushmask))
+ return;

So the kvm_flush_tlb_others() skips all the tlbs shutdown, I think
that's the reason why the score of overcommit is as high as
non-overcommit, in addition, it also explains why I can't observe IPI
related functions by perf top.

Regards,
Wanpeng Li


Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Wanpeng Li
2017-11-15 17:54 GMT+08:00 Peter Zijlstra :
> On Wed, Nov 15, 2017 at 04:43:32PM +0800, Wanpeng Li wrote:
>> Hi Peterz,
>>
>> I found big performance difference as I discuss with you several days ago.
>>
>> ebizzy -M
>> vanillastatic/local cpumask per-cpu cpumask
>>  8 vCPUs   1015210083  10117
>> 16 vCPUs1224  4866  10008
>> 24 vCPUs1109  38719928
>> 32 vCPUs1025  33759811
>>
>> In addition, I can observe ~50% perf top time is occupied by
>> smp_call_function_many(), ~30% perf top time is occupied by
>> call_function_interrupt() in the guest when running ebizzy for
>> static/local cpumask variable. However, I almost can't observe these
>> IPI stuffs after changing to per-cpu variable. Any opinions?
>
> That doesn't really make sense.. :/
>
> So a single static variable is broken (multiple CPUs can call
> flush_tlb_others() concurrently and overwrite each others masks). But I
> don't see why a per-cpu variable would be much slower than an on-stack
> variable.

The score of ebizzy, bigger is better, so per-cpu variable 2~3 times
better than on-stack. Actually I find what happens here. :)

+ for_each_possible_cpu(cpu) {
+ zalloc_cpumask_var_node(per_cpu_ptr(&__pv_tlb_mask, cpu),
+ GFP_KERNEL, cpu_to_node(cpu));
+ }

This zalloc_cpumask_var_node() returns NULL and fails to alloc per-cpu
memory. There is a check in my kvm_flush_tlb_others():

+ if (unlikely(!flushmask))
+ return;

So the kvm_flush_tlb_others() skips all the tlbs shutdown, I think
that's the reason why the score of overcommit is as high as
non-overcommit, in addition, it also explains why I can't observe IPI
related functions by perf top.

Regards,
Wanpeng Li


Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Peter Zijlstra
On Wed, Nov 15, 2017 at 04:43:32PM +0800, Wanpeng Li wrote:
> Hi Peterz,
> 
> I found big performance difference as I discuss with you several days ago.
> 
> ebizzy -M
> vanillastatic/local cpumask per-cpu cpumask
>  8 vCPUs   1015210083  10117
> 16 vCPUs1224  4866  10008
> 24 vCPUs1109  38719928
> 32 vCPUs1025  33759811
> 
> In addition, I can observe ~50% perf top time is occupied by
> smp_call_function_many(), ~30% perf top time is occupied by
> call_function_interrupt() in the guest when running ebizzy for
> static/local cpumask variable. However, I almost can't observe these
> IPI stuffs after changing to per-cpu variable. Any opinions?

That doesn't really make sense.. :/

So a single static variable is broken (multiple CPUs can call
flush_tlb_others() concurrently and overwrite each others masks). But I
don't see why a per-cpu variable would be much slower than an on-stack
variable.




Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Peter Zijlstra
On Wed, Nov 15, 2017 at 04:43:32PM +0800, Wanpeng Li wrote:
> Hi Peterz,
> 
> I found big performance difference as I discuss with you several days ago.
> 
> ebizzy -M
> vanillastatic/local cpumask per-cpu cpumask
>  8 vCPUs   1015210083  10117
> 16 vCPUs1224  4866  10008
> 24 vCPUs1109  38719928
> 32 vCPUs1025  33759811
> 
> In addition, I can observe ~50% perf top time is occupied by
> smp_call_function_many(), ~30% perf top time is occupied by
> call_function_interrupt() in the guest when running ebizzy for
> static/local cpumask variable. However, I almost can't observe these
> IPI stuffs after changing to per-cpu variable. Any opinions?

That doesn't really make sense.. :/

So a single static variable is broken (multiple CPUs can call
flush_tlb_others() concurrently and overwrite each others masks). But I
don't see why a per-cpu variable would be much slower than an on-stack
variable.




Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Wanpeng Li
2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
> On 10/11/2017 08:04, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> Remote flushing api's does a busy wait which is fine in bare-metal
>> scenario. But with-in the guest, the vcpus might have been pre-empted
>> or blocked. In this scenario, the initator vcpu would end up
>> busy-waiting for a long amount of time.
>>
>> This patch set implements para-virt flush tlbs making sure that it
>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>> flush the tlb on guest enter.
>>
>> The best result is achieved when we're overcommiting the host by running
>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>
>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>> one linux guest.
>>
>> ebizzy -M
>>   vanillaoptimized boost
>>  8 vCPUs   10152   10083   -0.68%
>> 16 vCPUs12244866   297.5%
>> 24 vCPUs11093871   249%
>> 32 vCPUs10253375   229.3%
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>  arch/x86/kernel/kvm.c| 31 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>> b/Documentation/virtual/kvm/cpuid.txt
>> index 117066a..9693fcc 100644
>> --- a/Documentation/virtual/kvm/cpuid.txt
>> +++ b/Documentation/virtual/kvm/cpuid.txt
>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>> checks this feature bit
>> ||   || mizations such as usage of
>> ||   || qspinlocks.
>>  
>> --
>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
>> +   ||   || before enabling 
>> paravirtualized
>> +   ||   || tlb flush.
>> +--
>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>> guest-side
>> ||   || per-cpu warps are expected in
>> ||   || kvmclock.
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>> b/arch/x86/include/uapi/asm/kvm_para.h
>> index 9ead1ed..a028479 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -25,6 +25,7 @@
>>  #define KVM_FEATURE_PV_EOI   6
>>  #define KVM_FEATURE_PV_UNHALT7
>>  #define KVM_FEATURE_PV_DEDICATED 8
>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>
>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>   * in pvclock structure. If no bits are set, all flags are ignored.
>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>
>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>
>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>  struct kvm_clock_pairing {
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 66ed3bc..50f4b6a 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>  }
>>
>> +static cpumask_t flushmask;
>
> Hi Wanpeng,
>
> are you going to send v3 with a percpu variable?

Hi Peterz,

I found big performance difference as I discuss with you several days ago.

ebizzy -M
vanillastatic/local cpumask per-cpu cpumask
 8 vCPUs   1015210083  10117
16 vCPUs1224  4866  10008
24 vCPUs1109  38719928
32 vCPUs1025  33759811

In addition, I can observe ~50% perf top time is occupied by
smp_call_function_many(), ~30% perf top time is occupied by
call_function_interrupt() in the guest when running ebizzy for
static/local cpumask variable. However, I almost can't observe these
IPI stuffs after changing to per-cpu variable. Any opinions?

Regards,
Wanpeng Li

>
> Paolo
>
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> + const struct flush_tlb_info *info)
>> +{
>> + u8 state;
>> + int cpu;
>> + struct kvm_steal_time *src;
>> +
>> + cpumask_copy(, cpumask);
>> + /*
>> +  * We have to call flush only on online vCPUs. And
>> +  * queue 

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-15 Thread Wanpeng Li
2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
> On 10/11/2017 08:04, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> Remote flushing api's does a busy wait which is fine in bare-metal
>> scenario. But with-in the guest, the vcpus might have been pre-empted
>> or blocked. In this scenario, the initator vcpu would end up
>> busy-waiting for a long amount of time.
>>
>> This patch set implements para-virt flush tlbs making sure that it
>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>> flush the tlb on guest enter.
>>
>> The best result is achieved when we're overcommiting the host by running
>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>
>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>> one linux guest.
>>
>> ebizzy -M
>>   vanillaoptimized boost
>>  8 vCPUs   10152   10083   -0.68%
>> 16 vCPUs12244866   297.5%
>> 24 vCPUs11093871   249%
>> 32 vCPUs10253375   229.3%
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>  arch/x86/kernel/kvm.c| 31 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>> b/Documentation/virtual/kvm/cpuid.txt
>> index 117066a..9693fcc 100644
>> --- a/Documentation/virtual/kvm/cpuid.txt
>> +++ b/Documentation/virtual/kvm/cpuid.txt
>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>> checks this feature bit
>> ||   || mizations such as usage of
>> ||   || qspinlocks.
>>  
>> --
>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
>> +   ||   || before enabling 
>> paravirtualized
>> +   ||   || tlb flush.
>> +--
>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>> guest-side
>> ||   || per-cpu warps are expected in
>> ||   || kvmclock.
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>> b/arch/x86/include/uapi/asm/kvm_para.h
>> index 9ead1ed..a028479 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -25,6 +25,7 @@
>>  #define KVM_FEATURE_PV_EOI   6
>>  #define KVM_FEATURE_PV_UNHALT7
>>  #define KVM_FEATURE_PV_DEDICATED 8
>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>
>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>   * in pvclock structure. If no bits are set, all flags are ignored.
>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>
>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>
>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>  struct kvm_clock_pairing {
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 66ed3bc..50f4b6a 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>  }
>>
>> +static cpumask_t flushmask;
>
> Hi Wanpeng,
>
> are you going to send v3 with a percpu variable?

Hi Peterz,

I found big performance difference as I discuss with you several days ago.

ebizzy -M
vanillastatic/local cpumask per-cpu cpumask
 8 vCPUs   1015210083  10117
16 vCPUs1224  4866  10008
24 vCPUs1109  38719928
32 vCPUs1025  33759811

In addition, I can observe ~50% perf top time is occupied by
smp_call_function_many(), ~30% perf top time is occupied by
call_function_interrupt() in the guest when running ebizzy for
static/local cpumask variable. However, I almost can't observe these
IPI stuffs after changing to per-cpu variable. Any opinions?

Regards,
Wanpeng Li

>
> Paolo
>
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> + const struct flush_tlb_info *info)
>> +{
>> + u8 state;
>> + int cpu;
>> + struct kvm_steal_time *src;
>> +
>> + cpumask_copy(, cpumask);
>> + /*
>> +  * We have to call flush only on online vCPUs. And
>> +  * queue flush_on_enter for pre-empted vCPUs
>> +  */
>> + for_each_cpu(cpu, cpumask) {
>> + src = 

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Wanpeng Li
2017-11-10 16:33 GMT+08:00 Wanpeng Li :
> 2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
>> On 10/11/2017 08:04, Wanpeng Li wrote:
>>> From: Wanpeng Li 
>>>
>>> Remote flushing api's does a busy wait which is fine in bare-metal
>>> scenario. But with-in the guest, the vcpus might have been pre-empted
>>> or blocked. In this scenario, the initator vcpu would end up
>>> busy-waiting for a long amount of time.
>>>
>>> This patch set implements para-virt flush tlbs making sure that it
>>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>>> flush the tlb on guest enter.
>>>
>>> The best result is achieved when we're overcommiting the host by running
>>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>>
>>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>>> one linux guest.
>>>
>>> ebizzy -M
>>>   vanillaoptimized boost
>>>  8 vCPUs   10152   10083   -0.68%
>>> 16 vCPUs12244866   297.5%
>>> 24 vCPUs11093871   249%
>>> 32 vCPUs10253375   229.3%
>>>
>>> Cc: Paolo Bonzini 
>>> Cc: Radim Krčmář 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>>  arch/x86/kernel/kvm.c| 31 +++
>>>  3 files changed, 37 insertions(+)
>>>
>>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>>> b/Documentation/virtual/kvm/cpuid.txt
>>> index 117066a..9693fcc 100644
>>> --- a/Documentation/virtual/kvm/cpuid.txt
>>> +++ b/Documentation/virtual/kvm/cpuid.txt
>>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>>> checks this feature bit
>>> ||   || mizations such as usage of
>>> ||   || qspinlocks.
>>>  
>>> --
>>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature 
>>> bit
>>> +   ||   || before enabling 
>>> paravirtualized
>>> +   ||   || tlb flush.
>>> +--
>>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>>> guest-side
>>> ||   || per-cpu warps are expected 
>>> in
>>> ||   || kvmclock.
>>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>>> b/arch/x86/include/uapi/asm/kvm_para.h
>>> index 9ead1ed..a028479 100644
>>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>>> @@ -25,6 +25,7 @@
>>>  #define KVM_FEATURE_PV_EOI   6
>>>  #define KVM_FEATURE_PV_UNHALT7
>>>  #define KVM_FEATURE_PV_DEDICATED 8
>>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>>
>>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>>   * in pvclock structure. If no bits are set, all flags are ignored.
>>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>>
>>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>>
>>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>>  struct kvm_clock_pairing {
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 66ed3bc..50f4b6a 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>>  }
>>>
>>> +static cpumask_t flushmask;
>>
>> Hi Wanpeng,
>>
>> are you going to send v3 with a percpu variable?
>
> Yeah, I just complete v3 according to Peterz's comments in another
> guy's thread, I will send out them after completing the testing.

This is how it looks it. https://pastebin.com/raw/L2vqu4cZ

Regards,
Wanpeng Li

>
> Regards,
> Wanpeng Li
>
>>
>> Paolo
>>
>>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>>> + const struct flush_tlb_info *info)
>>> +{
>>> + u8 state;
>>> + int cpu;
>>> + struct kvm_steal_time *src;
>>> +
>>> + cpumask_copy(, cpumask);
>>> + /*
>>> +  * We have to call flush only on online vCPUs. And
>>> +  * queue flush_on_enter for pre-empted vCPUs
>>> +  */
>>> + for_each_cpu(cpu, cpumask) {
>>> + src = _cpu(steal_time, cpu);
>>> + state = src->preempted;
>>> + if ((state & KVM_VCPU_PREEMPTED)) {
>>> + if (cmpxchg(>preempted, state, state |
>>> + KVM_VCPU_SHOULD_FLUSH) == state)
>>> + 

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Wanpeng Li
2017-11-10 16:33 GMT+08:00 Wanpeng Li :
> 2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
>> On 10/11/2017 08:04, Wanpeng Li wrote:
>>> From: Wanpeng Li 
>>>
>>> Remote flushing api's does a busy wait which is fine in bare-metal
>>> scenario. But with-in the guest, the vcpus might have been pre-empted
>>> or blocked. In this scenario, the initator vcpu would end up
>>> busy-waiting for a long amount of time.
>>>
>>> This patch set implements para-virt flush tlbs making sure that it
>>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>>> flush the tlb on guest enter.
>>>
>>> The best result is achieved when we're overcommiting the host by running
>>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>>
>>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>>> one linux guest.
>>>
>>> ebizzy -M
>>>   vanillaoptimized boost
>>>  8 vCPUs   10152   10083   -0.68%
>>> 16 vCPUs12244866   297.5%
>>> 24 vCPUs11093871   249%
>>> 32 vCPUs10253375   229.3%
>>>
>>> Cc: Paolo Bonzini 
>>> Cc: Radim Krčmář 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>>  arch/x86/kernel/kvm.c| 31 +++
>>>  3 files changed, 37 insertions(+)
>>>
>>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>>> b/Documentation/virtual/kvm/cpuid.txt
>>> index 117066a..9693fcc 100644
>>> --- a/Documentation/virtual/kvm/cpuid.txt
>>> +++ b/Documentation/virtual/kvm/cpuid.txt
>>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>>> checks this feature bit
>>> ||   || mizations such as usage of
>>> ||   || qspinlocks.
>>>  
>>> --
>>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature 
>>> bit
>>> +   ||   || before enabling 
>>> paravirtualized
>>> +   ||   || tlb flush.
>>> +--
>>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>>> guest-side
>>> ||   || per-cpu warps are expected 
>>> in
>>> ||   || kvmclock.
>>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>>> b/arch/x86/include/uapi/asm/kvm_para.h
>>> index 9ead1ed..a028479 100644
>>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>>> @@ -25,6 +25,7 @@
>>>  #define KVM_FEATURE_PV_EOI   6
>>>  #define KVM_FEATURE_PV_UNHALT7
>>>  #define KVM_FEATURE_PV_DEDICATED 8
>>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>>
>>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>>   * in pvclock structure. If no bits are set, all flags are ignored.
>>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>>
>>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>>
>>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>>  struct kvm_clock_pairing {
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 66ed3bc..50f4b6a 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>>  }
>>>
>>> +static cpumask_t flushmask;
>>
>> Hi Wanpeng,
>>
>> are you going to send v3 with a percpu variable?
>
> Yeah, I just complete v3 according to Peterz's comments in another
> guy's thread, I will send out them after completing the testing.

This is how it looks it. https://pastebin.com/raw/L2vqu4cZ

Regards,
Wanpeng Li

>
> Regards,
> Wanpeng Li
>
>>
>> Paolo
>>
>>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>>> + const struct flush_tlb_info *info)
>>> +{
>>> + u8 state;
>>> + int cpu;
>>> + struct kvm_steal_time *src;
>>> +
>>> + cpumask_copy(, cpumask);
>>> + /*
>>> +  * We have to call flush only on online vCPUs. And
>>> +  * queue flush_on_enter for pre-empted vCPUs
>>> +  */
>>> + for_each_cpu(cpu, cpumask) {
>>> + src = _cpu(steal_time, cpu);
>>> + state = src->preempted;
>>> + if ((state & KVM_VCPU_PREEMPTED)) {
>>> + if (cmpxchg(>preempted, state, state |
>>> + KVM_VCPU_SHOULD_FLUSH) == state)
>>> + __cpumask_clear_cpu(cpu, );
>>> + }
>>> + }
>>> +
>>> + native_flush_tlb_others(, 

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Wanpeng Li
2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
> On 10/11/2017 08:04, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> Remote flushing api's does a busy wait which is fine in bare-metal
>> scenario. But with-in the guest, the vcpus might have been pre-empted
>> or blocked. In this scenario, the initator vcpu would end up
>> busy-waiting for a long amount of time.
>>
>> This patch set implements para-virt flush tlbs making sure that it
>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>> flush the tlb on guest enter.
>>
>> The best result is achieved when we're overcommiting the host by running
>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>
>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>> one linux guest.
>>
>> ebizzy -M
>>   vanillaoptimized boost
>>  8 vCPUs   10152   10083   -0.68%
>> 16 vCPUs12244866   297.5%
>> 24 vCPUs11093871   249%
>> 32 vCPUs10253375   229.3%
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>  arch/x86/kernel/kvm.c| 31 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>> b/Documentation/virtual/kvm/cpuid.txt
>> index 117066a..9693fcc 100644
>> --- a/Documentation/virtual/kvm/cpuid.txt
>> +++ b/Documentation/virtual/kvm/cpuid.txt
>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>> checks this feature bit
>> ||   || mizations such as usage of
>> ||   || qspinlocks.
>>  
>> --
>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
>> +   ||   || before enabling 
>> paravirtualized
>> +   ||   || tlb flush.
>> +--
>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>> guest-side
>> ||   || per-cpu warps are expected in
>> ||   || kvmclock.
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>> b/arch/x86/include/uapi/asm/kvm_para.h
>> index 9ead1ed..a028479 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -25,6 +25,7 @@
>>  #define KVM_FEATURE_PV_EOI   6
>>  #define KVM_FEATURE_PV_UNHALT7
>>  #define KVM_FEATURE_PV_DEDICATED 8
>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>
>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>   * in pvclock structure. If no bits are set, all flags are ignored.
>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>
>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>
>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>  struct kvm_clock_pairing {
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 66ed3bc..50f4b6a 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>  }
>>
>> +static cpumask_t flushmask;
>
> Hi Wanpeng,
>
> are you going to send v3 with a percpu variable?

Yeah, I just complete v3 according to Peterz's comments in another
guy's thread, I will send out them after completing the testing.

Regards,
Wanpeng Li

>
> Paolo
>
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> + const struct flush_tlb_info *info)
>> +{
>> + u8 state;
>> + int cpu;
>> + struct kvm_steal_time *src;
>> +
>> + cpumask_copy(, cpumask);
>> + /*
>> +  * We have to call flush only on online vCPUs. And
>> +  * queue flush_on_enter for pre-empted vCPUs
>> +  */
>> + for_each_cpu(cpu, cpumask) {
>> + src = _cpu(steal_time, cpu);
>> + state = src->preempted;
>> + if ((state & KVM_VCPU_PREEMPTED)) {
>> + if (cmpxchg(>preempted, state, state |
>> + KVM_VCPU_SHOULD_FLUSH) == state)
>> + __cpumask_clear_cpu(cpu, );
>> + }
>> + }
>> +
>> + native_flush_tlb_others(, info);
>> +}
>> +
>>  void __init kvm_guest_init(void)
>>  {
>>   int i;
>> @@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
>>   

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Wanpeng Li
2017-11-10 16:24 GMT+08:00 Paolo Bonzini :
> On 10/11/2017 08:04, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> Remote flushing api's does a busy wait which is fine in bare-metal
>> scenario. But with-in the guest, the vcpus might have been pre-empted
>> or blocked. In this scenario, the initator vcpu would end up
>> busy-waiting for a long amount of time.
>>
>> This patch set implements para-virt flush tlbs making sure that it
>> does not wait for vcpus that are sleeping. And all the sleeping vcpus
>> flush the tlb on guest enter.
>>
>> The best result is achieved when we're overcommiting the host by running
>> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching
>> vCPUs which are not scheduled and avoid the wait on the main CPU.
>>
>> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in
>> one linux guest.
>>
>> ebizzy -M
>>   vanillaoptimized boost
>>  8 vCPUs   10152   10083   -0.68%
>> 16 vCPUs12244866   297.5%
>> 24 vCPUs11093871   249%
>> 32 vCPUs10253375   229.3%
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  Documentation/virtual/kvm/cpuid.txt  |  4 
>>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>>  arch/x86/kernel/kvm.c| 31 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/cpuid.txt 
>> b/Documentation/virtual/kvm/cpuid.txt
>> index 117066a..9693fcc 100644
>> --- a/Documentation/virtual/kvm/cpuid.txt
>> +++ b/Documentation/virtual/kvm/cpuid.txt
>> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
>> checks this feature bit
>> ||   || mizations such as usage of
>> ||   || qspinlocks.
>>  
>> --
>> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
>> +   ||   || before enabling 
>> paravirtualized
>> +   ||   || tlb flush.
>> +--
>>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
>> guest-side
>> ||   || per-cpu warps are expected in
>> ||   || kvmclock.
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
>> b/arch/x86/include/uapi/asm/kvm_para.h
>> index 9ead1ed..a028479 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -25,6 +25,7 @@
>>  #define KVM_FEATURE_PV_EOI   6
>>  #define KVM_FEATURE_PV_UNHALT7
>>  #define KVM_FEATURE_PV_DEDICATED 8
>> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>>
>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>   * in pvclock structure. If no bits are set, all flags are ignored.
>> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>>
>>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>>  #define KVM_VCPU_PREEMPTED  (1 << 0)
>> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>>
>>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>>  struct kvm_clock_pairing {
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 66ed3bc..50f4b6a 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>>  }
>>
>> +static cpumask_t flushmask;
>
> Hi Wanpeng,
>
> are you going to send v3 with a percpu variable?

Yeah, I just complete v3 according to Peterz's comments in another
guy's thread, I will send out them after completing the testing.

Regards,
Wanpeng Li

>
> Paolo
>
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> + const struct flush_tlb_info *info)
>> +{
>> + u8 state;
>> + int cpu;
>> + struct kvm_steal_time *src;
>> +
>> + cpumask_copy(, cpumask);
>> + /*
>> +  * We have to call flush only on online vCPUs. And
>> +  * queue flush_on_enter for pre-empted vCPUs
>> +  */
>> + for_each_cpu(cpu, cpumask) {
>> + src = _cpu(steal_time, cpu);
>> + state = src->preempted;
>> + if ((state & KVM_VCPU_PREEMPTED)) {
>> + if (cmpxchg(>preempted, state, state |
>> + KVM_VCPU_SHOULD_FLUSH) == state)
>> + __cpumask_clear_cpu(cpu, );
>> + }
>> + }
>> +
>> + native_flush_tlb_others(, info);
>> +}
>> +
>>  void __init kvm_guest_init(void)
>>  {
>>   int i;
>> @@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
>>   pv_time_ops.steal_clock = kvm_steal_clock;
>>   }
>>
>> + if 

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Paolo Bonzini
On 10/11/2017 08:04, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
> one linux guest.
> 
> ebizzy -M 
>   vanillaoptimized boost
>  8 vCPUs   10152   10083   -0.68% 
> 16 vCPUs12244866   297.5% 
> 24 vCPUs11093871   249%
> 32 vCPUs10253375   229.3% 
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
>  Documentation/virtual/kvm/cpuid.txt  |  4 
>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>  arch/x86/kernel/kvm.c| 31 +++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/cpuid.txt 
> b/Documentation/virtual/kvm/cpuid.txt
> index 117066a..9693fcc 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
> checks this feature bit
> ||   || mizations such as usage of
> ||   || qspinlocks.
>  
> --
> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
> +   ||   || before enabling 
> paravirtualized
> +   ||   || tlb flush.
> +--
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> guest-side
> ||   || per-cpu warps are expected in
> ||   || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> index 9ead1ed..a028479 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -25,6 +25,7 @@
>  #define KVM_FEATURE_PV_EOI   6
>  #define KVM_FEATURE_PV_UNHALT7
>  #define KVM_FEATURE_PV_DEDICATED 8
> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>  
>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>  #define KVM_VCPU_PREEMPTED  (1 << 0)
> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>  
>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>  struct kvm_clock_pairing {
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 66ed3bc..50f4b6a 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static cpumask_t flushmask;

Hi Wanpeng,

are you going to send v3 with a percpu variable?

Paolo

> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> +
> + cpumask_copy(, cpumask);
> + /*
> +  * We have to call flush only on online vCPUs. And
> +  * queue flush_on_enter for pre-empted vCPUs
> +  */
> + for_each_cpu(cpu, cpumask) {
> + src = _cpu(steal_time, cpu);
> + state = src->preempted;
> + if ((state & KVM_VCPU_PREEMPTED)) {
> + if (cmpxchg(>preempted, state, state |
> + KVM_VCPU_SHOULD_FLUSH) == state)
> + __cpumask_clear_cpu(cpu, );
> + }
> + }
> +
> + native_flush_tlb_others(, info);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
>   int i;
> @@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
>   pv_time_ops.steal_clock = kvm_steal_clock;
>   }
>  
> + if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
> + !kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
> + pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
> +
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>

Re: [PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-10 Thread Paolo Bonzini
On 10/11/2017 08:04, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
> one linux guest.
> 
> ebizzy -M 
>   vanillaoptimized boost
>  8 vCPUs   10152   10083   -0.68% 
> 16 vCPUs12244866   297.5% 
> 24 vCPUs11093871   249%
> 32 vCPUs10253375   229.3% 
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
>  Documentation/virtual/kvm/cpuid.txt  |  4 
>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>  arch/x86/kernel/kvm.c| 31 +++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/cpuid.txt 
> b/Documentation/virtual/kvm/cpuid.txt
> index 117066a..9693fcc 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest 
> checks this feature bit
> ||   || mizations such as usage of
> ||   || qspinlocks.
>  
> --
> +KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
> +   ||   || before enabling 
> paravirtualized
> +   ||   || tlb flush.
> +--
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> guest-side
> ||   || per-cpu warps are expected in
> ||   || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> index 9ead1ed..a028479 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -25,6 +25,7 @@
>  #define KVM_FEATURE_PV_EOI   6
>  #define KVM_FEATURE_PV_UNHALT7
>  #define KVM_FEATURE_PV_DEDICATED 8
> +#define KVM_FEATURE_PV_TLB_FLUSH 9
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -53,6 +54,7 @@ struct kvm_steal_time {
>  
>  #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
>  #define KVM_VCPU_PREEMPTED  (1 << 0)
> +#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
>  
>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>  struct kvm_clock_pairing {
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 66ed3bc..50f4b6a 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static cpumask_t flushmask;

Hi Wanpeng,

are you going to send v3 with a percpu variable?

Paolo

> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> +
> + cpumask_copy(, cpumask);
> + /*
> +  * We have to call flush only on online vCPUs. And
> +  * queue flush_on_enter for pre-empted vCPUs
> +  */
> + for_each_cpu(cpu, cpumask) {
> + src = _cpu(steal_time, cpu);
> + state = src->preempted;
> + if ((state & KVM_VCPU_PREEMPTED)) {
> + if (cmpxchg(>preempted, state, state |
> + KVM_VCPU_SHOULD_FLUSH) == state)
> + __cpumask_clear_cpu(cpu, );
> + }
> + }
> +
> + native_flush_tlb_others(, info);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
>   int i;
> @@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
>   pv_time_ops.steal_clock = kvm_steal_clock;
>   }
>  
> + if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
> + !kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
> + pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
> +
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>   apic_set_eoi_write(kvm_guest_apic_eoi_write);
>  
> 



[PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-09 Thread Wanpeng Li
From: Wanpeng Li 

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter.

The best result is achieved when we're overcommiting the host by running 
multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
vCPUs which are not scheduled and avoid the wait on the main CPU.

Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
one linux guest.

ebizzy -M 
  vanillaoptimized boost
 8 vCPUs   10152   10083   -0.68% 
16 vCPUs12244866   297.5% 
24 vCPUs11093871   249%
32 vCPUs10253375   229.3% 

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Signed-off-by: Wanpeng Li 
---
 Documentation/virtual/kvm/cpuid.txt  |  4 
 arch/x86/include/uapi/asm/kvm_para.h |  2 ++
 arch/x86/kernel/kvm.c| 31 +++
 3 files changed, 37 insertions(+)

diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 117066a..9693fcc 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest checks 
this feature bit
||   || mizations such as usage of
||   || qspinlocks.
 --
+KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
+   ||   || before enabling paravirtualized
+   ||   || tlb flush.
+--
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side
||   || per-cpu warps are expected in
||   || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 9ead1ed..a028479 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -25,6 +25,7 @@
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
 #define KVM_FEATURE_PV_DEDICATED   8
+#define KVM_FEATURE_PV_TLB_FLUSH   9
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -53,6 +54,7 @@ struct kvm_steal_time {
 
 #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
 #define KVM_VCPU_PREEMPTED  (1 << 0)
+#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
 
 #define KVM_CLOCK_PAIRING_WALLCLOCK 0
 struct kvm_clock_pairing {
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66ed3bc..50f4b6a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
update_intr_gate(X86_TRAP_PF, async_page_fault);
 }
 
+static cpumask_t flushmask;
+
+static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   const struct flush_tlb_info *info)
+{
+   u8 state;
+   int cpu;
+   struct kvm_steal_time *src;
+
+   cpumask_copy(, cpumask);
+   /*
+* We have to call flush only on online vCPUs. And
+* queue flush_on_enter for pre-empted vCPUs
+*/
+   for_each_cpu(cpu, cpumask) {
+   src = _cpu(steal_time, cpu);
+   state = src->preempted;
+   if ((state & KVM_VCPU_PREEMPTED)) {
+   if (cmpxchg(>preempted, state, state |
+   KVM_VCPU_SHOULD_FLUSH) == state)
+   __cpumask_clear_cpu(cpu, );
+   }
+   }
+
+   native_flush_tlb_others(, info);
+}
+
 void __init kvm_guest_init(void)
 {
int i;
@@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}
 
+   if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
+   !kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-- 
2.7.4



[PATCH v2 2/4] KVM: Add paravirt remote TLB flush

2017-11-09 Thread Wanpeng Li
From: Wanpeng Li 

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter.

The best result is achieved when we're overcommiting the host by running 
multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
vCPUs which are not scheduled and avoid the wait on the main CPU.

Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
one linux guest.

ebizzy -M 
  vanillaoptimized boost
 8 vCPUs   10152   10083   -0.68% 
16 vCPUs12244866   297.5% 
24 vCPUs11093871   249%
32 vCPUs10253375   229.3% 

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Signed-off-by: Wanpeng Li 
---
 Documentation/virtual/kvm/cpuid.txt  |  4 
 arch/x86/include/uapi/asm/kvm_para.h |  2 ++
 arch/x86/kernel/kvm.c| 31 +++
 3 files changed, 37 insertions(+)

diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 117066a..9693fcc 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED   || 8 || guest checks 
this feature bit
||   || mizations such as usage of
||   || qspinlocks.
 --
+KVM_FEATURE_PV_TLB_FLUSH   || 9 || guest checks this feature bit
+   ||   || before enabling paravirtualized
+   ||   || tlb flush.
+--
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side
||   || per-cpu warps are expected in
||   || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 9ead1ed..a028479 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -25,6 +25,7 @@
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
 #define KVM_FEATURE_PV_DEDICATED   8
+#define KVM_FEATURE_PV_TLB_FLUSH   9
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -53,6 +54,7 @@ struct kvm_steal_time {
 
 #define KVM_VCPU_NOT_PREEMPTED  (0 << 0)
 #define KVM_VCPU_PREEMPTED  (1 << 0)
+#define KVM_VCPU_SHOULD_FLUSH   (1 << 1)
 
 #define KVM_CLOCK_PAIRING_WALLCLOCK 0
 struct kvm_clock_pairing {
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66ed3bc..50f4b6a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
update_intr_gate(X86_TRAP_PF, async_page_fault);
 }
 
+static cpumask_t flushmask;
+
+static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   const struct flush_tlb_info *info)
+{
+   u8 state;
+   int cpu;
+   struct kvm_steal_time *src;
+
+   cpumask_copy(, cpumask);
+   /*
+* We have to call flush only on online vCPUs. And
+* queue flush_on_enter for pre-empted vCPUs
+*/
+   for_each_cpu(cpu, cpumask) {
+   src = _cpu(steal_time, cpu);
+   state = src->preempted;
+   if ((state & KVM_VCPU_PREEMPTED)) {
+   if (cmpxchg(>preempted, state, state |
+   KVM_VCPU_SHOULD_FLUSH) == state)
+   __cpumask_clear_cpu(cpu, );
+   }
+   }
+
+   native_flush_tlb_others(, info);
+}
+
 void __init kvm_guest_init(void)
 {
int i;
@@ -484,6 +511,10 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}
 
+   if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
+   !kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-- 
2.7.4