from:"Quan Xu"

Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-14 Thread Quan Xu




On 2017/9/14 17:19, Wanpeng Li wrote:

2017-09-14 16:36 GMT+08:00 Quan Xu <quan@gmail.com>:


on 2017/9/13 19:56, Yang Zhang wrote:

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.


Hi, MST

I have tested the patch "intel_idle: add pv cstates when running on kvm"  on
a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC PATCH
v2 0/7] x86/idle: add halt poll support",
upstream linux, and  idle=poll.

the following is the result (which seems better than ever berfore, as I ran
test case on a more powerful machine):

for __netperf__,  the first column is trans. rate per sec, the second column
is CPU utilzation.

1. upstream linux

This "upstream linux" means that disables the kvm adaptive
halt-polling after confirm with Xu Quan.



upstream linux -- the source code is just from upstream linux, without 
our patch or MST's patch..
yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following 
cases.


Quan



Regards,
Wanpeng Li


   28371.7 bits/s -- 76.6 %CPU

2. idle=poll

   34372 bit/s -- 999.3 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different
values of parameter 'halt_poll_threshold':

   28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=1)
   32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=2)
   39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3)
   40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4)
   40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5)


4. "intel_idle: add pv cstates when running on kvm"

   33041.8 bits/s  -- 999.4 %CPU





for __ctxsw__, the first column is the time per process context switches,
the second column is CPU utilzation..

1. upstream linux

   3624.19 ns/ctxsw -- 191.9 %CPU

2. idle=poll

   3419.66 ns/ctxsw -- 999.2 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
values of parameter 'halt_poll_threshold':

   1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1)
   1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2)
   1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3)
   1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4)
   1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5)

  4.  "intel_idle: add pv cstates when running on kvm"

   3427.59 ns/ctxsw -- 999.4 %CPU

-Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-14 Thread Quan Xu




on 2017/9/1 13:57, Quan Xu wrote:

on 2017/8/29 20:45, Peter Zijlstra wrote:


On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>

Broken SoB chain.
  Peter,  I can't follow 'Broken SoB chain'.. could you explain more 
about it?



    Peter, Ping..

Quan




  -Quan


diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
  }
    /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
  void __weak arch_cpu_idle_prepare(void) { }
  void __weak arch_cpu_idle_enter(void) { }

And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?

Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-14 Thread Quan Xu




on 2017/9/13 19:56, Yang Zhang wrote:

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.



Hi, MST

I have tested the patch "intel_idle: add pv cstates when running on 
kvm"  on a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC 
PATCH v2 0/7] x86/idle: add halt poll support",

upstream linux, and  idle=poll.

the following is the result (which seems better than ever berfore, as I 
ran test case on a more powerful machine):


for __netperf__,  the first column is trans. rate per sec, the second 
column is CPU utilzation.


1. upstream linux

  28371.7 bits/s -- 76.6 %CPU

2. idle=poll

  34372 bit/s -- 999.3 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different 
values of parameter 'halt_poll_threshold':


  28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=1)
  32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=2)
  39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3)
  40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4)
  40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5)


4. "intel_idle: add pv cstates when running on kvm"

  33041.8 bits/s  -- 999.4 %CPU





for __ctxsw__, the first column is the time per process context 
switches, the second column is CPU utilzation..


1. upstream linux

  3624.19 ns/ctxsw -- 191.9 %CPU

2. idle=poll

  3419.66 ns/ctxsw -- 999.2 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different 
values of parameter 'halt_poll_threshold':


  1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1)
  1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2)
  1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3)
  1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4)
  1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5)

 4.  "intel_idle: add pv cstates when running on kvm"

  3427.59 ns/ctxsw -- 999.4 %CPU

-Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-29 Thread Quan Xu




On 2017/9/1 14:49, Quan Xu wrote:

on 2017/8/29 22:39, Borislav Petkov wrote:

On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Kyle Huey <m...@kylehuey.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Len Brown <len.br...@intel.com>
Cc: linux-kernel@vger.kernel.org
---
  arch/x86/kernel/process.c | 7 +++
  kernel/sched/idle.c   | 2 ++
  2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
  x86_idle();
  }
  +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+    paravirt_idle_poll();
+}
+#endif

So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.

Huh?

 Borislav,
 think twice, it is better to run ander an IF condition, to make 
sure they are running any guests.


  Quan

Borislav ,
   yes, this will get called on any system which has CONFIG_PARAVIRT 
enabled.


   but if they are not running any guests,  the callback is 
paravirt_nop() ,
   IIUC which is as similar as the other paravirt_*, such as 
paravirt_pgd_free()..


 - Quan

Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-09-29 Thread Quan Xu




On 2017/8/29 20:46, Peter Zijlstra wrote:

On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote:

In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>

Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.

So no.


Peter,

I think you are right..

IIUC, the time we happened to be idle may contain a chain of 
VMEXIT/VMENTER cycles,
which would be mainly (except the last VMEXIT/VMENTER cycles) for just 
idle loops. right?


as you mentioned, poll duration to be related to is the cost of __a__ 
VMEXIT/VMENTER cycle.
howerver it is very difficult to measure a VMEXIT/VMENTER cycle 
accurately from
kvm guest, we could find out an approximate one -- dropping the idle 
loops from the

time we happened to be idle.. make sense?

Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-01 Thread Quan Xu


on 2017/8/29 22:39, Borislav Petkov wrote:

On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Kyle Huey <m...@kylehuey.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Len Brown <len.br...@intel.com>
Cc: linux-kernel@vger.kernel.org
---
  arch/x86/kernel/process.c | 7 +++
  kernel/sched/idle.c   | 2 ++
  2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
x86_idle();
  }
  
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)

+void arch_cpu_idle_poll(void)
+{
+   paravirt_idle_poll();
+}
+#endif

So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.

Huh?

   Borislav ,
   yes,  this will get called on any system which has CONFIG_PARAVIRT 
enabled.


   but if they are not running any guests,  the callback is 
paravirt_nop() ,
   IIUC which is as similar as the other paravirt_*, such as 
paravirt_pgd_free()..


 - Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-08-31 Thread Quan Xu


on 2017/8/29 20:45, Peter Zijlstra wrote:


On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>

Broken SoB chain.

  Peter,  I can't follow 'Broken SoB chain'.. could you more about it?

  -Quan


diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
  }
  
  /* Weak implementations for optional arch specific functions */

+void __weak arch_cpu_idle_poll(void) { }
  void __weak arch_cpu_idle_prepare(void) { }
  void __weak arch_cpu_idle_enter(void) { }

And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-15 Thread Quan Xu




On 2017-11-15 22:43, Rik van Riel wrote:

Can you explain why you believe that?


for example, a vcpu thread is running in kvm mode under cretical
condition to stop. QEMU send an IPI to cause a VM-exit to happen
immediately, and this IPI doesn't make vcpu return to QEMU. IIUC
this vcpu thread will still continue to run in kvm mode when is
waked up at targer machine. with your patch, I don't see a chance
to load guest FPU or XSTATE, until return to QEMU and run kvm mode again.

then the FPU or XSTATE status is inconsistent for a small window, what's 
even

worse is that the vcpu is running.

Did I misunderstand?

Quan
Alibaba Cloud




Getting the guest FPU or XSTATE is done under the vcpu->mutex.

This patch switches out guest and userspace FPU/XSTATE under the
vcpu->mutex, and switches it back before releasing the vcpu->mutex.

By the time a KVM_GET_FPU has obtained the vcpu->mutex, the guest
FPU state will be in vcpu->arch.guest_fpu.state, where you expect
it to be.

What am I missing?

Re: [PATCH RFC 0/7] kvm pvtimer

2017-12-14 Thread Quan Xu




On 2017/12/14 19:56, Paolo Bonzini wrote:

On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote:

1) VM idle path and network req/resp services:

Does this go away if you don't hit the idle path? Meaning if you
loop without hitting HLT/MWAIT? I am assuming the issue you are facing
is the latency - that is first time the guest comes from HLT and
responds to the packet the latency is much higher than without?

And the arming of the timer?
2) process context switches.

Is that related to the 1)? That is the 'schedule' call and the process
going to sleep waiting for an interrupt or timer?

This all sounds like issues with low-CPU usage workloads where you
need low latency responses?

Even high-CPU usage, as long as there is a small idle time.  The cost of
setting the TSC deadline timer twice is about 3000 cycles.

However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE
can recover most of the performance and it's way less intrusive.


  Paolo, could you share the Amazon's patch or the LML link? thanks.


Quan


Thanks,

Paolo

Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

2017-12-14 Thread Quan Xu

On 2017/12/14 21:15, Gonglei (Arei) wrote:

-Original Message-
From: Liran Alon [mailto:liran.a...@oracle.com]
Sent: Thursday, December 14, 2017 9:02 PM
To: Gonglei (Arei); pbonz...@redhat.com; rkrc...@redhat.com
Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; Huangweidong (C)
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

On 14/12/17 14:23, Gonglei wrote:

We hit a bug in our test while run PCMark 10 in a windows 7 VM,
The VM got stuck and the wallclock was hang after several minutes running
PCMark 10 in it.
It is quite easily to reproduce the bug with the upstream KVM and Qemu.

We found that KVM can not inject any RTC irq to VM after it was hang, it fails

to

Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
int irq_level, bool line_status)
{
...
   if (!irq_level) {
ioapic->irr &= ~mask;
ret = 1;
goto out;
   }
...
   if ((edge && old_irr == ioapic->irr) ||
   (!edge && entry.fields.remote_irr)) {
ret = 0;
goto out;
   }

According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
register C to to clear the irq flag, and pull down the irq electric pin.

For Qemu, we will emulate the reading operation in cmos_ioport_read(),
but Guest OS will fire a write operation before to tell which register will be

read

after this write, where we use s->cmos_index to record the following register

to read.

But in our test, we found that there is a possible situation that Vcpu fails to

read

RTC_REG_C to clear irq, This could happens while two VCpus are

writing/reading

registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
but before it tries to read register C, another vcpu1 is going to read

RTC_YEAR,

it changes s->cmos_index to RTC_YEAR by a writing action.
The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we

will miss

calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never 
inject

RTC irq,

and Windows VM will hang.

If I understood correctly, this looks to me like a race-condition bug in
the Windows guest kernel. In real-hardware this race-condition will also
cause the RTC_YEAR to be read instead of RTC_REG_C.
Guest kernel should make sure that 2 CPUs does not attempt to read a
CMOS register in parallel as they can override each other's cmos_index.

See for example how Linux kernel makes sure to avoid such kind of issues
in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.

Yes, I knew that.

Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.

Can you elaborate a bit more why it makes sense to put such workaround
in KVM code instead of declaring this as guest kernel bug?

I agree it's a Windows bug. The big problem is there is not problem on Xen 
platform.

 have you analyzed why it is not a problem on Xen?

Quan

Thanks,
-Gonglei

Regards,
-Liran

Suggested-by: Paolo Bonzini 
Signed-off-by: Gonglei 
---
Thanks to Paolo provides a good solution. :)

   arch/x86/kvm/ioapic.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 4e822ad..5022d63 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic,

struct kvm_vcpu *vcpu)

   {
if (test_and_clear_bit(vcpu->vcpu_id,
   ioapic->rtc_status.dest_map.map)) {
+   ioapic->irr &= ~(1 << RTC_GSI);
--ioapic->rtc_status.pending_eoi;
rtc_status_pending_eoi_check_valid(ioapic);
}

Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

2017-12-13 Thread Quan Xu




On 2017/12/08 23:06, Konrad Rzeszutek Wilk wrote:

On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote:

From: Ben Luo <bn0...@gmail.com>

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
then KVM handles this timer for guest.

The tsc-deadline timestamp is mostly recorded in share page
with less VM-exit. We Introduce a periodically working kthread
to scan share page and synchronize timer setting for guest
on a dedicated CPU.

That sounds like a race. Meaning the guest may put too small window
and this 'working thread to scan' may not get to it fast enough?

yes, you are right. So ..

.
Meaning we miss the deadline to inject the timer in the guest.

Or is this part of this PV MSR semantics - that it will only work
for certain amount of values and anything less than say 1ms
should not use the PV MSR?


..
for these timers, We have to program these tsc-deadline timestamps
to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
signal the working thread through IPI to program timer, instead of
registering on current CPU (patch 0004).

more detail in patch 0007.

Quan
Alibaba Cloud

Re: [PATCH] KVM: x86: avoid unnecessary XSETBV on guest entry

2017-12-13 Thread Quan Xu




On 2017/12/13 20:51, Paolo Bonzini wrote:

xsetbv can be expensive when running on nested virtualization, try to
avoid it.

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  arch/x86/kvm/x86.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0f82e2cbf64c..daa1918031df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -702,7 +702,8 @@ static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
!vcpu->guest_xcr0_loaded) {
/* kvm_set_xcr() also depends on this */
-   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
+   if (vcpu->arch.xcr0 != host_xcr0)
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
vcpu->guest_xcr0_loaded = 1;
    }
  }


Reviewed-by: Quan Xu <quan@gmail.com>

Re: [PATCH RFC 0/7] kvm pvtimer

2017-12-13 Thread Quan Xu




On 2017/12/14 00:28, Konrad Rzeszutek Wilk wrote:

On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote:

On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk <
konrad.w...@oracle.com> wrote:


On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote:

From: Ben Luo <bn0...@gmail.com>

This patchset introduces a new paravirtualized mechanism to reduce

VM-exit

caused by guest timer accessing.

And how bad is this blib in arming the timer?

And how often do you get this timer to be armed? OR better yet - what
are the workloads in which you found this VMExit to be painful?

Thanks. Or better yet - what
are the workloads in which you found this VMExit to be painful?


one painful point is from VM idle path..
  for some network req/resp services, or benchmark of  process context
switches..

So:

1) VM idle path and network req/resp services:

Does this go away if you don't hit the idle path? Meaning if you
loop without hitting HLT/MWAIT?
  we still hit HLT.. we can use it with 
https://lkml.org/lkml/2017/8/29/279 ..

  I am assuming the issue you are facing
is the latency - that is first time the guest comes from HLT and
responds to the packet the latency is much higher than without?

yes,

And the arming of the timer?
2) process context switches.

Is that related to the 1)? That is the 'schedule' call and the process
going to sleep waiting for an interrupt or timer?

This all sounds like issues with low-CPU usage workloads where you
need low latency responses?

yes,  it is also helpful to some timer-intensive services.

Quan

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-17 Thread Quan Xu




On 2017-11-16 17:53, Thomas Gleixner wrote:

On Thu, 16 Nov 2017, Quan Xu wrote:

On 2017-11-16 06:03, Thomas Gleixner wrote:
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
     target_state = >states[index];
     }

+#ifdef CONFIG_PARAVIRT
+   paravirt_idle_poll();
+
+   if (need_resched())
+   return -EBUSY;
+#endif

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.



Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling


I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,

calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it

as easy as following:

 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)

 update_intr_gate(X86_TRAP_PF, async_page_fault);
  }

 +static __cpuidle void kvm_safe_halt(void)
 +{
     +    /* 1. POLL, if need_resched() --> return */
     +
 +    asm volatile("sti; hlt": : :"memory"); /* 2. halt */
 +
     +    /* 3. get the last idle state's residency time */
 +
     +    /* 4. update poll duration based on last idle state's 
residency time */

 +}
 +
  void __init kvm_guest_init(void)
  {
 int i;
 @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
 if (kvmclock_vsyscall)
 kvm_setup_vsyscall_timeinfo();

 +   pv_irq_ops.safe_halt = kvm_safe_halt;
 +
  #ifdef CONFIG_SMP




then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.

also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-17 Thread Quan Xu




On 2017-11-17 19:36, Thomas Gleixner wrote:

On Fri, 17 Nov 2017, Quan Xu wrote:

On 2017-11-16 17:53, Thomas Gleixner wrote:

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.


1. which data needs to considerd to make decisions about probabilistic polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

That's part of the picture, but not complete.


tglx, could you share more? I am very curious about it..


I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.

That's only true for your particular use case.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for KVM
guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can implement
it
as easy as following:

  --- a/arch/x86/kernel/kvm.c

Your email client is using a very strange formatting.


my bad, I insert space to highlight these code.


This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.



yes, I am exhausted :):)


could you tell me the gap to be generic and necessarily suitable for
all use case scenarios? as lack of irq/idle predictors?

 I really want to upstream it for all of public cloud users/providers..

as kvm host has a similar one, is it possible to upstream with following 
conditions? :
    1). add a QEMU configuration, whether enable or not, by default 
disable.

    2). add some "TODO" comments near the code.
    3). ...


anyway, thanks for your help..

Quan
 Alibaba Cloud

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-16 18:21, Paolo Bonzini wrote:

On 16/11/2017 06:06, Quan Xu wrote:

when vcpu thread is scheduled out, the pkru value in
current->thread.fpu.state may be the host pkru value, instead of
guest pkru value (of course, this _assumes_ that the pkru is in
current->thread.fpu.state as well). in this way, the pkru may be a
coner case.

Rik may correct me, but I think this is not possible.  Preemption is
disabled all the time while PKRU = guest_pkru (which is only during
vmx_vcpu_run).


refer to the following code,

vcpu_enter_guest()
{
    ...
    preempt_disable()
   ...
   kvm_x86_ops->run(vcpu);  (actually it is vmx_vcpu_run())
   ...
    ...
    preempt_enable();
}



when preempt_disable before to run kvm_x86_ops->run..
in kvm_x86_ops->run, the pkru is restored with host_pkru (IF guest_pkru 
!= host_pkru).

all this happened under preempt_disable().

it's not true that preemtion is disable all the time while PKRU  = 
guest_pkru..




However it seems there is still some gap..

as Rik said, "at context switch time, the context switch code will save 
the guest

FPU state to current->thread.fpu when the VCPU thread is scheduled out."

after preempt_enable() in vcpu_enter_guest(), the vcpu thread is 
scheduled out,
in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF 
guest_pkru != host_pkru)..

instead of guest_pkru..

then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu?

as mentioned, all this _assumes_ that the pkru is in 
current->thread.fpu.state as well.



thanks,

Quan
Alibaba Cloud


Context switching will only happen in vcpu_enter_guest() after
preempt_enable() for a preemptible kernel, or in vcpu_run via
cond_resched() for a non-preemptible kernel.

Thanks,

Paolo


VM migration again, in case,
    source_host_pkru_value != guest_pkru_value,
    target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-16 20:18, Paolo Bonzini wrote:

On 16/11/2017 13:12, Quan Xu wrote:

However it seems there is still some gap..

as Rik said, "at context switch time, the context switch code will save
the guest FPU state to current->thread.fpu when the VCPU thread is scheduled 
out."

By "guest FPU state" Rik means "guest FPU with host PKRU".

  :( actually it is host_pkru, just with different names..


Guest PKRU is always stored in vcpu->arch.pkru rather than in the guest
FPU state, so guest PKRU will never be in current->thread.fpu.state either.

KVM_GET_XSAVE will the guest FPU state with vcpu->arch.pkru and
migration will work properly.

  agreed, I fix it.. that's why I concern.. there are so much methods to
  write PKRU with host_pkru/guest_pkru..

 after migration, KVM_GET|SET_XSAVE restore the right pkru.. but we 
introduce

 another method:

  -- When the VCPU thread is scheduled back in, the context
 switch code will restore current->thread.fpu to the FPU
 registers.


there is still a window to restore current->thread.fpu to the FPU 
registers before enter guest mode and


preempt_disable().

on target machine, after migration, the pkru value is source_host_pkru 
in current->thread.fpu.


in case,

    source_host_pkru_value != guest_pkru_value,
    target_host_pkru_value == guest_pkru_value..

source_host_pkru_value may be restored to PKRU.. make pkru status inconsistent..


thanks

Quan
Alibaba Cloud

Thanks,

Paolo


after preempt_enable() in vcpu_enter_guest(), the vcpu thread is
scheduled out,
in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF
guest_pkru != host_pkru)..
instead of guest_pkru..

then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu?

as mentioned, all this _assumes_ that the pkru is in
current->thread.fpu.state as well.


thanks,

Quan
Alibaba Cloud


Context switching will only happen in vcpu_enter_guest() after
preempt_enable() for a preemptible kernel, or in vcpu_run via
cond_resched() for a non-preemptible kernel.

Thanks,

Paolo


VM migration again, in case,
     source_host_pkru_value != guest_pkru_value,
     target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..

Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll

2017-11-13 Thread Quan Xu




On 2017/11/13 23:08, Ingo Molnar wrote:

* Quan Xu <quan.x...@gmail.com> wrote:


From: Quan Xu <quan@gmail.com>

To reduce the cost of poll, we introduce three sysctl to control the
poll time when running as a virtual machine with paravirt.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
  Documentation/sysctl/kernel.txt |   35 +++
  arch/x86/kernel/paravirt.c  |4 
  include/linux/kernel.h  |6 ++
  kernel/sysctl.c |   34 ++
  4 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c..30c25fb 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one.
  
  ==
  
+paravirt_poll_grow: (X86 only)

+
+Multiplied value to increase the poll time. This is expected to take
+effect only when running as a virtual machine with CONFIG_PARAVIRT
+enabled. This can't bring any benifit on bare mental even with
+CONFIG_PARAVIRT enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==
+
+paravirt_poll_shrink: (X86 only)
+
+Divided value to reduce the poll time. This is expected to take effect
+only when running as a virtual machine with CONFIG_PARAVIRT enabled.
+This can't bring any benifit on bare mental even with CONFIG_PARAVIRT
+enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==
+
+paravirt_poll_threshold_ns: (X86 only)
+
+Controls the maximum poll time before entering real idle path. This is
+expected to take effect only when running as a virtual machine with
+CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental
+even with CONFIG_PARAVIRT enabled.
+
+By default, this value is 0 means not to poll. Possible values to set
+are in range {0..50}. Change the value to non-zero if running
+latency-bound workloads in a virtual machine.

I absolutely hate it how this hybrid idle loop polling mechanism is not
self-tuning!


Ingo, actually it is self-tuning..

Please make it all work fine by default, and automatically so, instead of adding
three random parameters...
.. I will make it all fine by default. howerver cloud environment is of 
diversity,


could I only leave paravirt_poll_threshold_ns parameter (the maximum 
poll time),
which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then 
user can turn

it off, or find an appropriate threshold for some odd scenario..

thanks for your comments!!
Quan
Alibaba Cloud

And if it cannot be done automatically then we should rather not do it at all.
Maybe the next submitter of a similar feature can think of a better approach.

Thanks,

Ingo

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 15:12, Wanpeng Li wrote:

2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>:


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Juergen Gross <jgr...@suse.com>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
 29031.6 bit/s -- 76.1 %CPU

  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
 35787.7 bit/s -- 129.4 %CPU

  3. w/ kvm dynamic poll:
 35735.6 bit/s -- 200.0 %CPU

Actually we can reduce the CPU utilization by sleeping a period of
time as what has already been done in the poll logic of IO subsystem,
then we can improve the algorithm in kvm instead of introduing another
duplicate one in the kvm guest.

We really appreciate upstream's kvm dynamic poll mechanism, which is
really helpful for a lot of scenario..

However, as description said, in virtualization, idle path includes
several heavy operations includes timer access (LAPIC timer or TSC
deadline timer) which will hurt performance especially for latency
intensive workload like message passing task. The cost is mainly from
the vmexit which is a hardware context switch between virtual machine
and hypervisor.

for upstream's kvm dynamic poll mechanism, even you could provide a
better algorism, how could you bypass timer access (LAPIC timer or TSC
deadline timer), or a hardware context switch between virtual machine
and hypervisor. I know these is a tradeoff.

Furthermore, here is the data we get when running benchmark contextswitch
to measure the latency(lower is better):

1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  3402.9 ns/ctxsw -- 199.8 %CPU

2. w/ patch and disable kvm dynamic poll:
  1163.5 ns/ctxsw -- 205.5 %CPU

3. w/ kvm dynamic poll:
  2280.6 ns/ctxsw -- 199.5 %CPU

so, these tow solution are quite similar, but not duplicate..

that's also why to add a generic idle poll before enter real idle path.
When a reschedule event is pending, we can bypass the real idle path.


Quan
Alibaba Cloud





Regards,
Wanpeng Li


  4. w/patch and w/ kvm dynamic poll:
 42225.3 bit/s -- 198.7 %CPU

  5. idle=poll
 37081.7 bit/s -- 998.1 %CPU



  w/ this patch, we will improve performance by 23%.. even we could improve
  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
  cost of CPU is much lower than 'idle=poll' case..


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


I assume this feature will be ported to other archs.. a new pvops makes code
clean and easy to maintain. also I tried to add it into existed pvops, but
it
doesn't match.



Quan
Alibaba Cloud


Juergen

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-13 Thread Quan Xu




On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Juergen Gross <jgr...@suse.com>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
    29031.6 bit/s -- 76.1 %CPU

 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
    35787.7 bit/s -- 129.4 %CPU

 3. w/ kvm dynamic poll:
    35735.6 bit/s -- 200.0 %CPU

 4. w/patch and w/ kvm dynamic poll:
    42225.3 bit/s -- 198.7 %CPU

 5. idle=poll
    37081.7 bit/s -- 998.1 %CPU



 w/ this patch, we will improve performance by 23%.. even we could improve
 performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
 cost of CPU is much lower than 'idle=poll' case..


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


I assume this feature will be ported to other archs.. a new pvops makes code
clean and easy to maintain. also I tried to add it into existed pvops, 
but it

doesn't match.



Quan
Alibaba Cloud


Juergen

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 18:27, Juergen Gross wrote:

On 14/11/17 10:38, Quan Xu wrote:


On 2017/11/14 15:30, Juergen Gross wrote:

On 14/11/17 08:02, Quan Xu wrote:

On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like
message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Juergen Gross <jgr...@suse.com>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a
new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
  35787.7 bit/s -- 129.4 %CPU

   3. w/ kvm dynamic poll:
  35735.6 bit/s -- 200.0 %CPU

   4. w/patch and w/ kvm dynamic poll:
  42225.3 bit/s -- 198.7 %CPU

   5. idle=poll
  37081.7 bit/s -- 998.1 %CPU



   w/ this patch, we will improve performance by 23%.. even we could
improve
   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
also the
   cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this
would
work on other architectures, too.

I assume this feature will be ported to other archs.. a new pvops makes

   sorry, a typo.. /other archs/other hypervisors/
   it refers hypervisor like Xen, HyperV and VMware)..


code
clean and easy to maintain. also I tried to add it into existed pvops,
but it
doesn't match.

You are aware that pvops is x86 only?

yes, I'm aware..


I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
 if (static_key_false(_idle_poll_key))
     guest_idle_poll_func();
}



thank you for your sample code :)
I agree there is no big difference.. I think we are discussion for two
things:
  1) x86 VM on different hypervisors
  2) different archs VM on kvm hypervisor

What I want to do is x86 VM on different hypervisors, such as kvm / xen
/ hyperv ..

Why limit the solution to x86 if the more general solution isn't
harder?

As you didn't give any reason why the pvops approach is better other
than you don't care for non-x86 platforms you won't get an "Ack" from
me for this patch.



It just looks a little odder to me. I understand you care about no-x86 arch.

Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in
   - arch/arm64/include/asm/paravirt.h
   - arch/x86/include/asm/paravirt_types.h
   - arch/arm/include/asm/paravirt.h

I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops
for arm/arm64 arch, you'd define a same structure in
   - arch/arm64/include/asm/paravirt.h or
   - arch/arm/include/asm/paravirt.h

.. instead of static key / fuction.

then implement a real function in
   - arch/arm/kernel/paravirt.c.

Also I wonder HOW/WHERE to define a static key/function, then to benifit
x86/no-x86 archs?

Quan
Alibaba Cloud


And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.


.. referred to 'pv_mmu_ops', HyperV and Xen can implement their own
functions for 'pv_mmu_ops'.
I think it is the same to pv_idle_ops.

with above explaination, do you still think I need to define the static
key/function pointer variant?

btw, any interest to port it to Xen HVM guest? :)

Maybe. But this should work for Xen on ARM, too.


Juergen

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 15:30, Juergen Gross wrote:

On 14/11/17 08:02, Quan Xu wrote:


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Juergen Gross <jgr...@suse.com>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     29031.6 bit/s -- 76.1 %CPU

  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
     35787.7 bit/s -- 129.4 %CPU

  3. w/ kvm dynamic poll:
     35735.6 bit/s -- 200.0 %CPU

  4. w/patch and w/ kvm dynamic poll:
     42225.3 bit/s -- 198.7 %CPU

  5. idle=poll
     37081.7 bit/s -- 998.1 %CPU



  w/ this patch, we will improve performance by 23%.. even we could improve
  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
  cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.

I assume this feature will be ported to other archs.. a new pvops makes


  sorry, a typo.. /other archs/other hypervisors/
  it refers hypervisor like Xen, HyperV and VMware)..


code
clean and easy to maintain. also I tried to add it into existed pvops,
but it
doesn't match.

You are aware that pvops is x86 only?


yes, I'm aware..


I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
if (static_key_false(_idle_poll_key))
guest_idle_poll_func();
}




thank you for your sample code :)
I agree there is no big difference.. I think we are discussion for two 
things:

 1) x86 VM on different hypervisors
 2) different archs VM on kvm hypervisor

What I want to do is x86 VM on different hypervisors, such as kvm / xen 
/ hyperv ..



And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.



.. referred to 'pv_mmu_ops', HyperV and Xen can implement their own 
functions for 'pv_mmu_ops'.

I think it is the same to pv_idle_ops.

with above explaination, do you still think I need to define the static
key/function pointer variant?

btw, any interest to port it to Xen HVM guest? :)

Quan
Alibaba Cloud

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 16:22, Wanpeng Li wrote:

2017-11-14 16:15 GMT+08:00 Quan Xu <quan@gmail.com>:


On 2017/11/14 15:12, Wanpeng Li wrote:

2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>:


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Juergen Gross <jgr...@suse.com>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
  35787.7 bit/s -- 129.4 %CPU

   3. w/ kvm dynamic poll:
  35735.6 bit/s -- 200.0 %CPU

Actually we can reduce the CPU utilization by sleeping a period of
time as what has already been done in the poll logic of IO subsystem,
then we can improve the algorithm in kvm instead of introduing another
duplicate one in the kvm guest.

We really appreciate upstream's kvm dynamic poll mechanism, which is
really helpful for a lot of scenario..

However, as description said, in virtualization, idle path includes
several heavy operations includes timer access (LAPIC timer or TSC
deadline timer) which will hurt performance especially for latency
intensive workload like message passing task. The cost is mainly from
the vmexit which is a hardware context switch between virtual machine
and hypervisor.

for upstream's kvm dynamic poll mechanism, even you could provide a
better algorism, how could you bypass timer access (LAPIC timer or TSC
deadline timer), or a hardware context switch between virtual machine
and hypervisor. I know these is a tradeoff.

Furthermore, here is the data we get when running benchmark contextswitch
to measure the latency(lower is better):

1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
   3402.9 ns/ctxsw -- 199.8 %CPU

2. w/ patch and disable kvm dynamic poll:
   1163.5 ns/ctxsw -- 205.5 %CPU

3. w/ kvm dynamic poll:
   2280.6 ns/ctxsw -- 199.5 %CPU

so, these tow solution are quite similar, but not duplicate..

that's also why to add a generic idle poll before enter real idle path.
When a reschedule event is pending, we can bypass the real idle path.


There is a similar logic in the idle governor/driver, so how this
patchset influence the decision in the idle governor/driver when
running on bare-metal(power managment is not exposed to the guest so
we will not enter into idle driver in the guest)?



This is expected to take effect only when running as a virtual machine with
proper CONFIG_* enabled. This can not work on bare mental even with proper
CONFIG_* enabled.

Quan
Alibaba Cloud

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-19 Thread Quan Xu




On 2017-11-16 17:45, Daniel Lezcano wrote:

On 16/11/2017 10:12, Quan Xu wrote:


On 2017-11-16 06:03, Thomas Gleixner wrote:

On Wed, 15 Nov 2017, Peter Zijlstra wrote:


On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:

From: Yang Zhang <yang.zhang...@gmail.com>

Implement a generic idle poll which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_poll function which
can be overridden by the architecture code if needed.

No, we want less of those magic hooks, not more.


Interrupts arrive which may not cause a reschedule in idle loops.
In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
for interrupts and VM-exit immediately. Also this becomes more
expensive than bare metal. Add a generic idle poll before enter
real idle path. When a reschedule event is pending, we can bypass
the real idle path.

Why not do a HV specific idle driver?

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is
already
an interesting quirk there which makes it exit early.  See commit
3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
commit
looks similar. But lets not proliferate that. I'd rather see that go
away.

agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
arch_needs_cpu")
in kvm guest. I won't proliferate that..


But the irq_timings stuff is heading into the same direction, with a more
complex prediction logic which should tell you pretty good how long that
idle period is going to be and in case of an interrupt heavy workload
this
would skip the extra work of stopping and restarting the tick and
provide a
very good input into a polling decision.


interesting. I have tested with IRQ_TIMINGS related code, which seems
not working so far.

I don't know how you tested it, can you elaborate what you meant by
"seems not working so far" ?


Daniel, I tried to enable IRQ_TIMINGS* manually. used 
irq_timings_next_event()

to return estimation of the earliest interrupt. However I got a constant.


There are still some work to do to be more efficient. The prediction
based on the irq timings is all right if the interrupts have a simple
periodicity. But as soon as there is a pattern, the current code can't
handle it properly and does bad predictions.

I'm working on a self-learning pattern detection which is too heavy for
the kernel, and with it we should be able to detect properly the
patterns and re-ajust the period if it changes. I'm in the process of
making it suitable for kernel code (both math and perf).

One improvement which can be done right now and which can help you is
the interrupts rate on the CPU. It is possible to compute it and that
will give an accurate information for the polling decision.


As tglx said, talk to each other / work together to make it usable for 
all use cases.
could you share how to enable it to get the interrupts rate on the CPU? 
I can try it

in cloud scenario. of course, I'd like to work with you to improve it.

Quan
Alibaba Cloud

Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support

2017-11-19 Thread Quan Xu




On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote:

On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:

From: Yang Zhang <yang.zhang...@gmail.com>

Some latency-intensive workload have seen obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost I have seen is
inside idle path.

Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
back in guest (VMRESUME) takes time. And hence your latency gets
all whacked b/c of this?

   Konrad, I can't follow 'b/c' here.. sorry.


So if I understand - you want to use your _full_ timeslice (of the guest)
without ever (or as much as possible) to go in the hypervisor?

    as much as possible.


Which means in effect you don't care about power-saving or CPUfreq
savings, you just want to eat the full CPU for snack?
  actually, we  care about power-saving. The poll duration is 
self-tuning, otherwise it is almost as the same as
  'halt=poll'. Also we always sent out with CPU usage of benchmark 
netperf/ctxsw. We got much more

  performance with limited promotion of CPU usage.



This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Schedule of what? The guest or the host?

  rescheduled of guest scheduler..
  it is the guest.


Quan
Alibaba Cloud

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-15 Thread Quan Xu




On 2017-11-16 12:21, Rik van Riel wrote:

On Thu, 2017-11-16 at 10:50 +0800, Quan Xu wrote:

On 2017-11-15 22:43, Rik van Riel wrote:

Can you explain why you believe that?

for example, a vcpu thread is running in kvm mode under cretical
condition to stop. QEMU send an IPI to cause a VM-exit to happen
immediately, and this IPI doesn't make vcpu return to QEMU. IIUC
this vcpu thread will still continue to run in kvm mode when is
waked up at targer machine. with your patch, I don't see a chance
to load guest FPU or XSTATE, until return to QEMU and run kvm mode
again.

then the FPU or XSTATE status is inconsistent for a small window,
what's
even
worse is that the vcpu is running.

Did I misunderstand?

At context switch time, the context switch code will save
the guest FPU state to current->thread.fpu when the
VCPU thread is scheduled out.

When the VCPU thread is scheduled back in, the context
switch code will restore current->thread.fpu to the FPU
registers.



good catch!
Also as your comment, PKRU is switched out separately at
VMENTER and VMEXIT time, but with a lots of IF conditions..

the pkru may be restored with host pkru after VMEXIT. when
vcpu thread is scheduled out, the pkru value in current->thread.fpu.state
may be the host pkru value, instead of guest pkru value (of course,
this _assumes_ that the pkru is in current->thread.fpu.state as well).
in this way, the pkru may be a coner case.

VM migration again, in case,
   source_host_pkru_value != guest_pkru_value,
   target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..



Quan
Alibaba Cloud


The VCPU thread will never run with anything else than
the guest FPU state, while inside the KVM_RUN code.

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-17 01:50, Paolo Bonzini wrote:

On 16/11/2017 15:28, Quan Xu wrote:

vcpu->srcu_idx = srcu_read_lock(>srcu);
  
+ kvm_load_guest_fpu(vcpu);

+
for (;;) {
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);

<<



   as Rik dropped  kvm_load_guest_fpu(vcpu) in  vcpu_enter_guest() ..
   in case still in kvm mode, how to make sure to pkru is always the
right one before enter guest mode, not be preempted before
preempt_disable()  after migration? :(

As you know:

1) kvm_load_guest_fpu doesn't load the guest PKRU, it keeps the
userspace PKRU.

2) the guest PKRU is only ever set in a preemption-disabled area

Thus, context switch always sees the userspace PKRU.  The guest PKRU is
only set the next time vmx_vcpu_run executes.

Paolo



Paolo, thanks for your explanation!!:-)
Rik, could you cc me in v2?

Quan

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-16 Thread Quan Xu




On 2017-11-16 06:03, Thomas Gleixner wrote:

On Wed, 15 Nov 2017, Peter Zijlstra wrote:


On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:

From: Yang Zhang <yang.zhang...@gmail.com>

Implement a generic idle poll which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_poll function which
can be overridden by the architecture code if needed.

No, we want less of those magic hooks, not more.


Interrupts arrive which may not cause a reschedule in idle loops.
In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
for interrupts and VM-exit immediately. Also this becomes more
expensive than bare metal. Add a generic idle poll before enter
real idle path. When a reschedule event is pending, we can bypass
the real idle path.

Why not do a HV specific idle driver?

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is already
an interesting quirk there which makes it exit early.  See commit
3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit
looks similar. But lets not proliferate that. I'd rather see that go away.


agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce 
arch_needs_cpu")

in kvm guest. I won't proliferate that..


But the irq_timings stuff is heading into the same direction, with a more
complex prediction logic which should tell you pretty good how long that
idle period is going to be and in case of an interrupt heavy workload this
would skip the extra work of stopping and restarting the tick and provide a
very good input into a polling decision.



interesting. I have tested with IRQ_TIMINGS related code, which seems 
not working so far.

Also I'd like to help as much as I can.

This can be handled either in a HV specific idle driver or even in the
generic core code. If the interrupt does not arrive then you can assume
within the predicted time then you can assume that the flood stopped and
invoke halt or whatever.

That avoids all of that 'tunable and tweakable' x86 specific hackery and
utilizes common functionality which is mostly there already.
here is some sample code. Poll for a while before enter halt in 
cpuidle_enter_state()
If I get a reschedule event, then don't try to enter halt.  (I hope this 
is the right direction as Peter mentioned in another email)


--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,

    target_state = >states[index];
    }

+#ifdef CONFIG_PARAVIRT
+   paravirt_idle_poll();
+
+   if (need_resched())
+   return -EBUSY;
+#endif
+
    /* Take note of the planned idle state. */
    sched_idle_set_state(target_state);




thanks,

Quan
Alibaba Cloud

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-16 Thread Quan Xu




On 2017-11-16 16:45, Peter Zijlstra wrote:

On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote:

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is already
an interesting quirk there which makes it exit early.

Sure. And there are people who want to do the same for native.

Adding more ugly and special cases just isn't the way to go about doing
that.

I'm fairly sure I've told the various groups that want to tinker with
this to work together on this. I've also in fairly significant detail
sketched how to rework the idle code and idle predictors.

At this point I'm too tired to dig any of that up, so I'll just keep
saying no to patches that don't even attempt to go in the right
direction.

Peter, take care.

I really have considered this factor, and try my best not to interfere 
with scheduler/idle code.
if irq_timings code is ready, I can use it directly. I think irq_timings 
is not an easy task, I'd
like to help as much as I can.  Also don't try to touch tick_nohz* code 
again.


as tglx suggested, this can be handled either in a HV specific idle driver or 
even in the generic core code.

I hope this is in the right direction.

Quan
Alibaba Cloud

[PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-08 Thread Quan Xu

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
 arch/x86/kvm/vmx.c |   17 +
 1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
 static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
 enum {
-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
 
 static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
 
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])
-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
 #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
 #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
 #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 #endif
int i;
 
-   /* I/O */
-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
 
-   vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
 
-   memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
 
-- 
1.7.1

[PATCH RFC 0/7] kvm pvtimer

2017-12-08 Thread Quan Xu

From: Ben Luo 

This patchset introduces a new paravirtualized mechanism to reduce VM-exit
caused by guest timer accessing.

In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE
MSR. This will cause a VM-exit, and then KVM handles this timer for guest.

Also kvm always registers timer on the CPU which vCPU was running on. Even
though vCPU thread is rescheduled to another CPU, the timer will be migrated
to the target CPU as well. When timer expired, timer interrupt could make
guest-mode vCPU VM-exit on this CPU.

When pvtimer is enabled:

   - The tsc-deadline timestamp is mostly recorded in share page with less
 VM-exit. We Introduce a periodically working kthread to scan share page
 and synchronize timer setting for guest on a dedicated CPU.

   - Since the working kthread scans periodically, some of the timer events
 may be lost or delayed. We have to program these tsc-deadline timestamps
 to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
 signal the working thread through IPI to program timer, instread of
 registering on current CPU.

   - Timer interrupt will be delivered by posted interrupt mechanism to vCPUs
 with less VM-exit.

Ben Luo (7):
  kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR
  kvm: x86: add a function to exchange value
  KVM: timer: synchronize tsc-deadline timestamp for guest
  KVM: timer: program timer to a dedicated CPU
  KVM: timer: ignore timer migration if pvtimer is enabled
  Doc/KVM: introduce a new cpuid bit for kvm pvtimer
  kvm: guest: reprogram guest timer

 Documentation/virtual/kvm/cpuid.txt  |4 +
 arch/x86/include/asm/kvm_host.h  |5 +
 arch/x86/include/asm/kvm_para.h  |9 ++
 arch/x86/include/uapi/asm/kvm_para.h |7 ++
 arch/x86/kernel/apic/apic.c  |9 +-
 arch/x86/kernel/kvm.c|   38 
 arch/x86/kvm/cpuid.c |1 +
 arch/x86/kvm/lapic.c |  170 +-
 arch/x86/kvm/lapic.h |   11 ++
 arch/x86/kvm/x86.c   |   15 +++-
 include/linux/kvm_host.h |3 +
 virt/kvm/kvm_main.c  |   42 +
 12 files changed, 308 insertions(+), 6 deletions(-)

[PATCH RFC 2/7] kvm: x86: add a function to exchange value

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

Introduce kvm_xchg_guest_cached to exchange value with guest
page atomically.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 include/linux/kvm_host.h |3 +++
 virt/kvm/kvm_main.c  |   42 ++
 2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538..32949ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -688,6 +688,9 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
   void *data, int offset, unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len);
+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+   struct gfn_to_hva_cache *ghc, unsigned long offset,
+   unsigned long new, int size);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9deb5a2..3149e17 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2010,6 +2010,48 @@ int kvm_read_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
 
+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+   struct gfn_to_hva_cache *ghc, unsigned long offset,
+   unsigned long new, int size)
+{
+   unsigned long r;
+   void *kva;
+   struct page *page;
+   kvm_pfn_t pfn;
+
+   WARN_ON(offset > ghc->len);
+
+   pfn = gfn_to_pfn_atomic(kvm, ghc->gpa >> PAGE_SHIFT);
+   page = kvm_pfn_to_page(pfn);
+
+   if (is_error_page(page))
+   return -EFAULT;
+
+   kva = kmap_atomic(page) + offset_in_page(ghc->gpa) + offset;
+   switch (size) {
+   case 1:
+   r = xchg((char *)kva, new);
+   break;
+   case 2:
+   r = xchg((short *)kva, new);
+   break;
+   case 4:
+   r = xchg((int *)kva, new);
+   break;
+   case 8:
+   r = xchg((long *)kva, new);
+   break;
+   default:
+   kunmap_atomic(kva);
+   return -EFAULT;
+   }
+
+   kunmap_atomic(kva);
+   mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+   return r;
+}
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
-- 
1.7.1

[PATCH RFC 1/7] kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

Guest enables pv timer functionality using this MSR

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 arch/x86/include/asm/kvm_host.h  |5 +
 arch/x86/include/uapi/asm/kvm_para.h |6 ++
 arch/x86/kvm/lapic.c |   22 ++
 arch/x86/kvm/lapic.h |6 ++
 arch/x86/kvm/x86.c   |8 
 5 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c73e493..641b4aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -684,6 +684,11 @@ struct kvm_vcpu_arch {
bool pv_unhalted;
} pv;
 
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   } pv_timer;
+
int pending_ioapic_eoi;
int pending_external_vector;
 
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 554aa8f..3dd6116 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -41,6 +41,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_PV_TIMER_EN0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
@@ -64,6 +65,11 @@ struct kvm_clock_pairing {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct pvtimer_vcpu_event_info {
+   __u64 expire_tsc;
+   __u64 next_sync_tsc;
+} __attribute__((__packed__));
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 36c90d6..55c9ba3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1991,6 +1991,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool 
init_event)
kvm_lapic_set_base(vcpu,
vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP);
vcpu->arch.pv_eoi.msr_val = 0;
+   vcpu->arch.pv_timer.msr_val = 0;
apic_update_ppr(apic);
if (vcpu->arch.apicv_active) {
kvm_x86_ops->apicv_post_state_restore(vcpu);
@@ -2478,6 +2479,27 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 
data)
 addr, sizeof(u8));
 }
 
+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data)
+{
+   u64 addr = data & ~KVM_MSR_ENABLED;
+   int ret;
+
+   if (!lapic_in_kernel(vcpu))
+   return 1;
+
+   if (!IS_ALIGNED(addr, 4))
+   return 1;
+
+   vcpu->arch.pv_timer.msr_val = data;
+   if (!pv_timer_enabled(vcpu))
+   return 0;
+
+   ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, >arch.pv_timer.data,
+   addr, sizeof(struct pvtimer_vcpu_event_info));
+
+   return ret;
+}
+
 void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4b9935a..539a738 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -113,6 +113,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct 
kvm_vcpu *vcpu)
 }
 
 int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data);
 void kvm_lapic_init(void);
 void kvm_lapic_exit(void);
 
@@ -207,6 +208,11 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu 
*vcpu)
return lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, 
>arch.apic->pending_events);
 }
 
+static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED;
+}
+
 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 
 void wait_lapic_expire(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb..5668774 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1025,6 +1025,7 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu)
HV_X64_MSR_STIMER0_CONFIG,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
+   MSR_KVM_PV_TIMER_EN,
 
MSR_IA32_TSC_ADJUST,
MSR_IA32_TSCDEADLINE,
@@ -2279,6 +2280,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (kvm_lapic_enable_pv_eoi(vcpu, data))
return 1;
break;
+   case MSR_KVM_PV_TIMER_EN:
+   if (kvm_lapic_enable_pv_timer(vcpu, data))
+   return 1;
+   break;
 
case MSR

[PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
then KVM handles this timer for guest.

The tsc-deadline timestamp is mostly recorded in share page
with less VM-exit. We Introduce a periodically working kthread
to scan share page and synchronize timer setting for guest
on a dedicated CPU.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 arch/x86/kvm/lapic.c |  138 ++
 arch/x86/kvm/lapic.h |5 ++
 2 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 55c9ba3..20a23bb 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -36,6 +36,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include "kvm_cache_regs.h"
 #include "irq.h"
 #include "trace.h"
@@ -70,6 +74,12 @@
 #define APIC_BROADCAST 0xFF
 #define X2APIC_BROADCAST   0xul
 
+static struct hrtimer pv_sync_timer;
+static long pv_timer_period_ns = PVTIMER_PERIOD_NS;
+static struct task_struct *pv_timer_polling_worker;
+
+module_param(pv_timer_period_ns, long, 0644);
+
 static inline int apic_test_vector(int vec, void *bitmap)
 {
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
@@ -2542,8 +2552,130 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
}
 }
 
+static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer)
+{
+   hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns));
+   wake_up_process(pv_timer_polling_worker);
+
+   return HRTIMER_RESTART;
+}
+
+void kvm_apic_sync_pv_timer(void *data)
+{
+   struct kvm_vcpu *vcpu = data;
+   struct kvm_lapic *apic = vcpu->arch.apic;
+   unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+   u64 guest_tsc, expire_tsc;
+   long rem_tsc;
+
+   if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu))
+   return;
+
+   local_irq_save(flags);
+   guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+   rem_tsc = ktime_to_ns(hrtimer_get_remaining(_sync_timer))
+   * this_tsc_khz;
+   if (rem_tsc <= 0)
+   rem_tsc += pv_timer_period_ns * this_tsc_khz;
+   do_div(rem_tsc, 100L);
+
+   /*
+* make sure guest_tsc and rem_tsc are assigned before to update
+* next_sync_tsc.
+*/
+   smp_wmb();
+   kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data,
+   offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc),
+   guest_tsc + rem_tsc, 8);
+
+   /* make sure next_sync_tsc is visible */
+   smp_wmb();
+
+   expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data,
+   offsetof(struct pvtimer_vcpu_event_info, expire_tsc),
+   0UL, 8);
+
+   /* make sure expire_tsc is visible */
+   smp_wmb();
+
+   if (expire_tsc) {
+   if (expire_tsc > guest_tsc)
+   /*
+* As we bind this thread to a dedicated CPU through
+* IPI, the timer is registered on that dedicated
+* CPU here.
+*/
+   kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc);
+   else
+   /* deliver immediately if expired */
+   kvm_apic_local_deliver(apic, APIC_LVTT);
+   }
+   local_irq_restore(flags);
+}
+
+static int pv_timer_polling(void *arg)
+{
+   struct kvm *kvm;
+   struct kvm_vcpu *vcpu;
+   int i;
+   mm_segment_t oldfs = get_fs();
+
+   while (1) {
+   set_current_state(TASK_INTERRUPTIBLE);
+
+   if (kthread_should_stop()) {
+   __set_current_state(TASK_RUNNING);
+   break;
+   }
+
+   spin_lock(_lock);
+   __set_current_state(TASK_RUNNING);
+   list_for_each_entry(kvm, _list, vm_list) {
+   set_fs(USER_DS);
+   use_mm(kvm->mm);
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   kvm_apic_sync_pv_timer(vcpu);
+   }
+   unuse_mm(kvm->mm);
+   set_fs(oldfs);
+   }
+
+   spin_unlock(_lock);
+
+   schedule();
+   }
+
+   return 0;
+}
+
+static void kvm_pv_timer_init(void)
+{
+   ktime_t ktime = ktime_set(0, pv_timer_period_ns);
+
+   hrtimer_init(_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+   pv_sync_timer.function = _sync_timer_callback;
+
+   /* kthread for pv_timer

[PATCH RFC 7/7] kvm: guest: reprogram guest timer

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR.

When pvtimer is enabled, we introduce a new mechanism to
reprogram KVM guest timer. A periodically working kthread
scans share page and synchronize timer setting for guest
on a dedicated CPU. The next time event of the periodically
working kthread is a threshold to decide whether to program
tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR, or to
share page.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 arch/x86/include/asm/kvm_para.h |9 +
 arch/x86/kernel/apic/apic.c |9 ++---
 arch/x86/kernel/kvm.c   |   38 ++
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c373e44..109e706 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
@@ -92,6 +93,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned 
long p1,
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
+int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt);
 extern void kvm_disable_steal_time(void);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -126,6 +129,12 @@ static inline void kvm_disable_steal_time(void)
 {
return;
 }
+
+static inline int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt)
+{
+   return 0;
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ff89177..286c1b3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -471,10 +471,13 @@ static int lapic_next_event(unsigned long delta,
 static int lapic_next_deadline(unsigned long delta,
   struct clock_event_device *evt)
 {
-   u64 tsc;
+   u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR);
 
-   tsc = rdtsc();
-   wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+   /* TODO: undisciplined function call */
+   if (kvm_pv_timer_next_event(tsc, evt))
+   return 0;
+
+   wrmsrl(MSR_IA32_TSC_DEADLINE, tsc);
return 0;
 }
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..ec7aff1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -328,6 +328,35 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 
val)
apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static DEFINE_PER_CPU(int, pvtimer_enabled);
+static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info,
+ pvtimer_shared_buf) = {0};
+#define PVTIMER_PADDING25000
+int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt)
+{
+   struct pvtimer_vcpu_event_info *src;
+   u64 now;
+
+   if (!this_cpu_read(pvtimer_enabled))
+   return false;
+
+   src = this_cpu_ptr(_shared_buf);
+   xchg((u64 *)>expire_tsc, tsc);
+
+   barrier();
+
+   if (tsc < src->next_sync_tsc)
+   return false;
+
+   rdtscll(now);
+   if (tsc < now || tsc - now < PVTIMER_PADDING)
+   return false;
+
+   return true;
+}
+EXPORT_SYMBOL_GPL(kvm_pv_timer_next_event);
+
 static void kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -362,6 +391,15 @@ static void kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (kvm_para_has_feature(KVM_FEATURE_PV_TIMER)) {
+   unsigned long data;
+
+   data  = slow_virt_to_phys(this_cpu_ptr(_shared_buf))
+ | KVM_MSR_ENABLED;
+   wrmsrl(MSR_KVM_PV_TIMER_EN, data);
+   this_cpu_write(pvtimer_enabled, 1);
+   }
 }
 
 static void kvm_pv_disable_apf(void)
-- 
1.7.1

[PATCH RFC 6/7] Doc/KVM: introduce a new cpuid bit for kvm pvtimer

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

KVM_FEATURE_PV_TIMER enables guest to check whether pvtimer
can be enabled in guest.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 Documentation/virtual/kvm/cpuid.txt  |4 
 arch/x86/include/uapi/asm/kvm_para.h |1 +
 arch/x86/kvm/cpuid.c |1 +
 3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..b26b31c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest checks 
this feature bit
||   || before enabling paravirtualized
||   || spinlock support.
 --
+KVM_FEATURE_PV_TIMER   || 8 || guest checks this feature bit
+   ||   || before enabling paravirtualized
+   ||   || timer support.
+--
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side
||   || per-cpu warps are expected in
||   || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 3dd6116..46734a8 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -25,6 +25,7 @@
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
+#define KVM_FEATURE_PV_TIMER   8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..e02fd23 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -593,6 +593,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_PV_EOI) |
+(1 << KVM_FEATURE_PV_TIMER) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
 (1 << KVM_FEATURE_PV_UNHALT);
 
-- 
1.7.1

[PATCH RFC 5/7] KVM: timer: ignore timer migration if pvtimer is enabled

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

When pvtimer is enabled, KVM programs timer to a dedicated CPU
through IPI. Whether the vCPU is on the dedicated CPU or any
other CPU, the timer interrupt will be delivered properly.
No need to migrate timer.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 arch/x86/kvm/lapic.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5835a27..265efe6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2282,7 +2282,7 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 {
struct hrtimer *timer;
 
-   if (!lapic_in_kernel(vcpu))
+   if (!lapic_in_kernel(vcpu) || pv_timer_enabled(vcpu))
return;
 
timer = >arch.apic->lapic_timer.timer;
-- 
1.7.1

[PATCH RFC 4/7] KVM: timer: program timer to a dedicated CPU

2017-12-08 Thread Quan Xu

From: Ben Luo <bn0...@gmail.com>

KVM always registers timer on the CPU which vCPU was running on.
Even though vCPU thread is rescheduled to another CPU, the timer
will be migrated to the target CPU as well. When timer expired,
timer interrupt could make guest-mode vCPU VM-exit on this CPU.

Since the working kthread scans periodically, some of the timer
events may be lost or delayed. We have to program these tsc-
deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which
will cause VM-exit and KVM will signal the working thread through
IPI to program timer, instread of registering on current CPU.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Signed-off-by: Ben Luo <bn0...@gmail.com>
---
 arch/x86/kvm/lapic.c |8 +++-
 arch/x86/kvm/x86.c   |7 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 20a23bb..5835a27 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2072,7 +2072,13 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer 
*data)
struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, 
lapic_timer);
 
-   apic_timer_expired(apic);
+
+   if (pv_timer_enabled(apic->vcpu)) {
+   kvm_apic_local_deliver(apic, APIC_LVTT);
+   if (apic_lvtt_tscdeadline(apic))
+   apic->lapic_timer.tscdeadline = 0;
+   } else
+   apic_timer_expired(apic);
 
if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5668774..3cbb223 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -26,6 +26,7 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "x86.h"
+#include "lapic.h"
 #include "cpuid.h"
 #include "pmu.h"
 #include "hyperv.h"
@@ -2196,7 +2197,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
return kvm_x2apic_msr_write(vcpu, msr, data);
case MSR_IA32_TSCDEADLINE:
-   kvm_set_lapic_tscdeadline_msr(vcpu, data);
+   if (pv_timer_enabled(vcpu))
+   smp_call_function_single(PVTIMER_SYNC_CPU,
+   kvm_apic_sync_pv_timer, vcpu, 0);
+   else
+   kvm_set_lapic_tscdeadline_msr(vcpu, data);
break;
case MSR_IA32_TSC_ADJUST:
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
-- 
1.7.1

Re: [PATCH v3] KVM: X86: Fix load RFLAGS w/o the fixed bit

2017-12-07 Thread Quan Xu




On 2017/12/07 16:30, Wanpeng Li wrote:

From: Wanpeng Li <wanpeng...@hotmail.com>

  *** Guest State ***
  CR0: actual=0x0030, shadow=0x6010, 
gh_mask=fff7
  CR4: actual=0x2050, shadow=0x, 
gh_mask=e871
  CR3 = 0xfffbc000
  RSP = 0x  RIP = 0x
  RFLAGS=0x DR7 = 0x0400
 ^^

The failed vmentry is triggered by the following testcase when ept=Y:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
 long r[5];

 int main()
 {
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
struct kvm_regs regs = {
.rflags = 0,
};
ioctl(r[4], KVM_SET_REGS, );
ioctl(r[4], KVM_RUN, 0);
 }

X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1
of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails.
This patch fixes it by oring X86_EFLAGS_FIXED during ioctl.

Suggested-by: Jim Mattson <jmatt...@google.com>
Reviewed-by: David Hildenbrand <da...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Radim Krčmář <rkrc...@redhat.com>
Cc: Jim Mattson <jmatt...@google.com>
Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>

  Reviewed-by: Quan Xu <quan@gmail.com>

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-10 Thread Quan Xu




On 2017/12/09 00:18, David Hildenbrand wrote:

On 08.12.2017 11:22, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
  static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
  
  enum {

-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
  
  static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
  
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])

-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
  #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
  #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
  #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
  #endif
int i;
  
-	/* I/O */

-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
  
-	vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);

memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
  
-	memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);

-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
  


Looks good to me.


David, thanks for your review.



We could now also drop

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a5d2856eb28b..732e93332f4c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12,
  * Rather, exit every time.
  */
 exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

 vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);

As this will be implicitly set by vmx_exec_control()



good catch.
yes, L0 merge vmcs01 with vmcsl2 to create vmcs02 and run L2.

however as this code is related to nested virtualization, I better 
introduce it in another new patch

with "KVM: NVMX" prefix subject after this patch is applied.

furthermore, I think I could drop these tow lines in another new patch:

-   exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

as the CPU_BASED_USE_IO_BITMAPS is clear after this patch.


Quan

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-10 Thread Quan Xu




On 2017/12/09 01:31, Jim Mattson wrote:

On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu <quan@gmail.com> wrote:

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
 FIELD(HOST_FS_SELECTOR, host_fs_selector),
 FIELD(HOST_GS_SELECTOR, host_gs_selector),
 FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),

These two lines should stay.

Jim, could you explain why these two lines should stay?


IIUC, the main concern is from  nested virtualization, which still uses 
io_bitmap_a/io_bitmap_b..

if so, we really need to further clean up these code, as

CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for 
both L0/L2. after new patches which I mentioned
in this thread.

right?

Alibaba Cloud
Quan

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu




On 2017/12/12 02:08, Jim Mattson wrote:

Removing these two lines from the initialization of
field_to_offset_table[] means that vmcs_field_to_offset() will return
-ENOENT for IO_BITMAP_A or IO_BITMAP_B. Hence, handle_vmread and
handle_vmwrite will incorrectly report these fields as unsupported
VMCS components if an L1 hypervisor tries to access them.


I will fix in v2.

Quan
Alibaba Cloud




On Sun, Dec 10, 2017 at 9:37 PM, Quan Xu <quan@gmail.com> wrote:


On 2017/12/09 01:31, Jim Mattson wrote:

On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu <quan@gmail.com> wrote:

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
   arch/x86/kvm/vmx.c |   17 +
   1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
  FIELD(HOST_FS_SELECTOR, host_fs_selector),
  FIELD(HOST_GS_SELECTOR, host_gs_selector),
  FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),

These two lines should stay.

Jim, could you explain why these two lines should stay?


IIUC, the main concern is from  nested virtualization, which still uses
io_bitmap_a/io_bitmap_b..
if so, we really need to further clean up these code, as

CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set
for both L0/L2. after new patches which I mentioned
in this thread.

right?

Alibaba Cloud
Quan

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu




On 2017/12/09 00:18, David Hildenbrand wrote:

On 08.12.2017 11:22, Quan Xu wrote:

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
  static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
  
  enum {

-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
  
  static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
  
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])

-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
  #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
  #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
  #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
  #endif
int i;
  
-	/* I/O */

-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
  
-	vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);

memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
  
-	memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);

-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
  


Looks good to me.

We could now also drop

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a5d2856eb28b..732e93332f4c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12,
  * Rather, exit every time.
  */
 exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

 vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);

As this will be implicitly set by vmx_exec_control()
David,  I find some nVMX code is still using vmcs12 (.e.g. 
nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))


dropping it would make these logic error handling..
we better leave it as is..

Quan
Alibaba Cloud

[PATCH v2] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu

From: Quan Xu <quan@gmail.com>

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson <jmatt...@google.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Quan Xu <quan@gmail.com>
---
 arch/x86/kvm/vmx.c |   15 +--
 1 files changed, 1 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..41aaf5e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -943,8 +943,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
 static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
 enum {
-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +956,6 @@ enum {
 
 static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
 
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])
-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
 #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
 #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
 #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3628,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5441,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 #endif
int i;
 
-   /* I/O */
-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6743,9 @@ static __init int hardware_setup(void)
goto out;
}
 
-   vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
 
-   memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
 
-- 
1.7.1

Re: [PATCH] vfio: mdev: make a couple of functions and structure vfio_mdev_driver static

2017-12-24 Thread Quan Xu




On 2017/12/22 07:12, Xiongwei Song wrote:

The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.

Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]

Signed-off-by: Xiongwei Song <sxwj...@gmail.com>
---
  drivers/vfio/mdev/vfio_mdev.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index fa848a701b8b..d230620fe02d 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -111,19 +111,19 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
.mmap   = vfio_mdev_mmap,
  };
  
-int vfio_mdev_probe(struct device *dev)

+static int vfio_mdev_probe(struct device *dev)
  {
struct mdev_device *mdev = to_mdev_device(dev);
  
  	return vfio_add_group_dev(dev, _mdev_dev_ops, mdev);

  }
  
-void vfio_mdev_remove(struct device *dev)

+static void vfio_mdev_remove(struct device *dev)
  {
vfio_del_group_dev(dev);
  }
  
-struct mdev_driver vfio_mdev_driver = {

+static struct mdev_driver vfio_mdev_driver = {
.name   = "vfio_mdev",
.probe  = vfio_mdev_probe,
.remove = vfio_mdev_remove,

Reviewed-by: Quan Xu <quan@gmail.com>

Re: [PATCH] KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit

2018-01-02 Thread Quan Xu




On 2018/01/02 17:47, Liran Alon wrote:



On 02/01/18 00:58, Paolo Bonzini wrote:
The POSTED_INTR_NV field is constant (though it differs between the 
vmcs01 and

vmcs02), there is no need to reload it on vmexit to L1.

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  arch/x86/kvm/vmx.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e6223fe8faa1..1e184830a295 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11610,9 +11610,6 @@ static void load_vmcs12_host_state(struct 
kvm_vcpu *vcpu,

   */
  vmx_flush_tlb(vcpu, true);
  }
-    /* Restore posted intr vector. */
-    if (nested_cpu_has_posted_intr(vmcs12))
-    vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);

  vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
  vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);



Reviewed-by: Liran Alon <liran.a...@oracle.com>

I would also add to commit message:
Fixes: 06a5524f091b ("KVM: nVMX: Fix posted intr delivery when vcpu is 
in guest mode")



Reviewed-by: Quan Xu <quan@gmail.com>

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-08-31 Thread Quan Xu


on 2017/8/29 20:45, Peter Zijlstra wrote:


On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 

Broken SoB chain.

  Peter,  I can't follow 'Broken SoB chain'.. could you more about it?

  -Quan


diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
  }
  
  /* Weak implementations for optional arch specific functions */

+void __weak arch_cpu_idle_poll(void) { }
  void __weak arch_cpu_idle_prepare(void) { }
  void __weak arch_cpu_idle_enter(void) { }

And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-01 Thread Quan Xu


on 2017/8/29 22:39, Borislav Petkov wrote:

On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Kyle Huey 
Cc: Andy Lutomirski 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
  arch/x86/kernel/process.c | 7 +++
  kernel/sched/idle.c   | 2 ++
  2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
x86_idle();
  }
  
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)

+void arch_cpu_idle_poll(void)
+{
+   paravirt_idle_poll();
+}
+#endif

So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.

Huh?

   Borislav ,
   yes,  this will get called on any system which has CONFIG_PARAVIRT 
enabled.


   but if they are not running any guests,  the callback is 
paravirt_nop() ,
   IIUC which is as similar as the other paravirt_*, such as 
paravirt_pgd_free()..


 - Quan

Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-09-29 Thread Quan Xu




On 2017/8/29 20:46, Peter Zijlstra wrote:

On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote:

In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 

Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.

So no.


Peter,

I think you are right..

IIUC, the time we happened to be idle may contain a chain of 
VMEXIT/VMENTER cycles,
which would be mainly (except the last VMEXIT/VMENTER cycles) for just 
idle loops. right?


as you mentioned, poll duration to be related to is the cost of __a__ 
VMEXIT/VMENTER cycle.
howerver it is very difficult to measure a VMEXIT/VMENTER cycle 
accurately from
kvm guest, we could find out an approximate one -- dropping the idle 
loops from the

time we happened to be idle.. make sense?

Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-29 Thread Quan Xu




On 2017/9/1 14:49, Quan Xu wrote:

on 2017/8/29 22:39, Borislav Petkov wrote:

On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Kyle Huey 
Cc: Andy Lutomirski 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
  arch/x86/kernel/process.c | 7 +++
  kernel/sched/idle.c   | 2 ++
  2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
  x86_idle();
  }
  +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+    paravirt_idle_poll();
+}
+#endif

So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.

Huh?

 Borislav,
 think twice, it is better to run ander an IF condition, to make 
sure they are running any guests.


  Quan

Borislav ,
   yes, this will get called on any system which has CONFIG_PARAVIRT 
enabled.


   but if they are not running any guests,  the callback is 
paravirt_nop() ,
   IIUC which is as similar as the other paravirt_*, such as 
paravirt_pgd_free()..


 - Quan

Re: [PATCH RFC 0/7] kvm pvtimer

2017-12-14 Thread Quan Xu




On 2017/12/14 19:56, Paolo Bonzini wrote:

On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote:

1) VM idle path and network req/resp services:

Does this go away if you don't hit the idle path? Meaning if you
loop without hitting HLT/MWAIT? I am assuming the issue you are facing
is the latency - that is first time the guest comes from HLT and
responds to the packet the latency is much higher than without?

And the arming of the timer?
2) process context switches.

Is that related to the 1)? That is the 'schedule' call and the process
going to sleep waiting for an interrupt or timer?

This all sounds like issues with low-CPU usage workloads where you
need low latency responses?

Even high-CPU usage, as long as there is a small idle time.  The cost of
setting the TSC deadline timer twice is about 3000 cycles.

However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE
can recover most of the performance and it's way less intrusive.


  Paolo, could you share the Amazon's patch or the LML link? thanks.


Quan


Thanks,

Paolo

Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

2017-12-14 Thread Quan Xu

On 2017/12/14 21:15, Gonglei (Arei) wrote:

-Original Message-
From: Liran Alon [mailto:liran.a...@oracle.com]
Sent: Thursday, December 14, 2017 9:02 PM
To: Gonglei (Arei); pbonz...@redhat.com; rkrc...@redhat.com
Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; Huangweidong (C)
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

On 14/12/17 14:23, Gonglei wrote:

We hit a bug in our test while run PCMark 10 in a windows 7 VM,
The VM got stuck and the wallclock was hang after several minutes running
PCMark 10 in it.
It is quite easily to reproduce the bug with the upstream KVM and Qemu.

We found that KVM can not inject any RTC irq to VM after it was hang, it fails

to

Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
int irq_level, bool line_status)
{
...
   if (!irq_level) {
ioapic->irr &= ~mask;
ret = 1;
goto out;
   }
...
   if ((edge && old_irr == ioapic->irr) ||
   (!edge && entry.fields.remote_irr)) {
ret = 0;
goto out;
   }

According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
register C to to clear the irq flag, and pull down the irq electric pin.

For Qemu, we will emulate the reading operation in cmos_ioport_read(),
but Guest OS will fire a write operation before to tell which register will be

read

after this write, where we use s->cmos_index to record the following register

to read.

But in our test, we found that there is a possible situation that Vcpu fails to

read

RTC_REG_C to clear irq, This could happens while two VCpus are

writing/reading

registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
but before it tries to read register C, another vcpu1 is going to read

RTC_YEAR,

it changes s->cmos_index to RTC_YEAR by a writing action.
The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we

will miss

calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never 
inject

RTC irq,

and Windows VM will hang.

If I understood correctly, this looks to me like a race-condition bug in
the Windows guest kernel. In real-hardware this race-condition will also
cause the RTC_YEAR to be read instead of RTC_REG_C.
Guest kernel should make sure that 2 CPUs does not attempt to read a
CMOS register in parallel as they can override each other's cmos_index.

See for example how Linux kernel makes sure to avoid such kind of issues
in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.

Yes, I knew that.

Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.

Can you elaborate a bit more why it makes sense to put such workaround
in KVM code instead of declaring this as guest kernel bug?

I agree it's a Windows bug. The big problem is there is not problem on Xen 
platform.

 have you analyzed why it is not a problem on Xen?

Quan

Thanks,
-Gonglei

Regards,
-Liran

Suggested-by: Paolo Bonzini 
Signed-off-by: Gonglei 
---
Thanks to Paolo provides a good solution. :)

   arch/x86/kvm/ioapic.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 4e822ad..5022d63 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic,

struct kvm_vcpu *vcpu)

   {
if (test_and_clear_bit(vcpu->vcpu_id,
   ioapic->rtc_status.dest_map.map)) {
+   ioapic->irr &= ~(1 << RTC_GSI);
--ioapic->rtc_status.pending_eoi;
rtc_status_pending_eoi_check_valid(ioapic);
}

Re: [PATCH] vfio: mdev: make a couple of functions and structure vfio_mdev_driver static

2017-12-24 Thread Quan Xu




On 2017/12/22 07:12, Xiongwei Song wrote:

The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.

Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]

Signed-off-by: Xiongwei Song 
---
  drivers/vfio/mdev/vfio_mdev.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index fa848a701b8b..d230620fe02d 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -111,19 +111,19 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
.mmap   = vfio_mdev_mmap,
  };
  
-int vfio_mdev_probe(struct device *dev)

+static int vfio_mdev_probe(struct device *dev)
  {
struct mdev_device *mdev = to_mdev_device(dev);
  
  	return vfio_add_group_dev(dev, _mdev_dev_ops, mdev);

  }
  
-void vfio_mdev_remove(struct device *dev)

+static void vfio_mdev_remove(struct device *dev)
  {
vfio_del_group_dev(dev);
  }
  
-struct mdev_driver vfio_mdev_driver = {

+static struct mdev_driver vfio_mdev_driver = {
.name   = "vfio_mdev",
.probe  = vfio_mdev_probe,
.remove = vfio_mdev_remove,

Reviewed-by: Quan Xu

Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll

2017-11-13 Thread Quan Xu




On 2017/11/13 23:08, Ingo Molnar wrote:

* Quan Xu  wrote:


From: Quan Xu 

To reduce the cost of poll, we introduce three sysctl to control the
poll time when running as a virtual machine with paravirt.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
---
  Documentation/sysctl/kernel.txt |   35 +++
  arch/x86/kernel/paravirt.c  |4 
  include/linux/kernel.h  |6 ++
  kernel/sysctl.c |   34 ++
  4 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c..30c25fb 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one.
  
  ==
  
+paravirt_poll_grow: (X86 only)

+
+Multiplied value to increase the poll time. This is expected to take
+effect only when running as a virtual machine with CONFIG_PARAVIRT
+enabled. This can't bring any benifit on bare mental even with
+CONFIG_PARAVIRT enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==
+
+paravirt_poll_shrink: (X86 only)
+
+Divided value to reduce the poll time. This is expected to take effect
+only when running as a virtual machine with CONFIG_PARAVIRT enabled.
+This can't bring any benifit on bare mental even with CONFIG_PARAVIRT
+enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==
+
+paravirt_poll_threshold_ns: (X86 only)
+
+Controls the maximum poll time before entering real idle path. This is
+expected to take effect only when running as a virtual machine with
+CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental
+even with CONFIG_PARAVIRT enabled.
+
+By default, this value is 0 means not to poll. Possible values to set
+are in range {0..50}. Change the value to non-zero if running
+latency-bound workloads in a virtual machine.

I absolutely hate it how this hybrid idle loop polling mechanism is not
self-tuning!


Ingo, actually it is self-tuning..

Please make it all work fine by default, and automatically so, instead of adding
three random parameters...
.. I will make it all fine by default. howerver cloud environment is of 
diversity,


could I only leave paravirt_poll_threshold_ns parameter (the maximum 
poll time),
which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then 
user can turn

it off, or find an appropriate threshold for some odd scenario..

thanks for your comments!!
Quan
Alibaba Cloud

And if it cannot be done automatically then we should rather not do it at all.
Maybe the next submitter of a similar feature can think of a better approach.

Thanks,

Ingo

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-13 Thread Quan Xu




On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu 

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Juergen Gross 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
    29031.6 bit/s -- 76.1 %CPU

 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
    35787.7 bit/s -- 129.4 %CPU

 3. w/ kvm dynamic poll:
    35735.6 bit/s -- 200.0 %CPU

 4. w/patch and w/ kvm dynamic poll:
    42225.3 bit/s -- 198.7 %CPU

 5. idle=poll
    37081.7 bit/s -- 998.1 %CPU



 w/ this patch, we will improve performance by 23%.. even we could improve
 performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
 cost of CPU is much lower than 'idle=poll' case..


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


I assume this feature will be ported to other archs.. a new pvops makes code
clean and easy to maintain. also I tried to add it into existed pvops, 
but it

doesn't match.



Quan
Alibaba Cloud


Juergen

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 15:12, Wanpeng Li wrote:

2017-11-14 15:02 GMT+08:00 Quan Xu :


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu 

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Juergen Gross 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
 29031.6 bit/s -- 76.1 %CPU

  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
 35787.7 bit/s -- 129.4 %CPU

  3. w/ kvm dynamic poll:
 35735.6 bit/s -- 200.0 %CPU

Actually we can reduce the CPU utilization by sleeping a period of
time as what has already been done in the poll logic of IO subsystem,
then we can improve the algorithm in kvm instead of introduing another
duplicate one in the kvm guest.

We really appreciate upstream's kvm dynamic poll mechanism, which is
really helpful for a lot of scenario..

However, as description said, in virtualization, idle path includes
several heavy operations includes timer access (LAPIC timer or TSC
deadline timer) which will hurt performance especially for latency
intensive workload like message passing task. The cost is mainly from
the vmexit which is a hardware context switch between virtual machine
and hypervisor.

for upstream's kvm dynamic poll mechanism, even you could provide a
better algorism, how could you bypass timer access (LAPIC timer or TSC
deadline timer), or a hardware context switch between virtual machine
and hypervisor. I know these is a tradeoff.

Furthermore, here is the data we get when running benchmark contextswitch
to measure the latency(lower is better):

1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  3402.9 ns/ctxsw -- 199.8 %CPU

2. w/ patch and disable kvm dynamic poll:
  1163.5 ns/ctxsw -- 205.5 %CPU

3. w/ kvm dynamic poll:
  2280.6 ns/ctxsw -- 199.5 %CPU

so, these tow solution are quite similar, but not duplicate..

that's also why to add a generic idle poll before enter real idle path.
When a reschedule event is pending, we can bypass the real idle path.


Quan
Alibaba Cloud





Regards,
Wanpeng Li


  4. w/patch and w/ kvm dynamic poll:
 42225.3 bit/s -- 198.7 %CPU

  5. idle=poll
 37081.7 bit/s -- 998.1 %CPU



  w/ this patch, we will improve performance by 23%.. even we could improve
  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
  cost of CPU is much lower than 'idle=poll' case..


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


I assume this feature will be ported to other archs.. a new pvops makes code
clean and easy to maintain. also I tried to add it into existed pvops, but
it
doesn't match.



Quan
Alibaba Cloud


Juergen

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 15:30, Juergen Gross wrote:

On 14/11/17 08:02, Quan Xu wrote:


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu 

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Juergen Gross 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     29031.6 bit/s -- 76.1 %CPU

  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
     35787.7 bit/s -- 129.4 %CPU

  3. w/ kvm dynamic poll:
     35735.6 bit/s -- 200.0 %CPU

  4. w/patch and w/ kvm dynamic poll:
     42225.3 bit/s -- 198.7 %CPU

  5. idle=poll
     37081.7 bit/s -- 998.1 %CPU



  w/ this patch, we will improve performance by 23%.. even we could improve
  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
  cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.

I assume this feature will be ported to other archs.. a new pvops makes


  sorry, a typo.. /other archs/other hypervisors/
  it refers hypervisor like Xen, HyperV and VMware)..


code
clean and easy to maintain. also I tried to add it into existed pvops,
but it
doesn't match.

You are aware that pvops is x86 only?


yes, I'm aware..


I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
if (static_key_false(_idle_poll_key))
guest_idle_poll_func();
}




thank you for your sample code :)
I agree there is no big difference.. I think we are discussion for two 
things:

 1) x86 VM on different hypervisors
 2) different archs VM on kvm hypervisor

What I want to do is x86 VM on different hypervisors, such as kvm / xen 
/ hyperv ..



And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.



.. referred to 'pv_mmu_ops', HyperV and Xen can implement their own 
functions for 'pv_mmu_ops'.

I think it is the same to pv_idle_ops.

with above explaination, do you still think I need to define the static
key/function pointer variant?

btw, any interest to port it to Xen HVM guest? :)

Quan
Alibaba Cloud

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 16:22, Wanpeng Li wrote:

2017-11-14 16:15 GMT+08:00 Quan Xu :


On 2017/11/14 15:12, Wanpeng Li wrote:

2017-11-14 15:02 GMT+08:00 Quan Xu :


On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu 

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Juergen Gross 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
  35787.7 bit/s -- 129.4 %CPU

   3. w/ kvm dynamic poll:
  35735.6 bit/s -- 200.0 %CPU

Actually we can reduce the CPU utilization by sleeping a period of
time as what has already been done in the poll logic of IO subsystem,
then we can improve the algorithm in kvm instead of introduing another
duplicate one in the kvm guest.

We really appreciate upstream's kvm dynamic poll mechanism, which is
really helpful for a lot of scenario..

However, as description said, in virtualization, idle path includes
several heavy operations includes timer access (LAPIC timer or TSC
deadline timer) which will hurt performance especially for latency
intensive workload like message passing task. The cost is mainly from
the vmexit which is a hardware context switch between virtual machine
and hypervisor.

for upstream's kvm dynamic poll mechanism, even you could provide a
better algorism, how could you bypass timer access (LAPIC timer or TSC
deadline timer), or a hardware context switch between virtual machine
and hypervisor. I know these is a tradeoff.

Furthermore, here is the data we get when running benchmark contextswitch
to measure the latency(lower is better):

1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
   3402.9 ns/ctxsw -- 199.8 %CPU

2. w/ patch and disable kvm dynamic poll:
   1163.5 ns/ctxsw -- 205.5 %CPU

3. w/ kvm dynamic poll:
   2280.6 ns/ctxsw -- 199.5 %CPU

so, these tow solution are quite similar, but not duplicate..

that's also why to add a generic idle poll before enter real idle path.
When a reschedule event is pending, we can bypass the real idle path.


There is a similar logic in the idle governor/driver, so how this
patchset influence the decision in the idle governor/driver when
running on bare-metal(power managment is not exposed to the guest so
we will not enter into idle driver in the guest)?



This is expected to take effect only when running as a virtual machine with
proper CONFIG_* enabled. This can not work on bare mental even with proper
CONFIG_* enabled.

Quan
Alibaba Cloud

Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Quan Xu




On 2017/11/14 18:27, Juergen Gross wrote:

On 14/11/17 10:38, Quan Xu wrote:


On 2017/11/14 15:30, Juergen Gross wrote:

On 14/11/17 08:02, Quan Xu wrote:

On 2017/11/13 18:53, Juergen Gross wrote:

On 13/11/17 11:06, Quan Xu wrote:

From: Quan Xu 

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like
message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Juergen Gross 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a
new
pvops function is necessary?

Juergen, Here is the data we get when running benchmark netperf:
   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
  29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
  35787.7 bit/s -- 129.4 %CPU

   3. w/ kvm dynamic poll:
  35735.6 bit/s -- 200.0 %CPU

   4. w/patch and w/ kvm dynamic poll:
  42225.3 bit/s -- 198.7 %CPU

   5. idle=poll
  37081.7 bit/s -- 998.1 %CPU



   w/ this patch, we will improve performance by 23%.. even we could
improve
   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
also the
   cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.


Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this
would
work on other architectures, too.

I assume this feature will be ported to other archs.. a new pvops makes

   sorry, a typo.. /other archs/other hypervisors/
   it refers hypervisor like Xen, HyperV and VMware)..


code
clean and easy to maintain. also I tried to add it into existed pvops,
but it
doesn't match.

You are aware that pvops is x86 only?

yes, I'm aware..


I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
 if (static_key_false(_idle_poll_key))
     guest_idle_poll_func();
}



thank you for your sample code :)
I agree there is no big difference.. I think we are discussion for two
things:
  1) x86 VM on different hypervisors
  2) different archs VM on kvm hypervisor

What I want to do is x86 VM on different hypervisors, such as kvm / xen
/ hyperv ..

Why limit the solution to x86 if the more general solution isn't
harder?

As you didn't give any reason why the pvops approach is better other
than you don't care for non-x86 platforms you won't get an "Ack" from
me for this patch.



It just looks a little odder to me. I understand you care about no-x86 arch.

Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in
   - arch/arm64/include/asm/paravirt.h
   - arch/x86/include/asm/paravirt_types.h
   - arch/arm/include/asm/paravirt.h

I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops
for arm/arm64 arch, you'd define a same structure in
   - arch/arm64/include/asm/paravirt.h or
   - arch/arm/include/asm/paravirt.h

.. instead of static key / fuction.

then implement a real function in
   - arch/arm/kernel/paravirt.c.

Also I wonder HOW/WHERE to define a static key/function, then to benifit
x86/no-x86 archs?

Quan
Alibaba Cloud


And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.


.. referred to 'pv_mmu_ops', HyperV and Xen can implement their own
functions for 'pv_mmu_ops'.
I think it is the same to pv_idle_ops.

with above explaination, do you still think I need to define the static
key/function pointer variant?

btw, any interest to port it to Xen HVM guest? :)

Maybe. But this should work for Xen on ARM, too.


Juergen

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-19 Thread Quan Xu




On 2017-11-16 17:45, Daniel Lezcano wrote:

On 16/11/2017 10:12, Quan Xu wrote:


On 2017-11-16 06:03, Thomas Gleixner wrote:

On Wed, 15 Nov 2017, Peter Zijlstra wrote:


On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:

From: Yang Zhang 

Implement a generic idle poll which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_poll function which
can be overridden by the architecture code if needed.

No, we want less of those magic hooks, not more.


Interrupts arrive which may not cause a reschedule in idle loops.
In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
for interrupts and VM-exit immediately. Also this becomes more
expensive than bare metal. Add a generic idle poll before enter
real idle path. When a reschedule event is pending, we can bypass
the real idle path.

Why not do a HV specific idle driver?

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is
already
an interesting quirk there which makes it exit early.  See commit
3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
commit
looks similar. But lets not proliferate that. I'd rather see that go
away.

agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
arch_needs_cpu")
in kvm guest. I won't proliferate that..


But the irq_timings stuff is heading into the same direction, with a more
complex prediction logic which should tell you pretty good how long that
idle period is going to be and in case of an interrupt heavy workload
this
would skip the extra work of stopping and restarting the tick and
provide a
very good input into a polling decision.


interesting. I have tested with IRQ_TIMINGS related code, which seems
not working so far.

I don't know how you tested it, can you elaborate what you meant by
"seems not working so far" ?


Daniel, I tried to enable IRQ_TIMINGS* manually. used 
irq_timings_next_event()

to return estimation of the earliest interrupt. However I got a constant.


There are still some work to do to be more efficient. The prediction
based on the irq timings is all right if the interrupts have a simple
periodicity. But as soon as there is a pattern, the current code can't
handle it properly and does bad predictions.

I'm working on a self-learning pattern detection which is too heavy for
the kernel, and with it we should be able to detect properly the
patterns and re-ajust the period if it changes. I'm in the process of
making it suitable for kernel code (both math and perf).

One improvement which can be done right now and which can help you is
the interrupts rate on the CPU. It is possible to compute it and that
will give an accurate information for the polling decision.


As tglx said, talk to each other / work together to make it usable for 
all use cases.
could you share how to enable it to get the interrupts rate on the CPU? 
I can try it

in cloud scenario. of course, I'd like to work with you to improve it.

Quan
Alibaba Cloud

Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support

2017-11-19 Thread Quan Xu




On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote:

On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:

From: Yang Zhang 

Some latency-intensive workload have seen obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost I have seen is
inside idle path.

Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
back in guest (VMRESUME) takes time. And hence your latency gets
all whacked b/c of this?

   Konrad, I can't follow 'b/c' here.. sorry.


So if I understand - you want to use your _full_ timeslice (of the guest)
without ever (or as much as possible) to go in the hypervisor?

    as much as possible.


Which means in effect you don't care about power-saving or CPUfreq
savings, you just want to eat the full CPU for snack?
  actually, we  care about power-saving. The poll duration is 
self-tuning, otherwise it is almost as the same as
  'halt=poll'. Also we always sent out with CPU usage of benchmark 
netperf/ctxsw. We got much more

  performance with limited promotion of CPU usage.



This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Schedule of what? The guest or the host?

  rescheduled of guest scheduler..
  it is the guest.


Quan
Alibaba Cloud

Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-14 Thread Quan Xu




on 2017/9/13 19:56, Yang Zhang wrote:

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.



Hi, MST

I have tested the patch "intel_idle: add pv cstates when running on 
kvm"  on a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC 
PATCH v2 0/7] x86/idle: add halt poll support",

upstream linux, and  idle=poll.

the following is the result (which seems better than ever berfore, as I 
ran test case on a more powerful machine):


for __netperf__,  the first column is trans. rate per sec, the second 
column is CPU utilzation.


1. upstream linux

  28371.7 bits/s -- 76.6 %CPU

2. idle=poll

  34372 bit/s -- 999.3 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different 
values of parameter 'halt_poll_threshold':


  28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=1)
  32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=2)
  39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3)
  40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4)
  40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5)


4. "intel_idle: add pv cstates when running on kvm"

  33041.8 bits/s  -- 999.4 %CPU





for __ctxsw__, the first column is the time per process context 
switches, the second column is CPU utilzation..


1. upstream linux

  3624.19 ns/ctxsw -- 191.9 %CPU

2. idle=poll

  3419.66 ns/ctxsw -- 999.2 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different 
values of parameter 'halt_poll_threshold':


  1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1)
  1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2)
  1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3)
  1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4)
  1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5)

 4.  "intel_idle: add pv cstates when running on kvm"

  3427.59 ns/ctxsw -- 999.4 %CPU

-Quan

Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-09-14 Thread Quan Xu




on 2017/9/1 13:57, Quan Xu wrote:

on 2017/8/29 20:45, Peter Zijlstra wrote:


On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote:

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 

Broken SoB chain.
  Peter,  I can't follow 'Broken SoB chain'.. could you explain more 
about it?



    Peter, Ping..

Quan




  -Quan


diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
  }
    /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
  void __weak arch_cpu_idle_prepare(void) { }
  void __weak arch_cpu_idle_enter(void) { }

And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?

Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-14 Thread Quan Xu




On 2017/9/14 17:19, Wanpeng Li wrote:

2017-09-14 16:36 GMT+08:00 Quan Xu :


on 2017/9/13 19:56, Yang Zhang wrote:

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.


Hi, MST

I have tested the patch "intel_idle: add pv cstates when running on kvm"  on
a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC PATCH
v2 0/7] x86/idle: add halt poll support",
upstream linux, and  idle=poll.

the following is the result (which seems better than ever berfore, as I ran
test case on a more powerful machine):

for __netperf__,  the first column is trans. rate per sec, the second column
is CPU utilzation.

1. upstream linux

This "upstream linux" means that disables the kvm adaptive
halt-polling after confirm with Xu Quan.



upstream linux -- the source code is just from upstream linux, without 
our patch or MST's patch..
yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following 
cases.


Quan



Regards,
Wanpeng Li


   28371.7 bits/s -- 76.6 %CPU

2. idle=poll

   34372 bit/s -- 999.3 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different
values of parameter 'halt_poll_threshold':

   28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=1)
   32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=2)
   39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3)
   40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4)
   40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5)


4. "intel_idle: add pv cstates when running on kvm"

   33041.8 bits/s  -- 999.4 %CPU





for __ctxsw__, the first column is the time per process context switches,
the second column is CPU utilzation..

1. upstream linux

   3624.19 ns/ctxsw -- 191.9 %CPU

2. idle=poll

   3419.66 ns/ctxsw -- 999.2 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
values of parameter 'halt_poll_threshold':

   1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1)
   1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2)
   1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3)
   1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4)
   1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5)

  4.  "intel_idle: add pv cstates when running on kvm"

   3427.59 ns/ctxsw -- 999.4 %CPU

-Quan

Re: [PATCH] KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit

2018-01-02 Thread Quan Xu




On 2018/01/02 17:47, Liran Alon wrote:



On 02/01/18 00:58, Paolo Bonzini wrote:
The POSTED_INTR_NV field is constant (though it differs between the 
vmcs01 and

vmcs02), there is no need to reload it on vmexit to L1.

Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/vmx.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e6223fe8faa1..1e184830a295 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11610,9 +11610,6 @@ static void load_vmcs12_host_state(struct 
kvm_vcpu *vcpu,

   */
  vmx_flush_tlb(vcpu, true);
  }
-    /* Restore posted intr vector. */
-    if (nested_cpu_has_posted_intr(vmcs12))
-    vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);

  vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
  vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);



Reviewed-by: Liran Alon 

I would also add to commit message:
Fixes: 06a5524f091b ("KVM: nVMX: Fix posted intr delivery when vcpu is 
in guest mode")



Reviewed-by: Quan Xu

Re: [PATCH v3] KVM: X86: Fix load RFLAGS w/o the fixed bit

2017-12-07 Thread Quan Xu




On 2017/12/07 16:30, Wanpeng Li wrote:

From: Wanpeng Li 

  *** Guest State ***
  CR0: actual=0x0030, shadow=0x6010, 
gh_mask=fff7
  CR4: actual=0x2050, shadow=0x, 
gh_mask=e871
  CR3 = 0xfffbc000
  RSP = 0x  RIP = 0x
  RFLAGS=0x DR7 = 0x0400
 ^^

The failed vmentry is triggered by the following testcase when ept=Y:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
 long r[5];

 int main()
 {
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
struct kvm_regs regs = {
.rflags = 0,
};
ioctl(r[4], KVM_SET_REGS, );
ioctl(r[4], KVM_RUN, 0);
 }

X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1
of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails.
This patch fixes it by oring X86_EFLAGS_FIXED during ioctl.

Suggested-by: Jim Mattson 
Reviewed-by: David Hildenbrand 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Jim Mattson 
Signed-off-by: Wanpeng Li 

  Reviewed-by: Quan Xu

[PATCH RFC 1/7] kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR

2017-12-08 Thread Quan Xu

From: Ben Luo 

Guest enables pv timer functionality using this MSR

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 arch/x86/include/asm/kvm_host.h  |5 +
 arch/x86/include/uapi/asm/kvm_para.h |6 ++
 arch/x86/kvm/lapic.c |   22 ++
 arch/x86/kvm/lapic.h |6 ++
 arch/x86/kvm/x86.c   |8 
 5 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c73e493..641b4aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -684,6 +684,11 @@ struct kvm_vcpu_arch {
bool pv_unhalted;
} pv;
 
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   } pv_timer;
+
int pending_ioapic_eoi;
int pending_external_vector;
 
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 554aa8f..3dd6116 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -41,6 +41,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_PV_TIMER_EN0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
@@ -64,6 +65,11 @@ struct kvm_clock_pairing {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct pvtimer_vcpu_event_info {
+   __u64 expire_tsc;
+   __u64 next_sync_tsc;
+} __attribute__((__packed__));
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 36c90d6..55c9ba3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1991,6 +1991,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool 
init_event)
kvm_lapic_set_base(vcpu,
vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP);
vcpu->arch.pv_eoi.msr_val = 0;
+   vcpu->arch.pv_timer.msr_val = 0;
apic_update_ppr(apic);
if (vcpu->arch.apicv_active) {
kvm_x86_ops->apicv_post_state_restore(vcpu);
@@ -2478,6 +2479,27 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 
data)
 addr, sizeof(u8));
 }
 
+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data)
+{
+   u64 addr = data & ~KVM_MSR_ENABLED;
+   int ret;
+
+   if (!lapic_in_kernel(vcpu))
+   return 1;
+
+   if (!IS_ALIGNED(addr, 4))
+   return 1;
+
+   vcpu->arch.pv_timer.msr_val = data;
+   if (!pv_timer_enabled(vcpu))
+   return 0;
+
+   ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, >arch.pv_timer.data,
+   addr, sizeof(struct pvtimer_vcpu_event_info));
+
+   return ret;
+}
+
 void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4b9935a..539a738 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -113,6 +113,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct 
kvm_vcpu *vcpu)
 }
 
 int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data);
 void kvm_lapic_init(void);
 void kvm_lapic_exit(void);
 
@@ -207,6 +208,11 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu 
*vcpu)
return lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, 
>arch.apic->pending_events);
 }
 
+static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED;
+}
+
 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 
 void wait_lapic_expire(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb..5668774 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1025,6 +1025,7 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu)
HV_X64_MSR_STIMER0_CONFIG,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
+   MSR_KVM_PV_TIMER_EN,
 
MSR_IA32_TSC_ADJUST,
MSR_IA32_TSCDEADLINE,
@@ -2279,6 +2280,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (kvm_lapic_enable_pv_eoi(vcpu, data))
return 1;
break;
+   case MSR_KVM_PV_TIMER_EN:
+   if (kvm_lapic_enable_pv_timer(vcpu, data))
+   return 1;
+   break;
 
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
@@ -2510,6 +2515,9 @@ int kvm_get_msr_common(struc

[PATCH RFC 0/7] kvm pvtimer

2017-12-08 Thread Quan Xu

From: Ben Luo 

This patchset introduces a new paravirtualized mechanism to reduce VM-exit
caused by guest timer accessing.

In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE
MSR. This will cause a VM-exit, and then KVM handles this timer for guest.

Also kvm always registers timer on the CPU which vCPU was running on. Even
though vCPU thread is rescheduled to another CPU, the timer will be migrated
to the target CPU as well. When timer expired, timer interrupt could make
guest-mode vCPU VM-exit on this CPU.

When pvtimer is enabled:

   - The tsc-deadline timestamp is mostly recorded in share page with less
 VM-exit. We Introduce a periodically working kthread to scan share page
 and synchronize timer setting for guest on a dedicated CPU.

   - Since the working kthread scans periodically, some of the timer events
 may be lost or delayed. We have to program these tsc-deadline timestamps
 to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
 signal the working thread through IPI to program timer, instread of
 registering on current CPU.

   - Timer interrupt will be delivered by posted interrupt mechanism to vCPUs
 with less VM-exit.

Ben Luo (7):
  kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR
  kvm: x86: add a function to exchange value
  KVM: timer: synchronize tsc-deadline timestamp for guest
  KVM: timer: program timer to a dedicated CPU
  KVM: timer: ignore timer migration if pvtimer is enabled
  Doc/KVM: introduce a new cpuid bit for kvm pvtimer
  kvm: guest: reprogram guest timer

 Documentation/virtual/kvm/cpuid.txt  |4 +
 arch/x86/include/asm/kvm_host.h  |5 +
 arch/x86/include/asm/kvm_para.h  |9 ++
 arch/x86/include/uapi/asm/kvm_para.h |7 ++
 arch/x86/kernel/apic/apic.c  |9 +-
 arch/x86/kernel/kvm.c|   38 
 arch/x86/kvm/cpuid.c |1 +
 arch/x86/kvm/lapic.c |  170 +-
 arch/x86/kvm/lapic.h |   11 ++
 arch/x86/kvm/x86.c   |   15 +++-
 include/linux/kvm_host.h |3 +
 virt/kvm/kvm_main.c  |   42 +
 12 files changed, 308 insertions(+), 6 deletions(-)

[PATCH RFC 2/7] kvm: x86: add a function to exchange value

2017-12-08 Thread Quan Xu

From: Ben Luo 

Introduce kvm_xchg_guest_cached to exchange value with guest
page atomically.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 include/linux/kvm_host.h |3 +++
 virt/kvm/kvm_main.c  |   42 ++
 2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538..32949ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -688,6 +688,9 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
   void *data, int offset, unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len);
+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+   struct gfn_to_hva_cache *ghc, unsigned long offset,
+   unsigned long new, int size);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9deb5a2..3149e17 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2010,6 +2010,48 @@ int kvm_read_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
 
+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+   struct gfn_to_hva_cache *ghc, unsigned long offset,
+   unsigned long new, int size)
+{
+   unsigned long r;
+   void *kva;
+   struct page *page;
+   kvm_pfn_t pfn;
+
+   WARN_ON(offset > ghc->len);
+
+   pfn = gfn_to_pfn_atomic(kvm, ghc->gpa >> PAGE_SHIFT);
+   page = kvm_pfn_to_page(pfn);
+
+   if (is_error_page(page))
+   return -EFAULT;
+
+   kva = kmap_atomic(page) + offset_in_page(ghc->gpa) + offset;
+   switch (size) {
+   case 1:
+   r = xchg((char *)kva, new);
+   break;
+   case 2:
+   r = xchg((short *)kva, new);
+   break;
+   case 4:
+   r = xchg((int *)kva, new);
+   break;
+   case 8:
+   r = xchg((long *)kva, new);
+   break;
+   default:
+   kunmap_atomic(kva);
+   return -EFAULT;
+   }
+
+   kunmap_atomic(kva);
+   mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+   return r;
+}
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
-- 
1.7.1

[PATCH RFC 5/7] KVM: timer: ignore timer migration if pvtimer is enabled

2017-12-08 Thread Quan Xu

From: Ben Luo 

When pvtimer is enabled, KVM programs timer to a dedicated CPU
through IPI. Whether the vCPU is on the dedicated CPU or any
other CPU, the timer interrupt will be delivered properly.
No need to migrate timer.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 arch/x86/kvm/lapic.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5835a27..265efe6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2282,7 +2282,7 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 {
struct hrtimer *timer;
 
-   if (!lapic_in_kernel(vcpu))
+   if (!lapic_in_kernel(vcpu) || pv_timer_enabled(vcpu))
return;
 
timer = >arch.apic->lapic_timer.timer;
-- 
1.7.1

[PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

2017-12-08 Thread Quan Xu

From: Ben Luo 

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
then KVM handles this timer for guest.

The tsc-deadline timestamp is mostly recorded in share page
with less VM-exit. We Introduce a periodically working kthread
to scan share page and synchronize timer setting for guest
on a dedicated CPU.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 arch/x86/kvm/lapic.c |  138 ++
 arch/x86/kvm/lapic.h |5 ++
 2 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 55c9ba3..20a23bb 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -36,6 +36,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include "kvm_cache_regs.h"
 #include "irq.h"
 #include "trace.h"
@@ -70,6 +74,12 @@
 #define APIC_BROADCAST 0xFF
 #define X2APIC_BROADCAST   0xul
 
+static struct hrtimer pv_sync_timer;
+static long pv_timer_period_ns = PVTIMER_PERIOD_NS;
+static struct task_struct *pv_timer_polling_worker;
+
+module_param(pv_timer_period_ns, long, 0644);
+
 static inline int apic_test_vector(int vec, void *bitmap)
 {
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
@@ -2542,8 +2552,130 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
}
 }
 
+static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer)
+{
+   hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns));
+   wake_up_process(pv_timer_polling_worker);
+
+   return HRTIMER_RESTART;
+}
+
+void kvm_apic_sync_pv_timer(void *data)
+{
+   struct kvm_vcpu *vcpu = data;
+   struct kvm_lapic *apic = vcpu->arch.apic;
+   unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+   u64 guest_tsc, expire_tsc;
+   long rem_tsc;
+
+   if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu))
+   return;
+
+   local_irq_save(flags);
+   guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+   rem_tsc = ktime_to_ns(hrtimer_get_remaining(_sync_timer))
+   * this_tsc_khz;
+   if (rem_tsc <= 0)
+   rem_tsc += pv_timer_period_ns * this_tsc_khz;
+   do_div(rem_tsc, 100L);
+
+   /*
+* make sure guest_tsc and rem_tsc are assigned before to update
+* next_sync_tsc.
+*/
+   smp_wmb();
+   kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data,
+   offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc),
+   guest_tsc + rem_tsc, 8);
+
+   /* make sure next_sync_tsc is visible */
+   smp_wmb();
+
+   expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data,
+   offsetof(struct pvtimer_vcpu_event_info, expire_tsc),
+   0UL, 8);
+
+   /* make sure expire_tsc is visible */
+   smp_wmb();
+
+   if (expire_tsc) {
+   if (expire_tsc > guest_tsc)
+   /*
+* As we bind this thread to a dedicated CPU through
+* IPI, the timer is registered on that dedicated
+* CPU here.
+*/
+   kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc);
+   else
+   /* deliver immediately if expired */
+   kvm_apic_local_deliver(apic, APIC_LVTT);
+   }
+   local_irq_restore(flags);
+}
+
+static int pv_timer_polling(void *arg)
+{
+   struct kvm *kvm;
+   struct kvm_vcpu *vcpu;
+   int i;
+   mm_segment_t oldfs = get_fs();
+
+   while (1) {
+   set_current_state(TASK_INTERRUPTIBLE);
+
+   if (kthread_should_stop()) {
+   __set_current_state(TASK_RUNNING);
+   break;
+   }
+
+   spin_lock(_lock);
+   __set_current_state(TASK_RUNNING);
+   list_for_each_entry(kvm, _list, vm_list) {
+   set_fs(USER_DS);
+   use_mm(kvm->mm);
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   kvm_apic_sync_pv_timer(vcpu);
+   }
+   unuse_mm(kvm->mm);
+   set_fs(oldfs);
+   }
+
+   spin_unlock(_lock);
+
+   schedule();
+   }
+
+   return 0;
+}
+
+static void kvm_pv_timer_init(void)
+{
+   ktime_t ktime = ktime_set(0, pv_timer_period_ns);
+
+   hrtimer_init(_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+   pv_sync_timer.function = _sync_timer_callback;
+
+   /* kthread for pv_timer sync buffer */
+   pv_timer_polling_worker = kthread_create(pv_timer_polling, NULL,
+

[PATCH RFC 7/7] kvm: guest: reprogram guest timer

2017-12-08 Thread Quan Xu

From: Ben Luo 

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR.

When pvtimer is enabled, we introduce a new mechanism to
reprogram KVM guest timer. A periodically working kthread
scans share page and synchronize timer setting for guest
on a dedicated CPU. The next time event of the periodically
working kthread is a threshold to decide whether to program
tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR, or to
share page.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 arch/x86/include/asm/kvm_para.h |9 +
 arch/x86/kernel/apic/apic.c |9 ++---
 arch/x86/kernel/kvm.c   |   38 ++
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c373e44..109e706 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
@@ -92,6 +93,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned 
long p1,
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
+int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt);
 extern void kvm_disable_steal_time(void);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -126,6 +129,12 @@ static inline void kvm_disable_steal_time(void)
 {
return;
 }
+
+static inline int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt)
+{
+   return 0;
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ff89177..286c1b3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -471,10 +471,13 @@ static int lapic_next_event(unsigned long delta,
 static int lapic_next_deadline(unsigned long delta,
   struct clock_event_device *evt)
 {
-   u64 tsc;
+   u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR);
 
-   tsc = rdtsc();
-   wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+   /* TODO: undisciplined function call */
+   if (kvm_pv_timer_next_event(tsc, evt))
+   return 0;
+
+   wrmsrl(MSR_IA32_TSC_DEADLINE, tsc);
return 0;
 }
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..ec7aff1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -328,6 +328,35 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 
val)
apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static DEFINE_PER_CPU(int, pvtimer_enabled);
+static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info,
+ pvtimer_shared_buf) = {0};
+#define PVTIMER_PADDING25000
+int kvm_pv_timer_next_event(unsigned long tsc,
+   struct clock_event_device *evt)
+{
+   struct pvtimer_vcpu_event_info *src;
+   u64 now;
+
+   if (!this_cpu_read(pvtimer_enabled))
+   return false;
+
+   src = this_cpu_ptr(_shared_buf);
+   xchg((u64 *)>expire_tsc, tsc);
+
+   barrier();
+
+   if (tsc < src->next_sync_tsc)
+   return false;
+
+   rdtscll(now);
+   if (tsc < now || tsc - now < PVTIMER_PADDING)
+   return false;
+
+   return true;
+}
+EXPORT_SYMBOL_GPL(kvm_pv_timer_next_event);
+
 static void kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -362,6 +391,15 @@ static void kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (kvm_para_has_feature(KVM_FEATURE_PV_TIMER)) {
+   unsigned long data;
+
+   data  = slow_virt_to_phys(this_cpu_ptr(_shared_buf))
+ | KVM_MSR_ENABLED;
+   wrmsrl(MSR_KVM_PV_TIMER_EN, data);
+   this_cpu_write(pvtimer_enabled, 1);
+   }
 }
 
 static void kvm_pv_disable_apf(void)
-- 
1.7.1

[PATCH RFC 6/7] Doc/KVM: introduce a new cpuid bit for kvm pvtimer

2017-12-08 Thread Quan Xu

From: Ben Luo 

KVM_FEATURE_PV_TIMER enables guest to check whether pvtimer
can be enabled in guest.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 Documentation/virtual/kvm/cpuid.txt  |4 
 arch/x86/include/uapi/asm/kvm_para.h |1 +
 arch/x86/kvm/cpuid.c |1 +
 3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..b26b31c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest checks 
this feature bit
||   || before enabling paravirtualized
||   || spinlock support.
 --
+KVM_FEATURE_PV_TIMER   || 8 || guest checks this feature bit
+   ||   || before enabling paravirtualized
+   ||   || timer support.
+--
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side
||   || per-cpu warps are expected in
||   || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 3dd6116..46734a8 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -25,6 +25,7 @@
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
+#define KVM_FEATURE_PV_TIMER   8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..e02fd23 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -593,6 +593,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_PV_EOI) |
+(1 << KVM_FEATURE_PV_TIMER) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
 (1 << KVM_FEATURE_PV_UNHALT);
 
-- 
1.7.1

[PATCH RFC 4/7] KVM: timer: program timer to a dedicated CPU

2017-12-08 Thread Quan Xu

From: Ben Luo 

KVM always registers timer on the CPU which vCPU was running on.
Even though vCPU thread is rescheduled to another CPU, the timer
will be migrated to the target CPU as well. When timer expired,
timer interrupt could make guest-mode vCPU VM-exit on this CPU.

Since the working kthread scans periodically, some of the timer
events may be lost or delayed. We have to program these tsc-
deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which
will cause VM-exit and KVM will signal the working thread through
IPI to program timer, instread of registering on current CPU.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Signed-off-by: Ben Luo 
---
 arch/x86/kvm/lapic.c |8 +++-
 arch/x86/kvm/x86.c   |7 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 20a23bb..5835a27 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2072,7 +2072,13 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer 
*data)
struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, 
lapic_timer);
 
-   apic_timer_expired(apic);
+
+   if (pv_timer_enabled(apic->vcpu)) {
+   kvm_apic_local_deliver(apic, APIC_LVTT);
+   if (apic_lvtt_tscdeadline(apic))
+   apic->lapic_timer.tscdeadline = 0;
+   } else
+   apic_timer_expired(apic);
 
if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5668774..3cbb223 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -26,6 +26,7 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "x86.h"
+#include "lapic.h"
 #include "cpuid.h"
 #include "pmu.h"
 #include "hyperv.h"
@@ -2196,7 +2197,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
return kvm_x2apic_msr_write(vcpu, msr, data);
case MSR_IA32_TSCDEADLINE:
-   kvm_set_lapic_tscdeadline_msr(vcpu, data);
+   if (pv_timer_enabled(vcpu))
+   smp_call_function_single(PVTIMER_SYNC_CPU,
+   kvm_apic_sync_pv_timer, vcpu, 0);
+   else
+   kvm_set_lapic_tscdeadline_msr(vcpu, data);
break;
case MSR_IA32_TSC_ADJUST:
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
-- 
1.7.1

[PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-08 Thread Quan Xu

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
 arch/x86/kvm/vmx.c |   17 +
 1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
 static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
 enum {
-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
 
 static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
 
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])
-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
 #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
 #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
 #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 #endif
int i;
 
-   /* I/O */
-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
 
-   vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
 
-   memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
 
-- 
1.7.1

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-10 Thread Quan Xu




On 2017/12/09 00:18, David Hildenbrand wrote:

On 08.12.2017 11:22, Quan Xu wrote:

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
  static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
  
  enum {

-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
  
  static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
  
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])

-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
  #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
  #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
  #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
  #endif
int i;
  
-	/* I/O */

-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
  
-	vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);

memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
  
-	memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);

-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
  


Looks good to me.


David, thanks for your review.



We could now also drop

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a5d2856eb28b..732e93332f4c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12,
  * Rather, exit every time.
  */
 exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

 vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);

As this will be implicitly set by vmx_exec_control()



good catch.
yes, L0 merge vmcs01 with vmcsl2 to create vmcs02 and run L2.

however as this code is related to nested virtualization, I better 
introduce it in another new patch

with "KVM: NVMX" prefix subject after this patch is applied.

furthermore, I think I could drop these tow lines in another new patch:

-   exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

as the CPU_BASED_USE_IO_BITMAPS is clear after this patch.


Quan

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-10 Thread Quan Xu




On 2017/12/09 01:31, Jim Mattson wrote:

On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu  wrote:

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
 FIELD(HOST_FS_SELECTOR, host_fs_selector),
 FIELD(HOST_GS_SELECTOR, host_gs_selector),
 FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),

These two lines should stay.

Jim, could you explain why these two lines should stay?


IIUC, the main concern is from  nested virtualization, which still uses 
io_bitmap_a/io_bitmap_b..

if so, we really need to further clean up these code, as

CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for 
both L0/L2. after new patches which I mentioned
in this thread.

right?

Alibaba Cloud
Quan

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu




On 2017/12/09 00:18, David Hildenbrand wrote:

On 08.12.2017 11:22, Quan Xu wrote:

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
FIELD(HOST_FS_SELECTOR, host_fs_selector),
FIELD(HOST_GS_SELECTOR, host_gs_selector),
FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),
FIELD64(MSR_BITMAP, msr_bitmap),
FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
@@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
  static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
  
  enum {

-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +954,6 @@ enum {
  
  static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
  
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])

-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
  #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
  #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
  #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
  #endif
int i;
  
-	/* I/O */

-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6741,9 @@ static __init int hardware_setup(void)
goto out;
}
  
-	vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);

memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
  
-	memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);

-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
  


Looks good to me.

We could now also drop

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a5d2856eb28b..732e93332f4c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12,
  * Rather, exit every time.
  */
 exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
-   exec_control |= CPU_BASED_UNCOND_IO_EXITING;

 vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);

As this will be implicitly set by vmx_exec_control()
David,  I find some nVMX code is still using vmcs12 (.e.g. 
nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))


dropping it would make these logic error handling..
we better leave it as is..

Quan
Alibaba Cloud

Re: [PATCH] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu




On 2017/12/12 02:08, Jim Mattson wrote:

Removing these two lines from the initialization of
field_to_offset_table[] means that vmcs_field_to_offset() will return
-ENOENT for IO_BITMAP_A or IO_BITMAP_B. Hence, handle_vmread and
handle_vmwrite will incorrectly report these fields as unsupported
VMCS components if an L1 hypervisor tries to access them.


I will fix in v2.

Quan
Alibaba Cloud




On Sun, Dec 10, 2017 at 9:37 PM, Quan Xu  wrote:


On 2017/12/09 01:31, Jim Mattson wrote:

On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu  wrote:

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
   arch/x86/kvm/vmx.c |   17 +
   1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..3e4f760 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -771,8 +771,6 @@ enum segment_cache_field {
  FIELD(HOST_FS_SELECTOR, host_fs_selector),
  FIELD(HOST_GS_SELECTOR, host_gs_selector),
  FIELD(HOST_TR_SELECTOR, host_tr_selector),
-   FIELD64(IO_BITMAP_A, io_bitmap_a),
-   FIELD64(IO_BITMAP_B, io_bitmap_b),

These two lines should stay.

Jim, could you explain why these two lines should stay?


IIUC, the main concern is from  nested virtualization, which still uses
io_bitmap_a/io_bitmap_b..
if so, we really need to further clean up these code, as

CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set
for both L0/L2. after new patches which I mentioned
in this thread.

right?

Alibaba Cloud
Quan

[PATCH v2] KVM: VMX: drop I/O permission bitmaps

2017-12-12 Thread Quan Xu

From: Quan Xu 

Since KVM removes the only I/O port 0x80 bypass on Intel hosts,
clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING
bit. Then these I/O permission bitmaps are not used at all, so
drop I/O permission bitmaps.

Signed-off-by: Jim Mattson 
Signed-off-by: Radim Krčmář 
Signed-off-by: Quan Xu 
---
 arch/x86/kvm/vmx.c |   15 +--
 1 files changed, 1 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd9a8c..41aaf5e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -943,8 +943,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
 static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
 enum {
-   VMX_IO_BITMAP_A,
-   VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
@@ -958,8 +956,6 @@ enum {
 
 static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
 
-#define vmx_io_bitmap_a  (vmx_bitmap[VMX_IO_BITMAP_A])
-#define vmx_io_bitmap_b  (vmx_bitmap[VMX_IO_BITMAP_B])
 #define vmx_msr_bitmap_legacy
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
 #define vmx_msr_bitmap_longmode  
(vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
 #define vmx_msr_bitmap_legacy_x2apic_apicv   
(vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
@@ -3632,7 +3628,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 #endif
  CPU_BASED_CR3_LOAD_EXITING |
  CPU_BASED_CR3_STORE_EXITING |
- CPU_BASED_USE_IO_BITMAPS |
+ CPU_BASED_UNCOND_IO_EXITING |
  CPU_BASED_MOV_DR_EXITING |
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_INVLPG_EXITING |
@@ -5445,10 +5441,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 #endif
int i;
 
-   /* I/O */
-   vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
-   vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
-
if (enable_shadow_vmcs) {
vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
@@ -6751,14 +6743,9 @@ static __init int hardware_setup(void)
goto out;
}
 
-   vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
 
-   memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
-
-   memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
-
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
 
-- 
1.7.1

Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

2017-12-13 Thread Quan Xu




On 2017/12/08 23:06, Konrad Rzeszutek Wilk wrote:

On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote:

From: Ben Luo 

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
then KVM handles this timer for guest.

The tsc-deadline timestamp is mostly recorded in share page
with less VM-exit. We Introduce a periodically working kthread
to scan share page and synchronize timer setting for guest
on a dedicated CPU.

That sounds like a race. Meaning the guest may put too small window
and this 'working thread to scan' may not get to it fast enough?

yes, you are right. So ..

.
Meaning we miss the deadline to inject the timer in the guest.

Or is this part of this PV MSR semantics - that it will only work
for certain amount of values and anything less than say 1ms
should not use the PV MSR?


..
for these timers, We have to program these tsc-deadline timestamps
to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
signal the working thread through IPI to program timer, instead of
registering on current CPU (patch 0004).

more detail in patch 0007.

Quan
Alibaba Cloud

Re: [PATCH RFC 0/7] kvm pvtimer

2017-12-13 Thread Quan Xu




On 2017/12/14 00:28, Konrad Rzeszutek Wilk wrote:

On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote:

On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk <
konrad.w...@oracle.com> wrote:


On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote:

From: Ben Luo 

This patchset introduces a new paravirtualized mechanism to reduce

VM-exit

caused by guest timer accessing.

And how bad is this blib in arming the timer?

And how often do you get this timer to be armed? OR better yet - what
are the workloads in which you found this VMExit to be painful?

Thanks. Or better yet - what
are the workloads in which you found this VMExit to be painful?


one painful point is from VM idle path..
  for some network req/resp services, or benchmark of  process context
switches..

So:

1) VM idle path and network req/resp services:

Does this go away if you don't hit the idle path? Meaning if you
loop without hitting HLT/MWAIT?
  we still hit HLT.. we can use it with 
https://lkml.org/lkml/2017/8/29/279 ..

  I am assuming the issue you are facing
is the latency - that is first time the guest comes from HLT and
responds to the packet the latency is much higher than without?

yes,

And the arming of the timer?
2) process context switches.

Is that related to the 1)? That is the 'schedule' call and the process
going to sleep waiting for an interrupt or timer?

This all sounds like issues with low-CPU usage workloads where you
need low latency responses?

yes,  it is also helpful to some timer-intensive services.

Quan

Re: [PATCH] KVM: x86: avoid unnecessary XSETBV on guest entry

2017-12-13 Thread Quan Xu




On 2017/12/13 20:51, Paolo Bonzini wrote:

xsetbv can be expensive when running on nested virtualization, try to
avoid it.

Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/x86.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0f82e2cbf64c..daa1918031df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -702,7 +702,8 @@ static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
!vcpu->guest_xcr0_loaded) {
/* kvm_set_xcr() also depends on this */
-   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
+   if (vcpu->arch.xcr0 != host_xcr0)
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
vcpu->guest_xcr0_loaded = 1;
    }
  }


Reviewed-by: Quan Xu

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-15 Thread Quan Xu




On 2017-11-15 22:43, Rik van Riel wrote:

Can you explain why you believe that?


for example, a vcpu thread is running in kvm mode under cretical
condition to stop. QEMU send an IPI to cause a VM-exit to happen
immediately, and this IPI doesn't make vcpu return to QEMU. IIUC
this vcpu thread will still continue to run in kvm mode when is
waked up at targer machine. with your patch, I don't see a chance
to load guest FPU or XSTATE, until return to QEMU and run kvm mode again.

then the FPU or XSTATE status is inconsistent for a small window, what's 
even

worse is that the vcpu is running.

Did I misunderstand?

Quan
Alibaba Cloud




Getting the guest FPU or XSTATE is done under the vcpu->mutex.

This patch switches out guest and userspace FPU/XSTATE under the
vcpu->mutex, and switches it back before releasing the vcpu->mutex.

By the time a KVM_GET_FPU has obtained the vcpu->mutex, the guest
FPU state will be in vcpu->arch.guest_fpu.state, where you expect
it to be.

What am I missing?

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-15 Thread Quan Xu




On 2017-11-16 12:21, Rik van Riel wrote:

On Thu, 2017-11-16 at 10:50 +0800, Quan Xu wrote:

On 2017-11-15 22:43, Rik van Riel wrote:

Can you explain why you believe that?

for example, a vcpu thread is running in kvm mode under cretical
condition to stop. QEMU send an IPI to cause a VM-exit to happen
immediately, and this IPI doesn't make vcpu return to QEMU. IIUC
this vcpu thread will still continue to run in kvm mode when is
waked up at targer machine. with your patch, I don't see a chance
to load guest FPU or XSTATE, until return to QEMU and run kvm mode
again.

then the FPU or XSTATE status is inconsistent for a small window,
what's
even
worse is that the vcpu is running.

Did I misunderstand?

At context switch time, the context switch code will save
the guest FPU state to current->thread.fpu when the
VCPU thread is scheduled out.

When the VCPU thread is scheduled back in, the context
switch code will restore current->thread.fpu to the FPU
registers.



good catch!
Also as your comment, PKRU is switched out separately at
VMENTER and VMEXIT time, but with a lots of IF conditions..

the pkru may be restored with host pkru after VMEXIT. when
vcpu thread is scheduled out, the pkru value in current->thread.fpu.state
may be the host pkru value, instead of guest pkru value (of course,
this _assumes_ that the pkru is in current->thread.fpu.state as well).
in this way, the pkru may be a coner case.

VM migration again, in case,
   source_host_pkru_value != guest_pkru_value,
   target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..



Quan
Alibaba Cloud


The VCPU thread will never run with anything else than
the guest FPU state, while inside the KVM_RUN code.

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-16 Thread Quan Xu




On 2017-11-16 06:03, Thomas Gleixner wrote:

On Wed, 15 Nov 2017, Peter Zijlstra wrote:


On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:

From: Yang Zhang 

Implement a generic idle poll which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_poll function which
can be overridden by the architecture code if needed.

No, we want less of those magic hooks, not more.


Interrupts arrive which may not cause a reschedule in idle loops.
In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
for interrupts and VM-exit immediately. Also this becomes more
expensive than bare metal. Add a generic idle poll before enter
real idle path. When a reschedule event is pending, we can bypass
the real idle path.

Why not do a HV specific idle driver?

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is already
an interesting quirk there which makes it exit early.  See commit
3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit
looks similar. But lets not proliferate that. I'd rather see that go away.


agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce 
arch_needs_cpu")

in kvm guest. I won't proliferate that..


But the irq_timings stuff is heading into the same direction, with a more
complex prediction logic which should tell you pretty good how long that
idle period is going to be and in case of an interrupt heavy workload this
would skip the extra work of stopping and restarting the tick and provide a
very good input into a polling decision.



interesting. I have tested with IRQ_TIMINGS related code, which seems 
not working so far.

Also I'd like to help as much as I can.

This can be handled either in a HV specific idle driver or even in the
generic core code. If the interrupt does not arrive then you can assume
within the predicted time then you can assume that the flood stopped and
invoke halt or whatever.

That avoids all of that 'tunable and tweakable' x86 specific hackery and
utilizes common functionality which is mostly there already.
here is some sample code. Poll for a while before enter halt in 
cpuidle_enter_state()
If I get a reschedule event, then don't try to enter halt.  (I hope this 
is the right direction as Peter mentioned in another email)


--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,

    target_state = >states[index];
    }

+#ifdef CONFIG_PARAVIRT
+   paravirt_idle_poll();
+
+   if (need_resched())
+   return -EBUSY;
+#endif
+
    /* Take note of the planned idle state. */
    sched_idle_set_state(target_state);




thanks,

Quan
Alibaba Cloud

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-16 Thread Quan Xu




On 2017-11-16 16:45, Peter Zijlstra wrote:

On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote:

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is already
an interesting quirk there which makes it exit early.

Sure. And there are people who want to do the same for native.

Adding more ugly and special cases just isn't the way to go about doing
that.

I'm fairly sure I've told the various groups that want to tinker with
this to work together on this. I've also in fairly significant detail
sketched how to rework the idle code and idle predictors.

At this point I'm too tired to dig any of that up, so I'll just keep
saying no to patches that don't even attempt to go in the right
direction.

Peter, take care.

I really have considered this factor, and try my best not to interfere 
with scheduler/idle code.
if irq_timings code is ready, I can use it directly. I think irq_timings 
is not an easy task, I'd
like to help as much as I can.  Also don't try to touch tick_nohz* code 
again.


as tglx suggested, this can be handled either in a HV specific idle driver or 
even in the generic core code.

I hope this is in the right direction.

Quan
Alibaba Cloud

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-16 18:21, Paolo Bonzini wrote:

On 16/11/2017 06:06, Quan Xu wrote:

when vcpu thread is scheduled out, the pkru value in
current->thread.fpu.state may be the host pkru value, instead of
guest pkru value (of course, this _assumes_ that the pkru is in
current->thread.fpu.state as well). in this way, the pkru may be a
coner case.

Rik may correct me, but I think this is not possible.  Preemption is
disabled all the time while PKRU = guest_pkru (which is only during
vmx_vcpu_run).


refer to the following code,

vcpu_enter_guest()
{
    ...
    preempt_disable()
   ...
   kvm_x86_ops->run(vcpu);  (actually it is vmx_vcpu_run())
   ...
    ...
    preempt_enable();
}



when preempt_disable before to run kvm_x86_ops->run..
in kvm_x86_ops->run, the pkru is restored with host_pkru (IF guest_pkru 
!= host_pkru).

all this happened under preempt_disable().

it's not true that preemtion is disable all the time while PKRU  = 
guest_pkru..




However it seems there is still some gap..

as Rik said, "at context switch time, the context switch code will save 
the guest

FPU state to current->thread.fpu when the VCPU thread is scheduled out."

after preempt_enable() in vcpu_enter_guest(), the vcpu thread is 
scheduled out,
in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF 
guest_pkru != host_pkru)..

instead of guest_pkru..

then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu?

as mentioned, all this _assumes_ that the pkru is in 
current->thread.fpu.state as well.



thanks,

Quan
Alibaba Cloud


Context switching will only happen in vcpu_enter_guest() after
preempt_enable() for a preemptible kernel, or in vcpu_run via
cond_resched() for a non-preemptible kernel.

Thanks,

Paolo


VM migration again, in case,
    source_host_pkru_value != guest_pkru_value,
    target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-16 20:18, Paolo Bonzini wrote:

On 16/11/2017 13:12, Quan Xu wrote:

However it seems there is still some gap..

as Rik said, "at context switch time, the context switch code will save
the guest FPU state to current->thread.fpu when the VCPU thread is scheduled 
out."

By "guest FPU state" Rik means "guest FPU with host PKRU".

  :( actually it is host_pkru, just with different names..


Guest PKRU is always stored in vcpu->arch.pkru rather than in the guest
FPU state, so guest PKRU will never be in current->thread.fpu.state either.

KVM_GET_XSAVE will the guest FPU state with vcpu->arch.pkru and
migration will work properly.

  agreed, I fix it.. that's why I concern.. there are so much methods to
  write PKRU with host_pkru/guest_pkru..

 after migration, KVM_GET|SET_XSAVE restore the right pkru.. but we 
introduce

 another method:

  -- When the VCPU thread is scheduled back in, the context
 switch code will restore current->thread.fpu to the FPU
 registers.


there is still a window to restore current->thread.fpu to the FPU 
registers before enter guest mode and


preempt_disable().

on target machine, after migration, the pkru value is source_host_pkru 
in current->thread.fpu.


in case,

    source_host_pkru_value != guest_pkru_value,
    target_host_pkru_value == guest_pkru_value..

source_host_pkru_value may be restored to PKRU.. make pkru status inconsistent..


thanks

Quan
Alibaba Cloud

Thanks,

Paolo


after preempt_enable() in vcpu_enter_guest(), the vcpu thread is
scheduled out,
in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF
guest_pkru != host_pkru)..
instead of guest_pkru..

then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu?

as mentioned, all this _assumes_ that the pkru is in
current->thread.fpu.state as well.


thanks,

Quan
Alibaba Cloud


Context switching will only happen in vcpu_enter_guest() after
preempt_enable() for a preemptible kernel, or in vcpu_run via
cond_resched() for a non-preemptible kernel.

Thanks,

Paolo


VM migration again, in case,
     source_host_pkru_value != guest_pkru_value,
     target_host_pkru_value == guest_pkru_value..

the pkru status would be inconsistent..

Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-16 Thread Quan Xu




On 2017-11-17 01:50, Paolo Bonzini wrote:

On 16/11/2017 15:28, Quan Xu wrote:

vcpu->srcu_idx = srcu_read_lock(>srcu);
  
+ kvm_load_guest_fpu(vcpu);

+
for (;;) {
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);

<<



   as Rik dropped  kvm_load_guest_fpu(vcpu) in  vcpu_enter_guest() ..
   in case still in kvm mode, how to make sure to pkru is always the
right one before enter guest mode, not be preempted before
preempt_disable()  after migration? :(

As you know:

1) kvm_load_guest_fpu doesn't load the guest PKRU, it keeps the
userspace PKRU.

2) the guest PKRU is only ever set in a preemption-disabled area

Thus, context switch always sees the userspace PKRU.  The guest PKRU is
only set the next time vmx_vcpu_run executes.

Paolo



Paolo, thanks for your explanation!!:-)
Rik, could you cc me in v2?

Quan

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-17 Thread Quan Xu




On 2017-11-16 17:53, Thomas Gleixner wrote:

On Thu, 16 Nov 2017, Quan Xu wrote:

On 2017-11-16 06:03, Thomas Gleixner wrote:
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
     target_state = >states[index];
     }

+#ifdef CONFIG_PARAVIRT
+   paravirt_idle_poll();
+
+   if (need_resched())
+   return -EBUSY;
+#endif

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.



Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling


I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,

calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it

as easy as following:

 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)

 update_intr_gate(X86_TRAP_PF, async_page_fault);
  }

 +static __cpuidle void kvm_safe_halt(void)
 +{
     +    /* 1. POLL, if need_resched() --> return */
     +
 +    asm volatile("sti; hlt": : :"memory"); /* 2. halt */
 +
     +    /* 3. get the last idle state's residency time */
 +
     +    /* 4. update poll duration based on last idle state's 
residency time */

 +}
 +
  void __init kvm_guest_init(void)
  {
 int i;
 @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
 if (kvmclock_vsyscall)
 kvm_setup_vsyscall_timeinfo();

 +   pv_irq_ops.safe_halt = kvm_safe_halt;
 +
  #ifdef CONFIG_SMP




then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.

also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud

Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

2017-11-17 Thread Quan Xu




On 2017-11-17 19:36, Thomas Gleixner wrote:

On Fri, 17 Nov 2017, Quan Xu wrote:

On 2017-11-16 17:53, Thomas Gleixner wrote:

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.


1. which data needs to considerd to make decisions about probabilistic polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

That's part of the picture, but not complete.


tglx, could you share more? I am very curious about it..


I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.

That's only true for your particular use case.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for KVM
guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can implement
it
as easy as following:

  --- a/arch/x86/kernel/kvm.c

Your email client is using a very strange formatting.


my bad, I insert space to highlight these code.


This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.



yes, I am exhausted :):)


could you tell me the gap to be generic and necessarily suitable for
all use case scenarios? as lack of irq/idle predictors?

 I really want to upstream it for all of public cloud users/providers..

as kvm host has a similar one, is it possible to upstream with following 
conditions? :
    1). add a QEMU configuration, whether enable or not, by default 
disable.

    2). add some "TODO" comments near the code.
    3). ...


anyway, thanks for your help..

Quan
 Alibaba Cloud

92 matches

Mail list logo